Hadoop's rich and rapidly expanding ecosystem is both a blessing and a curse for users of the open source distributed...
processing framework. That's because there are so many Hadoop-related tools and technologies available to them, which creates lots of choices -- but also a crowded landscape that can be hard to navigate.
What's needed is a focused approach, with measured steps being taken to deploy and expand a Hadoop cluster architecture, according to Raheem Daya, director of product development and manager of the Hadoop platform at RelayHealth, a McKesson Corp. subsidiary that provides a variety of medical information services to healthcare providers, insurers and pharmacies.
Daya began working with Hadoop at RelayHealth in 2011. Since then, he has introduced other tools into the Alpharetta, Ga., company's Hadoop environment in stages, adding new ones only as they became needed. During a presentation at TDWI's The Analytics Experience conference in Boston, he recommended that implementers pinpoint an initial Hadoop use case and see it through before taking on additional ones. In his case, the first step was to augment an existing operational data store (ODS), which served as a file archive, with a Hadoop-based version that allowed RelayHealth to keep more data for a longer period and add some analysis functionality to the equation.
"Stay focused on what is going to bring you ROI," Daya told conference attendees, referring to a return on investment. "Find something that is going to prove out some cost savings or additional revenue, and start there." He noted that the ODS update helped RelayHealth stay current with government compliance requirements for medical data, while also reducing costly data warehouse and mainframe growth.
Despite his advice to maintain a narrow focus at first, Daya cautioned that the initial Hadoop implementation must soon be accompanied by some longer-term planning. "I realize the two ideas are at loggerheads," he said. "You have to start with a specific use case, but also plan future hardware buying with additional use cases in mind."
Consorting with Kafka, Camus and more
Over the past four years, Daya and his team have considerably built up the Hadoop architecture at RelayHealth: starting with a 10-node cluster running Cloudera's Hadoop distribution, expanding it to 40 machines and adding a host of related software technologies from the Hadoop ecosystem.
Early on came HBase, a wide-column NoSQL database that works within the Hadoop framework. And out of necessity, the database was just the start. "It was fast enough, but then things would break. So I added complexity where I needed to," Daya said. That has led him on a path also including Hive, Impala, Storm, Kafka, Camus and the Spark processing engine's Spark Streaming module, which RelayHealth installed earlier this year to run predictive models aimed at identifying medical claims likely to be flagged for review by insurers.
Each added technology has a specific purpose. For example, the Kafka message broker is being used to queue and distribute data within the Hadoop environment. At that point, the information can be handled by any of three technologies: Spark Streaming for predictive analytics; Storm for specialized computations; or Camus, a MapReduce job developed at LinkedIn that serves as a data pipeline between Kafka and Hadoop Distributed File System (HDFS) storage nodes in the cluster.
"I'm a big fan of using a mix of technologies, but with a purpose," Daya said. And while he and his team have become increasingly adept at integrating new components into the Hadoop cluster, going from a proof-of-concept system with 10 nodes to the current 40-node configuration proved to be somewhat daunting, he acknowledged. For example, a 64-GB memory capacity per node seemed like plenty at the start -- but it turned out to fall short as technologies like Impala and Spark were added to the mix, requiring 96 GB and 128 GB, respectively.
Go organic in tilling the Hadoop soil
After several years in the IT spotlight, Hadoop adoption still isn't widespread -- only 10% of the respondents to a Gartner Inc. survey this year said their organizations had deployed Hadoop for production uses, while another 16% said they were running pilot projects or experimenting with the technology. Daya said IT teams working on deployments might get better results if they grow their Hadoop systems -- and expertise -- organically. "We started out with the idea that, if nothing else, we had a solution for our file archive," he said.
The trick in going forward for data professionals, Daya added, is being aware that implementing a Hadoop cluster architecture to support big data applications requires a different mindset than building an enterprise data warehouse does.
"This is not traditional data development -- that's the big stumbling block," he said. "There are so many different pieces, and people get lost in the complexity."
To avoid that, find the pieces you really need, he advised -- and avoid the temptation to do an exhaustive survey of all the different corners of the Hadoop ecosystem. "You'll never come back," he warned.
Learn more about managing Hadoop projects
Test your understanding of the Hadoop ecosystem
Find out the latest about machine learning in a Hadoop context