Using big data platforms for data management, access and analytics
A comprehensive collection of articles, videos and more, hand-picked by our editors
IT teams looking to build big data architectures have an abundance of technology choices they can mix and match to meet their data processing and analytics needs. But there's a downside to that: Putting all the required pieces together is a daunting task.
Finding and deploying the right big data technologies within the expanding Hadoop ecosystem is a lengthy process frequently measured in years, unless corporate executives throw ample amounts of money and resources at projects to speed them up. Missteps are common, and one company's architectural blueprint won't necessarily translate to other organizations, even in the same industry.
"I tell people that it's not something you can order from Amazon or get from the Apple Store," said Bryan Lari, director of institutional analytics at the University of Texas MD Anderson Cancer Center in Houston. Fully constructing a big data architecture, he added, "is complex, and it's a journey. It's not something we're going to implement in six months or a year." Nor is there an easy-to-apply technology formula to follow. "Depending on the use case or user, there are different tools that do what we need to do," Lari said.
MD Anderson's big data environment is centered on a Hadoop cluster that went into production use in March, initially to process vital-signs data collected from monitoring devices in patients' rooms. But the data lake platform also includes HBase; Hadoop's companion NoSQL database; the Hive SQL-on-Hadoop software; and various other Apache open source technologies such as Pig, Sqoop, Oozie and Zookeeper. In addition, the cancer treatment and research organization has deployed an Oracle data warehouse as a downstream repository to support analytics and reporting applications, plus IBM's Watson cognitive computing system to provide natural language processing and machine learning capabilities. New data visualization, governance and security tools are due to be added in the future, too.
Bryan Laridirector of institutional analytics at the University of Texas MD Anderson Cancer Center
The IT team at MD Anderson began working with Hadoop in early 2015. To demo some potential applications and learn about the technology, the center first built a pilot cluster using the base Apache Hadoop software; later, it brought in the Hortonworks distribution of Hadoop for the production deployment.
Vamshi Punugoti, associate director of research information systems at MD Anderson, said the experience gained in the pilot project should also help make it easier to cope with modifications to the architecture that likely will be needed as new big data tools emerge to augment or replace existing ones. "It's a continually evolving field, and even the data we're collecting is constantly changing," Punugoti said. "It would be naïve to assume we have it all covered."
Steering toward a better architecture
A platform engineering team at ride-sharing company Uber similarly spent about 12 months building a multifaceted big data architecture, but with even more technology components, and in more of a hurry-up mode. Vinoth Chandar, a senior software engineer on Uber's Hadoop team, said the San Francisco-based company's existing systems couldn't keep up with the volumes of data that its fast-growing business operations were generating. As a result, much of the data couldn't be analyzed in a timely manner -- a big problem because Uber's business "is inherently real time" in nature, Chandar said.
To enable operations managers to become more data-driven, Chandar and his colleagues set up a Hadoop data lake environment that also includes HBase, Hive, the Spark processing engine, the Kafka message queueing system and a mix of other technologies. Some of the technologies are homegrown -- for example, a data ingestion tool called Streamific. With the architecture in place, Uber is "catching up to the state of the art on big data and analytics," Chandar said. But it wasn't easy getting there. "To pull it all together, I can say that 10 people didn't sleep for a year," he added half-jokingly.
The architectural challenges are no laughing matter for organizations, though. Consulting company Gartner predicts that through 2018, 70% of Hadoop installations will fall short of their cost savings and revenue generation goals due to a combination of inadequate skills and technology integration difficulties. The integration hurdles are being exacerbated, Gartner analyst Merv Adrian said, by a steady increase in the number of related big data technologies that Hadoop distribution vendors are supporting as part of their commercial offerings (see "Elephantine Proportions").
In a presentation at the 2016 Pacific Northwest BI Summit in Grants Pass, Ore., Adrian listed a total of 46 open source technologies revolving around Hadoop that are supported by one or more of the distribution vendors. But connecting specific sets of those components into working big data architectures is a job that's left to user organizations, according to Adrian. "Most Hadoop projects of any significance are artisanal -- you're putting the pieces together," he said.
Changes in plans along the way
This kind of piece work can be a tall order even when Hadoop isn't part of the picture. Celtra Inc., which offers a platform for designing online display and video ads, had several fits and starts in deploying a cloud-based processing architecture that now combines Spark and its Spark SQL module with an Amazon Simple Storage Service (S3) repository, MySQL relational databases and a data warehouse system from Snowflake Computing.
"There was a lot of trial and error," said Grega Kespret, engineering director for analytics at the Boston-based company. "What's challenging is to come up with an architecture that meets your business needs but not to overreach." If you do, he cautioned, you might "end up with a mess."
Initially, Celtra collected data on ad interactions by website visitors and other trackable events in S3 and used Spark as an extract, transform and load (ETL) engine to aggregate the information with operational data in MySQL for reporting uses. But it was hard to analyze the raw event data with that setup, Kespret said. The company then added a separate Spark-based analytics system, which helped but still required Celtra's data analysts to stitch together, cleanse and validate the event data themselves -- an undertaking that he said was "kind of error-prone."
In late 2015, after trying but rejecting other technologies, Kespret and his team turned to Snowflake as a holding tank for the event data after it passes through MySQL and gets organized by user sessions to make working with it easier for the analysts. The Snowflake system went into production use last April after an earlier soft-launch period. Kespret said the next step is to store the data aggregates in Snowflake as well and eliminate a second ETL process that funnels them into another MySQL data store.
'Wild West days' on big data development
Hadoop co-creator Doug Cutting acknowledged that the plethora of technology choices can complicate the process of building big data architectures. For many user organizations looking to take advantage of Hadoop and its cohort technologies, "there's sort of a froth in these Wild West days," said Cutting, who now is chief architect at Hadoop vendor Cloudera Inc.
But Cutting thinks the benefits of big data systems -- including increased architectural flexibility, support for new kinds of analytics applications and lower IT costs -- are well worth the integration hassles. And he chalked up much of the problem to a lack of familiarity with the open source development and deployment process. "Pretty soon it won't be that daunting," he said. "People will get used to it."
Maybe so, but even IT managers at Yahoo Inc., which claims to be the largest Hadoop user around, aren't entirely immune from feeling the pressure. Cutting worked at Yahoo, based in Sunnyvale, Calif., when Hadoop originated in 2006, and the web search and internet services company was the first production user of the technology. Currently, the company's big data environment includes about 40 clusters that blend Hadoop with HBase, Spark, the Storm real-time processing engine and other technologies.
Overall, the vast technology ecosystem that has been built up around Hadoop is a boon for users, said Sumeet Singh, senior director of product development for cloud and big data platforms at Yahoo. Singh noted that the open source approach accelerates the pace of technology development and lets IT teams take part in planning and creating tools that are useful to their companies without having to do all the work themselves. "I know there's an overkill of open source projects, but not everyone will get widely adopted," he said. "There will be convergence and real clear winners."
The big data universe isn't all sunshine and blue skies, though. "It does come with its set of problems," Singh said, adding that at times his mind "sort of bursts just dealing with" the ongoing open source efforts and the myriad technology permutations that are possible in big data architectures.
Mainstream Hadoop users seek out big data's business benefits
Why Hadoop management and governance are becoming bigger IT priorities
Hadoop's core components may not be so core to future big data systems