Taking full advantage of big data platforms, such as Hadoop and Spark, often requires a new education for IT and analytics teams on how to configure systems and partition data to maximize processing speeds.
For example, when Valence Health was working to deploy a Hadoop cluster in early 2015, the healthcare technology and services provider initially focused its internal training efforts on Drill, an open source SQL-on-Hadoop query engine that IT developers would be using to write extract, transform and load (ETL) scripts for processing incoming data. But that turned out to be the wrong approach.
The bigger issue, Valence CTO Dan Blake said, was getting a better understanding of Hadoop's underlying structure and how to work with it effectively to optimize data processing performance. "We kind of started out with Drill training, but we really needed to do more in-depth training on Hadoop itself first," he said. "It's very different from a relational database."
Blake's team eventually went back to the basics on Hadoop, and he said it's now working to wring as much processing speed as it can out of the cluster and Drill, which both went into production use last May.
Chicago-based Valence, which works with hospitals and health systems looking to transition to value-based care methodologies, is using Drill to help pull 3,000 daily data feeds containing 45 different types of healthcare data into the 15-node Hadoop cluster for downstream analysis. That can amount to up to 25 million records on a busy day, Blake said, adding that the cluster -- based on MapR Technologies' Hadoop distribution -- can now handle the processing workload in "an hour or two."
Learning curve to climb on using Hadoop
Figuring out how to fully leverage Hadoop was also a big challenge for developers and systems administrators at Progressive Insurance, according to Chris Barendt, an IT architect at the Mayfield Village, Ohio, auto insurer. "It was kind of a steep learning curve on understanding how to run the environment," he said.
Dan BarendtIT architect at Progressive Insurance
Progressive is using Hive, another open source SQL-on-Hadoop technology, for both ETL and analytics to give its SQL-savvy business analysts and data scientists a familiar programming environment. But echoing Blake, Barendt said that Hadoop is a "completely different" platform to work with than SQL-based relational databases are.
And there is configuration work to do in Hadoop, even though it doesn't impose the same kind of rigid data formats that relational systems do. The data in a Hadoop cluster may be unstructured or semi-structured in nature, but it has to be properly set up and partitioned to get good query performance, Barendt said. "You still need to do good design -- it's not magic."
In fact, he added that deploying Hive "was probably the easiest part of using Hadoop" at Progressive, which is running Hortonworks' distribution of the big data framework.
Speed not a given on big data platforms
Sellpoints Inc. faced similar configuration and partitioning hurdles after it began using a cloud-based Spark system from Databricks early last year. "It took us a while to figure out how to make that work," said Benny Blum, vice president of product and data at the Emeryville, Calif., company, which provides online marketing and advertising services to corporate clients.
Sellpoints uses Spark to process online activity data captured from websites, running ETL routines created with the technology's Spark SQL module to prepare the data for analysis. At first, Blum said, his team primarily had to focus on getting the Spark system to "a steady state" on processing -- a common priority for users implementing big data platforms, especially with emerging technologies like Spark. But steady didn't necessarily mean speedy, he added.
Last fall, Sellpoints began working to partition its data sets for faster query performance. ETL jobs that previously took 30 to 45 minutes to run with Spark can now be completed in as little as 10 seconds, Blum said. "You really need to take the time to understand the right way to structure your data," he advised other users, saying that doing so properly "allows you to access your Spark data in a much more efficient way."
Consultant David A. Teich on the need for database schemas in big data systems
Examples of deploying big data platforms, in our Hadoop project management guide
DBAs, other IT pros need to update their skills to thrive in big data environments