ra2 studio - Fotolia
Beachbody LLC runs a conventional on-premises Oracle data warehouse. But last year, when the maker of fitness and...
nutrition products decided to bulk up its analytics architecture by adding a Hadoop-based data lake, it headed to the cloud.
The big data system went live in December in the Amazon Web Services (AWS) cloud, making Beachbody one of a growing number of organizations turning to cloud platforms for deployments of Hadoop and related technologies. In Beachbody's case, the increased agility and flexibility enabled by the cloud was the biggest selling point in favor of setting up the AWS data lake, said Eric Anderson, executive director of data at the Santa Monica, Calif., company.
"It would have been a very difficult process to do on premises," Anderson said. For starters, his team has limited resources: Even after hiring some new employees with big data skills, he would have needed "a lot more people than I have now" for an on-premises implementation. In addition, open source big data technologies are evolving so quickly that keeping up with the pace of software updates would have been a challenge, he said.
The same goes for putting together all the different technology pieces that Beachbody is using. Anderson, who spoke about the data lake project at Strata + Hadoop World 2017 in San Jose, Calif., said in a follow-up interview that the cloud deployment was completed in six months. Building a similar big data environment internally likely would have taken more than a year, he estimated.
Under the AWS data lake's surface
The cloud system is built on Hortonworks Inc.'s distribution of Hadoop; as part of that, Beachbody also uses the Spark processing engine, Hive SQL-on-Hadoop query software and Zeppelin notebook tool for developing and running analytics applications. Incoming data is initially processed in the Hadoop Distributed File System, but then stored in the Amazon Simple Storage Service (S3), with data integration and ingestion processes handled primarily via tools from Talend.
The Amazon Redshift cloud data warehouse is part of the picture, too -- at least for now. Beachbody pushes some highly processed data into Redshift to get faster performance on analytical queries; although, Anderson said he's looking to drop Redshift down the road and move the work it does to the open source Presto query engine and R analytics tools running directly against data in S3.
The AWS data lake supports self-service analytics by Beachbody's data scientists and analysts, and it gives them access to a much broader collection of data than the Oracle data warehouse does. The information being pumped into the big data environment includes website activity data, logs from the company's workout-video streaming service, call-center records and external data on customer acquisition and spending, in addition to sales and financial transaction data.
And there's more to come: The data lake will hold "everything, eventually," Anderson said. "We just haven't gotten to it all yet." He added that while data sources can be connected to the cloud system much more quickly than to the data warehouse, his team currently has "a backlog of close to 100 data sources" to work through.
Hortonworks and Hadoop rival Cloudera Inc. both took steps last year to make it easier to deploy cloud-based big data systems with their distributions. Now, Hortonworks, based in Santa Clara, Calif., is adopting a cloud-first rollout strategy for its Hortonworks Data Platform software through business partner Microsoft, which builds its Azure HDInsight managed service on top of HDP. At the Strata conference earlier this month, Microsoft announced an HDInsight update based on a new HDP 2.6 release, which Hortonworks will make available on AWS and for on-premises installations next week.
Slow going for cloud and big data
Big data analytics deployments in the cloud have lagged behind cloud adoption for other types of applications, said Hortonworks CTO Scott Gnau. "But I think we're at the point where that's going to change." Only about 25% of Hortonworks users currently run public cloud systems, Gnau said. Within three years, he expects nearly all of them to have hybrid on-premises and cloud environments.
Also at Strata, IT vendor Unisys Corp. announced a machine learning service that runs on HDP in both the AWS and Microsoft Azure clouds -- the latter through HDInsight. The cloud-based service will apply machine learning algorithms written by data scientists at Unisys to data from corporate clients as an alternative to on-premises analytics platforms that the Blue Bell, Pa., vendor also offers.
The Unisys customer base primarily consists of large organizations with a lot of data -- for example, airlines and financial services firms. Rod Fontecilla, vice president and global leader of advanced data analytics at Unisys, isn't as bullish about cloud adoption as Gnau is, but he projected that about 30% of the companies he works with "will go to the cloud and live happily there."
Beachbody faced some hurdles on its way to the big data cloud. First, the data team had to sell both the data lake and cloud concepts internally, Anderson said. Next, he had to build up the team's big data skills, initially by bringing in consultants to help jump-start the project and then through a combination of new hires and retraining. New methods of documenting and governing data also had to be put in place.
Now, though, Beachbody is taking its IT architecture to the cloud in general, according to Anderson. "The data lake initially was something of an outlier," he said. "It's a more strategic move now." That includes an envisioned shift of the on-premises data warehouse to either Redshift or Presto on S3. Anderson also plans to link the AWS data lake directly to Salesforce and other business applications so end users can access relevant data as part of day-to-day operations.
Gartner's Merv Adrian on the complexity of "artisanal" Hadoop deployments
Mainstream users look to make sure they get big data's business benefits
IT teams have some structural work to do in building big data platforms