This content is part of the Conference Coverage: Your guide to AWS re:Invent 2017 news and analysis

Cloud big data clusters test users on migration, management

There are good reasons to move big data systems to the cloud, but doing so also poses challenges for IT teams on migrating workloads and then managing clusters and system instances.

NEW YORK -- Companies are increasingly shifting big data clusters to the cloud for more flexibility and easier scalability. But IT managers who have made the move warn that getting the clusters there isn't easy, and that there are ongoing complications to contend with after you do.

The hurdles start with workload and data migration challenges, and they continue with a variety of management issues, according to speakers and attendees at the 2017 Strata Data Conference here. They pointed to things such as frequent system crashes and the need to carefully manage temporary clusters that are set up to run particular processing jobs and then shut down. In addition, they said some workloads aren't a good fit for the cloud computing model, which can require integration with systems that are left running internally.

The ability to dynamically spin up and modify big data clusters as needed in the cloud makes dealing with the downsides worthwhile for Chris Mills, who leads the big data team at The Meet Group Inc., a New Hope, Pa., company that operates a set of social networking and online dating sites.

After switching from an on-premises big data environment to one in the Amazon Web Services (AWS) cloud, clusters can be added or expanded "in minutes," Mills said. That has reduced IT overhead costs and made experimental and "deep-dive" analytics applications more feasible, he added.

2017 Strata Data Conference in New York
Migrating to cloud-based big data systems was among the topics discussed at the 2017 Strata Data Conference in New York.

But moving to the cloud "is going to cost more and take longer than you planned," Mills cautioned in a conference session. In The Meet Group's case, that was partly due to the project team identifying potential new applications during the migration process. But unexpected issues also cropped up along the way, he said. All told, it took about six months to set up the cloud-based big data architecture, and another six months to fine-tune the environment.

Getting in tune on cloud migrations

At music streaming company Spotify, migrating thousands of processing workloads from an on-premises Hadoop cluster to a new architecture on the Google Cloud Platform created both technical and organizational challenges, said Alison Gilles, director of engineering for its data infrastructure group.

Stockholm-based Spotify couldn't just start moving jobs to the cloud without potentially blocking others from continuing to run successfully, said Gilles, who works at the company's U.S. headquarters in New York. Nor could its 100 or so product engineering and operations teams, which control their own workloads, stop working on projects related to the streaming service to focus on the migration effort.

To make sure processing jobs don't get blocked, Spotify is aggressively copying data back and forth between the on-premises cluster and the cloud architecture, said Josh Baer, who is managing the data migration process. In August, the copying work amounted to 110,000 jobs in its own right.

"We're incurring some technical debt," Baer acknowledged during a joint presentation with Gilles. "But we thought the long-term gain was worth some short-term pain here."

The data infrastructure unit also developed a set of open source software to help streamline migrations. To support "forklifting" of workloads to the cloud platform, Baer said, it built a tool for scheduling batch migration jobs to run in Docker containers via the Kubernetes orchestration engine, plus a technology that automates the setup of temporary big data clusters to handle the migration workflows. A Scala API was created, as well, for teams that want to rewrite their workloads as part of the migration, although Baer said the infrastructure group encourages them to move the applications first.

A push and then a sprint

In addition, infrastructure engineers are being assigned to work directly with teams that have particularly complicated big data pipelines, or that need "a push" to get going, Baer said. The pairings are meant to result in sprint development projects aimed at completing migrations in two weeks or less.

Spotify, which launched the cloud initiative in early 2016, is about 80% done with the migration work overall, Gilles said. She hopes to ramp down most of the processing still being done on the on-premises cluster next year. "But you can imagine that in the last 20% [of the migrations], there are a lot of dragons," she added. "So, they may take longer, relatively speaking."

Based on The Meet Group's experiences, there are more dragons lurking in the big data cloud beyond the migration stage. For example, system instances in the Amazon Elastic Compute Cloud (EC2) service "fail all the time," Mills said. "It's not the same as working in a data center. You might lose five overnight."

You can have a batch job that's 80% complete, and the instance disappears -- poof.
Chris Millsbig data team leader, The Meet Group

Using EC2's Spot Instances feature, which lets AWS customers bid on spare compute nodes for temporary processing uses, also presents perils, according to Mills. It can cut costs by as much as 50% compared to Amazon's regular pricing, but he has seen spot instances stop running in the middle of jobs because other users placed higher bids for the assigned nodes. "You can have a batch job that's 80% complete, and the instance disappears -- poof," Mills said.

Beware of zombie cloud clusters

In addition, cloud big data clusters spun up temporarily on dedicated nodes need to be monitored closely to avoid what Mills described as zombie clusters that aren't doing any processing work, but are still generating costs. "It's unlikely that a data center server is going to rack up a $40,000 bill over the weekend because someone left it running, but that has happened to us in the cloud," he said.

Immature technologies are another issue in the cloud; for example, Mills said a version of the Apache HBase database that uses the Amazon Simple Storage Service instead of the Hadoop Distributed File System has bugs that can corrupt files and cause other problems. And The Meet Group is still running some on-premises systems in tandem with the AWS framework, including an Oracle database server that holds data on its websites' users and a Kafka system that feeds event data to the cloud clusters.

Lige Hensley, CTO at Ivy Tech Community CollegeLige Hensley

It's a similar situation at Ivy Tech Community College, which operates 45 campuses and satellite locations in Indiana. Ivy Tech uses an AWS-based big data architecture and Hitachi Vantara's Pentaho data integration and BI tools to support a self-service analytics system for its business users. But the cloud platform isn't the answer for all of the Indianapolis-based college's needs, CTO Lige Hensley said.

"To me, the cloud is part of the toolbox," Hensley said in an interview before a session on Ivy Tech's deployment. "You can use it for a lot of things, but it isn't perfect for everything. There are some workloads that we're never going to put in the cloud."

Next Steps

More advice on using big data platforms both on premises and in the cloud

How a cloud big data system pumped up Beachbody's analytics architecture

Test your knowledge on implementing and managing data lakes

Dig Deeper on Big data management