Spark and Hadoop analytics efforts often stumble when teams try to turn small pilot projects into larger operational...
apps meant for data science teams and business analysts. For many, it is an obstacle in their quest to work with big data.
Configuration complexity has sometimes been the stumbling block. A custom-configured prototype built by a lone data scientist can take a long time to recreate, and then fail when it is shared by a wider user pool. To grapple with the problem, some individuals are applying DevOps-oriented container and microservices techniques as they sew together Spark and Hadoop components.
"Our data science teams and business stakeholders don't want to wait days or weeks for us to set up a new Spark cluster or other big data environment with all the tools, versions, configurations and data they need," said Ramesh Thyagarajan, managing director of The Advisory Board Company, a Washington, D.C.-based firm that provides analytics and consulting services to healthcare organizations.
He said he views Docker containers as a key enabling technology on the road to better agility for big data scientists and business users alike.
To bring such DevOps style deployment to its big data applications, Advisory Board is using BlueData Software's EPIC software platform to run Spark SQL and the Spark analytics engine, as well as Apache Zeppelin developer notebooks. "For us, it's all about agility and faster business innovation," Thyagarajan said, emphasizing BlueData platform's capabilities as a container-based architecture for big data deployments.
According to Thyagarajan, the platform provides on-demand spin-ups of new Spark clusters to data scientists and business analysts, who are basically shielded from the complexity underlying the configurations required for such deployments.
He said his team built its own framework to bring data into the Hadoop Distributed File System (HDFS). As a result, the on-demand Spark clusters work off a single curated source of data. Such centralization is important. "There is no way we could have supported 400-plus users, each creating their own cluster," he said.
Is it in the script?
It's still early for containers in big data. To date, Spark clusters have mainly been the province of bare metal server implementations, according to Tom Phelan, co-founder and chief architect at BlueData, and a veteran of the virtualization industry.
Bare metal has meant arduous setups and static implementations that are hard to change, he said in a presentation at the recent Spark Summit East 2017 in Boston.
Container implementations can be done by hand using scripting, he said, but this becomes more challenging as big data pipelines feature more components. Spark, today, he said, is often a part of complex, orchestrated workloads that aren't necessarily easy to adapt to container methods.
"Today you have to navigate a river of container managers," he told conference attendees. That is one of the issues that the BlueData software looks to address, he continued.
A path to elastic scaling
Phelan said that recent updates to the BlueData platform address the implementation needs of Spark-using data scientists, such as those at the Advisory Board. BlueData's latest release, unveiled earlier this month, supports common Spark tools -- such as JupyterHub, RStudio Server and Zeppelin programming notebooks -- as preconfigured Docker images. The goal is to bring more DevOps-style agility to data science.
Behind use of Docker containers and other microservices methods is a drive to automate more aspects of application deployment. These methods are often a path to elastic scaling, which allows administrators to build up and break down compute resources as workloads require.
That is an increasingly common requirement in cloud computing as well as in on-premises implementations, and something Spark and Hadoop may need to embrace, if their use is to widen in the enterprise.
Spark Streaming due for Drizzle update
Hadoop scoops data warehouse jobs: the low-hanging fruit
Home at last -- containers arrive in Hadoop ecosystem
Kubernetes container orchestration is the new thing in big data