First developed to make applications easier to deploy, manage and scale, container technologies nonetheless have seen limited use in big data systems due to earlier struggles managing application state and data. But all that is beginning to change, promising more agility and flexibility for these systems.
Containers can be viewed as part of a continuum of infrastructure simplification situated between traditional monolithic infrastructure and serverless functions, said John Gray, CTO at Infiniti Consulting, an InterVision company. Compared to monolithic infrastructure deployments, serverless infrastructure could provide more agility and reduce costs in the short run, while greatly easing management tasks in the long run.
But "containers can still do many things better than serverless functions," Gray added. Refactoring very large monolithic applications, for example, is still better suited for containers. Placing this type of application into a serverless production environment contains many more pieces than with legacy monolithic applications. Containers also provide developers with more control over the virtual environment. It may take years before it becomes feasible for bigger apps to be rolled out to serverless platforms, Gray said.
"Containers enable almost immediate provisioning of complex systems in a repeatable and reliable fashion," said Monte Zweben, co-founder and CEO of data platform provider Splice Machine. With orchestration tools like Kubernetes, containers can be continuously monitored, and they can proactively take self-healing actions. For instance, if one of the nodes in a large distributed data system is nonresponsive and thus holding up computation, the system can proactively kill the bad container and add a node that picks up where the old one left off.
Beyond specific apps, big data container technologies can help support an IT strategy that drives intelligent automation and real-time decision-making, said Gordon Van Huizen, vice president of platform strategy at Mendix, maker of a low-code software development platform. "It fits within the broader context of democratized data science, iterative experimentation, individual and team autonomy -- as well as rapid scaling -- and is a foundational component in support of those aims," he said.
Making data streaming more agile
The tools used for provisioning big data apps in containers are generally the same as those used for other types of applications -- Docker, Kubernetes, Jenkins and GitLab among them. But it's also important to consider using associated tools for data streaming, Van Huizen noted.
Monte Zweben Co-founder and CEO, Splice Machine
Not only is streaming the optimal way to perform real-time analytics, it offers a much more flexible and loosely coupled approach to querying and processing data more broadly. Big data containers, for example, can be combined with the ability to create and manage data pipelines between source systems, "data sink" repositories and processing nodes. Pairing containers with open source streaming and pipeline technologies like Apache Kafka, Beam and Airflow, "will rapidly transform the way we build systems that leverage big data," Van Huizen said.
Data governance tools provider Collibra uses containers and real-time data pipelines and streaming apps for high-volume data that requires real-time event response, according to Jeff Burk, the company's senior vice president of engineering. As a result, the container holding the big data applications and settings in a consolidated environment can be allocated faster than the traditional process of deploying these tools on servers. Yet the complexity of creating the containerized environments had presented a challenge.
Although big data container technologies can ease distributed processing, some of their value is lost if they're dependent on a monolithic data store. The system might appear to be modular, but each service will have to write to the same shared store.
When scaling a relational database, the common practice has been to scale vertically by adding larger storage devices and increasing the size of the database host. Newer NoSQL databases like MongoDB, Cassandra and Couchbase Server are built with horizontal scalability as the foundational pattern for scaling without the need for containers. That allows new nodes to be joined to the database cluster and existing data as well as requests to be redistributed across the old and new nodes when data volumes increase.
Gordon Van Huizen Vice president of platform strategy, Mendix
In addition, Docker and Kubernetes paired with a cloud service like AWS, Google or Azure can ease deployment of a high-availability database cluster, Burk explained. Tools like Presto, Spark and Kafka, he said, were built with high-volume distributed computing in mind and are more friendly toward Python-based tools.
The YARN controller used for Hadoop clusters limits users to tools focused on Java. By using Kubernetes as the orchestration layer, it's possible to run data science tools like Jupyter, TensorFlow, PyTorch or custom Docker containers on the same cluster.
Challenges to recognize, contain and resolve
Big data apps tend to be state-heavy, creating a challenge when using containers originally developed for stateless web apps. "There is still a significant amount of engineering that needs to be done in both the ecosystems and communities to address gaps in the containers space, which largely originated from the world of stateless or state-externalized web applications," said Vinod Vavilapalli, director of engineering at big data platform maker Cloudera. A common practice is to scale down stateless apps when they're not being used. But that tends to require extensive engineering to work well for stateful apps, he noted.
Another key issue is the persistence of data within the container. "Strides have been made with data persistence, but merging this with big data takes care," said Todd Matters, chief architect and co-founder of RackWare, a hybrid cloud management platform provider. He suggested using microservices coupled with big data containers. "The microservices architecture can help tie the big data container processes together and, more importantly, help manage the data for both persistent and stateless data," he explained.
On the flip side, he added, the ability to wipe away the state of temporary metadata associated with big data processing can actually be beneficial. Data stored directly in containers typically isn't permanent, making it convenient to hold temporary metadata there. When that process is done, it's cleaned up automatically.
Vinod Vavilapalli Director of engineering, Cloudera
In addition, management and infrastructure issues surrounding big data container technologies need to be addressed. Security and networking models in big data systems are different from traditional web apps, Vavilapalli said. In the big data space, security is typically through Kerberos and short-lived job tokens, while the containers use WebAuth and certificates. Similarly, big data environments tend to be engineered with high-throughput data movement between nodes, racks and clusters -- not as big a consideration when it comes to containers.
Big data management tooling is also different from container management technology. "Containers are their own new world, and bridging these two will take time," Vavilapalli said. "Organizations in the interim will have to decide how they can put the duct tape around these systems when using them together till the right and comprehensive solutions emerge."
Meantime, companies will need to think about writing instrumentation for these systems that differs from what's needed in conventional container deployments on servers, Splice Machine's Zweben said. With traditional containers, instrumentation could be as easy as simply pinging the container to make sure it's alive. In big data systems, these tests also have to ensure that the containers can perform their function.
There are also challenges surrounding isolation of different users in multi-tenant environments. "You have to make sure tenants of data systems are sufficiently isolated from each other so that one neighbor does not interfere with another," Zweben explained.
The concern is that multiple containers can run on a single physical system, and data-intensive workloads could adversely affect the use of CPU, memory or network by other containers on the same hardware. A good practice is to provide network isolation among tenants and use role-based resource management so that users are granted computational resources like CPU and memory in accordance with their roles.