Maksim Samasiuk - Fotolia
Docker containers in Kubernetes clusters give IT teams a new framework for deploying big data systems, making it easier to spin up additional infrastructure for handling new data sources. These containers also make it easier to customize data management for specific data science projects for creating machine learning models and simplify the provisioning of data infrastructure when the enterprise is ready to put these models into production.
And Docker containers provide a consistent packaging mechanism for applications, developer code and tools like Jupyter Notebooks.
"This enables consistency between dev, test, staging and production, or when working across private or public cloud environments," said Brian Gracely, director of product strategy at Red Hat.
This collection of data containers is typically managed using Kubernetes, which provides a highly scalable platform for deploying and running big data applications. Kubernetes can be the consistent platform to run almost any big data, AI or machine learning application workload. This can reduce the need to maintain siloed clusters for each type of big data framework. Emerging Kubernetes technologies, such as the Operator Framework, simplify how big data frameworks and applications are managed, enabling highly automated operations.
But deploying big data systems in containers can also involve more configurations and settings that must be managed. Here are five key technology, deployment and management issues that businesses should be aware of before big data environments are put into containers:
1. Explore the data container communities
Most data container platforms and Kubernetes are completely open source technologies. Consequently, they have large, vibrant communities supporting and using them.
"The first thing anyone should do is check around the community message boards, GitHub sites, Slack channels and blogs to see if the thing they want to accomplish has already been done before," Gracely said. There's a good chance that a lot of experience already exists.
Projects such as Radanalytics, Kubeflow, TensorFlow, PyTorch, JupyterHub and OperatorHub are great places to get started or find existing integrations to make your use case successful. There are also several great free online tutorials about OpenShift, Kubernetes and containers, Gracely said.
It's also worth looking at how other companies have spun up their big data systems in containers.
The ChRIS project at Massachusetts Open Cloud is solving health issues using a large volume and wide variety of data sources. UPS is running analytics on the edge for delivery trucks. And ACI Worldwide is using containers for doing analytics for credit card fraud detection.
2. Plan for security
Containerization is a new concept that is being adopted very quickly, which can introduce a variety of new security vulnerabilities. As a result, organizations need to train people to not only deploy data containers, but also to identify these new vulnerabilities.
"Security and monitoring should never be an afterthought when deploying big data apps on containers," said Ahyoung An, senior product marketing manager at MuleSoft.
An said it is crucial to identify and mitigate issues within containers as they occur, as a breach of containers could mean unauthorized access to an app, a breach in sensitive data or vulnerabilities to key systems.
The best way to ensure security-by-design is to use the application network that naturally emerges when a business packages up its applications, data and devices as standardized APIs to make them pluggable and reusable. These APIs can then be deployed in data containers, either together or separately, depending on business needs. When they are plugged into an application network, containers can be designed in a way that show exactly what data applications might expose, so they can be turned off in the case of a breach.
3. Plan for container APIs
With containerization, users can package their big data applications with the tools and resources they require. Any microservices used can be deployed within the data container, creating an isolated ecosystem that cannot be affected by downstream processes.
Once created, these containers can be plugged into an application network as APIs and moved across any cloud or data center.
"These microservices containers further isolate your big data application from becoming a single point of failure, providing logical entry points to different aspects of your application," An said.
4. Integrate big data and app infrastructure
Deploying big data apps on containers offers advantages to organizations that are looking to increase isolation, flexibility and portability. Having a single unified interface to cluster environments is a draw for using Kubernetes. Additionally, by running a Kubernetes cluster for non-big data workloads, but not at full usage, data science teams and developers can tap into those unused workloads to increase overall resource efficiency across the organization, said Christopher Crosbie, product manager at Google Cloud.
Developers may want to explore best practices for creating custom configurations at the OS level to make it easy to support both application and data containers on the same infrastructure. They will also want to find ways to package containers with libraries for applications, which is handy for more targeted upgrades and migrating workloads.
5. Watch out for infrastructure sprawl
"Big data systems that are containerized and use Kubernetes don't magically solve all issues," Crosbie said.
For example, if data teams are using YARN, adding Kubernetes for big data containerization represents another cluster environment to manage. Additionally, audit logs, application networking and distributed stateful data represent additional areas that teams should be aware of before deploying big data systems in containers.