Docker use cases – How to handle big data with Docker

Data Science   |   
Published May 11, 2018   |   

Docker containers provide a way to package applications with everything needed to run them, including base operating system images, databases, libraries, and binaries. By running a Docker engine on a host machine, Docker containers interact solely with the kernel of the host OS, meaning all containerized apps function the same regardless of the underlying infrastructure. Furthermore, you can run multiple apps on a single host machine, which leads to impressive cost savings by letting enterprises run more apps on existing hardware.
The statistics for Docker are telling with regards to its popularity and potential:

  • Docker adoption increased by 40 percent between 2016 and 2017, and the latest numbers show that 3.5 million applications have been placed in Docker containers.
  • Research firm 451 Research predicts a compound annual growth in container market revenue of 35 percent until 2021.

The rest of this article will overview some use cases where Docker ties into and helps to handle Big Data sets which are fast-moving, voluminous, and contain a huge variety of information from disparate sources and in different formats. For more info on containers, check out this Docker wiki page.

Docker & big data use cases

Isolate big data tools

Coupled with the hardware used to set up and manage Big Data clusters are a set of tools that developers and data scientists will use to complete processing jobs or other tasks on Big Data. The problem that often arises is that each developer wants to use their own specific tools to do what they need to do with the data, necessitating the distribution of a whole gamut of tools and their dependencies to each machine within a Big Data cluster.
With a large number of developers, dependency issues will quickly arise, and one tool’s specific requirements can cause another tool to malfunction.
Docker offers a way to overcome these dependency issues by allowing you to build a Big Data ecosystem in which each tool is self-contained, along with all of its dependencies. Developers can use their own tools for different jobs without worrying about conflict with other tools because each tool is isolated within a container.

Run scheduled analytics jobs

A scheduled analytics job is a type of automated data manipulation task that you can run either on a recurring schedule or at a particular time. These types of jobs are very useful for Big Data which inundates organizations at high-velocity, necessitating some form of automation to keep up to speed with tasks. Docker containers can add to the convenience of scheduled jobs by allowing you to run scheduled jobs without manually setting them up on each node in a Big Data cluster.
For example, Chronos is a fault-tolerant job scheduler running on top of Apache Mesos that enables the launching of Docker instances into a Mesos cluster, creating scheduled analytics jobs for those instances. Within Mesos, you can run distributed Big Data applications, such as Hadoop or Spark.
With Chronos and Mesos, your developers or sysadmins can schedule Docker containers to run ETL, batch, and analytics applications on a recurring or time-specific basis, all without the need for any manual setup on cluster nodes.
Aside from the convenience of using Docker for scheduled analytics, the Chronos job scheduler also shows you a job dependency graph to help track dependencies for different jobs.

Provision big data development environments

The ability to provision a Big Data environment on a local computer is useful for developers who want to learn more about the various technologies and tools needed to become proficient with Big Data ecosystems. After all, within a development context, learning by doing is the best way to gain knowledge. Docker can assist with this by enabling the creation of a multiple-node cluster on a single host machine, replicating the typical Big Data setup.
For example, Ferry is a tool that lets you run multiple container nodes on a single host machine using Docker. This means developers can define, run, and deploy big data stacks using either the human-friendly YAML data serialization standard or JSON. For example, the following code creates a Big Data stack containing a 5-node cluster and a single Linux client to interact with Hadoop:
– storage:
personality: “hadoop”
instances: 5
– “hive”
– personality: “hadoop-client”
After defining this Big Data stack, you can easily run it in Docker. Start up the Ferry server by running the sudo ferry server command in your Docker terminal, followed by ferry start hadoop.
The ability to provision a Big Data stack locally like this is useful for developers who need a local environment for development purposes, but it’s also good for data scientists who want to experiment with Big Data technologies and further their knowledge.

Build a big data micro services architecture

Docker facilitates the transition to building a microservices architecture for Big Data applications. Microservices are independant, modular services, and Docker containers provide a natural platform with which to implement such a setup for Big Data apps.
The main benefits of microservices for Big Data include easier application scalability and better quality data. Ingesting Big Data results in many possible points of failure that can lead to lower data quality. With microservices, development teams have an easier job in testing and maintaining services, reducing the chances of poor data quality.

Build a multi-cloud distributed big data processing system

The typical drawbacks for companies looking to extract meaningful information from their large data volumes are the need to provision a powerful data processing system and the requirement to install and use complex big data analytics tools.
As described in this paper, a possible use case for Docker is building a Docker container-based big data processing system in multiple clouds for everyone, with the help of the Docker Swarm, which is used to orchestrate containers.

Wrap up

Docker’s impressive security, performance, and the speed at which you can create multi-node Hadoop clusters make it an ideal fit for use with Big Data workflows. Docker has particular advantages of Big Data ecosystems that use virtual machines because Docker containers are much more lightweight, and they require much less time and effort to set up Hadoop clusters or other Big Data environments.