This post was written by Artem Harutyunyan and Jie Yu of Mesosphere.
Containers have brought great efficiencies to the process of configuring and deploying applications, but they don't always bring the same levels of efficiency to other aspects of IT operations. For example, as they proliferate, containers—especially Docker containers—can become a bear when it comes to datacenter storage and network capacity.
It's a problem that's somewhat counterintuitive: Containers are lauded for their ability to run services using a fraction of the resources of a typical virtual machine, so how can they possibly be considered inefficient? The answer lies in the way Docker images are stored and downloaded. Traditionally, each image stored in a repository includes all the files associated with its layers, and updating a Docker image on a host machine requires downloading the entire image from the repository.
This might not seem like too big a problem when you're talking about a handful of containers for a small web app, but imagine the situation when you're talking about hundreds, thousands or even millions of containers. Repository sizes balloon as more containers pop up and as existing ones grow larger with each update. Network traffic gets congested as gigabytes worth of Docker downloads are moving across the pipe at any given time.
Companies running large numbers of containers, such as Twitter, have already experienced this phenomenon. It will only become more common as more companies adopt containers and begin the process of scaling out large container repositories.
We think one solution to problem of container overload might lie in a piece of technology called the CernVM File System (CernVM-FS). It was developed by CERN, the European Organization for Nuclear Research, with contributions from the world-wide high-energy physics community and, in particular, from FermiLab in the United States, in order to help distribute the 2 terabytes(!) of software its scientists develop each year—software that, in some cases, might be distributed over hundreds of thousands of computers and be updated multiple times per week. CernVM-FS uses a combination of extensive indexing, deduplication, caching and geographic distribution in order to minimize the number of components associated with each download, and to speed up the pieces that must be downloaded.
Mesosphere and CERN are now investigating how to integrate CernVM-FS with Apache Mesos, in order to determine how it might apply to container downloads and how much more efficiency we can add to large container environments.
A deeper dive into how CernVM-FS works
CernVM-FS was conceived in 2008, when researchers at CERN looked into hardware virtualization for the same underlying reason that people today are looking into containers: the deployment of applications. Instead of creating images or packages, the idea was to use a globally distributed file system where scientists can install their software once on a web server, and then access it from anywhere in the world.
The first observation in favor of a distributed file system is that at any given point in time, only a tiny fraction of all the available files are actually being accessed. For instance, in order to run a web server, you only need a few libraries from your operating system (e.g., glibc and openssl). A distributed file system that hosts the operating system sees such files as they are requested, and, in fact, CVMFS will only download and locally cache what's required when it's required. The choice of HTTP as a download protocol allows for piggybacking on the available web caching infrastructure (Akamai, CloudFront, Squid and NGINX proxy servers, for example).
The second observation is that metadata (the information about the presence of a file, rather than its content) is a first-class citizen. More often than not, a file system hosting software is hammered by requests like "is libssl.so in /lib32? Is it in /lib64? Is it perhaps in /usr/lib32?". General-purpose distributed file systems have a hard time time with these types of requests. CernVM-FS stores all the metadata in SQlite files that are downloaded and cached just like regular files. Millions of metadata requests are thus resolved locally, and speed-wise CernVM-FS feels almost like a local POSIX file system.
The third observation is that the software is only modified at the point of publication—all other consumers are read-only with high availability. Many previous file system designs have struggled with the tradeoffs described by the CAP theorem because they assume that a client's objective is always to read the most-recent version of data available. In this case, software must remain consistent while the application or container runs, so a single snapshot of the filesystem must be delivered throughout the run.
That led CERN to the use of content-addressable storage and Merkle trees. Similar to git, files are internally named by their (unique) cryptographic content hash. Multiple copies of the same file in several directories (such as the "ls" utility in different Ubuntu images) are automatically coalesced into a single file. The hashes of the SQlite file catalogs depend on the content hashes of the files they contain. So the root hash (a 160-bit integer) ultimately identifies an entire file system snapshot. To close the trust chain, a root hash is cryptographically signed and every client verifies every bit of received data to make sure it comes from the expected origin and has been untampered with.
Today, CernVM-FS delivers several hundred million files and directories of Large Hadron Collider experiment software to roughly 100,000 globally distributed computers.
Containerizers in Mesos
Mesos provides task containerization through several implementations of so-called "containerizers." Containerizers are responsible for isolating running tasks from one another, as well as for limiting the resources (such as CPU, memory, disk and network) available to each task. The functionality of a Mesos containerizer can easily be extended using so-called "isolators," which can be thought of as plugins that perform specific tasks (for example, mounting an external volume into a container).
Last but not least, a containerizer also provides a runtime environment for the task. It does so by allowing users to pre-package all the dependencies, along with the task itself, into a file system image. This image is then distributed around and is used for launching the task.
Mesos currently has implementations for several types of containerizers. The default options is the Mesos containerizer, which uses Linux namespaces and cgroups for task isolation and resource-usage controls, as well as a set of isolators for providing extra functionality. There is also Docker containerizer, which uses Docker tools for fetching images and launching docker containers.
The Mesos containerizer is currently being extended to support various image formats, such as Docker and AppC, with the idea being the creation of a "unified containerizer." This approach makes it very easy to add support for a new image format: With the unified containerizer, one just needs to implement a so-called "provisioner" for the new image format and reuse all the isolation bits that are already in place. For example, Mesos users could continue using Docker images, but fetching, isolation and other tasks would be carried out via Mesos rather Docker tooling.
Integrating CernVM FS and Mesos to tackle container image distribution
Its use of content addressable storage, its security features and its proven scalability make CernVM-FS a very attractive candidate for solving a container-image distribution problem for the Mesos containerizer. To test this idea, we created a new CernVM-FS repository and populated it with an Ubuntu installation. We then implemented a Mesos container image provisioner based on CVMFS. Instead of downloading the entire image up front, it uses the CernVM-FS client to mount the remote image root directory locally. The provisioner takes as its input the name of the CernVM-FS repository (which internally is mapped to the URL of the CernVM-FS server) as well as a path within the repository that needs to be used as container image root.
This makes it possible to have multiple container images published within the same CernVM-FS repository. Essentially, the CernVM-FS repository plays the role of a secure and scalable container image registry.
From the point of view of the containerizer, nothing changes. It is still dealing with the local directory that contains the image directory tree, on top of which it needs to start the container. The big advantage, however, is that the fine-grained deduplication of CernVM-FS (based on files or chunks rather than layers, as with Docker) means we now can start a container without actually having to download the entire image. Once the container starts, CernVM-FS downloads the files necessary to run the task on the fly.
In our test, we start with an Ubuntu container and then run a single command in a shell. In a traditional scenario, we need to download the entire Ubuntu image, which is about 300MB. However, when we use the CernVM-FS provisioner, we only download what is actually needed by the task—in this case it is just under 6MB.
Because CernVM-FS uses content addressable storage, we never need to download the same file twice. So if we go ahead and start another container (let's say a CentOS image) that runs a different command, we will only need to download the files required by the new command and we will reuse all the common dependencies (Bash or libc.so, for example) that we have downloaded previously with the Ubuntu image. In this model, there is no notion of a container layer anymore; deduplication is happening at a much finer level.
We are also planning on adding support to the provisioner for mounting arbitrary CernVM-FS catalogs using corresponding checksums. This will make it really easy for developers to iterate quickly when working on container images and it will also make it easy for operators to switch between various container image versions.
Just in time for the container deluge
The teams at CERN and Mesosphere are very excited about what the integration of CernVM-FS and Apache Mesos could mean for the broader IT community as application containers become commonplace. If companies and institutions are going to fundamentally improve the way they architect applications, deploy code and operate datacenters, they'll need to launch, kill, update and otherwise manage containers at a scale far great than most are doing today. A tight integration between CernVM-FS and Mesos can go along way toward ensuring that storage capacity and network bandwidth don't become bottlenecks along the path toward mass containerization.