For more than five years, DC/OS has enabled some of the largest, most sophisticated enterprises in the world to achieve unparalleled levels of efficiency, reliability, and scalability from their IT infrastructure. But now it is time to pass the torch to a new generation of technology: the D2iQ Kubernetes Platform (DKP). Why? Kubernetes has now achieved a level of capability that only DC/OS could formerly provide and is now evolving and improving far faster (as is true of its supporting ecosystem). That’s why we have chosen to sunset DC/OS, with an end-of-life date of October 31, 2021. With DKP, our customers get the same benefits provided by DC/OS and more, as well as access to the most impressive pace of innovation the technology world has ever seen. This was not an easy decision to make, but we are dedicated to enabling our customers to accelerate their digital transformations, so they can increase the velocity and responsiveness of their organizations to an ever-more challenging future. And the best way to do that right now is with DKP.
This is the first part of a blog series on Networking for Docker containers. The second part covers the basics of Service Discovery, how it has evolved over time, along with the evolution in computing and virtualization. The third part lays out the various architectural patterns you could use for Service Discovery, Registration, and Load Balancing.
Networking is one of the main components of any stack. Alas, it's often neglected and it's not uncommon to hear that "the best network is the one you do not think about". Nonetheless, understanding how networking works is key not only for the network admin that needs to keep the network running. It is also for the developers and operations teams that need to run applications and microservices on top of the network, and who need to make them available to end users and to other services through networking primitives.
The advent of virtualization, cloud, and containerization has introduced several changes in the way workloads are connected to the network, and how users connect to these workloads. For many with a background in "traditional" networking, this new way of doing things may seem confusing at the beginning. Also, there is much less reference information available on the different options available for networking architectures in a containerized world.
In this series of blog posts, we will cover how networking works in a containerized and "cloudy" world. With the objective of providing a clear vision to both network administrators that need to operate these systems ensuring they run harmoniously on top of the current network infrastructure, and to software developers that need to run their applications and microservices on the network and "wire them" properly so that their consumers can access them easily.
We will cover the different options available to connect containers to the network, how connectivity works in a containerized stack, how services are discovered and published to internal and external consumers, and several other aspects relevant to networking with containers. In this first post, we cover how connectivity works for containers. This entry provides the basis for other topics in the series. You can find the second blog here.
Part I - Network Connectivity Options for Containers
An example commonly used to explain what a container is defines it as a "lightweight version of a virtual machine without the hardware emulation." Although the accuracy of this example is debatable, if you've got some knowledge around virtual machines then picturing a container as a "lightweight virtual machine" may be a useful analogy.
One of the differences is that the container does not rely on any hardware emulation, and rather it's just a process in a host running a container runtime (like Docker) and living in its own isolated and controlled namespace that shares the kernel of that host. As opposed to the way virtual machines work, containers will not be connected to emulated hardware like a "virtual network interface card", but rather it'll share one or several network interfaces and/or networking namespaces of the host in which it lives. We can connect the container to the same network interface and namespace that the host uses (e.g. "eth0"), or we can connect it to some sort of "internal" virtual networking interface of the kernel and then do different things to map between this internal interface and the outside world. This is where the different "Networking Mode" options appear, all of them with their advantages and compromises.
When we launch a container (or a set of containers forming a distributed service), we are given the option to choose between these networking modes to define how we want to connect it to the network, based on these options.
Let's take a look at each one of them, along with their pros and cons:
Given that a container is just a process running in a host, the simplest option seems to just connect it to the "host NIC" (or "host networking namespace"). The container will behave from a networking standpoint just as any other process running in the host. So it will use the host's IP address and, very importantly, it will also use the host TCP port space to expose the service running inside the container. We can run a container in host mode on a docker host with a command like:
docker run –d –-name nginx-1 –net=host nginx
That means that if your container is a web server (say, an NGINX or an APACHE container), it will very likely by default attach to ports 80 (HTTP) and 443 (HTTPS) in the host, marking them as "in use" (in the diagram, 192.168.0.2:80 will be "busy"). Let's imagine that we later try to run another standard web-based service on the same host. Unless told otherwise, our second web service container will likely try to attach to the same ports (80 and 443) and in the same way. But these ports are now in use by our previous container, so our new container won't be able to launch in that host and will fail.
This issue can be solved in several ways, including using BRIDGE mode as we discuss below (which also has its cons), or doing some sort of dynamic port assignment to tell the container from an external orchestration platform to start in a different, non-default port (see below for details on how DC/OS and Marathon solve this). But that also requires that the container is intelligent enough to "listen" to that dynamic port being assigned to it (usually through an environment variable). Check out this example in Python or this example in Java to see how an app running on a container can listen to a port assigned through an environment variable. In some cases where you may be using off-the-shelf containers from third parties, that won't be the case and the container will just try to start in a default port.
So, on the upside:
- This a simple configuration that developers and operators understand immediately, making it easy to use and troubleshoot.
- Does not perform any operations on incoming traffic (NAT or otherwise), so performance is not affected at scale.
- Does not require special configuration or maintenance.
But, on the downside:
- Without an additional "dynamic port assignment" mechanism, services will easily clash on TCP port use and fail.
- "Dynamic port assignment" needs to be managed by a container orchestration platform (more on this later), and usually requires specific code in the container to learn the assigned port.
- Containers share the host network namespace, which may have security implications.
In order to avoid the "port clash" issue discussed above for "Host" mode, a potential solution would be putting the container on a completely separate network namespace, internal to the host where it's living, and then "sharing" the "external" IP address of the host amongst the many containers living in it through the use of Network Address Translation (or NAT for short). That would work much in the same way as your home network connecting to your broadband provider, where the public IP address of your broadband router is shared between the devices in your home network when they reach the internet. Your laptop and cell phone get a private address in your home network, and those get "transformed" (NAT'ed) to the public IP address that your provider assigns you as they traverse the broadband router.
In order to do something similar to that inside the containerized host, a separate virtual bridge can be created with a completely separate internal network namespace. As an example, the "docker0" bridge in the picture above has private addressing (172.16.0.0/24), just as your home network is separate from the broadband network it's connected to. In this mode, containers are connected to this "private network", and each one gets its own IP address and a full network namespace where all TCP ports are available. The role that your broadband router performs in your home network translating between "public" and "private" addresses (or, in this case, "host" address and "internal" addresses) is performed inside the host by iptables, a well-known linux program that enables to configure network translation rules in the kernel so that an "external" Host_IP:Host_port combination is "published" and translated to a specific "internal" Container_IP:Container_port one. The docker runtime allows the operator to configure these "NAT rules" between internal and external ports through a simple flag in the "docker run" command, and configures iptables accordingly.
This way, every time that a container is created in "bridge mode", it can run on any port of its internal network namespace (a default port likely). A first NGINX-1 container created in bridge mode would listen in, for example, 172.16.0.2:80. A second NGINX-2 container would listen in 172.16.0.3:80, avoiding the clash. The internal addresses are assigned automatically through DHCP so the operator does not participate in assigning a specific internal address in the private range (172.16.0.0/24 in the example).
In order to allow communication to and from the outside world, the operator needs to "publish" the private container port to the host networking namespace. This means that the operator will use a flag in the container runtime (docker in this example) to request the NAT mapping from a specific "host address:port" combination that she knows it's free, to that internal IP address in the desired "container port." For example, we could choose that the host address/port combination 192.168.0.2:10000 is mapped to NGINX-1 on 172.16.0.2:80, while for NGINX-2 we could map 192.168.0.2:10001 to 172.16.0.3:80. Consumers of these services would then access them through 192.168.0.2:10000 and 192.168.0.2:10001 respectively.
These mappings are configured in docker when running a new container, with the flag "-p" or "—publish", followed by the "external_port":"internal_port" so the "docker run" command would look like:
docker run –d –-name nginx-1 -p 10000:80 nginx
docker run –d –-name nginx-2 -p 10001:80 nginx
Note that we didn't have to specify --net=bridge because this is the default working mode for docker. This technique allows to have multiple containers running on the same host without requiring the containers themselves to be aware of having to listen in any dynamically assigned ports. On the other hand, the use of NAT imposes a performance hit in terms of decreased throughput and increased delay.
So, on the upside for BRIDGE mode:
- Allows to have containers running on the same host without port clashing.
- Simplifies the use of "off the shelf" third party containers, as they can all run in their standard ports without requiring modifications to the code to "listen" to dynamically assigned ports.
- Each container lives in its private network namespace that is separate from the host one, which could be considered safer.
But, on the downside:
- Requires the configuration and maintenance of host_port-to-container_port mappings (although this can be alleviated with the use of a container orchestration platform that does this assignment automatically, as we will see later).
- Impacts network throughput and latency due to the use NAT.
VIRTUAL NETWORK mode (a.k.a USER mode, a.k.a OVERLAY mode)
In both options presented above, networking is handled by the Linux kernel in the host following the command of the container runtime. Both of them address how to "wire" a container running inside a host to the network, but they don't provide a solution for container communication across hosts.
When instead of a single container, our containerized application spans many instances to cope with increased load, it will very likely outgrow the capacity provided by a single host, and many instances of it would need to be launched in separate hosts in a server farm. This is quite relevant, provided that one of the objectives pursued with containerizing workloads is being able to easily scale out by augmenting the number of instances implementing the service (this also requires "wiring" them properly towards a load balancer or service discovery point -- but that will be discussed in a future blog post).
Moreover, in order to easily solve the "port clashing" issue we discussed above, it'd be ideal to provide a full IP namespace to a container (sometimes called "IP-per-container") like we do on BRIDGE mode, but without having to compromise on performance, or require configuration of port mappings. Finally, we could consider providing a private VLAN-like experience for applications and tenants inside a containerized cloud in order to achieve private addressing across hosts, and possibly isolation and granular security policies between them.
Where have I seen this before? The virtual machine precedent
These issues will likely be familiar to the reader that has experience working with virtual machines. The need for multi-host networking, private addressing and the ability to provide security policies between VMs appeared when virtualized environments started to outgrow a few hosts with "manual wiring between them." Hypervisors don't provide networking features, so initially the network was configured by leveraging the basic features of the host where they reside, possibly with a bit of artisan automation around it.
As virtualized environments scale up and get more complex, a few aspects that may be overlooked in small deployments become necessity: providing connectivity without the exponential complexity of having to maintain port mapping rules, enabling advanced features (private addressing, security policies, integration with Service Discovery, IPAM), or simplifying operations.
In a VMware environment, solutions such as VMware NSX address this need. Another good example is Openstack, where this is tackled with the introduction of a new component called Neutron, an abstraction interface designed to provide networking services to virtual machines, and "Neutron plugins": specialized networking solutions that "plug" into Neutron to implement those networking services. Neutron offers these "networking services" throught an API to other components of Openstack that may need them. These services include connecting a VM to the network, creating networks and subnets, or adding policies between tenants. But Neutron does NOT implement the networks or processes the traffic, it's just an abstraction layer. Instead, it provides a standard API interface so that networking vendors can implement Neutron "plugins" or "drivers" that implement those networking services and process the actual traffic. If you are running Openstack, you can choose what "networking backend", or "plugin" to use with Neutron. There are open source options (Openvswitch, OVN), and also vendor-backed options. These latter options include providing networking to VMs using hardware devices (such as Cisco's APIC), or providing Software-defined Networking (SDN) with components running on the cloud hosts (such as Juniper's Contrail, Nokia's Nuage VSP or Midokura's Midonet.)
Container Networking Interfaces and Plugins
Much like in the virtual machine world, networking for containers has recently evolved from a "minimal connectivity" approach in which simple primitives and services inside the host running the container were leveraged to provide connectivity, to a "plugin" approach in which networking is considered to be worthy of being separated from the container runtime and implemented in separate specialized services.
In order to achieve this, there is a need for a standard interface that container runtimes and orchestration platforms can use to request these services, and that networking "plugins" need to implement to offer them. In particular, there are two main standards available:
- Container Network Interface (or CNI) — Used by Mesos, DC/OS, Cloud Foundry, Kubernetes, Rkt and several other runtimes and platforms, CNI is a community-backed, open source project providing the specification and libraries for writing plugins to configure network interfaces in linux containers, along with a number of supported plugins. CNI is only concerned with network connectivity of containers, which allows it to be simple and leaves it to the plugins to implement advanced services beyond that.
- libnetwork — Developed and used by Docker, libnetwork is an implementation of an abstraction layer for connecting containers. Its goal is to provide a consistent programming interface and the required network abstractions for applications.
Depending on which container orchestration platform you are using (DC/OS, Kubernetes, Docker…), the containers running in your platform may be connected to the network through CNI, libnetwork, or you may even have a mixed scenario in which some pieces of your app use one and others use the other. In any case, there are several things that you should be able to get from a CNI plugin. Things like:
- Multi-host connectivity
- Separate address space
- Automatic address management
These features would provide a solution to the issues affecting Host or Bridge mode as discussed above. Although possibly at the cost of having to install, configure and operate a third party piece of software in your container orchestration platform, which may or may not be trivial.
Additional features such as policy, isolation, etc. are implementation dependent, so your mileage may vary depending on the plugin used.
Additionally, some plugins provide features specific to a network function (like IP Address Management) or the ability to leverage special hardware or driver functions such as SR-IOV.
Some example of plugins available for either CNI, libnetwork or both are those from Calico, Weave, Cilium, Infoblox or the SR-IOV plugin. Many of the networking vendors that had Neutron plugins are adapting them for use in CNI and libnetwork. As part of our DC/OS, at Mesosphere we also provide our Navstar plugin as our default SDN option for USER mode networking.
Simplifying Container Networking with Mesosphere DC/OS:
Here at Mesosphere we provide the industry's platform of choice for running containerized applications at scale. This means that we are serving a very diverse set of use cases for a large customer base with our DC/OS platform. As such, we believe in openness and in the power of choice, so we work to enable our customers to freely choose the architecture that best fit their needs, and make available to them the components to build their ideal tailored stack.
Marathon, the container orchestration platform of choice that fuels DC/OS, is able to run containers and connect them to the network using the three networking modes discussed in here. Using our GUI, our CLI, or our REST API, users can launch applications in either of these modes.
Here's how Marathon tackles these challenges:
- For HOST mode, Marathon automatically provides dynamic port assignment to containers, so that a container can simply "listen" to the environment variable PORT or PORT0, PORT1, etc. in order to receive port assignments. These ports are brokered by Marathon and communicated to the rest of the components and programs running on DC/OS, so the "wiring" of the applications is done automatically across a large cluster.
- For BRIDGE mode, Marathon provides seamless configuration through GUI, CLI and API, and is also able to automatically provide "host ports" so that an operator or developer just needs to specify a container port and have Marathon do the internal wiring to the rest of the components of DC/OS.
- In VIRTUAL NETWORK mode, DC/OS ships with our Navstar SDN plugin providing IP-per-container, automatic address assignment and configurable per-network address space, amongst many other features. Following our philosophy of providing a flexible and customizable cloud platform, Marathon and Mesos are able to seamlessly leverage CNI and libnetwork plugins. This means that DC/OS ships with "batteries included but replaceable", so if you want to consider an alternate networking plugin you're free to do so.
In summary, DC/OS enables you to run your workloads efficiently in HOST, BRIDGE, or USER mode with a CNI or libnetwork plugin.
Mesosphere's Universe service catalog includes ready-to-install packages for many of these plugins, and you can check our Examples site to find easy-to-follow instructions on how-to install, configure, and run them.
Hopefully this post has helped you understand the options available for connecting your containerized application to the network, either using standalone Docker in a dev environment, or in a container orchestration platform such as DC/OS or Kubernetes when moving beyond a single host. In upcoming posts, we will be covering what Service Discovery is, how it works, and how it can help us deliver our applications to our users hiding the complexity of how it's implemented in containers.