Introducing Marathon 1.4
For more than five years, DC/OS has enabled some of the largest, most sophisticated enterprises in the world to achieve unparalleled levels of efficiency, reliability, and scalability from their IT infrastructure. But now it is time to pass the torch to a new generation of technology: the D2iQ Kubernetes Platform (DKP). Why? Kubernetes has now achieved a level of capability that only DC/OS could formerly provide and is now evolving and improving far faster (as is true of its supporting ecosystem). That’s why we have chosen to sunset DC/OS, with an end-of-life date of October 31, 2021. With DKP, our customers get the same benefits provided by DC/OS and more, as well as access to the most impressive pace of innovation the technology world has ever seen. This was not an easy decision to make, but we are dedicated to enabling our customers to accelerate their digital transformations, so they can increase the velocity and responsiveness of their organizations to an ever-more challenging future. And the best way to do that right now is with DKP.
We are excited to announce the release of Marathon 1.4.2. Marathon is the container orchestrator powering Apache Mesos and DC/OS. Marathon 1.4.2 is part of DC/OS 1.9 and is also available for download as a standalone binary. This release includes a number of new features, bug fixes, and improvements. Along with performance and scalability improvements, Marathon now includes support for pods, Mesos-based health checks, improved ZK storage layout, and improved debugging support.
Support for Pods
Pods enable you to share storage, networking, and other resources among a group of applications on a single agent. You can then address them as one group rather than as separate applications and manage health as a unit.
Pods allow quick, convenient coordination between applications that need to work together. For example, a primary service and a related analytics service or log scraper. Pods are particularly useful for transitioning legacy applications to a microservices-based architecture.
The functionality of /v2/apps endpoint is still supported.
Support for Mesos-based health checks for HTTP, HTTPS, and TCP
Health checks are an integral part of application monitoring and have been available in Marathon since version 0.7.
Starting with Apache Mesos 1.1, you can now perform network-based health checks directly on the Mesos executor level.
Prior to this, health checks were only performed directly in Marathon. This was not ideal because:
- Marathon had to share the same network as the tasks to monitor so it could reach all launched tasks.
- Network partitions could lead to sub-optimal scheduling decisions.
- The health state of a task was not available via the Mesos state.
- Marathon health checks did not scale to large numbers of tasks.
For a deeper discussion on the tradeoffs between Mesos and Marathon-based health checks refer to this blog post series.
TL;DR: We strongly recommend Mesos-based health checks over Marathon-based health checks.
With this release, we deprecate Marathon-based health checks in favor of Mesos-based health checks.
For more details on how to use Mesos-based health checks, see the updated Marathon Health Check Documentation.
New ZK persistent storage layout
ZooKeeper has a limitation on the number of nodes it can store in a directory node and also on the size of one node (typically 1MB).
Until version 1.3, Marathon used a flat storage layout in ZooKeeper and encountered limitations with large installations. The latest version of Marathon uses a nested storage layout, which significantly increases the number of nodes that can be stored.
Also, prior Marathon versions stored each group with all subgroups and applications inside a single node, which could lead to a node size larger than 1 MB, exceeding the maximum ZooKeeper node size. In version 1.4, Marathon stores a group only with references in order to keep node size under 1 MB.
When upgrading from to 1.4, Marathon will automatically migrate the old layout to the new one.
Improved Task Lost behavior
The connection between the Mesos master and an agent can be broken for several reasons (network partition, agent update, etc). When this happens, the master has limited knowledge of the status of the agent's tasks.
Prior versions of Apache Mesos declared such tasks as "lost" after a timeout and killed the tasks if the agent rejoins the cluster. Starting with Mesos 1.1, those tasks are declared unreachable, not lost. The scheduler that launched the tasks decides how to handle unreachable tasks.
Marathon 1.4 takes advantage of this feature this feature and adds an UnreachableStrategy to the AppDefinition and PodDefinition, which allows you to define:
- inactiveAfterSeconds: How long Marathon should wait to start a replacement task.
- expungeAfterSeconds: How long Marathon should wait for a task to come back.
If a task comes back and the replacement task is already started, Marathon needs to decide which task to kill. A user can define kill selection to determine whether an old or new task should be killed.
Insights into the Launch Process - AKA: Why isn't my app starting?
In order to schedule tasks, Marathon needs to find a matching offer from the Mesos Master.
The matching logic needs to consider a number of factors, like the resource requirements, resource roles, and constraints. There are situations when Marathon cannot fulfill a launch request, since there are no matching offers.
Previously, it was very hard for users to understand why Marathon could not fulfill launch requests. Marathon 1.4 gives insight into the launch process by analyzing all incoming offers and providing statistics about the reasons why offers are not matching. These statistics can be fetched via the /v2/queue endpoint. See the REST API Reference for more details.