State of Marathon
For more than five years, DC/OS has enabled some of the largest, most sophisticated enterprises in the world to achieve unparalleled levels of efficiency, reliability, and scalability from their IT infrastructure. But now it is time to pass the torch to a new generation of technology: the D2iQ Kubernetes Platform (DKP). Why? Kubernetes has now achieved a level of capability that only DC/OS could formerly provide and is now evolving and improving far faster (as is true of its supporting ecosystem). That’s why we have chosen to sunset DC/OS, with an end-of-life date of October 31, 2021. With DKP, our customers get the same benefits provided by DC/OS and more, as well as access to the most impressive pace of innovation the technology world has ever seen. This was not an easy decision to make, but we are dedicated to enabling our customers to accelerate their digital transformations, so they can increase the velocity and responsiveness of their organizations to an ever-more challenging future. And the best way to do that right now is with DKP.
The Marathon user base is growing rapidly and it's great to see all the discussions and feedback in the Hacker News and GitHub communities. We've been following particularly closely the comments about Marathon installations failing, and we're here to help. In this post we'll give Marathon users an overview of the issues we've identified, what has been fixed so far, and what we are working on to increase the stability of Marathon for our users and customers.
The 0.7 series of Marathon added lots of new features including:
- App deployments (as opposed to unmonitored app creation before)
- Hierarchical application groups
- Application and group dependencies
- Docker support
To implement the app deployments and groups we had to drastically change the Marathon internals, including how task scheduling works. With so many big changes there was a great potential for introducing bugs, and this is what happened in Marathon.
Here are details on these bugs:
Registering with new framework ID
A Mesos framework is identified by a unique ID that Mesos assigns to it during the first registration. If a framework loses its connection to Mesos, it must re-register to Mesos within a specific time frame or Mesos removes all state associated with this framework, including the running tasks.
In a fresh Marathon HA setup, assuming that all instances are started simultaneously, the instances fetch the framework ID from the storage at startup, and never again. This means that on failover, all of the instances assume there is no existing ID register as a new framework. Eventually, after the re-registration timeout, all existing tasks are killed without Marathon being notified. This is because Marathon does not get the status updates since the tasks were running under a different framework ID during failure.
This same problem occurred when trying to deploy an app after such a faulty failover. Marathon tries to kill the old tasks, but never receives any status updates from Mesos and therefore never finishes the deployment.
This issue is partially fixed in 0.8.0 and completely fixed in current master.
Health checks never turned green in the UI
In Marathon 0.7.6 we started moving to a new JSON parser (we used Jackson before and are now using Play JSON in many places). In the app response, the output format was inadvertently changed, causing the UI to incorrectly extract the health information. This caused the health status to constantly display red in the task list.
This issue is fixed in 0.8.0.
Segfaults in the Mesos bindings
Marathon uses the mesos.jar to communicate with Mesos. This library uses JNI to call the native code in libmesos. There was a severe bug in the JNI code that led to the premature release of native pointers. This premature release caused segmentation faults or failing checks that led to the termination of the JVM process. This problem occured more frequently with larger tasks loads and/or slow network connections to ZooKeeper.
This bug is fixed in the Mesos master and will be released with Mesos 0.22.1. We realize that upgrading to Mesos 0.22.1 in a timely manner is not possible for everyone, so we are also working on a fix in Marathon to support older Mesos versions as well. This fix simply bypasses the Mesos state abstraction and calls into ZooKeeper directly.
We plan to release this fix as part of Marathon 0.8.2.
Marathon 0.7.x had several issues with leader election that led to leading node actions being executed on non-leading nodes. This could potentially cause loss or corruption of the Marathon state.
These issues are fixed in 0.8.0.
Slow API requests and UI with large numbers of apps/tasks
When creating many Marathon applications at once and/or if you are running many instances, the response time of the app and task specific APIs can drastically increase, ultimately leading to complete denial of service. We identified and fixed many bottlenecks in the Marathon backend and frontend code in the master, and we are working to fix the rest of them. Our goal is to support hundreds of apps and >10,000 tasks running simultaneously on Marathon.
If you encounter any issues with Marathon, we would love to hear from you on GitHub. All contributions, to the codebase itself or to the documentation, are also highly appreciated!
Thanks, The Marathon Team