Product, DC/OS, Marathon

Optimizing Marathon: Part II

For more than five years, DC/OS has enabled some of the largest, most sophisticated enterprises in the world to achieve unparalleled levels of efficiency, reliability, and scalability from their IT infrastructure. But now it is time to pass the torch to a new generation of technology: the D2iQ Kubernetes Platform (DKP). Why? Kubernetes has now achieved a level of capability that only DC/OS could formerly provide and is now evolving and improving far faster (as is true of its supporting ecosystem). That’s why we have chosen to sunset DC/OS, with an end-of-life date of October 31, 2021. With DKP, our customers get the same benefits provided by DC/OS and more, as well as access to the most impressive pace of innovation the technology world has ever seen. This was not an easy decision to make, but we are dedicated to enabling our customers to accelerate their digital transformations, so they can increase the velocity and responsiveness of their organizations to an ever-more challenging future. And the best way to do that right now is with DKP.

Jul 27, 2020

Karsten Jeschkies

D2iQ

Part II: Code Culture and Design

Part I covered our team culture which applies to many different types of work and teams. This part will cover our software engineering best practices that help us stabilize Marathon. 

 

Static Types and Non-Blocking Calls

Marathon is written in Scala and makes heavy use of Akka Actors and Streams. I probably don’t have to mention that Scala’s type system and its immutable data structures avoid a lot of bugs before we even run unit tests.

 

Akka allows us to encapsulate state and process requests asynchronously. While this lowered Marathon’s resource consumption and sped it up, it came with a cost. Asynchronous code can be very hard to debug and opens the door for race conditions. Business logic which already is complicated on paper becomes very hard to reason about in code. At first, we solved issues with blocking calls. This approach backfired. We would see thread starvation and random timeouts. Removing blocking calls brought a lot of flaky tests because some code actually relied on other code to block. I will cover this topic in depth in Part III.

 

We decided to remove almost all blocking code and replace it with async/await blocks. Async and await is this magic syntactic sugar that enables us to follow logic sequentially but is executed asynchronously.

 

Next, we made the internals of Marathon declarative. In the past Marathon would just keep counts on Mesos tasks to launch and react to Mesos events. This was insanely hard to debug, as one could not trace why a certain number was increased or decreased. So we replaced the counts with a declarative state of the world Marathon wants to achieve. Mesos events trigger the update of the observed state. By comparing the observed state with the declarative state, Marathon makes a decision. When we debug Marathon nowadays we can look at the observed and declared state and can decide whether an update was wrong or the decision based on the two states was wrong.

 

Let it Crash

As scary as this concept is, we decided to adopt the “let it crash” mind set. Any time Marathon reaches an illegal state or observes issues such as connection failures, it crashes. This behaviour has a few consequences. For one, there should be no data loss. Luckily, Marathon saves every request in ZooKeeper before it acts on it. So recovery after a crash simply means to load all data, reconcile and keep on working. For another, crashing the whole service also means that any failure recovery is pushed out to restarting Marathon. There is only one place that deals with recovery (Note 3).

 

Crashing Marathon also makes it obvious to a user that something is wrong immediately, instead of hours or even days later. Since we adopted this behaviour, we have had quite a few incidents when a customer reported Marathon was in a “crash loop”. As annoying as this is it not only raised issues early on but also makes it easy to see if a fix worked. Once Marathon stopped crashing, the issue was resolved.

 

We prefer to focus on state recovery and not graceful tear down. We'd prefer to just crash and restart, rather than implement many complex error recovery handlers.

 

Fail Loud and Proud

Letting Marathon crash is just one piece of the whole. Another is to fail loud and proud. This principle seems similar to the previous one but there is a difference. Let it crash urges one to not be too smart, just crash. This approach often backfired in the beginning, because users do not expect it and they often cannot make much out of it. This is when “fail loud and proud” comes into play. Be explicit why there is a crash and give users as much information as you have at hand. We started with simple exit codes pointing to different Mesos, Zookeeper or networking issues. It made debugging much much simpler.

 

Coming up in a few days: Part III of the blog, on our Test Pipeline. Check back soon!

 

Note 3: A great example is PR 6905 https://github.com/mesosphere/marathon/pull/6905.

Ready to get started?