This is a review of the last three years that we spent stabilizing Marathon. Marathon is the central workload scheduler in DC/OS. Most of the time when you launch an app or a service on DC/OS, it is Marathon that starts it on top of Apache Mesos. Mesos manages the compute and storage resources and Marathon orchestrates the workload. We sometimes dub it the “init.d of DC/OS”. Being such an integral part of DC/OS, we must ensure that it keeps functioning. Note that I did not say “crash”. You will learn why later in this three part series.
Before we start, let me thank my teammates for their hard work and our customers and support engineers for their patience when Marathon did not function as expected. As hard as we try to avoid them, bugs do end up in production. The on call sessions with our customers had a big impact on the stability of Marathon. However, in this article I would like to explore what we did to avoid getting into these situations and how we made them easier to handle.
This article series has three parts, which will be posted at intervals of a few days. I’ll start with our team culture, followed by our code culture and design. The third part is a description of our testing pipeline. I will then close with a retrospective and an outlook. Feel free to skip the sections on team and code culture. However, I strongly believe that they had an influence on our technology.
Part I: Team Culture
Over the course of a few years, we established a certain team culture. I would like to share four aspects that I feel had not only an impact on our code quality but also make working in this team exceptional.
Assume Best Intentions
We always assume everyone has the best intentions at their work. This mindset avoids unnecessary conflict and helps us to focus on the technical issue at hand. Think about it. Do you really assume someone wrote buggy code on purpose or are they or you missing some information? We often found that code sections or changes were controversial in the team because the author, the reviewers, or both did not have enough information and understanding of the problem we wanted to solve. Acknowledging that everyone has the best intentions, i.e., wants to make the product better, opens many doors. This idea leads me to the next principle.
Respect What Came Before
None of my team members were around when the original Marathon code was written. We were often tempted to dismiss the original code as buggy and bad. It is much harder for us to dig into it and try to understand why it was written this way. The former attitude leads to a Not-Invented-Here-Syndrome and unnecessary rewrites. So we focus on the latter approach and dig in. This hard work often pays off with new insights and code improvements which cover the edge cases the original code tried to solve.
Double Down on Success
The first two principles guide us to learn more about Marathon. What do we do with our new insights? We double down on them. Early in our retrospectives, we found that focusing on the last production failure or deadline miss was not only discouraging but also did not help us to move forward (note 1). As my teammate Tim put it, “it’s like going shopping with a list of things not to buy”. We were none the wiser. So we decided to concentrate on what worked and double down on it. This tool helped us find a bug? Let’s improve it and educate the support team to use it. This code change resolved a race condition? Let’s take the time to make the changes in other parts as well.
Pair Programming and Two Reviewers
Ok, we are learning and we are doubling down on success, but what does this look like in day to day work? We quickly found that establishing a process is just overhead. Instead, we encouraged our mates to pair program and enforced two code reviewers. This is our forcing function to increase the chances of applying what we’ve learnt. If two engineers work on a problem and pair, they already have a rough understanding of the code changes. Reviews become simple sign offs. However, when you add a third party they ask questions, they raise the Boy Scout Rule (note 2), they point to a help class just introduced last week or a pattern the team agreed on. I really enjoyed playing the unknowing reviewer and asking questions blocking the pull request to the annoyance of my teammates. However, I strongly believe that my whys and hows improved the code base.
Part II of this blog, Code Culture and Design, will be posted in a few days. Stay tuned!
1. This principal is similar to “Learning from mistakes is overrated” in Rework
2. Found in Martin’s book Clean Code.