This is a guest post by Tom Petr (Infrastructure Tech Lead at HubSpot) and Whitney Sorenson (VP of Platform Infrastructure at HubSpot).
HubSpot is a marketing and sales platform for businesses to run everything from blogging to analytics to customer relationship management. The product has a large surface area and, as a result, has many independent moving parts: about 170 single-page webapps, 300 RESTful web services, 380 background workers, 260 cron jobs and 400 one-off tasks, all individually deployable.
Our development team at HubSpot is all about velocity. Nearly 300 times a day, developers are committing code, testing changes, and deploying to production. Part of this is thanks to our Platform as a Service (PaaS) team, whose high-level goal is to find and implement ways to reduce developer friction. We've been able to make a huge impact over the past year by running more and more of our workloads on Mesos -- and our homemade PaaS called Singularity -- and have seen many benefits to our software, stack and culture along the way.
When people ask what benefits Mesos brings to HubSpot, they usually expect to hear "cost savings" or "auto-scaling." Those are definitely valid points; we've saved money because Mesos allows us to pack services densely onto larger machines, and it can easily scale up or down capacity on any service with just a couple clicks. However, the most important benefit that Mesos has brought us is the simplicity that comes with abstracting the many machines in our datacenter into one big computer.
Life before Mesos
Before Mesos, our infrastructure required much more human intervention. Provisioning new machines, or replacing impaired ones, was a manual, error-prone process. Developers, who have historically owned the entire stack of their features, would often be paged in the middle of the night because a machine hosting their service died. We also accumulated some single points of failure, such developers consolidating all their cron jobs on a single machine. When one of these "cron" machines would experience issues, there'd be a mad rush to replace and redeploy.
Our infrastructure was also very opaque in places. We have good machinery for pushing out telemetry data for alerting purposes (e.g., service metrics and health-checks), but developers who wanted to pull out granular data from their services (e.g., taking a heap dump, or grabbing all service logs for the past week) had to manually log into each machine and poke around. We needed an easier, more secure way to give developers access to their service's data, without letting them affect other services.
We moved to Mesos to let our developers do what they do best: write software without having to fully understand every single level of HubSpot's tech stack beneath it. When we began our Mesos journey, we first investigated wiring up Chronos (for cron jobs) and Marathon (for web services and background workers) into our deployment pipeline. This is where Singularity got its name -- it was originally supposed to be a Mesos framework for one-off tasks, something that Chronos and Marathon didn't support at the time.
We ultimately decided not to continue down that route, instead transforming Singularity into a framework that's more of a "PaaS-in-a-box." This means it's just one service that can handle the most-common types of workloads in a datacenter. You can think of it almost as a packaged-up Heroku for your datacenter.
We've also developed some unique features to Singularity that have come from real-world use:
- The ability to gracefully stop all tasks on a Mesos slave via "decommissioning" (coming soon to Apache Mesos in the form of inverse offers).
- An artifact download service that operates outside of any Mesos cgroups. While counter-intuitive, this is a good workaround for OOM kills during periods of high disk I/O on some kernels.
- A kinder, gentler OOM killer service that first attempts to smoothly shutdown tasks that exceed their memory allocation.
- A generic S3-upload service for long-term logging and miscellaneous data storage for tasks.
- A custom executor, which provides enhanced reporting and log rotation for tasks.
Rebooting and Recovering
The ability of Mesos to abstract away machines came in handy during the Amazon "rebootpocalypse" of 2014. While other companies (and non-Mesos infrastructure groups inside HubSpot) scrambled to replace all affected instances before they were automatically rebooted, our PaaS team simply spun up new Mesos slaves anddecommissioned the affected machines. Singularity automatically moved processes onto healthy machines, so the end result to our developers was ... absolutely nothing.
They experienced zero downtime, and had no idea of any impending doom.
Singularity's ability to quickly scale services up or down has come in handy many times. Sometimes, it's to handle expected spikes in traffic, like when one of our customers was featured on the show Shark Tank. Sometimes, it's in response to back-pressure on our services, like when a huge batch of emails are all queued up at once.
In a pre-Mesos world, these measures would have taken tens of minutes. Now they take tens of seconds, meaning our mean time to recovery is drastically improved.
Open source from the start
Singularity is unique in the sense that it's a core piece of HubSpot infrastructure that's been open source from its inception. Now, companies including Groupon, Opentable and EverTrue all use Singularity and our teams have this added value of working on the same framework together. We're all fixing things, building features, and catching bugs before they become huge issues for another company.
It's becoming this really valuable Mesos ecosystem where the byproduct of improving your own infrastructure is making other companies' lives a little bit easier without them knowing it.
Next, our goal is to move new pieces of HubSpot infrastructure into Mesos in order to reap the same benefits we've seen for our apps. This includes Memcache, Kafka, Spark, Hadoop and MySQL. If the past year was summed up as "all apps on Mesos," then this year will probably be "and everything else too."
Check out HubSpot's Product Blog for more from their engineering team.