By late 2013, Yelp Director of Operations Sam Eaton realized the company had an infrastructure problem. Scaling the site across its own datacenters and the Amazon Web Services cloud had become complex and resource-intensive, for both developers and operations staff.
"The question was how we could manage deploying services in both datacenters and AWS in a unified way, without making this overly complex for developers to work with," Eaton explained.
Too many options
Ironically, Yelp's problems had nothing to do with a lack of innovation. It had developed its own systems for deploying services on bare metal, and it used the
open source Asgard system, originally developed at Netflix, to deploy to AWS. It was practicing continuous deployment to update the website several times per day.
Yelp was suffering from too many options, and having to deal with the strengths and weaknesses of each. Building new virtual machine images, so-called "golden images," for AWS could take up to an hour, although the built images would launch quickly. Launching services on bare metal gave more control, but scaling was harder. Yelp developers trying to deploy some services locally and some services in the cloud had a hard time dealing with differences between AWS instances and Yelp hardware.
"They had to configure services one way for bare metal, and use a totally different mechanism for AWS," said Eaton. "That was unpleasant for developers."
In addition to the complexity of service deployment, Yelp also had a testing problem. Eaton said developers released new code multiple times per day, but were hamstrung in terms of speed, because it took 90 minutes to run through all the required tests. Parallelizing the test runs was problematic, and Eaton's team found it wasn't easy -- at least without wasting resources -- to analyze all the dependencies and find an optimal way to schedule all the tests.
"It was a ‘bin-packing' problem for us and one that wasn't easily solved by putting more VMs or more hardware into play," he said.
Here comes Docker
Eaton and his team considered numerous options in order to solve the conundrum presented by its growing pools of both cloud and local servers. They thought about deploying OpenStack. They also considered using software from Eucalyptus -- a startup, since acquired by HP, that focused on AWS-compatibility -- in order to build a hybrid environment that spanned cloud and on-premise servers.
However, Yelp's tech team ultimately decided that the future would revolve around
Docker containers, not virtual machines. Docker could solve some of Yelp's service deployment issues because it gave developers the ability to manage their own containers, and to deal with packaging and dependency issues rapidly without waiting for golden images to build.
Running millions of containers means Mesosphere and Mesos
Eaton and his team researched, and they decided on Mesosphere's evolution of
Apache Mesos as the best way to run containers at the scale Yelp required. Eaton was impressed with how Mesos would turn Yelp's clusters of machines -- virtual and physical -- into pools of aggregate resources that didn't stop at the boundaries of the servers that comprised them. Developers could launch containers at will without worrying about server configurations.
Rather than just stick with the base Apache Mesos, Eaton opted to test out Mesosphere's packaged version*, which includes tools such as Marathon. Eaton liked that Mesosphere provided native hooks into
the Marathon platform-as-a-service framework, which Yelp uses to schedule and orchestrate compute jobs. Moreover, Mesosphere's evolving concept of a Datacenter Operating System (DCOS) also aligned with Yelp's goal to create a single developer environment, toolchain, and set of DevOps processes that spanned local and cloud servers.
(* Mesophere still offers open source downloads, but now also
offers its Datacenter Operating System (DCOS) as a commercial software product. DCOS pre-packages many important components, and includes additional features to greatly simplify the experience of deploying and managing distributed services.)
Yelp was able to build a functional Mesosphere cluster in three months, bringing it up to near-production readiness.
Serving up PaaSTA
On top of Mesosphere, Eaton's team constructed a Docker-based microservices architecture, which it calls PaaSTA, to allow container-size jobs to run regardless of the computing platform.
Explained Eaton:
"We love that Docker gives our devs an identical environment throughout the stack. Marathon, and tooling in PaaSTA, gives them a better way to schedule the appropriate number of containers to run their services, and hides the underlying infrastructure from developers. They don't have to worry about whether their services run in a datacenter, on AWS, or some other cloud provider. If they write services according to the PaaSTA platform contract, and configure their Docker containers in the right way, PaaSTA takes care of all the other work of getting their service deployed, by providing resource discovery and launching the containers."
The benefits of PaaSTA and Mesos are felt throughout the Yelp engineering team. Service Infrastructure Technical Lead John Billings explained how:
"We used to spend a large amount of time manually assigning services onto individual machines. This caused a bottleneck when launching new services. If there was a machine failure, then we had to individually contact all of owners of affected services and ask them to move their services onto new hardware. We also had to keep juggling services between machines as traffic volumes increased. As we grew to tens of production services, this was becoming an impossible situation.
"PaaSTA has freed us from these time-consuming tasks by allowing automatic provisioning and migration of services across both in-house and AWS hardware. Developers and operations are completely in love with this technology."
[embed]https://vimeo.com/121183491[/embed]
Soaring with Seagull
In addition to providing the core of the PaaSTA platform-as-a-service system, Yelp also used Mesosphere to turbo-charge its testing infrastructure, developing a new testing platform called Seagull.
By using Mesos and a custom scheduler, Yelp was able to build a much more efficient system for parallelizing and accelerating unit tests. The company now runs about 17 million individual tests every day, some scheduled directly on machines, and some in containers -- and Mesos manages them all. Eaton says that Yelp is already launching 1 million Docker containers per day, with this number increasing as the company moves more tests into containers.
Mesosphere also gave Yelp an unexpected benefit: improved resource utilization meant Eaton's team could save money by participating more actively in the AWS spot market for compute capacity.
"As the number of tests increases, we are bidding for the instances we need to run our test suites beyond our base set of reserved instances," he explained. "If we spot bid for instances, then those instances may disappear midway through a run. Mesos allows us to survive this type of dynamic spot bidding by rescheduling tests onto new instances without interrupting a job."
"A combination of Mesosphere and Marathon gives our developers a much more elastic compute capability and lets them radically speed up their deploys," said Eaton. "They spend less time dealing with different platforms and more time working on their code. Mesosphere is good for developers and that's good for Yelp."