How Yelp Saves 2x more with AWS Spot Fleet on Apache Mesos
Nov 21, 2017
4 min read
The following is a summary of the presentation "How Yelp.com Runs on Apache Mesos in AWS Spot Fleet" delivered by Kyle Anderson, Site Reliability Engineer from Yelp at MesosCon Los Angeles. Anderson helps build and run "PaaSTA", Yelp's open source platform-as-a-service built on Mesos, running on a hybrid infrastructure composed of AWS and bare metal servers. Their service provides, through mobile app and website, a way for local businesses to connect with customers; maybe to find a restaurant or a plumber, for example.
At MesosCon North America this year, Kyle Anderson, Site Reliability Engineer at Yelp, shared his best practices about running Mesos in production atop of Amazon Web Services Spot Fleet instances. Spot Fleet instances represent a time-shared instance which can be revoked at any moment, which makes them highly volatile. However, Yelp has designed a methodology for leveraging this unstable service and as a result achieved a 2x savings on infrastructure costs above and beyond what they're already saving with Mesos.
While spot instances are volatile, Yelp makes it look easy to run production services without outage, thanks in part to the ability of Apache Mesos to schedule tasks and relaunch slave nodes. Anderson says that diversity is the key principle to manage the risk associated with volatility on Amazon's spot fleet. Diversity spans across Availability Zones (AZ) and instance types ; in one example Anderson visualizes 3 AZs and 4-5 instance types.
Best Practices for Managing Amazon Spot Fleet
Yelp's best practice for managing their Amazon spot fleet diversity comes through Infrastructure-as -Code by leveraging Hashicorp TerraForm for custom configurations and spot fleet autoscale requests. Through data analysis they have determined the best value for spot instances as compared to Amazon EC2 Reserved instances and On-Demand instances, contrasted with acceptable uptime ratios, comparing the cost of 100% uptime to the value of that bid price or the value of 80% uptime to a bid price. These data points empower Yelp to run an optimized combination of price + uptime + instance type + availability zone which yields a desired diversity level across those attributes.
When asked if the engineering effort was worth the value, Anderson responded that Yelp's engineering team of 4 headcount invested the time to make this work. From their investment, Yelp saves 50% of their projected AWS EC2 reserved instances cost every month, which he jested was sizable monetary value. Also, the tooling Yelp has developed is now publicly available for anyone to consume, an interested party would have a much lower up front investment to realize similar savings.
Anderson shared some caveats. The Amazon Spot Instance API seems to be unique amongst its peer APIs in complexity and intuitiveness, he warns the learning curve could be steep for someone new. Additionally, while diversity is the key point of this session, it's also the greatest area of risk to be addressed. Without diversity on an already highly volatile infrastructure, you open your applications up to the full possible risk associated with Amazon's EC2 spot fleet, since each instance could evict your workloads within 2 minutes. Losing 50% of one infrastructure is not an unlikely possibility, Yelp already over-provisions their spot fleet with the money they are saving to address this risk.
Automation is Essential
Automation was key to Yelp's implementation because the ability to iterate through various models, rebalance workloads, rebalance fleets and optimize bids, weights and instance types is key to success. Without this automation, changes would governed by fear of outages or changes would be too complex to implement without error. Without Yelp's ability to leverage Amazon's EC2 spot fleet, they would be spending 2x as much an infrastructure if they were to purchase EC2 Reserved Instances.
Diversity was a consistent theme in Anderson's session. If you were managing a mutual fund, without diversity you would expose yourself to risk in a particular industrial sector or company. Similarly, without diversity in your spot fleet, you expose yourself to risk within that fault domain of AZ+instance type. Rebalancing across these attributes, like a mutual fund, also reduces risk introduced through entropy bias.
Going to re:Invent? Stop by Mesosphere's booth #1900 and pick up your one of a kind tee-shirt or attend Tobi Knaup's, CTO and Co-Founder, session on how DC/OS delivers real time microservices to RCCL. If you aren't going to re:Invent, check out this blog post and see how Royal Caribbean delivers real-time microservices with DC/OS.