Dear HQ Trivia, Here’s How to Prevent Outages Using Edge Computing
How to build and scale a video streaming app to support millions of simultaneous live connections.
HQ Trivia's outages during the Super Bowl halftime show remind us that, even for seasoned professionals, designing apps that can handle the deluge that comes with overnight success is hard. It's common now in the tech industry to mistake the public cloud's near limitless capacity as a substitute for architecting app infrastructure to allow for scaling quickly to meet growing and dynamic demand. It's possible to avoid the "technical difficulties" fate by learning from the successes of other SaaS and mobile apps who faced this before.
If you haven't played HQ Trivia yet, you're missing out. It's re-imagining the traditional trivia show format to create a live, mobile experience — allowing thousands of your closest friends to compete in a question-and-answer game for cash prizes. You can't yet earn Jeopardy-level money or notoriety, but it is fun. However, it's begging for some updates, like nixing the annoying chat room and creating ways to keep players engaged after they miss one of the questions.
Game mechanics aside, its Achilles' heel right now is, unfortunately, downtime caused by overwhelming user demand. Like Twitter, Pokemon Go, and dozens of apps before them, HQ Trivia is struggling with its popularity. Building an application to scale up to support an average of 600,000 simultaneous users is no small feat -- let alone a spike of 1.7 million during the Super Bowl. It's not surprising that the video routinely freezes and connections are often dropped, resulting in emotional outbursts and obscenities in the chat stream. If HQ Trivia wants to keep building on its surging popularity, they must address current technical shortfalls. Rapid success is both a blessing and a curse for digital businesses, and HQ Trivia is wrestling with some of the hardest demands that can be placed on an application infrastructure. It's a great problem to have, but a real headache for the IT organization.
HQ Trivia's Exponential yet Spikey Demand
HQ Trivia is gaining users at breakneck speed. The hundreds of thousands of general trivia enthusiasts that log-in twice every weekday play for only a short time. This results in serious traffic bursts for tens of minutes, but for much of the day it's flat. This dramatic burst in traffic is a common pattern in industries like retail/ecommerce (Black Friday and Cyber Monday) and financial services, where crunching numbers for end-of-day portfolio analysis can require orders of magnitude more compute. Experiences in these industries have shown that avoiding outages requires a well thought-out approach and not a quick fix. Adding physical compute and memory through elastic cloud resources is akin to the traditional approach of throwing more hardware at the problem, and isn't the only or the even the best answer. True elastic application infrastructure that can scale components of the backend up and down is crucial for high availability and long-term scalability.
In the early years, Twitter faced a similar problem: global site outages, which were labeled "fail whales," became not uncommon occurrences. A user's homepage on Twitter, essentially a firehose of short messages, required multiple loosely coupled data and other services in order to render. An outage in one of the open source services (like MySQL) or any latency in the connection would cascade and snowball into outages. The only way to add capacity to any service at the time was to add servers and use configuration management to provision the application infrastructure.
When the number of users is growing exponentially, these processes become error prone and need to be automated and standardized. At Twitter, Mesosphere founders Florian Leibert and Ben Hindman helped architect one of the solutions that killed the fail whale, including new software infrastructure based on Apache Mesos. This allowed engineering to easily provision and manage the entire lifecycle of those application infrastructure components needed to add capacity as usage spiked. If something went down, Mesos would make sure to spin it back up. Mesosphere was born out of the need to make that type of technology more accessible to the wider business world — not just web-scale startups.
Real-Time, Data-Rich Delivery
HQ Trivia traffic spikes include demanding, resource-intensive, data-rich video that often requires additional design considerations to ensure a Netflix-like experience. On top of that, the network must support real-time chat interactions. While automation, management, and standardization of application infrastructure is important for data-rich services like video, it is also critical to consider data locality. In order to preserve a compelling user experience, data needs to be geographically close to the user to take advantage of a low-latency connection.
One could claim that cloud infrastructure and content delivery networks (CDN) eliminate the need for regional compute. Cloud providers have sold technical professionals wholesale on the myth that cloud infrastructure bandwidth is infinite and the latency virtually zero. When signing up for a cloud provider, technologists are often limited by the hardware that is available in a certain service region. For example, a cloud service provider may only have the hardware needed to support your disk or memory requirements in one datacenter. Once you application is constrained to a particular regional data center, your entire business is limited by that site location, resources, and load.
CDN services, however, become less useful the more dynamic your content. Very few support live-streaming, and those that do are shockingly expensive at scale. Instead of relying on a large, centralized cloud or "off-the-shelf" CDN to distribute video to customers, engineering should take a hybrid approach, deploying compute at the edge with the right resource management and networking to ensure that the application is highly available. This hybrid approach, with edge computing powering the user experience, is crucial to delivering on rising consumer expectations.
Edge is the Answer
HQ Trivia requires more than a CDN or a global cloud provider to solve their problems. While we don't know for certain what their infrastructure looks like, we believe that what they need is an edge computing architecture built on a combination of many regionally distributed data caching and compute resources that can quickly process data streams from within the chat while preserving data-rich video performance and quality-of-service (QoS).
Edge computing is rapidly gaining popularity. Placing data caching, processing, and analytics at the edge eliminates performance degradation caused by round-trip network latencies between the user device and cloud-based application stacks. Combining edge and cloud resources allows for a snappy user experience with centralized control, analysis, and insight.
If Royal Caribbean can use edge computing while at sea to cope with exponential, spiky demand against a geographically challenged infrastructure, then so can HQ Trivia. As anyone who has been on a cruise can testify, cruise ship internet connections are expensive and tenuous. Not too long ago, if you wanted to sign up for on-board activities or excursions, passengers had to line up and talk to guest services. By adopting Mesosphere DC/OS, Royal Caribbean was able to extend their compute power all the way to the edge (in this case, cruise ships) to power a reliable mobile app experience, allowing customers to spend more onboard the ship. The suite of Royal Caribbean mobile applications are powered by DC/OS, which manages resources across the datacenter and public cloud, to deliver reliable mobile experiences to everyone on the cruise ship, even during peak demand. As this example shows, the right architecture makes it possible to create reliable mobile experiences — even in the middle of the ocean. If Royal Caribbean can conquer the challenges faced by their distributed infrastructure, then so can HQ Trivia.
HQ Trivia could leverage a similar approach to connect concurrent users to content close to their location. Indeed, most organizations should start planning to utilize this edge design. As connectivity increases to our devices and they become smarter providing deeper interactions, and our tastes evolve from generic to bespoke personalized experiences. In turn, the need for edge computing becomes ever more salient. Easier said than done, as the Trivia HQ engineering team is likely saying reading this blog.
At Mesosphere, we're the first to admit that the new reality is hard. Building and operating distributed applications and related services results in hard problems with often half-built solutions. It's why we've made it our mission to make managing dynamic infrastructure, which is increasingly hybrid and multi-cloud, a bit more manageable.