Uber’s Distributed Deep Learning on Apache Mesos w/ GPUs and Gang Scheduling

For Uber, distributed deep learning is an essential part of the work on self-driving vehicles, trip forecasting, and fraud detection.

With deep learning, Uber can speed up complex model training, scale out to hundreds of GPUs, and shared models that can not be fit into a single machine. Uber relies on Mesos’s key features such as GPU isolation, nested containers, high scalability and reliability; but they also developed their own system, Peloton, to add task preemption and placement, as well as a job/task lifecycle manager on top of Mesos’s resource allocation and task execution. 

This presentation, given at MesosCon 2017 in Los Angeles, features Min Cai, Anne Holler, Paul Mikesell, and Alex Sergeev, engineers at UBER, discussing the design and implementation of running distributed TensorFlow on top of Mesos clusters with hundreds of GPUs. Presenters discuss the architecture of their Peloton solution, and how Uber implements several features in its scheduler to support GPU and Gang scheduling, task discovery and dynamic port allocation. Finally, speakers show the speed up of distributed training on Mesos using Horovod, a framework for TensorFlow for image classification.

Ready to get started?