Today, we are excited to announce the beta release of TensorFlow in the Mesosphere DC/OS Service Catalog. Using a single command, you can now deploy distributed TensorFlow on any bare-metal, virtual, or public cloud infrastructure. As with other packages available for DC/OS, the new TensorFlow package also includes the ability to use GPUs to accelerate your machine learning and deep learning applications.
In the race to leverage deep learning capabilities, data scientists specializing in deep learning are highly sought after. An efficient data science infrastructure allows you to attract the best data scientists and get the best work out of them, which gives your business a strategic advantage over competitors. The addition of distributed TensorFlow to DC/OS furthers Mesosphere's commitment to empowering developers, operators and data scientists.
In this blog post, we provide a brief introduction to TensorFlow, walk through the challenges of running TensorFlow in a distributed setting, and talk about how our new DC/OS TensorFlow package addresses these challenges in full. Running distributed TensorFlow on DC/OS, even as a beta package under active improvement, provides a simple and easy to use experience for running distributed TensorFlow on the market today.
(Quick) Introduction to TensorFlow
TensorFlow is an extremely popular open source library for machine learning originally developed by the Google Brain team. In fact, TensorFlow was the #1 most forked GitHub project of 2015 and has remained in the top 10 most-forked projects ever since. TensorFlow's popularity stems from its ability to simplify the development and training of deep neural networks using a computational model based on dataflow graphs.
[caption id="attachment_10545" align="aligncenter" width="1926"] In the example above, the input layer is responsible for finding patterns of local contrast, hidden layer 1 is responsible for finding individual facial features based on those contrasts, and hidden layer 2 is responsible for identifying entire faces based on those facial features. Source: https://www.edureka.co/blog/what-is-deep-learning[/caption]
In general, the lifecycle of deep neural networks go through two distinct phases: training and inference. In our example, the training phase consists of feeding the neural network with thousands of images in order to train it to recognize faces. This training could take hours, days, or even weeks to complete depending on a variety of factors such as the size of the data set, the complexity of the model, and the performance of the hardware. Once this training is complete however, the neural network can be used to "instantaneously" identify a face in an image.
The figure below shows the training and inference processes in more detail:
Although TensorFlow lends itself well to the design and implementation of classification networks like the one above, is not limited to this use case alone. TensorFlow has also been used for object tracking (https://github.com/akosiorek/hart), text-to-speech generation (https://github.com/ibab/tensorflow-wavenet), and even self-driving cars (https://github.com/udacity/self-driving-car/).
TensorFlow eases the development of such deep neural networks by providing basic machine learning primitives that you can integrate directly into your code. TensorFlow provides these primitives in the form of a library, with bindings into multiple popular languages (e.g. C/C++, Go, Java, and Python). Additionally, TensorFlow automatically figures out the best processing unit (CPU, GPU, TPU, etc.) to run your code on.
[caption id="" align="alignnone" width="808"] Developing a TensorFlow application in Python that runs on a combination CPUs and GPUs.[/caption]
Check out the TensorFlow 101 tutorial to learn more about building your first neural network using TensorFlow.
Single-Node vs. Distributed TensorFlow
Designing and implementing a deep neural network (even with the help of TensorFlow) is no small feat. Data scientists must first build machine learning models that lend themselves to distributed computation, map them onto deep neural networks, and then write the code to power the new model. They also must decide whether it is worth the effort to define and implement their deep neural network in a distributed fashion, or simply design it to run on a single workstation.
Designing a deep neural network for single-node computation is often simpler than designing it for distributed computation, but takes quite a bit longer to train. On the other hand, designing a deep neural network for distributed computation can be much more complex, but the ability to spread work across many machines cuts training time from months to days, hours, or less.
Challenges with deploying distributed TensorFlow
Organizations deploying distributed TensorFlow applications encounter a number of challenges that are solved transparently by running the service on DC/OS.
Running distributed computations in TensorFlow requires understanding the complex interactions between many different components. Parameter Servers feed the most up-to-date values to Workers that perform the computations while the Master coordinates and synchronizes all of this distributed effort.
Developers and data scientists take on the challenging tasks of designing models and writing TensorFlow applications that lend themselves to being distributed in this fashion, yet this is only the beginning. Deploying, running and maintaining distributed TensorFlow code on an actual cluster is a labor-intensive task without the help of DC/OS.
[caption id="" align="alignnone" width="960"] The primitives provided by TensorFlow help distribute work across a large cluster of machines[/caption]
The developer is responsible for defining a unique ClusterSpec for each deployment that consists of a list of IP addresses and ports where different workers and parameter servers must be started. Machines must then be manually provisioned consistent with what's already been defined in the ClusterSpec and finally code can be deployed onto those machines and run. Even in a dynamic cloud-based environment, the ClusterSpec must be manually updated with every infrastructure change.
However, a traditional TensorFlow implementation embeds the ClusterSpec within the deep learning model code. Therefore, configuring and fine-tuning operating parameters requires an all too familiar cycle of repeatedly editing the ClusterSpec and restarting workers to test the modifications one by one. DC/OS automates ClusterSpec updates, alleviating this tedious and error prone burden from the data science team.
In addition, recovering from distributed TensorFlow failures is not graceful. If the master node or any of the many parameter servers or workers goes down for any reason, then there is nothing to bring it back online without manual intervention. DC/OS automates this manual effort, removing the need to touch every machine repeatedly in order to maintain a healthy distributed TensorFlow deployment.
Benefits of running Distributed TensorFlow on DC/OS
The new beta release of TensorFlow on DC/OS helps solve each of the problems outlined above and more. Specifically it helps to:
The command to launch TensorFlow with this config would be:
dcos package install beta-tensorflow --options=<path/to/config.json>
The package can also be deployed from the DC/OS service catalog by specifying these parameters in the UI.
Learn More About Distributed TensorFlow on DC/OS
Current DC/OS users can install TensorFlow directly from the Mesosphere DC/OS Service Catalog. If you're not already running DC/OS, download it here.
To learn more about deploying TensorFlow models on DC/OS please watch Running Distributed TensorFlow on DC/OS from MesosCon Europe and checkout the example tutorial.