By Kevin Klues, Sam Pringle and Jörg Schad
Today, we are excited to announce the beta release of TensorFlow in the Mesosphere DC/OS Service Catalog
. Using a single command, you can now deploy distributed TensorFlow on any bare-metal, virtual, or public cloud infrastructure. As with other packages available for DC/OS, the new TensorFlow package also includes the ability to use GPUs to accelerate your machine learning and deep learning applications.
In the race to leverage deep learning capabilities, data scientists specializing in deep learning are highly sought after. An efficient data science infrastructure allows you to attract the best data scientists and get the best work out of them, which gives your business a strategic advantage over competitors. The addition of distributed TensorFlow to DC/OS furthers Mesosphere's commitment to empowering developers, operators and data scientists
In this blog post, we provide a brief introduction to TensorFlow, walk through the challenges of running TensorFlow in a distributed setting, and talk about how our new DC/OS TensorFlow package addresses these challenges in full. Running distributed TensorFlow on DC/OS, even as a beta package under active improvement, provides a simple and easy to use experience for running distributed TensorFlow on the market today.
(Quick) Introduction to TensorFlow
is an extremely popular open source library for machine learning originally developed by the Google Brain team. In fact, TensorFlow was the #1 most forked GitHub project of 2015 and has remained in the top 10 most-forked projects ever since. TensorFlow's popularity stems from its ability to simplify the development and training of deep neural networks using a computational model based on dataflow graphs.
In the example above, the input layer is responsible for finding patterns of local contrast, hidden layer 1 is responsible for finding individual facial features based on those contrasts, and hidden layer 2 is responsible for identifying entire faces based on those facial features. Source: https://www.edureka.co/blog/what-is-deep-learning[/caption]
In general, the lifecycle of deep neural networks go through two distinct phases: training and inference. In our example, the training phase consists of feeding the neural network with thousands of images in order to train it to recognize faces. This training could take hours, days, or even weeks to complete depending on a variety of factors such as the size of the data set, the complexity of the model, and the performance of the hardware. Once this training is complete however, the neural network can be used to "instantaneously" identify a face in an image.
The figure below shows the training and inference processes in more detail:
TensorFlow eases the development of such deep neural networks by providing basic machine learning primitives that you can integrate directly into your code. TensorFlow provides these primitives in the form of a library, with bindings into multiple popular languages (e.g. C/C++, Go, Java, and Python). Additionally, TensorFlow automatically figures out the best processing unit (CPU, GPU, TPU, etc.) to run your code on.
Developing a TensorFlow application in Python that runs on a combination CPUs and GPUs.[/caption]
Single-Node vs. Distributed TensorFlow
Designing and implementing a deep neural network (even with the help of TensorFlow) is no small feat. Data scientists must first build machine learning models that lend themselves to distributed computation, map them onto deep neural networks, and then write the code to power the new model. They also must decide whether it is worth the effort to define and implement their deep neural network in a distributed fashion, or simply design it to run on a single workstation.
Designing a deep neural network for single-node computation is often simpler than designing it for distributed computation, but takes quite a bit longer to train. On the other hand, designing a deep neural network for distributed computation can be much more complex, but the ability to spread work across many machines cuts training time from months to days, hours, or less.
Challenges with deploying distributed TensorFlow
Organizations deploying distributed TensorFlow applications encounter a number of challenges that are solved transparently by running the service on DC/OS.
Running distributed computations in TensorFlow requires understanding the complex interactions between many different components. Parameter Servers feed the most up-to-date values to Workers that perform the computations while the Master coordinates and synchronizes all of this distributed effort.
Developers and data scientists take on the challenging tasks of designing models and writing TensorFlow applications that lend themselves to being distributed in this fashion, yet this is only the beginning. Deploying, running and maintaining distributed TensorFlow code on an actual cluster is a labor-intensive task without the help of DC/OS.
The primitives provided by TensorFlow help distribute work across a large cluster of machines[/caption]
The developer is responsible for defining a unique ClusterSpec for each deployment that consists of a list of IP addresses and ports where different workers and parameter servers must be started. Machines must then be manually provisioned consistent with what's already been defined in the ClusterSpec and finally code can be deployed onto those machines and run. Even in a dynamic cloud-based environment, the ClusterSpec must be manually updated with every infrastructure change.
However, a traditional TensorFlow implementation embeds the ClusterSpec within the deep learning model code. Therefore, configuring and fine-tuning operating parameters requires an all too familiar cycle of repeatedly editing the ClusterSpec and restarting workers to test the modifications one by one. DC/OS automates ClusterSpec updates, alleviating this tedious and error prone burden from the data science team.
In addition, recovering from distributed TensorFlow failures is not graceful. If the master node or any of the many parameter servers or workers goes down for any reason, then there is nothing to bring it back online without manual intervention. DC/OS automates this manual effort, removing the need to touch every machine repeatedly in order to maintain a healthy distributed TensorFlow deployment.
Benefits of running Distributed TensorFlow on DC/OS
The new beta release of TensorFlow on DC/OS helps solve each of the problems outlined above and more. Specifically it helps to:
- Simplify the deployment of distributed TensorFlow: Deploying a distributed TensorFlow cluster with all of its components on any infrastructure, whether it's baremetal, virtual or public cloud is as simple as passing a JSON file to a single CLI command. Updating and tweaking parameters to fine-tune and optimize becomes trivial.
- Share infrastructure across teams: DC/OS allows multiple teams to share the infrastructure and launch multiple TensorFlow jobs while maintaining complete resource isolation. Once a TensorFlow job is done, capacity is released and made available to other teams.
- Deploy different TensorFlow versions on the same cluster: As with many DC/OS services, you can easily deploy multiple instances of a services, each with a different version, on the same cluster. This means that when a new version of TensorFlow is released, one team can take advantage of the latest features and capabilities without running the risk of breaking another team's code.
- Allocate GPUs dynamically: GPUs greatly increase the speed of deep learning models, especially during training. However, GPUs are precious resources that must be efficiently utilized. Since DC/OS automatically detects all GPUs on a cluster, GPU-based scheduling can be used to allow TensorFlow to request all or some of the GPU resources on a per job basis (similar to requesting CPU, memory, and disk resources). Once the job is complete, the GPU resources are released and made available to other jobs.
- Focus on model development, not deployment: DC/OS separates the model development from the cluster configuration by eliminating the need to manually introduce a ClusterSpec in the model code. Instead, the person deploying the TensorFlow package specifies the properties of the various workers and parameter servers they would like their model to run with, and the package generates a unique ClusterSpec for it at deploy-time. Under the hood the package finds a set of machines to run each worker / parameter server on, populates a ClusterSpec with the appropriate values, starts each parameter server and worker task, and passes it the generated ClusterSpec. The developer simply writes his code expecting this object to be populated, and the package takes care of the rest.The figure below shows a JSON snippet that can be used to deploy a TensorFlow package from the DC/OS CLI with a mix of CPU and GPU workers.
- The command to launch TensorFlow with this config would be:
dcos package install beta-tensorflow --options=<path/to/config.json>
- The package can also be deployed from the DC/OS service catalog by specifying these parameters in the UI.
- Automate failure recovery: The TensorFlow package is written using the DC/OS SDK and leverages built-in resiliency features including automatic restart so that failed tasks effectively self-heal.
- Deploy job configuration parameters securely at runtime: The DC/OS secrets service dynamically deploys credentials and confidential configuration options to each TensorFlow instance at runtime. Operators can easily add credentials to access confidential information or specific configuration URLs without exposing them in the model code.
Learn More About Distributed TensorFlow on DC/OS