Tutorial: Deep learning with TensorFlow, Nvidia and Apache Mesos (DC/OS), Part 2
Try out TensorFlow with GPU acceleration on DC/OS.
Watch a video version of this tutorial:

In part 2 of this tutorial, we'll:
- Install the TensorFlow service without GPUs.
- Run a neural network example.
- Install TensorFlow with GPUs.
- Run the same neural network example.
- Run an example that uses multiple GPUs.
Run TensorFlow on DC/OS without GPUs
First, let's see how easy it is to use TensorFlow on DC/OS, even without GPUs.
Prerequisites
- A DC/OS cluster with 1 private agent with 4 CPUs and 1 public agent with 8 CPUs and 8 Nvidia Tesla K80 GPUs.
- The DC/OS CLI installed.
Deploy the TensorFlow service
First, let's get TensorFlow running on your DC/OS cluster.
- Go to the Services tab of the DC/OS UI.
- Click + to add a service.
- Choose Single Container.
- Toggle to the JSON Editor and paste the following application definition into the editor.
{
This application definition specifies no GPUs and the standard TensorFlow Docker image.
- Click Review and Run, then Run Service.
Run a TensorFlow example
- Exec into the TensorFlow container from the DC/OS CLI. This command allows you to execute commands inside the container and stream the output to your local terminal.
dcos task exec -it my-tensorflow-no-gpus bash
- Now, let's get some examples to run. Install git and then clone the TensorFlow-Examples repository.
apt-get update; apt-get install -y git git clone https://github.com/aymericdamien/TensorFlow-Examples
- Run and time the same example you ran locally in the last tutorial, the convolutional network example.
cd TensorFlow-Examples/examples/3_NeuralNetworks time python convolutional_network.py
This took my DC/OS cluster 11 minutes.
Run TensorFlow on DC/OS with GPUs
Deploy the TensorFlow service with GPUs
Now that you've got TensorFlow examples running on your cluster, let's see how performance compares when you configure your service to use GPUs.
- Go to the Services tab of the DC/OS UI.
- Click + to add a service.
- Choose Single Container.
- Toggle to the JSON Editor and paste the following application definition into the editor.
{
- Click Review and Run, then Run Service.
Verify access to GPUs
You'll recall that we created a cluster with a public agent that has 8 GPUs, but only requested access to 4. Let's verify that the node has 8 GPUs, and that our service has access to only 4 of them.
- First, use
dcos task exec
to run a command inside of the container to get the public IP address of the agent node the container is running on.dcos task exec tensorflow-gpus-1 curl -s ifconfig.co
- Now, use that public IP to SSH into the node and run
nvidia-smi
to verify the number of GPUs the node has.ssh <public-ip> nvida-smi
- Run
dcos task exec
with thebash
option to get a shell inside of your service's container.dcos task exec -it tensorflow-gpus-1 bash
- Set up environment variables so you can run
nvida-smi
from within this shell.export LD_LIBRARY_PATH=/usr/local/nvidia/lib64 export PATH=$PATH:/usr/local/nvidia/bin
- Run
nvidia-smi
to verify that even though you have 8 GPUs installed on the machine, you only have access to four of them inside this container.nvidia-smi
Run a TensorFlow example with GPUs
Now that you've installed TensorFlow and verified your access to 4 GPUs, let's run the same example as before.
- If you exited the
tensorflow-gpus-1
container, reenter it and set up the environment variables by following the steps in the last section. - Install git and clone the TensorFlow-Examples repository.
apt-get update; apt-get install -y git git clone https://github.com/aymericdamien/TensorFlow-Examples
- Run and time the same example you ran earlier, the convolutional network example.
cd TensorFlow-Examples/examples/3_NeuralNetworks time python convolutional_network.py
- Watch the code find the GPUs and execute.
This took my DC/OS cluster about 2 minutes: about 5 times faster than before!
Launch Two TensorFlow Instances
You'll recall that we have a cluster with 8 GPUs, but we only requested access to 4 of them. Now, let's launch a second TensorFlow instance that will consume the remaining 4 GPUs in parallel with the first.
Running more than one TensorFlow instance in parallel shows that you can have multiple users on the same cluster with isolated access to the GPUs on it.
- Add a third service to your DC/OS cluster with the following application definition, which is similar to the first application definition with GPUs.
{
- Verify that your second TensorFlow instance is running by accessing the Jupyter notebook that runs by default on the TensorFlow Docker image. In the application definition above, the
acceptedResourceRoles
parameter is set toslave_public
, which gives us access to the public IP of the agents where the containers are running.- Get the public IP of the agent where the task has been launched.
dcos task exec tensorflow-gpus-2 curl -s ifconfig.co
- Go to the STDERR log of the service to get the Jupyter URL. Services > tensorflow-gpus-2 > task-id > paper icon > ERROR (STDERR). You will see this a message similar to the following.
Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://localhost:10144/?token=d4f3d8f80eb97299e74b5254d1600c480c3f042d548e51f5
- Replace
localhost
with the public IP you found earlier to see the Jupyter notebook. - Click the
Getting Started
notebook and run some commands.
- Get the public IP of the agent where the task has been launched.
Thanks for playing along at home!
The next post in the series will show you how to use DC/OS to dynamically request cluster resources and launch a distributed TensorFlow job across multiple agents. When that job completes, the resources it had used are automatically released back to the cluster and made available to other jobs. This dramatically increases efficiency in comparison to traditional TensorFlow deployment strategies.