Launch aws ec2 p2 instance (use spot instance for lower price).
Note: this guide tested on Ubuntu Server 16.04 LTS (HVM) - ami-6f587e1c, Ireland.

Connect to your instance.

ssh -i aws-spot.pem [email protected]

Note: you need to replace aws-spot.pem and 54.194.157.2 with your values.

Install some nvidia stuff.

wget http://us.download.nvidia.com/XFree86/Linux-x86_64/375.20/NVIDIA-Linux-x86_64-375.20.run
wget http://developer.download.nvidia.com/compute/redist/cudnn/v5.1/cudnn-8.0-linux-x64-v5.1.tgz
wget https://developer.nvidia.com/compute/cuda/8.0/prod/local_installers/cuda_8.0.44_linux-run

sudo apt-get update && \
sudo apt-get install -y build-essential linux-image-extra-`uname -r`

chmod +x ./NVIDIA-Linux-x86_64-375.20.run
# This command should open interactive menu. You need to accept license and press few times [Enter].
sudo ./NVIDIA-Linux-x86_64-375.20.run

chmod +x cuda_8.0.44_linux-run
./cuda_8.0.44_linux-run --extract=`pwd`/extracts

# another interactive installation
sudo ./extracts/cuda-linux64-rel-8.0.44-21122537.run
# 1. press **q**
# 2. enter **accept**
# 3. use defaul install path
# 4. don't add desktop menu shortcuts
# 5. yes, create a symbolic link

echo -e "export CUDA_HOME=/usr/local/cuda\nexport PATH=\$PATH:\$CUDA_HOME/bin\nexport LD_LIBRARY_PATH=\$LD_LINKER_PATH:\$CUDA_HOME/lib64" >> ~/.bashrc
source ~/.bashrc
tar xf cudnn-8.0-linux-x64-v5.1.tgz
cd cuda
sudo cp lib64/* /usr/local/cuda/lib64/
sudo cp include/cudnn.h /usr/local/cuda/include/

Optimizing GPU Settings official aws docs.

sudo nvidia-smi -pm 1
sudo nvidia-smi --auto-boost-default=0
sudo nvidia-smi -ac 2505,875

Install python with tensorflow.

sudo apt-get install -y python-pip python-dev
export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-0.12.1-cp27-none-linux_x86_64.whl
sudo pip install --upgrade $TF_BINARY_URL

Done! Let’s check simple tensorflow gpu example.

import tensorflow as tf
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print sess.run(c)

Create AMI for further fast launching.
That’s all.