Unix & Me: Deep Learning

Showing posts with label Deep Learning. Show all posts

Friday, March 1, 2019

Network Visualization with Intermediate Layer Shapes

Network visualization is very important. Looking at the network graph is much easier than reading through the network description. In this post, I will discuss how to create visual graph of a (Pytorch) network using Netron and include intermediate layer shapes as well.

Netron is an excellent tool for network visualization. At the moment, however, Netron does not support Pytorch natively (experimental feature but not stable). The best thing is to convert Pytorch model to ONNX and then use Netron to graph it.

The following is an example code that graphs ResNet50.
import torch
from torchvision import models
import onnx
from onnx import shape_inference

DEVICE = 'cuda:1'
PATH = 'resnet.onnx'
model = models.resnet50().to(DEVICE)
model.eval()
dummy_input = torch.randn(1,3,224,224).to(DEVICE)
torch.onnx.export(model, dummy_input, PATH, verbose=False)
onnx.save(shape_inference.infer_shapes(onnx.load(PATH)), PATH) # this is required for displaying intermediate shapes

To graph it, simply download Netron and run
$ netron resnet.onnx

That's it!

Tuesday, August 14, 2018

Multi-GPU on Pytorch

After some time, I finally figured out how to run multi-gpu on pytorch. In fact, multi-gpu API is just extremely simple in pytorch; the problem was my system.

Here is a simple test code to try out multi-gpu on pytorch. If this works about of the box, then you are good. However, some people may face problems, as discussed in this forum. As pointed out here, the problem is not about pytorch, but with external factor. In my case, it was ngimel 's comment that saved me. To recap her solution,

1. Test p2pBandwithLatencyTest from CUDA samples and make sure it works fine. If it does not pass this one, then the problem is with CUDA installation, etc, and not with pytorch. To download samples, simply run

$ cuda-install-samples-9.2.sh <target_path>

where you would replace the version above to whatever version you have. Then,

$ cd <target_path>/NVIDIA_CUDA-9.2_Samples/1_Utilities/p2pBandwidthLatencyTest/
$ make
$ ./p2pBandwidthLatencyTest

2. In my case, it was IOMMU that was the culprit. Disable it by editing /etc/default/grub and replace
#GRUB_CMDLINE_LINUX=""
with
GRUB_CMDLINE_LINUX="iommu=soft"

Then update grup
$ sudo update-grup

Then reboot

This is how I solved my problem. I love this open source community forum! Thank you everyone!

Deep Neural Network Tips and Tricks

I just want to scribble down some of the things I have learned from my own experience in training deep neural networks. Hope this helps others too.

1. Optimizer: use SGD with momentum. If momentum is too high, you may experience validation error greater than train error even if it is not overfit. This is because for each epoch, the momentum starts with zero but builds up as more batches are trained, and at the end of epoch, you may experience gradient explosion, which leads to large validation error. Typical value of momentum is 0.9

2. Gradient clipping: always use gradient clipping to prevent gradient explosion. This saves a lot of time because you don't need to manually tune learning rate constantly while training. Typical value is of 10 or lower

3. Learning rate: in theory, as large as it can be, given that it is small enough to prevent gradient explosion. However, this is just too much of work to adjust learning rate during the training, so simply set it high enough and use gradient clipping to prevent gradient explosion. Typical value is 0.001 or lower

4. Input normalization: to facilitate training, normalize the input data to have zero-mean and unit standard deviation.

5. Batch normalization: employ batch normalization layers. These layers are especially very helpful for deep-networks.

6. Drop out: although drop out is not needed when batch norm is employed, one can still employ small dropout (~0.1) for multiple layers. I think this is better than one or two large dropout (~0.5). If data size is small compared to network, and one needs extra measure to prevent overfitting, drop out layers are useful

7. L2 weight decay: not necessary, but still useful as an option. Typical value of 1e-5 should be fine

8. Short-cuts: shortcuts are extremely useful for deep neural networks. Most popular implementation is perhaps residual blocks. Full pre-activation may be the best choice as illustrated here

9. The output size y of convolution given input size x is
y = (x - kernel + padding*2)/stride + 1

10. For 2D convolutions, using small-kernel convolutions many times is more beneficial than using one large-kernel convolution. For example, using 3 convolutions of 3x3 kernel having depth d results in 3^3 * d = 27d parameters, whereas 1 convolution of 7x7 kernel having the same depth d results in 7^2 * d = 49d parameters. Note that both in both cases the receptive size is 7x7, while with the former case, we can employ 3 activation layers, while the latter we can only get 1 activation layer. Therefore, it is usually believed that the former should be more effective in learning. However, for 1D convolution, the former case requires more parameters than the latter

11. Activation layers: typically ReLU layers are used, but ELU may be a good alternative. It is recommended to employ clipping on those unbounded activation layers. i.e., use y = clamp(x, min=0, max=5) in place of ReLU layers to prevent too large values

12. Training history: it is very important to save loss and accuracy history during the training for both training data and validation data. This is because the training history tells us a lot about it. Usually, in the beginning of the training the validation error should be less than training error, since the error is the running average for the training, while the validation error is the error at the end of the training stage in the epoch. However, as time goes by, the training error should be less than validation error, because it will be slightly overfit. That is a good time to either stop the training, to prevent overfitting, or take additional measure. Also, when the improvement flattens, it is a good indicator to lower LR.

I will continue to add more as I gain more experienced.

Wednesday, July 11, 2018

Pytorch Implementation of BatchNorm

Batch Normalization is a really cool trick to speed up training of very deep and complex neural network. Although Pytorch has its own implementation of this in the backend, I wanted to implement it manually just to make sure that I understand this correctly. Below is my implementation on top of Pytorch's dcgan example (BN class starts at line 103)

Although this implementation is very crude, it seems to work well when tested with this example. To run this, type in
$ python main.py --cuda --dataset cifar10 --dataroot .

Thursday, April 26, 2018

Dimensionality Reduction and Scattered Data Visualization with MNIST

We live in 3D world, and we can only view scattered data in 1D, 2D, or 3D. Yet, we deal with data that have very large dimension. Consider MNIST dataset, which is considered to be a toy example in deep learning field, consists of 28 X 28 gray images; that is 784 dimensions.

How would this MNIST data look like in 2D or 3D after dimensionality reduction? Let's figure it out! I am going to write the code in Pytorch. I have to say, Pytorch is so much better than other deep learning libraries, such as Theano or Tensorflow. Of course, it is just my personal opinion, so let's not get into this argument here.

What I want to do is to take Pytorch's MNIST example found here, and make some modifications to reduce the data dimension to 2D and plot scattered data. This will be a very good example that shows how to do all the following in Pytorch:
1. Create a custom network
2. Create a custom layer
3. Transfer learning from an existing model
4. Save and load a model

Here is the plot I get from running the code below.

This code is tested on Pytorch 0.3.1.

Wednesday, November 8, 2017

Setting up Your Web App with Flask

With Python Flask, you can easily setup a web application that is interactive. In this tutorial, we will build a simple deep neural network server that will classify the given image into one of the 1000 categories. When done, your app will look like

Let's create a folder where all your web app files will reside.
$ mkdir ~/web_app
$ cd ~/web_app

Install Flas, Keras and OpenCV modules
$ pip install flask keras opencv-python

Next, create two more folders as below
$ mkdir templates uploads

Create server.py and copy the code below:

Create templates/index.html file and copy the code below:

Create templates/predict.html file and copy the code below:

That's it! Your web app will classify a given image---either uploaded directly from the client or using the web url---using ResNet50 pre-trained network.

To run the server, run
$ python server.py

While the server is running, you can browse to http://SERVER_IP:8888 to view the web app, where of course SERVER_IP must be replaced with the server's actual IP address.

Wednesday, November 1, 2017

Running Keras Deep Neural Network Inference Web Server using WebDNN

Web app is a great tool for simple demo across any platforms. I find it very useful because I can easily show my trained model's output to anyone like a breeze. The catch, however, would be setting up the server, but once the server is up and running, it cannot be any better to boast in your resume and show off to your boss with your trained model.

Here is a tutorial for setting up a simple neural network inference server using WebDNN library. Although its Github page and official documentations are very clear, I still had to spend some time to get it to work. In addition, the official documentation instructs users to compile and install emscripten from source, which will consume quite some time; I will show you how to get around this with install pre-compiled version.

First, one needs to clone WebDNN repository from Github:
$ git clone https://github.com/mil-tokyo/webdnn.git && cd webdnn

Note that WebDNN only supports Python3.6+, so you need to install this unless you already have it on the system. To check the your Python3 version, run
$ python3 --version

Make sure that it is 3.6+. There are plenty of resources that you can search on Google on how to install Python 3.6+. For instance, on Mac OS X, using homebrew is probably the easiest way:
$ brew install python3

Once you have Python 3.6+, it is a good idea to create virtual environment for what you will need to do.
$ virtualenv -p `which python3` python3

Now, activate the environment and install necessary packages
$ source python3/bin/activate
$ pip install tensorflow-gpu keras h5py

The Python environment is now complete. You now need to setup emscripten environment.
$ git clone https://github.com/juj/emsdk.git && cd emsdk
$ ./emsdk install latest
$ ./emsdk activate latest
$ source ./emsdk_env.sh

Now, we need to install eigen library.
$ wget http://bitbucket.org/eigen/eigen/get/3.3.3.tar.bz2
$ tar jxf 3.3.3.tar.bz2
$ export CPLUS_INCLUDE_PATH=$PWD/eigen-eigen-67e894c6cd8f
$ cd ..

Finally, we are ready. Let's first create pre-trained ResNet Keras model that we will use. Run Python and run the following lines
$ python
>>> from keras.applications import resnet50
>>> model = resnet50.ResNet50(include_top=True, weights='imagenet')
>>> model.save("resnet50.h5")

Exit Python and run the following
$ python ./bin/convert_keras.py resnet50.h5 --input_shape '(1,224,224,3)' --out output

After some time, it will generates files in output directory. To run the server, we first need to modify the example/resnet/script.js file. We must make sure to point to the correct directory by changing the weight path. For the version I have, I modified line 39 to point to output directory.
let runner = await WebDNN.load(`/output`, {backendOrder: backend_name});

Finally, we are ready to run the server. To start the server, run the following in webdnn directory
$ python -m http.server

On your web browser, go to address localhost:8000/example/resnet/index.html

You should be able to test ResNet50 model!

Tuesday, October 24, 2017

Using RAM Disk for Expediting Neural Net Training

These days I am constantly experimenting models with different architecture / hyper parameters. Because I am working with lots of images, I realized that it takes quite a bit of time for loading training or validation images every epoch from the disk. Yes, I am using a solid state drive, but it is still slow compared to RAM. To speed up the training, I have been saving images into the RAM in my Python code, which definitely expedites the training.

However, there are two main issues I see. The first is that raw training files are usually in JPEG images, which do not take up much space. However, when I am saving the image data in the Python code, I save the images in the numpy array of bitmap format, which significantly expands the storage space.

The second issue is that I have two GPUs in the server, thus sharing the CPU and RAM. When I am running different models on each GPU, but they share the same set of training data, I would have two copies of the exact same dataset in memory.

So, I was looking for a solution, and I found one that is very easy to implement. On Linux, there is a way to create a RAM disk, which is basically chunk of data saved in RAM but the system treats it as if it is a disk. Basically, I would mount a RAM disk and have the programs access the data from this RAM disk mount location.

Here is how to do it. The instructions below are based on this excellent article. First, create a folder to which you will mount the RAM disk.
$ sudo mkdir /mnt/ramdisk

Next, mount the RAM disk with specified space. For example,
$ sudo mount -t tmpfs -o size=1024m tmpfs /mnt/ramdisk

Finally, copy your training data into this folder and make sure to have your training code point to the new location.
$ cp -r /your/training/data /mnt/ramdisk

Happy training!

** Something to keep in mind **
- This is RAM, so every time your system reboots, the content will be gone
- Make sure that you have enough free RAM so that your data can fit in

Saturday, September 30, 2017

Tensorflow Fundamentals - K-means Cluster Part 2

From the previous post, I have shown how to calculate k-mean cluster using Tensorflow. In this post, I will add a bit more advanced implementations. In particular, I will show you how to implement conditional statement in Tensorflow.

The difference is at guessing the initial set of centroids. In the previous implementation, I simply chose k random points as initial centroids. Here, instead, I am selecting the first centroid to be the point furthest away from the origin. Next ith initial centroid for i = {1,2,...,k} is chosen such that the sum of the distances from previous i-1 centroids is the largest. This way, we can significantly reduce iteration number required to achieve the final state.

Saturday, September 23, 2017

Building Deep Learning Machine Under $2000 with Dual GTX 1080 GPUs

With my experimental model getting larger and larger, training takes too long time. This is especially true as most of my model is vision-based, so it requires a lot of computation and memory. Yes, I could use cloud computing, such as AWS or GCP, so I did some calculation.

The cheapest monthly cost I found for an instance with 2x GTX 1080 Ti GPUs is $500 (AWS or GCP costs much higher). In just four months of using the service, I would spend $2000 on the cloud service.

Instead, I could spend $2000 once building my own system with 2 GPUs, and some $50 or less each month for electric bill and train two models simultaneously. I could even sell the rig later on when I need to upgrade. I am expecting resale value of 1/3 to 1/2 of my system in 2 years.

The answer is quite obvious at this point. I need to build my own rig. After some research, below is the list of parts and justification if necessary.

CPU: AMD Ryzen 1700
This is a 8-core 16-thread processor from AMD. Since most of the computation during training is performed by GPU and not CPU, I did not want to spend more than $300 on CPU. I debated whether to get even cheaper one, 1600, which has 6-cores and 12-threads with higher clock speed. This could be a better option for neural network training. They are both good options. However, at the time of buying, I could not get Ryzen 1600 at its retail price, because 1600 was in such a high demand.

I did not get Intel CPU because it is over-priced at the moment, as the new generation Coffee Lake is imminent. If I could wait a few more months, maybe Coffee Lake processors could be much better candidates than Kaby Lake.

GPU: NVIDIA GTX 1080
This was a tough call. I could get 1060 6G, 1070 8G, 1080 8G, or 1080 Ti 11G. The best bang for the buck would be 1060 6G, but I wanted more VRAM than 6GB. Next up is 1070 8G, but this was too expensive at the time due to high demand, costing around $500. Next up is 1080 8G, which is around $550 with more than 15% boost in performance. Next up is 1080 Ti 11G at $750, but this is too expansive compared to 1080 8G and the performance gain does not justify it. I therefore went with GTX 1080 8G. In fact, I got 2x GTX 1080 8G to train two models simultaneously. If you are willing to spend extra $500, you could go with 2x GTX 1080 Ti 11G.

AMD GPUs were not considered, as most deep learning libraries do not fully support AMD GPUs at the moment. I really hope AMD catches up with GPGPU support for deep learning libraries soon.

Mainboard: ASRock Fatal1ty X370 Gaming K4
This was one of the cheapest mainboards that support AMD Ryzen series CPUs and 2x PCI Express3 8 lanes each. Since I was getting two GPUs, I wanted to make sure that both GPUs get at least PCI Express3 8 lanes.

Yes I could have chosen CPU and mainboard to support dual PCI Express3 16 lanes, but this would sky-rocket my rig cost, and I don't think there will be much performance difference between PCI Express3 8 lanes vs 16 lanes for GTX 1080 graphics cards (source). If you are getting GTX 1080 Ti series, then perhaps you may want to opt for high-end CPU that supports PCI Express 32+ lanes and mainboard to fully support PCI Express3 16 lanes for each GPU, but you would have to spend $3000 or so on the system.

RAM: 2x DDR4 2400 8G
I will get more RAM when the cost goes down a bit. Currently the memory price is just too expansive.

SSD: Samsung Evo 860 500G
Just get a decent SSD with >= 500G. Absolutely no HDD, as this will significantly lower the performance. Samsung's SSDs are renowned for speed and stability.

Power Supply: 850W Gold-rated
Maybe 850W is too much for my config, but it is always better to choose power supply with abundant output. A cheap low-quality low-power output supply can actually destroy the whole system! I roughly estimated 100W for CPU, 200W for each GPU, and 100W for the rest. This is 600W in total, and I wanted 200W margin just to be safe, but 700W+ should have worked just fine. For dual GTX 1080 Ti configuration, you may want to get at least 850W or more.

Case: ATX Mid-Tower
Choose whatever you like as long as it is large enough to fit two GPUs and the motherboard. Most of ATX mid-tower cases should do.

Cooling
Note that you should select a case with lots of fans and ventilation for cooling. I made a mistake of getting a case that wasn't so good in cooling, and the GPU temperature went up to 90C or more, so I had to buy and install additional fans to cool them down. It is very important to keep them cool enough, probably below 85C at full load. With GTX 1080 Ti GPUs, I imagine that cooling will be even more critical.

OK, so the grand total excluding monitor/mouse/keyboards, etc is a bit more than $1900 before tax. With this config, you can train a network that requires up to 16GB of VRAM, since you have 2x GTX 1080 8G, although you will need to make sure to distribute the work load between the two GPUs manually in the code.

I installed Ubuntu 16.04 LTS for now, although I may switch to Cent OS later on. After installing NVIDIA CUDA toolkit, I can successfully detect both GPUs and use them simultaneously with no problem. I did not connect them with SLI though, since it is not needed for my purpose.

Good luck with configuring your system!

Tuesday, September 19, 2017

Tensorflow Fundamentals - K-means Cluster Part 1

Now that we are familiar with Tensorflow, let us actually write code. For this series of posts, we are going to implement K-means clustering algorithm with Tensorflow.

K-means clustering algorithm is to divide a set of points into k-clusters. The simplest algorithm is
1. choose k random points
2. cluster all points into corresponding k groups, where each point in the group is closest to the centroid
3. update the centroids by finding geometric centroids of the clusters
4. repeat steps 2 & 3 until satisfied

Below is my bare-minimum implementation in Tensorflow.

Tensorflow Fundamentals - Interactive Session

As I have discussed in my previous posts, Tensorflow's computation graphs will not evaluate the expression unless one explicitly asks it to do so. One may find this quite annoying during debugging, so Tensorflow provides what is called Interactive Session, which let's you evaluate the expressions as you go, with minimal code.

This is really simple; just call
sess = tf.InteractiveSession()

in the beginning, and this will act like the block
with tf.Session() as sess:

See the code below.

Sunday, September 17, 2017

Tensorflow Fundamentals - Computation Graph Part 3

Next up, we want to now evaluate an expression from given input values. Let us construct a function (graph)

f(x,y) = x + y

where x,y are input values to fed into the graph f. To do this, we need to use tf.placeholder methods to define input nodes, and supply feed_dict parameter to run method as shown below:

The code is easy enough to be self-explanatory. The output of the code shall look similar to

$ python tf_computation_graph_p3.py 2>/dev/null
8 + -6 = 2
-6 + 4 = -2
-10 + -1 = -11
-6 + 0 = -6
5 + 9 = 14
7 + 6 = 13
3 + 8 = 11
3 + 6 = 9
5 + -4 = 1
0 + -3 = -3

Tensorflow Fundamentals - Computation Graph Part 2

Let's continue our journey on Tensorflow's computation graph.

We will now make use of Tensorflow's variables. They are different from constant in that
1. they are mutable during the execution, i.e., they can change their values
2. they will store their state (value), which shall equal to that from the lastest execution

For example, you can define a counter variable that will increment its value on each execution:
counter = counter + 1

Below is the simple demo code for doing this in Tensorflow:

The only catch here is that
1. tf.assign is another operation that will assign a new value to the variable on each execution (run)
2. one must initialize all variables

Running it will output
0
1
2
3
4
5

So far so good!

Tensorflow Fundamentals - Computation Graph Part 1

This post is intended for anyone, including myself, who is having difficulty grasping the very basic concepts of Tensorflow: Computation Graph. This post is heavily based on Tensorflow's official documentation.

The way Tensorflow and Theano operates is just different from all those other ones in that there are two distinct phases. In the first phase one builds the computation graph that defines the computations to be performed. It is like a function where given whatever inputs, it will spit out outputs. For example, let us say you define
f(x,y) = x + y

Then f is the computation graph in Tensroflow. This phase is called construction phase.

In the second phase, one executes the graph by feeding in the inputs. For example,
f(1,2) = 3
f(3,2) = 5

and so on for any x,y pairs you feed in. The graph will output the numerical values given numerical inputs. This phase is called execution phase.

One difference, however, is that not only the mathematical operations, such as additions, subtractions, multiplications, and divisions but also each variables are considered as operation nodes in Tensorflow's computation graphs. Thus, with the example above, we now have three ops:
x
y
+

Here, x y are constant ops, meaning that their values will be some constant directly fed in during the execution phase. The output of the graph f can be fed into a more complex graph.

Let's do a very simple example in code. We will define the computation graph for
f = pi + 1

and compute f for constant pi = 3.14159...

The output of the script shall yield
4.14159

So far so easy. We will progressively construct and execute more complex and useful graphs, so stay with me.

Saturday, September 16, 2017

Examining the Bottleneck between CPU and NVIDIA GPU

I was investigating which part of my computer is the culprit for slowing down neural net training. I first thought it was CPU doing the image preprocessing, as my CPU is Intel's low-end series G4560, which only costs about $90, whereas my GPU is NVIDIA's high-end series GTX 1070 that costs more than whopping $400, thanks to cryptocurrency booming.

To my surprise, it was actually the GPU that was lagging behind this time, at least for the current network that I am training. I would like to share how I found out whether GPU or CPU was lagging. Below is the code, most of which is taken from Patrick Rodriguez's repository keras-multiprocess-image-data-generator.

To run the script, you first need to install necessary modules. Save the following as requirement.txt

cycler

functools32

matplotlib

numpy

nvidia-ml-py

pkg-resources

psutil

pyparsing

python-dateutil

pyt

six

subprocess32

Next, run the command below to automate installing all the necessary modules:

$ pip install -r requirement.txt

Lastly, you also need python-tk module, so install it via
$ sudo apt-get install python-tk

Now, you can run the script
$ python sysmonitor.py

Note that you must have NVIDIA GPU in order for the script to work.

Friday, September 15, 2017

Multithreading in Python

In the previous post, I investigated a way to preprocess images using multiple processes. In this post, I will investigate a way to preprocess images using multiple threads.

The real difference from the multiprocessing code is not much. Instead of using multiprocessing.Pool class, use multiprocessing.pool.ThreadPool class. Below is the code:

The execution time for multithreading is a bit slower than that of multiprocessing, but I am not sure if this is always the case, as the difference is not significant.

single thread elapsed time: 364
2 threads elapsed time: 184
4 threads elapsed time: 115

Multiprocessing with Python

I have been training a simple neural network on my desktop, and I realized that GPU wasn't running at its full capacity, i.e., there must be some bottleneck from, most likely, CPU side. My guess is image preprocessing from CPU is taking longer than GPU computation for each batch. In order to reduce the time for CPU to preprocess the images, I started investigating multiprocessing option in Python.

Below is a simple code for running OpenCV's Canny function across multiple processes using Python's built-in multiprocess module:

Running the script yields approximately linear time reduction for 2 processes and sub-linear for 4 processes, due to other bottleneck, such as disk IO.

single process elapsed time: 364
2 processes elapsed time: 181
4 processes elapsed time: 108

Friday, September 1, 2017

Monitor NVIDIA GPU Status

If you have properly installed NVIDIA driver, then you can easily check your GPU's temperature by running
$ nvidia-smi -q -d temperature

In case you are not sure how to install NVIDIA drivers, refer to this page for excellent answer.

To display the GPU status in general, run
$ nvidia-smi

To watch the GPU status real time, run
$ watch nvidia-smi

Saturday, July 29, 2017

Output Size Calculation for Convolution and Pooling Layers

I keep forgetting the exact equation for calculating the output size of convolution and pooling layers. Below are equations directly from tensorflow's official documentation:

For 'SAME' padding, we have
out_height = ceil(float(in_height) / float(strides[1]))
out_width = ceil(float(in_width) / float(strides[2]))

For 'VALID' padding, we have
out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
out_width = ceil(float(in_width - filter_width + 1) / float(strides[2]))

Note that in tensorflow, a typical input is 4D tensor with shape [batch_size, height, width, channels]. Thus, strides[1] and strides[2] corresponds to stride_height and stride_width, respectively.

Hopefully this is useful!