Wednesday, July 11, 2018

Pytorch Implementation of BatchNorm

Batch Normalization is a really cool trick to speed up training of very deep and complex neural network. Although Pytorch has its own implementation of this in the backend, I wanted to implement it manually just to make sure that I understand this correctly. Below is my implementation on top of Pytorch's dcgan example (BN class starts at line 103)

Although this implementation is very crude, it seems to work well when tested with this example. To run this, type in
$ python --cuda --dataset cifar10 --dataroot .

Friday, July 6, 2018

Speeding up Numpy with Parallel Processing

Numpy is a bit strange; by default it utilizes all cores available, but the its speed doesn't seem to improve with the number of available cores.
For example try running the code below:

You will probably have to quit it after running it for some time, because it is just TOOOOO slow. Here is my output from a computer equipped with AMD Ryzen 1700x:

$ python
with 0 procs, elapsed time: 42.59s
with 1 procs, elapsed time: 43.05s
with 2 procs, elapsed time: 658.66s

So, what is the problem? Although I am not sure of the details, it seems that numpy's default multiprocessing library is quite horrible. It is supposed to use all cores efficiently, it in fact creates bottleneck when there are lots of core.

Furthermore, when you want to carry out the same tasks multiple times in parallel, it makes it even worse, since all cores are already busy from a single task. Notice the time for with 2 procs has just jumped more than 10x!

BTW, I installed numpy from pip, i.e.,

$ pip install numpy

Maybe is would be better if I compile numpy manually and link better BLAS library, but that is just too painful.

Well, the good new is that there is in fact a very simple solution. Try this.
$ export OMP_NUM_THREADS=1

$ export MKL_NUM_THREADS=1
$ python

with 0 procs, elapsed time: 26.91s
with 1 procs, elapsed time: 26.95s
with 2 procs, elapsed time: 13.53s
with 3 procs, elapsed time: 9.12s
with 4 procs, elapsed time: 6.82s
with 5 procs, elapsed time: 5.42s
with 6 procs, elapsed time: 4.61s
with 7 procs, elapsed time: 3.91s
with 8 procs, elapsed time: 4.52s
with 9 procs, elapsed time: 4.62s
with 10 procs, elapsed time: 3.42s
with 11 procs, elapsed time: 3.92s
with 12 procs, elapsed time: 3.62s
with 13 procs, elapsed time: 3.42s
with 14 procs, elapsed time: 3.32s
with 15 procs, elapsed time: 3.22s
with 16 procs, elapsed time: 3.12s

With the exact same task, I am seeing more than 13x speed up compared to the previous result!

Simple Tic Toc Alternative in Python

In Matlab, tic, toc functions provide very simple way to display time elapsed. We can create a similar mechanism in Python. Note that the source code below is only tested for Python3.

Very convinient!