Tuesday, August 14, 2018

Multi-GPU on Pytorch

After some time, I finally figured out how to run multi-gpu on pytorch. In fact, multi-gpu API is just extremely simple in pytorch; the problem was my system.

Here is a simple test code to try out multi-gpu on pytorch. If this works about of the box, then you are good. However, some people may face problems, as discussed in this forum. As pointed out here, the problem is not about pytorch, but with external factor. In my case, it was ngimel 's comment that saved me. To recap her solution,

1. Test p2pBandwithLatencyTest from CUDA samples and make sure it works fine. If it does not pass this one, then the problem is with CUDA installation, etc, and not with pytorch. To download samples, simply run

$ cuda-install-samples-9.2.sh <target_path>

where you would replace the version above to whatever version you have. Then,

$ cd <target_path>/NVIDIA_CUDA-9.2_Samples/1_Utilities/p2pBandwidthLatencyTest/
$ make
$ ./p2pBandwidthLatencyTest

2. In my case, it was IOMMU that was the culprit. Disable it by editing /etc/default/grub and replace

Then update grup
$ sudo update-grup

Then reboot

This is how I solved my problem. I love this open source community forum! Thank you everyone!

Deep Neural Network Tips and Tricks

I just want to scribble down some of the things I have learned from my own experience in training deep neural networks. Hope this helps others too.

1. Optimizer: use SGD with momentum. If momentum is too high, you may experience validation error greater than train error even if it is not overfit. This is because for each epoch, the momentum starts with zero but builds up as more batches are trained, and at the end of epoch, you may experience gradient explosion, which leads to large validation error. Typical value of momentum is 0.9

2. Gradient clipping: always use gradient clipping to prevent gradient explosion. This saves a lot of time because you don't need to manually tune learning rate constantly while training. Typical value is of 10 or lower

3. Learning rate: in theory, as large as it can be, given that it is small enough to prevent gradient explosion. However, this is just too much of work to adjust learning rate during the training, so simply set it high enough and use gradient clipping to prevent gradient explosion. Typical value is 0.001 or lower

4. Input normalization: to facilitate training, normalize the input data to have zero-mean and unit standard deviation.

5. Batch normalization: employ batch normalization layers. These layers are especially very helpful for deep-networks.

6. Drop out: although drop out is not needed when batch norm is employed, one can still employ small dropout (~0.1) for multiple layers. I think this is better than one or two large dropout (~0.5). If data size is small compared to network, and one needs extra measure to prevent overfitting, drop out layers are useful

7. L2 weight decay: not necessary, but still useful as an option. Typical value of 1e-5 should be fine

8. Short-cuts: shortcuts are extremely useful for deep neural networks. Most popular implementation is perhaps residual blocks. Full pre-activation may be the best choice as illustrated here

9. The output size y of convolution given input size x is
y = (x - kernel + padding*2)/stride + 1

10. For 2D convolutions, using small-kernel convolutions many times is more beneficial than using one large-kernel convolution. For example, using 3 convolutions of 3x3 kernel having depth d results in 3^3 * d = 27d parameters, whereas 1 convolution of 7x7 kernel having the same depth d results in 7^2 * d = 49d parameters. Note that both in both cases the receptive size is 7x7, while with the former case, we can employ 3 activation layers, while the latter we can only get 1 activation layer. Therefore, it is usually believed that the former should be more effective in learning. However, for 1D convolution, the former case requires more parameters than the latter

11. Activation layers: typically ReLU layers are used, but ELU may be a good alternative. It is recommended to employ clipping on those unbounded activation layers. i.e., use y = clamp(x, min=0, max=5) in place of ReLU layers to prevent too large values

12. Training history: it is very important to save loss and accuracy history during the training for both training data and validation data. This is because the training history tells us a lot about it. Usually, in the beginning of the training the validation error should be less than training error, since the error is the running average for the training, while the validation error is the error at the end of the training stage in the epoch. However, as time goes by, the training error should be less than validation error, because it will be slightly overfit. That is a good time to either stop the training, to prevent overfitting, or take additional measure. Also, when the improvement flattens, it is a good indicator to lower LR.

I will continue to add more as I gain more experienced.

Wednesday, August 8, 2018

Enable S3 Sleep for Thinkpad X1 Carbon 6th Gen on Ubuntu

I recently purchased X1 C6; I am back to Thinkpad after 8 years of digression to Macbook. I will probably write  a post explaining why I switched back, but for now I will focus on my story of getting S3 sleep with it under Ubuntu.

One thing I really miss from Macbook is how easy it was to simply close the lid and not worry about draining the battery. Unfortunately, with X1 C6, that is not the case. With Windows 10, closing the lid makes it go to Si03 sleep state where some processes can be on doing stuff in the background. The idea is great, but unfortunately in real world this new Si03 sleep state simply drains the battery so much that I just want to go back to old S3 state, where none of the processes can be on.

Since I installed Ubuntu on this machine and will mainly use Ubuntu, I searched for methods to enable S3 sleep rather than stupid Si03 sleep. After spending much time, I finally found a solution, and I want to share it with anyone who needs this.

At first, I tried this post, but did not work. Then, it was this answer here by Adonis that saved me. Basically, the missing step was to to generate grub.cfg file from grub-mkconfig from /etc/default/grub and then add initrd /boot/acpi_override.

Before this patch, it didn't even last a full day on sleep. Now with this patch, it seems to drain 10% a day in sleep. This is not as good as Macbook, which lasts about 30 days in sleep, so about 3% per day, but I guess this is better than 100% drain per day. With Windows, it was about 30% drain per day in sleep.

I really like this laptop, but I have to admit that I miss Macbook when it comes to this sort of minute convenience.

By the way, I really appreciate that people managed to create this patch and shared with all of us. I really respect their knowledge and skills. This was something Lenovo engineers could not even do!

Saturday, August 4, 2018

Build OS from Scratch 1

I've been always curious as to how a computer works, all the way from the bottom level to the top level. We use computers everyday, so we are familiar with the user-level, i.e., the top-most level, but how many people in the world actually know what is going on in the most deep down level?

I took a course during my undergrad how a computer and OS works, but it has been too long since then, and I don't remember much. Furthermore, when I was taking that course, I lacked much of the necessary knowledge to really absorb the materials; I wasn't even familiar with the most basic shell commands, such as cp, mv, etc.

Now that I think about it, that course was really something I want to learn now; unfortunately, I can't access the course materials any more. Thankfully, there are abundant other resources that are accessible from simple Google search, so I am going to dive into these very low level materials one by one.

I will be starting a series of short blog post to summarize what I learn in my own words, starting with this one. For this post, I am referring to this excellent document.

Boot process looks for bootable sector in any available disk or media. The bootable sector is flagged by magic number 0xAA55 in the last two bytes. The boot sector refers to the first sector in the media.

Let's install qemu to emulate a computer, and nasm to compile assembly.
$ sudo apt-get install qeum nasm -y

Next, create a simple boot sector that prints 'Hello' by first creating assembly source file hello.asm
; A simple  boot  sector  that  prints a message  to the  screen  using a BIOS  routine.
mov ah, 0x0e    ; teletype mode (tty)
mov al, 'H'     ; char to write to
int 0x10        ; print char on sreen
mov al, 'e'
int 0x10
mov al, 'l'
int 0x10
mov al, 'l'
int 0x10
mov al, 'o'
int 0x10
jmp $           ; Jump to the  current  address (i.e.  forever).
; Padding  and  magic  BIOS  number.
times  510-($-$$) db 0  ; Pad  the  boot  sector  out  with  zeros
                        ; $ means address at the beginning of the line
                        ; $$ means address at the beginning of the session (file)
dw 0xaa55               ; Last  two  bytes  form  the  magic  number ,
; so BIOS  knows  we are a boot  sector.

and compile to binary format
$ nasm hello.asm -f bin -o hello.bin

For more details on int 0x10, refer to here.

To boot this sector, simply run
$ qemu-system-x86-64 hello.bin

To view the boot sector in HEX, run
$ od -t x1 -A n hello.bin

You should see it boots up successfully by printing out 'Hello'!!