I have been training a simple neural network on my desktop, and I realized that GPU wasn't running at its full capacity, i.e., there must be some bottleneck from, most likely, CPU side. My guess is image preprocessing from CPU is taking longer than GPU computation for each batch. In order to reduce the time for CPU to preprocess the images, I started investigating multiprocessing option in Python.
Below is a simple code for running OpenCV's Canny function across multiple processes using Python's built-in multiprocess module:
Running the script yields approximately linear time reduction for 2 processes and sub-linear for 4 processes, due to other bottleneck, such as disk IO.
single process elapsed time: 364
2 processes elapsed time: 181
4 processes elapsed time: 108