Numpy is a bit strange; by default it utilizes all cores available, but the its speed doesn't seem to improve with the number of available cores.
For example try running the code below:
You will probably have to quit it after running it for some time, because it is just TOOOOO slow. Here is my output from a computer equipped with AMD Ryzen 1700x:
$ python test_numpy.py
with 0 procs, elapsed time: 42.59s
with 1 procs, elapsed time: 43.05s
with 2 procs, elapsed time: 658.66s
^C
So, what is the problem? Although I am not sure of the details, it seems that numpy's default multiprocessing library is quite horrible. It is supposed to use all cores efficiently, it in fact creates bottleneck when there are lots of core.
Furthermore, when you want to carry out the same tasks multiple times in parallel, it makes it even worse, since all cores are already busy from a single task. Notice the time for pool.map with 2 procs has just jumped more than 10x!
BTW, I installed numpy from pip, i.e.,
$ pip install numpy
Maybe is would be better if I compile numpy manually and link better BLAS library, but that is just too painful.
Well, the good new is that there is in fact a very simple solution. Try this.
$ export OMP_NUM_THREADS=1
$ export NUMEXPR_NUM_THREADS=1
$ export MKL_NUM_THREADS=1
$ python test_numpy.py
with 0 procs, elapsed time: 26.91s
with 1 procs, elapsed time: 26.95s
with 2 procs, elapsed time: 13.53s
with 3 procs, elapsed time: 9.12s
with 4 procs, elapsed time: 6.82s
with 5 procs, elapsed time: 5.42s
with 6 procs, elapsed time: 4.61s
with 7 procs, elapsed time: 3.91s
with 8 procs, elapsed time: 4.52s
with 9 procs, elapsed time: 4.62s
with 10 procs, elapsed time: 3.42s
with 11 procs, elapsed time: 3.92s
with 12 procs, elapsed time: 3.62s
with 13 procs, elapsed time: 3.42s
with 14 procs, elapsed time: 3.32s
with 15 procs, elapsed time: 3.22s
with 16 procs, elapsed time: 3.12s
With the exact same task, I am seeing more than 13x speed up compared to the previous result!
No comments:
Post a Comment