0

I'm running a dataset oversampling code on a python3 jupyter notebook:-

Snippet

sm = SVMSMOTE(random_state=42)
X_res, Y_res = sm.fit_resample(X,Y)

but this is taking too long to execute. When I checked the system monitor, it showed that only one CPU core is being used at 100% capacity.

So I investigated how to use all available cores.

Machine specs

My machine is rather powerful with 6 cores.

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              6
On-line CPU(s) list: 0-5
Thread(s) per core:  1
Core(s) per socket:  6
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               158
Model name:          Intel(R) Core(TM) i5-8600K CPU @ 3.60GHz
Stepping:            10
CPU MHz:             1186.900
CPU max MHz:         4300,0000
CPU min MHz:         800,0000
BogoMIPS:            7200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            9216K
NUMA node0 CPU(s):   0-5
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d

Jupyter feature: IPCluster

Thankfully the notebook has a functionality to use multiple cores. I ran and tested it as follows:-

In Ubuntu Terminal:-

installing ipyparallel module

pip3 install --upgrade ipyparallel

running the clusters

ipcluster start --n 6 --daemonize

In the Jupyter notebook:

Case 1: using sleep

single core

%%time
import time
time.sleep(5)
output
CPU times: user 725 µs, sys: 166 µs, total: 891 µs
Wall time: 5.03 s

multiple core

import ipyparallel as ipp

rc = ipp.Client()
rc[:]
%%px
%%time
import time
time.sleep(5)
output
[stdout:0] 
CPU times: user 613 µs, sys: 33 µs, total: 646 µs
Wall time: 5 s
[stdout:1] 
CPU times: user 522 µs, sys: 46 µs, total: 568 µs
Wall time: 5.01 s
[stdout:2] 
CPU times: user 498 µs, sys: 29 µs, total: 527 µs
Wall time: 5.01 s
[stdout:3] 
CPU times: user 552 µs, sys: 34 µs, total: 586 µs
Wall time: 5.01 s
[stdout:4] 
CPU times: user 573 µs, sys: 28 µs, total: 601 µs
Wall time: 5 s
[stdout:5] 
CPU times: user 573 µs, sys: 40 µs, total: 613 µs
Wall time: 5 s

Observation

Not much difference by multiple cores. It's almost 5seconds. Therefore a different test case

Test Case 2: with loop

single core

%%time
for x in range(1000):
    print(x)
output
...
996
997
998
999
CPU times: user 29 ms, sys: 9.6 ms, total: 38.6 ms
Wall time: 28.9 ms

multiple core

%%px
%%time
    for x in range(1000):
        print(x)
output
[stdout:0] 
1
2
...
996
997
998
999
CPU times: user 10.9 ms, sys: 8.74 ms, total: 19.7 ms
Wall time: 20.1 ms

Observation

Again, not much difference, barely of 8 seconds.

Questions

  1. Does multicore processing really help in speeding up instruction executions?
  2. How do I make my code snippet about SVMSMOTE fit_resample() run faster otherwise?
Community
  • 1
  • 1
cappy0704
  • 557
  • 2
  • 9
  • 30
  • `time.sleep(5)` waits for 5 seconds. What do you want to "speed up" on it? If you want it to be "faster", wait for 4 seconds - it will finish a whole second earlier that way. – tevemadar Mar 28 '19 at 17:16
  • Side note: even when doing meaningful parallel processing, printing anything on the console/screen/etc. usually ruins things: you have only one console, so even if you write on it from 100 independent, lightning fast processors, they have to sort out access to that single console. Parallel processing requires multiple things to work on. – tevemadar Mar 28 '19 at 17:19

0 Answers0