I am attempting to run a parallelized (OpenMPI) program on 48 cores, but am unable to tell without ambiguity whether I am truly running on cores or threads. I am using htop
to try to illuminate core/thread usage, but it's output lacks sufficient description to fully deduce how the program is running.
I have a workstation with 2x Intel Xeon Gold 6248R, each with 24 cores and threaded---so, 48 total cores, 96 total threads. I have a program parallelized using OpenMPI which I want to have run on all 48 processors, using 1 thread per processor. I am employing Slurm on this workstation to schedule jobs.
First, I have configured Slurm to reflect the system architecture. From the bottom of slurm.conf
:
...
# COMPUTE NODES
NodeName=name Sockets=2 CoresPerSocket=24 ThreadsPerCore=2 State=UNKNOWN
PartitionName=primary Nodes=ALL Default=YES MaxTime=INFINITE State=UP
I am submitting jobs to Slurm via the script:
#!/bin/bash
#SBATCH -J jobname
#SBATCH -e error.%A
#SBATCH -o output.%A
#SBATCH --nodes=1
#SBATCH --ntasks=48
#SBATCH --ntasks-per-node=48
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-socket=24
#SBATCH --threads-per-core=1
mpirun -np 48 binary.x
...and indeed, the processor is printing 48 output files (as expected).
So, the question is: how do I know that I am truly running on all 48 processors, rather than 48 threads on 24 processors? Sure, the Slurm script above is rather explicit, but it is something I want to be sure of before batch processing 1000 simulations.
The primary issue that is making me uncertain is using htop
. The following displays a typical/representative snapshot of what is being reported (it jumps around a bit, of course):
representative htop output
The problem is that, for the most part, htop
is saying that most of the activity is occurring on 0-47. So, the next question is: how is htop
printing cores/threads? One would expect they would be printed in pairs (i.e., the two threads for core 1 would be 0-1, the two threads for core 2 would be 2-3, and so on), but I cannot find this documented anywhere. Output from lstopo
(hwloc), below, indicates this could be the case (if we follow PU
numbers...), but I am not sure if these are linked / drawing the same info.
Any hints would be welcome, either with regard to knowing htop
's output, to constructing a better Slurm script, or both.
lstopo
output:
Machine (125GB total)
NUMANode L#0 (P#0 62GB)
Package L#0 + L3 L#0 (36MB)
L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#48)
L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#49)
L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#50)
L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#51)
L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#52)
L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#53)
L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#54)
L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#55)
L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#56)
L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#57)
L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#58)
L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#59)
L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
PU L#24 (P#12)
PU L#25 (P#60)
L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
PU L#26 (P#13)
PU L#27 (P#61)
L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
PU L#28 (P#14)
PU L#29 (P#62)
L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
PU L#30 (P#15)
PU L#31 (P#63)
L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
PU L#32 (P#16)
PU L#33 (P#64)
L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
PU L#34 (P#17)
PU L#35 (P#65)
L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
PU L#36 (P#18)
PU L#37 (P#66)
L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
PU L#38 (P#19)
PU L#39 (P#67)
L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
PU L#40 (P#20)
PU L#41 (P#68)
L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
PU L#42 (P#21)
PU L#43 (P#69)
L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
PU L#44 (P#22)
PU L#45 (P#70)
L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
PU L#46 (P#23)
PU L#47 (P#71)
HostBridge L#0
PCI 8086:a1d2
PCI 8086:a1bc
PCI 8086:2826
PCIBridge
PCI 8086:1533
Net L#0 "enp2s0"
PCI 8086:15b9
Net L#1 "enp0s31f6"
HostBridge L#2
PCI 8086:201d
HostBridge L#3
PCIBridge
PCI 1000:0017
Block(Disk) L#2 "sdb"
Block(Disk) L#3 "sda"
PCI 8086:201d
HostBridge L#5
PCIBridge
PCI 10de:24b0
GPU L#4 "renderD128"
GPU L#5 "controlD64"
GPU L#6 "card0"
NUMANode L#1 (P#1 63GB)
Package L#1 + L3 L#1 (36MB)
L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
PU L#48 (P#24)
PU L#49 (P#72)
L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
PU L#50 (P#25)
PU L#51 (P#73)
L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
PU L#52 (P#26)
PU L#53 (P#74)
L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
PU L#54 (P#27)
PU L#55 (P#75)
L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
PU L#56 (P#28)
PU L#57 (P#76)
L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
PU L#58 (P#29)
PU L#59 (P#77)
L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
PU L#60 (P#30)
PU L#61 (P#78)
L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
PU L#62 (P#31)
PU L#63 (P#79)
L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32
PU L#64 (P#32)
PU L#65 (P#80)
L2 L#33 (1024KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33
PU L#66 (P#33)
PU L#67 (P#81)
L2 L#34 (1024KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34
PU L#68 (P#34)
PU L#69 (P#82)
L2 L#35 (1024KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35
PU L#70 (P#35)
PU L#71 (P#83)
L2 L#36 (1024KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36
PU L#72 (P#36)
PU L#73 (P#84)
L2 L#37 (1024KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37
PU L#74 (P#37)
PU L#75 (P#85)
L2 L#38 (1024KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38
PU L#76 (P#38)
PU L#77 (P#86)
L2 L#39 (1024KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39
PU L#78 (P#39)
PU L#79 (P#87)
L2 L#40 (1024KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40
PU L#80 (P#40)
PU L#81 (P#88)
L2 L#41 (1024KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41
PU L#82 (P#41)
PU L#83 (P#89)
L2 L#42 (1024KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42
PU L#84 (P#42)
PU L#85 (P#90)
L2 L#43 (1024KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43
PU L#86 (P#43)
PU L#87 (P#91)
L2 L#44 (1024KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44
PU L#88 (P#44)
PU L#89 (P#92)
L2 L#45 (1024KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45
PU L#90 (P#45)
PU L#91 (P#93)
L2 L#46 (1024KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46
PU L#92 (P#46)
PU L#93 (P#94)
L2 L#47 (1024KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47
PU L#94 (P#47)
PU L#95 (P#95)
HostBridge L#7
PCI 8086:201d