How does a machine with higher CPU performance (according to gprof) have worse real time performacne?

Question

Background

I have a computationally intensive program that I am trying to run on a single supercomputer node. Here are the specs of one of the nodes on the supercomputer:

OS: Redhat 6 Enterprise 64-bit
CPU: Intel 2x 6-core 2.8GHz (12 cores) -- Cache 12MB
RAM: 24 GB @ ???? MHz (don't have sudo access to check dmidecode)

I have also been testing this program on a Ubuntu virtual machine running on my MacBook:

OS: Ubuntu 13.10 64-bit
CPU: Intel 4x 2.30GHz (4 cores) -- Cache 6MB
RAM: 3 GB @1600 MHz

The program is built with the same version of gcc on both machines. However, for a simplified test run of the program, the real time to run the program on the supercomputer is significantly longer than it is on my virtual machine.

This didn't make sense to me, and to make it more confusing, when I run gprof on my program, it shows that the supercomputer is indeed faster than my virtual machine. The table below shows the different times I am seeing for my program on each machine (SC = supercomputer, VM = virtual machine):

                            | SC   | VM    |
|---------------------------|------|-------|
| Release (-O3) Real Time   | 15s  | 3s    |
| Debug (-g -pg) Real Time  | 55s  | 35s   |
| Debug (-g -pg) gprof Time | 6.10 | 9.24s |

This happens no matter how many times I test it, and in the case of the supercomputer, I am the only user on the computation node when the program is running (i.e. it shouldn't be conflicting with other processes).

There is very little I/O involved in my program. It reads a 1.4MB file and outputs an 82 byte file.

Question

What is going on that is causing the real time performance of the supercomputer to be worse when gprof indicates that he CPU time performance is better? What can I do to improve the real time performance of the supercomputer?

Additional Info

The program spends most of it's time generating and solving a system of linear equations. The actual solver is an openmpi-enabled library that utilizes separate threads corresponding to the number of available cores on the machine.

I can run a separate test program using the same linear solver library which reads a more complex linear system from a Matrix Market formatted file (690MB -- the "A" Matrix is nearly 2million squared) and solves the linear system independent from the program I wrote. In this case, the supercomputer (at 48s) is faster than the virtual machine (at 74s). This indicates to me both that the problem is not in the linear solver and that the problem is not I/O related since this test is much more I/O intensive.

CPU Info

SC

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 44
model name  : Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
stepping    : 2
cpu MHz     : 2800.207
cache size  : 12288 KB
physical id : 0
siblings    : 6
core id     : 0
cpu cores   : 6
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips    : 5600.41
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 44
model name  : Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
stepping    : 2
cpu MHz     : 2800.207
cache size  : 12288 KB
physical id : 0
siblings    : 6
core id     : 1
cpu cores   : 6
apicid      : 2
initial apicid  : 2
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips    : 5600.41
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 2
vendor_id   : GenuineIntel
cpu family  : 6
model       : 44
model name  : Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
stepping    : 2
cpu MHz     : 2800.207
cache size  : 12288 KB
physical id : 0
siblings    : 6
core id     : 2
cpu cores   : 6
apicid      : 4
initial apicid  : 4
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips    : 5600.41
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 3
vendor_id   : GenuineIntel
cpu family  : 6
model       : 44
model name  : Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
stepping    : 2
cpu MHz     : 2800.207
cache size  : 12288 KB
physical id : 0
siblings    : 6
core id     : 8
cpu cores   : 6
apicid      : 16
initial apicid  : 16
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips    : 5600.41
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 4
vendor_id   : GenuineIntel
cpu family  : 6
model       : 44
model name  : Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
stepping    : 2
cpu MHz     : 2800.207
cache size  : 12288 KB
physical id : 0
siblings    : 6
core id     : 9
cpu cores   : 6
apicid      : 18
initial apicid  : 18
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips    : 5600.41
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 5
vendor_id   : GenuineIntel
cpu family  : 6
model       : 44
model name  : Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
stepping    : 2
cpu MHz     : 2800.207
cache size  : 12288 KB
physical id : 0
siblings    : 6
core id     : 10
cpu cores   : 6
apicid      : 20
initial apicid  : 20
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips    : 5600.41
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 6
vendor_id   : GenuineIntel
cpu family  : 6
model       : 44
model name  : Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
stepping    : 2
cpu MHz     : 2800.207
cache size  : 12288 KB
physical id : 1
siblings    : 6
core id     : 0
cpu cores   : 6
apicid      : 32
initial apicid  : 32
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips    : 5599.85
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 7
vendor_id   : GenuineIntel
cpu family  : 6
model       : 44
model name  : Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
stepping    : 2
cpu MHz     : 2800.207
cache size  : 12288 KB
physical id : 1
siblings    : 6
core id     : 1
cpu cores   : 6
apicid      : 34
initial apicid  : 34
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips    : 5599.85
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 8
vendor_id   : GenuineIntel
cpu family  : 6
model       : 44
model name  : Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
stepping    : 2
cpu MHz     : 2800.207
cache size  : 12288 KB
physical id : 1
siblings    : 6
core id     : 2
cpu cores   : 6
apicid      : 36
initial apicid  : 36
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips    : 5599.85
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 9
vendor_id   : GenuineIntel
cpu family  : 6
model       : 44
model name  : Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
stepping    : 2
cpu MHz     : 2800.207
cache size  : 12288 KB
physical id : 1
siblings    : 6
core id     : 8
cpu cores   : 6
apicid      : 48
initial apicid  : 48
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips    : 5599.85
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 10
vendor_id   : GenuineIntel
cpu family  : 6
model       : 44
model name  : Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
stepping    : 2
cpu MHz     : 2800.207
cache size  : 12288 KB
physical id : 1
siblings    : 6
core id     : 9
cpu cores   : 6
apicid      : 50
initial apicid  : 50
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips    : 5599.85
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 11
vendor_id   : GenuineIntel
cpu family  : 6
model       : 44
model name  : Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
stepping    : 2
cpu MHz     : 2800.207
cache size  : 12288 KB
physical id : 1
siblings    : 6
core id     : 10
cpu cores   : 6
apicid      : 52
initial apicid  : 52
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips    : 5599.85
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

VM

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
stepping    : 9
microcode   : 0x15
cpu MHz     : 2294.125
cache size  : 6144 KB
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt aes xsave avx f16c rdrand hypervisor lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase smep
bogomips    : 4588.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
stepping    : 9
microcode   : 0x15
cpu MHz     : 2294.125
cache size  : 6144 KB
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt aes xsave avx f16c rdrand hypervisor lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase smep
bogomips    : 4588.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 2
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
stepping    : 9
microcode   : 0x15
cpu MHz     : 2294.125
cache size  : 6144 KB
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt aes xsave avx f16c rdrand hypervisor lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase smep
bogomips    : 4588.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 3
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
stepping    : 9
microcode   : 0x15
cpu MHz     : 2294.125
cache size  : 6144 KB
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt aes xsave avx f16c rdrand hypervisor lahf_lm ida arat epb xsaveopt pln pts dtherm fsgsbase smep
bogomips    : 4588.25
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

I've update the system specs to reflect these...though I can't find out the memory speed of the supercomputer. — Neal Kruis, Mar 26 '14 at 21:18
oops...sorry. Just fixed that mistake: The supercomputer does indeed have a larger cache. — Neal Kruis, Mar 26 '14 at 21:24
And the times you report are "wall clock" times? What does the times command say (system/user) what are the numbers of pagefaults ? Where is the time spent ? (gprof -a, IIRC) — wildplasser, Mar 26 '14 at 21:35
Can you post the complete output of `cat /proc/cpuinfo` on both machines? The CPUs could be different architectures, making up for the difference in Ghz and cache size. Also, are you the only user on the supercomputer? Maybe it's running other tasks in the background, slowing down your calculation? What load does `uptime` show? — Michał Kosmulski, Mar 26 '14 at 21:37
@NealKruis OK, so it's apparently not other processes running in the background. How about the cpuinfo? — Michał Kosmulski, Mar 26 '14 at 21:54
By the way, the real times shown by `gprof` seem to say that the VM, not SC is faster. — Michał Kosmulski, Mar 26 '14 at 22:05
What flags (especially architecture) did you use for compiling the binary? I have seen at one point a specific choice of architecture to ruin performance on some machines, so maybe this is the case? — Michał Kosmulski, Mar 26 '14 at 22:06
@MichałKosmulski, you might be on to something here. I am not specifying any architecture flags. The only flags I am using are `-g -pg` for debug and `-O3 -DNDEBUG` for release. I'm guessing it's defaulting to *generic*. — Neal Kruis, Mar 27 '14 at 15:01
The [man page](http://linux.die.net/man/1/gcc) is somewhat vague on what exactly happens when you don't specify `-march` and `-mtune` directly, but it seems the defaults can vary between different `gcc` versions and might depend on current CPU. Also, `-O3` includes some risky optimizations, which may help but may also be detrimental at times. How about an experiment: compile the program on a different machine (e.g. on supercomputer if you compiled on your workstation so far) and see if you get similar results as now? — Michał Kosmulski, Mar 27 '14 at 17:02
@MichałKosmulski, I don't know what you mean by "different machine". So far I've been compiling on the supercomputer, my MacBook, and my virtual machine (all with the same compile flags). My MacBook and my virtual machine are similar in performance with the supercomputer being much slower. — Neal Kruis, Mar 27 '14 at 18:20
I thought perhaps you had compiled the program one one machine and distributed the binaries to other machines, but now I know this was not the case. — Michał Kosmulski, Mar 27 '14 at 18:39
The cluster node is a ccNUMA system. Make sure that all threads are bound to a specific core each, otherwise with the OS scheduler moving threads around the benchmark is going to be badly affected. — Hristo Iliev, Mar 28 '14 at 17:21

amdn · Answer 1 · 2014-03-28T15:02:59.703

3

Note3: Latest thinking is that the supercomputer has a faster I/O subsystem but lacks in speed in the matrix operations because the CPU lacks AVX extension. The MacBook is slower doing I/O from disk, but can compute faster because its CPU has AVX extension. Maybe the supercomputer took 33 seconds to load 690MB and 15 seconds to compute and the MacBook took 71 seconds to load 690MB and 3 seconds to compute. That would come up to the observed total of 48 seconds for the supercomputer and 74 seconds for the MacBook.

Note2: I have a new theory, I think when you run the separate test program both the supercomputer and your MacBook are memory bandwidth bound. The data there is 690MB, which doesn't fit in the processor cache, whereas the data in your production runs are 1.4MB, which does fit in the CPU cache. In the MacBook CPU the integrated memory controller is dual channel. On the supercomputer the X5660 Xeon processor memory controller enables three channels. So for very large datasets that don't fit in the CPU last level cache the supercomputer will be faster because it has more memory bandwidth (3 vs 2 channels). For small working sets that fit in the CPU cache the MacBook will be faster, because the problem becomes CPU bound, and the MacBook CPU has AVX instructions which are specialized for linear algebra.

Original Answer

Very likely your linear solver library selects the fastest routine based on a runtime detection of the capabilities of the CPU. The "supercomputer" may have more cores, larger caches, more memory, and run at a higher frequency, but it doesn't have the Intel® Advanced Vector Extensions (Intel® AVX) instructions available in your MacBook. Here's a discussion on AVX for Linear Algebra

Some quotes from Intel Engineers

We recently completed a set of tests SSE vs. AVX on Sandy Bridge vs. Ivy Bridge and a range of performance improvement was between ~3x and ~6x ( for sqrt operation ) and the codes ( C/C++ ) were agressively optimized by Intel C++ compiler 13.0.0.089 ( Initial Release ).

And this

Matrix multiplication is an ideal application for demonstration of AVX performance. It depends strongly on tiling for L1 locality, thus the renewed emphasis on performance libraries such as MKL.

http://ark.intel.com/compare/47921,64900

enter image description here

edited Mar 28 '14 at 15:02

answered Mar 26 '14 at 22:10

amdn

11,314
33
45

As I mentioned in the "Additional Info", I don't think the problem is in the linear solver. I tested the linear solver itself and found that the supercomputer can indeed solve the system faster than my virtual machine. – Neal Kruis Mar 27 '14 at 14:51
@NealKruis, Ooops, I missed that... I'll make a note in my answer but leave it in place in case it helps someone else figure out what's going on. – amdn Mar 27 '14 at 17:42
@NealKruis update: I think the performance characteristics of this program will change from CPU bound to memory bandwidth limited as the size of the matrix increases - 3 memory channels will beat 2 memory channels for large datasets, and AVX instructions will beat SSE for small datasets... see updated answer above. – amdn Mar 28 '14 at 00:48
the 1.4MB in my production runs is the input file I use to inform the creation of the "A" matrix. The matrix itself is stored in memory and is roughly the same size as the 690MB Matrix Market file. In either scenario, the same linear system is solved and it requires the same amount of memory. – Neal Kruis Mar 28 '14 at 14:48
So isn't the program I/O bound while it loads the 690MB file? – amdn Mar 28 '14 at 14:52
Perhaps it would tell us something if knew how long it took to load 690MB and how long it took to compute. Maybe the supercomputer took 33 seconds to load 690MB and 15 seconds to compute and the MacBook took 71 seconds to load 690MB and 3 seconds to compute. – amdn Mar 28 '14 at 14:58
Most of the time (~70%) is spent solving the linear system and not reading the file. Keep in mind that this is the case where I feed the linear system directly to the solver and the supercomputer is indeed out-performing my virtual machine. The point I was trying to make in my question was that my program has very little I/O requirements in contrast to the Matrix Market solution, indicating that the supercomputer is not slowed by I/O operations. – Neal Kruis Mar 28 '14 at 15:05
Alright, if it isn't the CPU and it isn't the I/O subsystem and it isn't the memory bandwidth, then there is some effect due to the different environment (kernel/hypervisor on the Virtual Machine). – amdn Mar 28 '14 at 15:12

score 1 · Accepted Answer · answered Mar 28 '14 at 18:01

This is probably not an answer to your question but more like an extended comment. Given that no line of code is shown, I can only speculate on the nature of the problem. A dual-socket system with Westmere's constitutes a ccNUMA (cache-coherent Non-Uniform Memory Access) platform. With NUMA systems the global memory is divided in areas, some of which are local and the others are remote with respect to any given CPU core. Accessing local memory is less expensive in terms of memory cycles and usually the bandwidth is higher. Your MacBook has a single CPU socket and is not a NUMA system.

That said, it is very important to enable process and thread binding (or pinning) on ccNUMA systems. The OS scheduler usually tries to keep all CPU cores equally loaded and therefore constantly moves threads (an processes as collections of threads) around. If a thread allocates memory on one NUMA node and is then moved to another one, memory access will be significantly slowed down. This could be countered using the processor affinity mechanism - one provides to the OS a list of CPUs where a given thread can run. The actual act of fixing the affinity mask is called binding or pinning. Binding is also important when it comes to cache utilisation since moving a thread from one core to another of the same socket results in reloading the L1 and L2 caches, while a move to a different socket results in reloading the L1, L2, and L3 caches.

Process binding is easily done using taskset or numactl. Binding threads is more involving since it depends on the threading mechanism. OpenMP 4.0 standardises the whole process, but most OpenMP implementations nowadays are from the previous era (i.e. up to version 3.1) and therefore one has to resort to vendor-specific methods. For GCC / libgomp, the way to go is via setting the GOMP_CPU_AFFINITY. For your 12-core cluster node the following should do it:

GOMP_CPU_AFFINITY="0-11" ./executable

Also OpenMP incurrs some overhead and with small matrices that overhead might be so high as to negate the benefits of threading. The overhead grows with increasing the number of threads too. Therefore you should compare your program with the same number of threads in the VM and on the cluster node. Setting the OMP_NUM_THREADS should work for well-written OpenMP codes that do not try to fix the number of threads themselves based on some internal logic.

In resume, you should try something like:

GOMP_CPU_AFFINITY="0-3" OMP_NUM_THREADS=4 ./executable

on both systems. This will remove the NUMA influence and the different OpenMP overheads. Any remaining differences are going to be from different L3 cache architectures (Ivy Bridge has the segmented L3 cache that Sandy Bridge introduced), decreased latencies of some instructions in Ivy Bridge, different power management (possibly the X5660's have their TurboBoost disabled?), and possibly different instruction sets utilised by the solver library as mentioned by @amdn.

How does a machine with higher CPU performance (according to gprof) have worse real time performacne?

Background

Question

Additional Info

CPU Info

SC

VM

2 Answers2