2

I have two physical database servers (both Windows Server 2016):

test server (5 years old): DELL PowerEdge R730xd, 1x Intel(R) Xeon(R) CPU E5-2637 v4 @ 3.50GHz (4C/8T), 192 GB RAM (12x 16GB PC4-17000 - 36ASF2G72PZ-2G1A2) - one NUMA node

production server (half year old): DELL PowerEdge R740xd, 2x Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz (6C/12T), 512 GB RAM (16x 32GB PC4-21300 - 36ASF4G72PZ-2G6E1) - two NUMA nodes (256 GB RAM on each NUMA node)

Both servers has Performance profile selected in BIOS.

I am Oracle DBA and I noticed that my test server is faster in some queries that uses "only" RAM - not storage system. I am really disappointed, because my 5 years old test server is faster than the new one. I think that my problem is related to NUMA, because test server is one NUMA node system and production hase two NUMA nodes.

I made a lot of test in Oracle, but I made also one simple test outside Oracle to confirm my suggestion. One simple PHP script that allocates in loop cca 2GB memory and free it again:

<?php
for($n=0;$n<=10000;$n++){
  $start = microtime(true);
  for ($i = 0; $i < 50000000; ++$i) {
      $arr[] = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
  }
  //echo (memory_get_usage()/1024/1024).PHP_EOL;
  echo (microtime(true) - $start).PHP_EOL;
  unset($arr);
}

On my test server one iteration run about 1.7 seconds On my production 2.0 to 2.6 seconds, if I set processor affinity to NUMA node 1 it is 1.9 seconds.

I am not hardware specialist, so could you help me to tune my memory subsystem? It comes to my mind - BIOS settings, node interleaving, Windows tuning etc. I can't believe that PC4-21300 is slower than PC4-17000 - could someone explain me this behavior? I can provide you some additional information if you want - my current BIOS setting etc.

teo
  • 143
  • 1
  • 3
  • This might be related to spectre and meltdown mitigation which has performance penality. Considering the respective hardware generation, the oldest one might not have those migitations implemented (at bios level) while the newer has them implemented. My first recommandation would be to check (and update) the BIOS. – sfk Apr 21 '20 at 13:38

1 Answers1

1

This is obviously a late answer but I ran across this question. I specialize in doing a lot of this at work and unfortunately a complete answer on doing NUMA testing is long and nuanced. Here are some general things to consider:

  1. It's not impossible, but it is a bit unlikely on a modern, NUMA-aware, operating system that the OS is simply assigning processes to the incorrect NUMA node. You usually see that problem most when a PCIe device like a GPU or NVMe drive is involve, connected over the motherboard to one proc, but the process is running on the opposite. If you think it is a problem you can check NUMA stat. If you are getting NUMA misses you will typically see high counts (and rising) for other_node or numa_foreign though this does depend on a few things. See this Linux doc for a more in the weeds explanation.
[root@r7525 ~]# numastat
                           node0           node1           node2           node3
numa_hit                     460             460          397706          414740
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit                 0               0           10633           10567
local_node                     0               0          226751           76898
other_node                   460             460          170955          337842

                           node4           node5           node6           node7
numa_hit                  423211          295925             460             460
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit             10645           10559               0               0
local_node                256692          247405               0               0
other_node                166519           48520             460             460

                           node8           node9          node10          node11
numa_hit                     460             460          692597          494990
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit                 0               0           10634           10577
local_node                     0               0          283516          274249
other_node                   460             460          409081          220741

                          node12          node13          node14          node15
numa_hit                  269866          227927             460             460
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit             10622           10565               0               0
local_node                103034           87552               0               0
other_node                166832          140374             460             460
  1. You can check the NUMA layout with numactl --hardware. Note: Be aware that there is a difference between the physical NUMA nodes and what you will see in the OS. Ex: The R7525, with NUMAs per socket set to 4, has 8 physical NUMA nodes. Potentially a few more if you enable L3 cache as NUMA. However, what you will see in the OS is this:
[root@r7525 ~]# numactl --hardware
...SNIP...
node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
  0:  10  11  12  12  12  12  12  12  32  32  32  32  32  32  32  32
  1:  11  10  12  12  12  12  12  12  32  32  32  32  32  32  32  32
  2:  12  12  10  11  12  12  12  12  32  32  32  32  32  32  32  32
  3:  12  12  11  10  12  12  12  12  32  32  32  32  32  32  32  32
  4:  12  12  12  12  10  11  12  12  32  32  32  32  32  32  32  32
  5:  12  12  12  12  11  10  12  12  32  32  32  32  32  32  32  32
  6:  12  12  12  12  12  12  10  11  32  32  32  32  32  32  32  32
  7:  12  12  12  12  12  12  11  10  32  32  32  32  32  32  32  32
  8:  32  32  32  32  32  32  32  32  10  11  12  12  12  12  12  12
  9:  32  32  32  32  32  32  32  32  11  10  12  12  12  12  12  12
 10:  32  32  32  32  32  32  32  32  12  12  10  11  12  12  12  12
 11:  32  32  32  32  32  32  32  32  12  12  11  10  12  12  12  12
 12:  32  32  32  32  32  32  32  32  12  12  12  12  10  11  12  12
 13:  32  32  32  32  32  32  32  32  12  12  12  12  11  10  12  12
 14:  32  32  32  32  32  32  32  32  12  12  12  12  12  12  10  11
 15:  32  32  32  32  32  32  32  32  12  12  12  12  12  12  11  10

This is because each physical processor is also running simultaneous multithreading (SMT) and subsequently presents two separate logical processors per physical processor and with it two NUMA nodes. There is a good script here which shows which two sibling threads:

for core in {0..63}; do
  echo -en "$core\t"
  cat /sys/devices/system/cpu/cpu$core/topology/thread_siblings_list
done

The R7525 has 64 physical cores so you see the following:

0       0,64
1       1,65
2       2,66
3       3,67
4       4,68
5       5,69
6       6,70
7       7,71
8       8,72
...SNIP...
59      59,123
60      60,124
61      61,125
62      62,126
63      63,127
  1. If you do have a PCIe card in the mix, you can check the PCIe card's NUMA alignment with `lstopo -v | grep -Ei 'pci|sd|numa'
...SNIP...
 PCIBridge L#1 (busid=0000:60:03.1 id=1022:1483 class=0604(PCIBridge) link=7.88GB/s buses=0000:[63-63])
            PCI L#0 (busid=0000:63:00.0 id=14e4:16d6 class=0200(Ethernet) link=7.88GB/s)
            PCI L#1 (busid=0000:63:00.1 id=14e4:16d6 class=0200(Ethernet) link=7.88GB/s)
          PCIBridge L#2 (busid=0000:60:05.2 id=1022:1483 class=0604(PCIBridge) link=0.50GB/s buses=0000:[61-62])
            PCIBridge L#3 (busid=0000:61:00.0 id=1556:be00 class=0604(PCIBridge) link=0.50GB/s buses=0000:[62-62])
              PCI L#2 (busid=0000:62:00.0 id=102b:0536 class=0300(VGA))
        NUMANode L#0 (P#2 local=65175752KB total=65175752KB)
        NUMANode L#1 (P#3 local=66057292KB total=66057292KB)
        NUMANode L#2 (P#4 local=66058316KB total=66058316KB)
        NUMANode L#3 (P#5 local=66045004KB total=66045004KB)
          PCIBridge L#5 (busid=0000:00:01.1 id=1022:1483 class=0604(PCIBridge) link=15.75GB/s buses=0000:[01-01])
            PCI L#3 (busid=0000:01:00.0 id=1000:10e2 class=0104(RAID) link=15.75GB/s PCISlot=0-1)
          PCIBridge L#6 (busid=0000:00:01.2 id=1022:1483 class=0604(PCIBridge) link=1.00GB/s buses=0000:[02-02])
            PCI L#4 (busid=0000:02:00.0 id=1b4b:9230 class=0106(SATA) link=1.00GB/s PCISlot=0-2)
              Block(Disk) L#2 (Size=234431064 SectorSize=512 LinuxDeviceID=8:16 Model=MTFDDAV240TDU Revision=J004 SerialNumber=2151338FC1AF) "sdb"
              Block(Disk) L#3 (Size=234431064 SectorSize=512 LinuxDeviceID=8:0 Model=MTFDDAV240TDU Revision=J004 SerialNumber=2151338FC427) "sda"
          PCIBridge L#8 (busid=0000:e0:05.1 id=1022:1483 class=0604(PCIBridge) link=0.50GB/s buses=0000:[e1-e1])
            PCI L#5 (busid=0000:e1:00.0 id=14e4:165f class=0200(Ethernet) link=0.50GB/s)
            PCI L#6 (busid=0000:e1:00.1 id=14e4:165f class=0200(Ethernet) link=0.50GB/s)
        NUMANode L#4 (P#10 local=66058316KB total=66058316KB)
          PCIBridge L#10 (busid=0000:c0:01.1 id=1022:1483 class=0604(PCIBridge) link=15.75GB/s buses=0000:[c1-c1])
            PCI L#7 (busid=0000:c1:00.0 id=1000:10e2 class=0104(RAID) link=15.75GB/s PCISlot=0-4)
              Block(Disk) L#6 (Size=1875374424 SectorSize=512 LinuxDeviceID=8:48 Vendor=NVMe Model=Dell_Ent_NVMe_v2 Revision=.2.0 SerialNumber=36435330529024130025384100000002) "sdd"
              Block(Disk) L#7 (Size=6250037248 SectorSize=512 LinuxDeviceID=8:64 Vendor=DELL Model=PERC_H755N_Front Revision=5.16 SerialNumber=6f4ee080160bd5002ab7652100a1691a) "sde"
              Block(Disk) L#8 (Size=1875374424 SectorSize=512 LinuxDeviceID=8:32 Vendor=NVMe Model=Dell_Ent_NVMe_v2 Revision=.2.0 SerialNumber=36435330529024120025384100000002) "sdc"
          PCIBridge L#11 (busid=0000:c0:08.3 id=1022:1484 class=0604(PCIBridge) link=31.51GB/s buses=0000:[c4-c4])
            PCI L#8 (busid=0000:c4:00.0 id=1022:7901 class=0106(SATA) link=31.51GB/s)
        NUMANode L#5 (P#11 local=66057292KB total=66057292KB)
        NUMANode L#6 (P#12 local=66058316KB total=66058316KB)
        NUMANode L#7 (P#13 local=66040920KB total=66040920KB)
          PCIBridge L#13 (busid=0000:80:01.2 id=1022:1483 class=0604(PCIBridge) link=2.00GB/s buses=0000:[81-81])
            PCI L#9 (busid=0000:81:00.0 id=10de:1bb1 class=0300(VGA) link=2.00GB/s PCISlot=4)
Special depth -3:   8 NUMANode (type #13)
Special depth -5:   10 PCIDev (type #15)
Special depth -6:   9 OSDev (type #16)

You can also print it to a picture with lstopo --of png > r7525.png

r7525

Obligatory legal disclaimer: I work for Dell.

Grant Curell
  • 1,043
  • 6
  • 19