I have Intel's P3700 SSD mounted on a Lenovo x3650 M5 server with two Intel Xeon E5-2630 v3 CPUs. The server is running Ubuntu 14.04 with 4.6.4 Kernel.
I've been using fio to benchmark the SSD using a synchronous sequential reads with 1MB block size. Bandwidth result is ~1.4GB/sec. This is relatively small to the maximum 2.8GB/sec it should be. I did witnessed a P3700 reaches this bandwidth with a similar benchmark on a high-end PC.
Using blktrace I can see that there is a pretty large latency of ~425usec until receiving back the data.
Edit - I do not actually know whether ~425usec latency is high since my comparison with the P3700 spec was incorrect. The spec latency is 20us for sequential reads and 4KB block size. Measuring the latency using fio and 4KB block size on my system I get average of ~50usec, which is pretty decent IMHO.
259,0 0 40510 1.298997405 21580 Q R 99427328 + 1024 [read_simple]
259,0 0 40511 1.298998348 21580 X R 99427328 / 99427584 [read_simple]
259,0 0 40512 1.298998572 21580 Q R 99427584 + 768 [read_simple]
259,0 0 40513 1.298998775 21580 G R 99427328 + 256 [read_simple]
259,0 0 40514 1.298999664 21580 X R 99427584 / 99427840 [read_simple]
259,0 0 40515 1.298999882 21580 Q R 99427840 + 512 [read_simple]
259,0 0 40516 1.299000060 21580 G R 99427584 + 256 [read_simple]
259,0 0 40517 1.299001737 21580 D RS 99427328 + 256 [read_simple]
259,0 0 40518 1.299002539 21580 X R 99427840 / 99428096 [read_simple]
259,0 0 40519 1.299002738 21580 Q R 99428096 + 256 [read_simple]
259,0 0 40520 1.299002932 21580 G R 99427840 + 256 [read_simple]
259,0 0 40521 1.299004179 21580 D RS 99427584 + 256 [read_simple]
259,0 0 40522 1.299005114 21580 G R 99428096 + 256 [read_simple]
259,0 0 40523 1.299006132 21580 D RS 99427840 + 256 [read_simple]
259,0 0 40524 1.299006563 21580 U N [read_simple] 1
259,0 0 40525 1.299006765 21580 I RS 99428096 + 256 [read_simple]
259,0 0 40526 1.299007810 21580 D RS 99428096 + 256 [read_simple]
259,0 0 40527 1.299433368 0 C RS 99427328 + 256 [0]
259,0 0 40528 1.299457972 0 C RS 99427584 + 256 [0]
259,0 0 40529 1.299499252 0 C RS 99428096 + 256 [0]
259,0 0 40530 1.299509996 0 C RS 99427840 + 256 [0]
I've suspected that there is an issue with the interrupts handling, maybe for some reason CPU1 gets the interrupt from the SSD, and then it forwards it to CPU0 which adds the extra overhead. But looking at /proc/interrupts (see image below) it seems like all interrupts from nvme0q0 (and nvme0q1 for some reason) reach core0 only - which is OK since I run fio on core0 only.
Interrupts table screenshot from /proc/interrupt
Any other ideas? debug advise? solution?
Thanks!