2

In my application, It writes a large file (650GB) to an SSD. That contains large number of ~1k messages. Messages are not fixed length. Therefore I maintain an in-memory indices as required for several message filtering criteria. The index contained the location of the file.

In one scenario, almost all corresponding messages in one index are written in the disk in ~63,000 bytes far. That means when one message (~1k) is read the application has to seek another ~63,000 in the file for read the next message. When the OS disk cache is dropped (sync; echo 1 > /proc/sys/vm/drop_caches;) the application can only read around 12,000 messages per second. That is 12MB/s. Is that the actual behavior? Please advice. I prepared an small application for simulate the same.

#include <string.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/time.h>
#include <time.h>
#include <errno.h>
#include <iostream>
#include <unistd.h>

using namespace std;

uint64_t gethrtime_us()
{
        timeval tv;
        gettimeofday(&tv, NULL);
        uint64_t t = tv.tv_sec;
        t = t * 1000 * 1000 + tv.tv_usec;
        return t;
}

int main(int argc, const char** argv)
{
    uint64_t gap = atol(argv[1]);
    uint64_t messages = atol(argv[2]);

    int32_t fd = open("/mnt/OC/data", O_RDONLY);// data.txt is 650GB binary data
    if (fd < 0)
    {
        cout << "Open Error" << endl;
        return 1;
    }

    uint64_t startTime = gethrtime_us();
    uint64_t totalReadSize = 0;
    for (uint64_t i = 0 ; i < messages ; i++)
    {
        char message[1024];
        if (read(fd, message, 1024) != 1024)
        {
            cout << "Read Error @" << i << endl;
            break;
        }
        if (lseek(fd, gap, SEEK_CUR) == -1)
        {
            cout << "Seek Error @" << i << endl;
            break;
        }
    }
    uint64_t endTime = gethrtime_us();
    cout << "Time taken " << (endTime - startTime) << " micro seconds" << endl;

    return 0;
}

Test results are as follows.

root@comp|centos7|3.10.0-957.5.1.el7:0:bin # g++ -std=c++11 -O3 Main.cpp;
root@comp|centos7|3.10.0-957.5.1.el7:0:bin # sync; echo 1 > /proc/sys/vm/drop_caches; ./a.out 0 100000
Time taken 172105 micro seconds
root@comp|centos7|3.10.0-957.5.1.el7:0:bin # sync; echo 1 > /proc/sys/vm/drop_caches; ./a.out 1000 100000
Time taken 318472 micro seconds
root@comp|centos7|3.10.0-957.5.1.el7:0:bin # sync; echo 1 > /proc/sys/vm/drop_caches; ./a.out 2000 100000
Time taken 446191 micro seconds
root@comp|centos7|3.10.0-957.5.1.el7:0:bin # sync; echo 1 > /proc/sys/vm/drop_caches; ./a.out 3000 100000
Time taken 561590 micro seconds
root@comp|centos7|3.10.0-957.5.1.el7:0:bin # sync; echo 1 > /proc/sys/vm/drop_caches; ./a.out 4000 100000
Time taken 702702 micro seconds
root@comp|centos7|3.10.0-957.5.1.el7:0:bin # sync; echo 1 > /proc/sys/vm/drop_caches; ./a.out 5000 100000
Time taken 874044 micro seconds
root@comp|centos7|3.10.0-957.5.1.el7:0:bin # sync; echo 1 > /proc/sys/vm/drop_caches; ./a.out 6000 100000
Time taken 1105384 micro seconds
root@comp|centos7|3.10.0-957.5.1.el7:0:bin # sync; echo 1 > /proc/sys/vm/drop_caches; ./a.out 7000 100000
Time taken 1698921 micro seconds
root@comp|centos7|3.10.0-957.5.1.el7:0:bin # sync; echo 1 > /proc/sys/vm/drop_caches; ./a.out 8000 100000
Time taken 1165863 micro seconds
root@comp|centos7|3.10.0-957.5.1.el7:0:bin # sync; echo 1 > /proc/sys/vm/drop_caches; ./a.out 9000 100000
Time taken 7445158 micro seconds
root@comp|centos7|3.10.0-957.5.1.el7:0:bin # sync; echo 1 > /proc/sys/vm/drop_caches; ./a.out 10000 100000
Time taken 8033516 micro seconds
root@comp|centos7|3.10.0-957.5.1.el7:0:bin # sync; echo 1 > /proc/sys/vm/drop_caches; ./a.out 15000 100000
Time taken 8950091 micro seconds
root@comp|centos7|3.10.0-957.5.1.el7:0:bin # sync; echo 1 > /proc/sys/vm/drop_caches; ./a.out 20000 100000
Time taken 9135020 micro seconds
root@comp|centos7|3.10.0-957.5.1.el7:0:bin # sync; echo 1 > /proc/sys/vm/drop_caches; ./a.out 30000 100000
Time taken 9180888 micro seconds
root@comp|centos7|3.10.0-957.5.1.el7:0:bin # sync; echo 1 > /proc/sys/vm/drop_caches; ./a.out 40000 100000
Time taken 9123635 micro seconds
root@comp|centos7|3.10.0-957.5.1.el7:0:bin # sync; echo 1 > /proc/sys/vm/drop_caches; ./a.out 50000 100000
Time taken 9181108 micro seconds
root@comp|centos7|3.10.0-957.5.1.el7:0:bin # sync; echo 1 > /proc/sys/vm/drop_caches; ./a.out 60000 100000
Time taken 9141465 micro seconds

In the final case 100000 messages per 9 seconds. (100000 * 1k) / 9 = 11111kb/s

I'm worried why the SDD manufactures are claiming ~2,3 GB/s sequential read rate taking the OC disk cache benefit which they can only do 12MB/s in this kind of scenario. Please advice.

Sujith Gunawardhane
  • 1,251
  • 1
  • 10
  • 24
  • 1
    Typical writes are under 200 MB/s for small random. Here is some info: [https://www.thessdreview.com/ssd-guides/beginners-guide/the-ssd-manufacturers-bluff/2/](https://www.thessdreview.com/ssd-guides/beginners-guide/the-ssd-manufacturers-bluff/2/) look at the first 4K numbers not the QD32 or threaded 4K. – drescherjm Dec 17 '20 at 14:24
  • Yes correct, But in this case it is absolutely random and far. The disk cache is not even involved. In this SSD, the speed goes more than 500MB/s when the data are nearby. – Sujith Gunawardhane Dec 17 '20 at 14:29
  • 2
    For small random I/Os, MBPS is meaningless. Measure IOPS. Since you are going through a filesystem, there are more I/Os than the ones explicit in your program. Use something like iostat to measure. – stark Dec 17 '20 at 14:29
  • 1
    Are these "in-memory indices" byte offsets or message index offsets? – Moshe Gottlieb Dec 17 '20 at 14:33
  • @MosheGottlieb It is byte offsets. It is just used to seek the file – Sujith Gunawardhane Dec 17 '20 at 14:35
  • 1
    I don't think there's anything wrong with manufacturers claiming ~2,3 GB/s sequential read rate and then not delivering ~2,3 GB/s in a "random and very much not sequential read" situation. – Brendan Dec 17 '20 at 15:01
  • 1
    I'd also recommend trying asynchronous IO if you can. With blocking IO (e.g. `read()`) the SSD probably spends half its time idle waiting for the next read request while the CPU spends the other half of the time waiting for the previous read to complete. For sequential reads this can be "hidden" (in OS and in SSD) by detecting the pattern and prefetching data before it's requested. – Brendan Dec 17 '20 at 15:07

1 Answers1

2

I think you should use disk benchmark applications to test your SSD performance. See https://www.raymond.cc/blog/measure-actual-hard-disk-perfomance-under-windows/

That said, reading relatively small chunks of data involve a system call and all of the assotiated operating system overhead. You would be better off bypassing the filesystem and doing reads from the raw device. On Linux, with root access, you would just have to read from /dev/sda or sdb (which represent the entire disk). With the necessary privileges, you can also do that on Windows.

Tarik
  • 10,810
  • 2
  • 26
  • 40