Parallel I/O SSD vs HDD surprising results

Question

I have a very strange situation happening with some of my tests regarding paralell I/O. Here is the situation.. I have multiple threads opening a file handler to the same file and reading from multiple locations of the file (evenly spaced intervals) a finite number of bytes and dumping that into an array. All is done with boost threads. Now, I assume with an HDD that should be slower due to the random access seeking. This is why my tests are in fact targeted towards SSD. Turns out I almost do not get any speedup when reading the same file from a solid state disk compared to a HDD. Wonder what the problem might be? Does that seem very surprising just to me / I am also posting my code below to see what I am exactly doing:

    void readFunctor(std::string pathToFile, size_t filePos, BYTE* buffer, size_t buffPos, size_t dataLn, boost::barrier& barier) {

        FILE* pFile;
        pFile = fopen(pathToFile.c_str(), "rb");

        fseek(pFile, filePos, SEEK_SET);
        fread(buffer, sizeof(BYTE), dataLn, pFile);

        fclose(pFile);
        barier.wait();

    }

    void joinAllThreads(std::vector<boost::shared_ptr<boost::thread> > &threads) {

        for (std::vector<boost::shared_ptr<boost::thread> >::iterator it = threads.begin(); it != threads.end(); ++it) {
            (*it).get()->join();

        }

    }

    void readDataInParallel(BYTE* buffer, std::string pathToFile, size_t lenOfData, size_t numThreads) {
        std::vector<boost::shared_ptr<boost::thread> > threads;
        boost::barrier barier(numThreads);
        size_t dataPerThread = lenOfData / numThreads;

        for (int var = 0; var < numThreads; ++var) {
            size_t filePos = var * dataPerThread;
            size_t bufferPos = var * dataPerThread;
            size_t dataLenForCurrentThread = dataPerThread;
            if (var == numThreads - 1) {
                dataLenForCurrentThread = dataLenForCurrentThread + (lenOfData % numThreads);
            }

            boost::shared_ptr<boost::thread> thread(
                    new boost::thread(readFunctor, pathToFile, filePos, buffer, bufferPos, dataLenForCurrentThread, boost::ref(barier)));
            threads.push_back(thread);

        }

        joinAllThreads(threads);

    }

Now.. in my main file I pretty much have..:

    int start_s = clock();
    size_t sizeOfData = 2032221073;
    boost::shared_ptr<BYTE> buffer((BYTE*) malloc(sizeOfData));
    readDataInParallel(buffer.get(), "/home/zahari/Desktop/kernels_big.dat", sizeOfData, 4);
    clock_t stop_s = clock();
    printf("%f %f\n", ((double) start_s / (CLOCKS_PER_SEC)) * 1000, (stop_s / double(CLOCKS_PER_SEC)) * 1000);

Surprisingly, when reading from SSD, I do not get any speedup compared to HDD? Why might that be?

It might first write to an output buffer ? What do you mean ? Why would it write to an output buffer and even if it does how does that matter in any way ? — Zahari, Jul 16 '13 at 12:18
Consider the limiting factors -- how fragmented is the file on the hard disk? are the random sections going to be on the same (or near) cylinder? what is the peak throughput like (compared to the capacity of the SATA connection)? What else is using the disks? — Rowland Shaw, Jul 16 '13 at 12:19
The SSD in particular was formatted and that file is really the first one written to it. Furthermore performing some linux bench marking (with the inbuilt disk tool), the avg read rate of the drive is 4 time higher than the one that is HDD. Yet again, my code does not run faster. And nothing else is using the SSD. — Zahari, Jul 16 '13 at 12:23
All operating systems cache files in memory. If you run the test more than once, you are getting the files from memory and not HDD or SSD. — brian beuning, Jul 16 '13 at 12:25
I have run it more than once, yes. But again even on the first run as I said.. It does not go faster. — Zahari, Jul 16 '13 at 12:28
@zahan When ever you made the data files, the OS will put them in the file buffer cache. Unless you booted or flushed the cache. — brian beuning, Jul 16 '13 at 12:38
There is no concurrent access to same drive from multiple threads. Threads will contend for the shared resource, and this will become the major bottleneck. Try to synchronize disk's access with mutex to get the more clean benchmarks. — SChepurin, Jul 16 '13 at 12:56
SO from what I get here, there is not real point in implementing parallel I/O even for SSD disks. I am sure that massively randomizing the read positions as suggested will show some improvement in SSD reads. However my purpose was to use that for real I/O, reading from a file where each thread will read a consecutive chunk of the file. Is that even worth it ? Would I benefit from that ? I thought that given the nature of the SSD, this would increase my IO performance a lot, but I guess I have been wrong. — Zahari, Jul 16 '13 at 13:03
From "Parallel I/O for High Performance Computing" John M. May,2001 - *The main parallel I/O technique is disk striping... A computer writing a large quantity of data can split the data into pieces and write them simultaneously to separate disks in a disk array. The data is generally divided into ﬁxed-size blocks, and the blocks are distributed cyclically to the disks.* Not much changed since for conventional drives. Google uses this technique in its data centers. — SChepurin, Jul 16 '13 at 13:40

score 7 · Answer 1 · answered Nov 21 '13 at 15:06

Your file probably gets cached so what you're measuring is the CPU overhead and not the I/O. Instead of flushing the entire disk cache, you can call posix_fadvise() on the file prior to reading it with the "wontneed" flag to advise the kernel not to cache it. That is, assuming you're on some kinda of a *nix platform or Mac OS.

Kerry Kobashi · Answer 2 · 2015-09-22T09:19:16.730

A possible explanation to this is that you are not running under a SATA III setup. The SATA III 6gb/s SSD drive you are using is attached to an older SATA II 3gb/s controller on the motherboard. In that case, your SSD is throttled down to 3 gb/s.

Check your hardware configuration. If it is SATA II, you need to replace the mobo to let your SSD reach its full performance potential.

Check your HDD disk drive also to see if it is SATA, SATA II or SATA III.

Make sure you are comparing apples to apples at the hardware interface level.

ogni42 · Answer 3 · 2013-07-16T12:49:12.853

2

Your measurements are dominated by all the boilerplating of setting up four threads, each of which does a single read and then terminates when the last of the four threads executes the barier.wait().

In order to measure the performance, each thread should make thousands of single byte reads in a loop before termination.

Here is my suggestions for a change:

   void readFunctor(std::string pathToFile, size_t filePos, BYTE* buffer, size_t buffPos, size_t dataLn) 
   {

       FILE* pFile;
       pFile = fopen(pathToFile.c_str(), "rb");

       fseek(pFile, filePos, SEEK_SET);
       // initialize random number generation
       std::random_device rd;
       tr1::uniform_int_distribution<> randomizer(0, dataLn-1);

       for (int i=0; i<dataLn; i++)
       {
           fseek(pFile, filePos+randomizer(rd), SEEK_SET);
           fread(buffer++, sizeof(BYTE), 1, pFile);
       }

       fclose(pFile);
    }

edited Jul 16 '13 at 12:49

answered Jul 16 '13 at 12:28

ogni42

1,232
7
11

Okey, So what do you suggest in terms of change in my code. And what is the difference in particular between reading a thousand bytes at once with fread(buffer, sizeof(BYTE), dataLn, pFile); compared to executing that thousand times and reading one byte each time ? – Zahari Jul 16 '13 at 12:31
The call to `fread` in your code, results in reading one large block which is performing well even on a hdd. performing random reads in parallel of single bytes completely differs from that. I will try to post an idea in a separate answer... – ogni42 Jul 16 '13 at 12:38
So putting that in\ a loop would theoretically degrade performance on HDD no matter what the physical layout of the file is ? – Zahari Jul 16 '13 at 12:39
Again, even in a loop the results are: 44190ms for HDD vs 44760ms for SSD.. Strange, dont you think ? – Zahari Jul 16 '13 at 12:57
What platform are you running that on? What is the optimization level you are using for the compiler? – ogni42 Jul 16 '13 at 13:30
I am running on Centos 6.3 x64 and not really tweaking the compiler in any way – Zahari Jul 16 '13 at 13:36
give it a try running g++ with optimizations -O2 (at least) – ogni42 Jul 16 '13 at 13:38
Poster says, *each thread should make thousands of single byte* . If each read only reads a *single byte* you are more likely to measure API/OS overhead than that actual read rate. You must read larger chunks to allow that overhead to be amortized to an insignificant amount. – Ira Baxter Nov 21 '13 at 16:14

score 2 · Answer 4 · answered Jul 16 '13 at 12:47

2

Depending on your data size, in either SSD or HDD, OS will cache your file. So, probably you are not really accessing your disks, but memory.

answered Jul 16 '13 at 12:47

LS_ᴅᴇᴠ

10,823
1
23
46

Is there a way to force that not to happen ? – Zahari Jul 16 '13 at 12:48
I'm not sure. Anyway, that should depend on OS. – LS_ᴅᴇᴠ Jul 16 '13 at 12:57
I usually make my test size 2 to 4 times the size of memory. Or on linux tell the OS to drop the cache. You can also usually tell the OS to use direct io or unbuffered reads and writes. – drescherjm Jul 16 '13 at 12:58
@drescherjm Sorry for my ignorance, but how do you tell the OS to "drop the cache" ? – Zahari Jul 16 '13 at 21:52
is it using `int setvbuf(FILE *stream, char *buf, int mode, size_t size);` – Zahari Jul 16 '13 at 21:59
This is the likely cause. The file is only 2GB big. The OS will keep it in memory, so you're essentially just measuring reading from memory, not from a drive. Use the method drescherjm describes before each test-run if you're on linux. – nos Jan 10 '14 at 19:15

Parallel I/O SSD vs HDD surprising results

4 Answers4