5

My laptop has a SSD disk that has 512 byte physical disk sector size and 4,096 byte logical disk sector size. I'm working on an ACID database system that has to bypass all OS caches, so I write directly from allocated internal memory (RAM) to the SSD disk. I also extend the files before I run the tests and don't resize it during the tests.

Now here is my problem, according to SSD benchmarks random read & write should be in the range 30 MB/s to 90 MB/s, respectively. But here is my (rather horrible) telemetry from my numerous perfrmance tests:

  • 1.2 MB/s when reading random 512 byte blocks (physical sector size)
  • 512 KB/s when writing random 512 byte blocks (physical sector size)
  • 8.5 MB/s when reading random 4,096 byte blocks (logical sector size)
  • 4.9 MB/s when writing random 4,096 byte blocks (logical sector size)

In addition to using asynchronous I/O I also set the FILE_SHARE_READ and FILE_SHARE_WRITE flags to disable all OS buffering - because our database is ACID I must do this, I also tried FlushFileBuffers() but that gave me even worse performance. I also wait for each async I/O operation to complete as is required by some of our code.

Here is my code, is there are problem with it or am I stuck with this bad I/O performance?

HANDLE OpenFile(const wchar_t *fileName)
{
    // Set access method
    DWORD desiredAccess = GENERIC_READ | GENERIC_WRITE ;

    // Set file flags
    DWORD fileFlags = FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING /*| FILE_FLAG_RANDOM_ACCESS*/;

    //File or device is being opened or created for asynchronous I/O
    fileFlags |= FILE_FLAG_OVERLAPPED ;

    // Exlusive use (no share mode)
    DWORD shareMode = 0;

    HANDLE hOutputFile = CreateFile(
        // File name
        fileName,
        // Requested access to the file 
        desiredAccess,
        // Share mode. 0 equals exclusive lock by the process
        shareMode,
        // Pointer to a security attribute structure
        NULL,
        // Action to take on file
        CREATE_NEW,
        // File attributes and flags
        fileFlags,
        // Template file
        NULL
    );
    if (hOutputFile == INVALID_HANDLE_VALUE)
    {
        int lastError = GetLastError();
        std::cerr << "Unable to create the file '" << fileName << "'. [CreateFile] error #" << lastError << "." << std::endl;
    }

    return hOutputFile;
}

DWORD ReadFromFile(HANDLE hFile, void *outData, _UINT64 bytesToRead, _UINT64 location, OVERLAPPED *overlappedPtr, 
    asyncIoCompletionRoutine_t completionRoutine)
{
    DWORD bytesRead = 0;

    if (overlappedPtr)
    {
        // Windows demand that you split the file byte locttion into high & low 32-bit addresses
        overlappedPtr->Offset = (DWORD)_UINT64LO(location);
        overlappedPtr->OffsetHigh = (DWORD)_UINT64HI(location);

        // Should we use a callback function or a manual event
        if (!completionRoutine && !overlappedPtr->hEvent)
        {
            // No manual event supplied, so create one. The caller must reset and close it themselves
            overlappedPtr->hEvent = CreateEvent(NULL, TRUE, FALSE, NULL);
            if (!overlappedPtr->hEvent)
            {
                DWORD errNumber = GetLastError();
                std::wcerr << L"Could not create a new event. [CreateEvent] error #" << errNumber << L".";
            }
        }
    }

    BOOL result = completionRoutine ? 
        ReadFileEx(hFile, outData, (DWORD)(bytesToRead), overlappedPtr, completionRoutine) : 
        ReadFile(hFile, outData, (DWORD)(bytesToRead), &bytesRead, overlappedPtr);

    if (result == FALSE)
    {
        DWORD errorCode = GetLastError();
        if (errorCode != ERROR_IO_PENDING)
        {
            std::wcerr << L"Can't read sectors from file. [ReadFile] error #" << errorCode << L".";
        }
    }

    return bytesRead;
}

2 Answers2

3

Random IO performance is not measured well in MB/sec. It is measured in IOPS. "1.2 MB/s when reading random 512 byte blocks" => 20000 IOPS. Not bad. Double the block size and you'll get 199% the MB/sec and 99% the IOPS because it takes almost the same time to read 512 bytes than it does to read 1024 bytes (almost no time at all). SSDs are not free of seeking costs as is sometimes mistakenly assumed.

So the numbers are not actually bad at all.

SSDs benefit from high queue depth. Try issuing multiple IOs at once and keep that number outstanding at all times. The optimal concurrency will be somewhere in the range of 1-32.

Because SSDs have hardware concurrency you can expect a small multiple of the single-threaded performance. My SSD has 4 parallel "banks" for example.

Using FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING is all that is needed to achieve direct writes to hardware. If these flags do not work your hardware does not respect these flags and you can't do anything about it. All server hardware respects these flags and I have not seen a consumer disk that doesn't.

The sharing flags are not meaningful in this context.

The code is fine although I don't see why you use async IO and later wait on an event to wait for completion. That makes no sense. Either use synchronous IO (which will perform about the same as async IO) or use async IO with completion ports and without waiting.

usr
  • 168,620
  • 35
  • 240
  • 369
0

Use hdparm -I /dev/sdx to check your logical and physical block size. Most modern SSDs have a 4096 byte physical block size but also support 512byte blocks for backward compatibility with older drives & OS software. This is done by "512 byte emulation" A.K.A 512e. If your drive is one of the ones that does 512 byte emulation your 512 byte accesses are actually read modify write operations. The SSD will try to turn sequential accesses in to 4k block writes.

If you can switch to 4k block writes you will (probably) see much better numbers for IOPS as well as bandwidth since this makes for much less work on the SSD. Random 512 block writes also have a big impact on long term performance due to increased write amplification.

Alex
  • 421
  • 5
  • 8