0

I have written a program (using FFTW) to perform Fourier transforms of some data files written in OpenFOAM.

The program first finds the paths to each data file (501 files in my current example), then splits the paths between threads, such that thread0 gets paths 0->61, thread1 gets 62-> 123 or so, etc, and then runs the remaining files in serial at the end.

I have implemented timers throughout the code to try and see where it bottlenecks, since run in serial each file takes around 3.5s and for 8 files in parallel the time taken is around 21s (a reduction from the 28s for 8x3.5 (serial time), but not by so much)

The problematic section of my code is below

if (DIAG_timers) {readTimer = timerNow();}
for (yindex=0; yindex<ycells; yindex++)
{
    for (xindex=0; xindex<xcells; xindex++)
    {
        getline(alphaFile, alphaStringValue);
        convertToNumber(alphaStringValue, alphaValue[xindex][yindex]);
    }
}
if (DIAG_timers) {endTimerP(readTimer, tid, "reading value and converting", false);}

Here, timerNow() returns the clock value, and endTimerP calculates the time that has passed in ms. (The remaining arguments relate to it running in a parallel thread, to avoid outputting 8 lines for each loop etc, and a description of what the timer measures).

convertToNumber takes the value on alphaStringValue, and converts it to a double, which is then stored in the alphaValue array.

alphaFile is a std::ifstream object, and alphaStringValue is a std::string which stores the text on each line.

The files to be read are approximately 40MB each (just a few lines more than 5120000, each containing only one value, between 0 and 1 (in most cases == (0||1) ), and I have 16GB of RAM, so copying all the files to memory would certainly be possible, since only 8 (1 per thread) should be open at once. I am unsure if mmap would do this better? Several threads on stackoverflow argue about the merits of mmap vs more straightforward read operations, in particular for sequential access, so I don't know if that would be beneficial.

I tried surrounding the code block with a mutex so that only one thread could run the block at once, in case reading multiple files was leading to slow IO via vaguely random access, but that just reduced the process to roughly serial-speed times.

Any suggestions allowing me to run this section more quickly, possibly via copying the file, or indeed anything else, would be appreciated.

Edit:

template<class T> inline void convertToNumber(std::string const& s, T &result)
{
    std::istringstream i(s);
    T x;
    if (!(i >> x))
        throw BadConversion("convertToNumber(\"" + s + "\")");
    result = x;
}

turns out to have been the slow section. I assume this is due to the creation of 5 million stringstreams per file, followed by the testing of 5 million if conditions? Replacing it with TonyD's suggestion presumably removes the possibility of catching an error, but saves a vast number of (at least in this controlled case) unnecessary operations.

chrisb2244
  • 2,940
  • 22
  • 44
  • Did you use `time` (the shell builtin, or `/usr/bin/time`) to benchmark your program (notably the single threaded case). Are you sure it is not I/O or system CPU bound? Are you willing to spend hours of work for a few % improvement? – Basile Starynkevitch Jan 20 '14 at 07:39
  • The timer functions use `gettimeofday` and then just calculate the difference, returning `unsigned long long`s. The rest of the program takes on the order of 0.5s per loop, compared with the 3.5 per thread for this section, so don't think it's CPU bound. I'm less sure about IO, but seems my hdd makes little noise, and `iotop` and `dstat` show sporadic io, but mostly output, when the program later writes (this step takes only ~0.1 seconds and is protected by a mutex lock, since the processes all write to one file) – chrisb2244 Jan 20 '14 at 07:47
  • I wrote about *system CPU* (i.e. CPU doing system calls in the kernel). If you insist on computing time programmatically, use [times(2)](http://man7.org/linux/man-pages/man2/times.2.html) and [clock_gettime(2)](http://man7.org/linux/man-pages/man2/clock_gettime.2.html) and read [time(7)](http://man7.org/linux/man-pages/man7/time.7.html) and [time(1)](http://man7.org/linux/man-pages/man1/time.1.html). Don't forget that the kernel has a good file system and disk cache. – Basile Starynkevitch Jan 20 '14 at 07:47
  • Will read up on system CPU calls and try to time it, also on the links you provided. Thank you. As for few %, the program takes around 30 mins to run and seems likely that my poor coding is reponsible, so hoping to cut that down. – chrisb2244 Jan 20 '14 at 07:52
  • Perhaps try C stdio with the Linux specific `"m"` mode to [fopen(3)](http://man7.org/linux/man-pages/man3/fopen.3.html) -which use `mmap`- or switch directly to [mmap(2)](http://man7.org/linux/man-pages/man2/mmap.2.html) after [stat(2)](http://man7.org/linux/man-pages/man2/stat.2.html). How long does it take to `wc` all your files? – Basile Starynkevitch Jan 20 '14 at 07:55
  • time $(wc */alpha1) returned `real 2m54.235s, user 2m53.176s, sys 0m0.969s` Not sure if the */alpha1 results in extra work but imagine it's a small fraction of the total. – chrisb2244 Jan 20 '14 at 08:03
  • 1
    "Several threads on stackoverflow argue about the merits of mmap vs more straightforward read operations, in particular for sequential access, so I don't know if that would be beneficial." - seriously - if you care, just try it. Nothing anybody says on here can be as reliable - there are too many variables. Separately, if you're able to ensure the threads are processing files from different physical drives, that may help. Also, why use `getline`+`convertToNumber` rather than `alphaFile >> alphaValue[xindex][yindex]`? – Tony Delroy Jan 20 '14 at 08:18
  • Profile with `valgrind` using the `callgrind` toolset. – Johannes S. Jan 20 '14 at 08:23
  • @TonyD looks like changing the getline/convertToNumber to your suggested alphaFile >> alphaValue[][] made it some 20x faster. Clearly my function is for some reason horrifically inefficient, although perhaps due to the way it was reading from getline? I don't know, but this is basically solved provided it hasn't started putting values in weird places - will run and check, but thanks – chrisb2244 Jan 20 '14 at 08:27
  • 1
    @chrisb2244: you're welcome... mmap etc may given another major boost, but if you don't need it, that's all good. :-). (If you want insights re why your code was slow, best post `convertToNumber` source....) – Tony Delroy Jan 20 '14 at 08:57

2 Answers2

1

The files to be read are approximately 40MB each (just a few lines more than 5120000, each containing only one value, between 0 and 1 (in most cases == (0||1) ), and I have 16GB of RAM, so copying all the files to memory would certainly be possible,

Yes. But loading them there will still count towards your process' wall clock time unless they were already read by another process short before.

since only 8 (1 per thread) should be open at once.

Since any files that were not loaded in memory before the process started will have to be loaded and thus the loading will count towards the process wall clock time, it does not matter how many are open at once. Any that are not cache will slow down the process.

I am unsure if mmap would do this better?

No, it wouldn't. mmap is faster, but because it saves the copy from kernel buffer to application buffer and some system call overhead (with read you do a kernel entry for each page while with mmap pages that are read with read-ahead won't cause further page faults). But it will not save you the time to read the files from disk if they are not already cached.

mmap does not load anything in memory. The kernel loads data from disk to internal buffers, the page cache. read copies the data from there to your application buffer while mmap exposes parts of the page cache directly in your address space. But in either case the data are fetched on first access and remain there until the memory manager drops them to reuse the memory. The page cache is global, so if one process causes some data to be cached, next process will get them faster. But if it's first access after longer time, the data will have to be read and this will affect read and mmap exactly the same way.

Since parallelizing the process didn't improve the time much, it seems majority of the time is the actual I/O. So you can optimize a bit more and mmap can help, but don't expect much. The only way to improve I/O time is to get a faster disk.


You should be able to ask the system to tell you how much time was spent on the CPU and how much was spent waiting for data (I/O) using getrusage(2) (call it at end of each thread to get data for that thread). So you can confirm how much time was spent by I/O.

Jan Hudec
  • 73,652
  • 13
  • 125
  • 172
  • Thank you for this - I will test it out and consider if further improvements are likely to be found via I/O times, but as you say I won't gain much (anything?) given the files are read only once each at present. – chrisb2244 Jan 21 '14 at 01:25
  • Marking this as an answer since I can't mark comments as answers, and since it does describe features to consider re. `mmap`, along with a way to measure i/o within a program using `getrusage(2)` – chrisb2244 Jan 21 '14 at 01:27
0

mmap is certainly the most efficient way to get large amounts of data into memory. The main benefit here is that there is no extra copying involved.

It does however make the code slightly more complex, since you can't directly use the file I/O functions to use mmap (and the main benefit is sort of lost if you use "m" mode of stdio functions, as you are now getting at least one copy). From past experiments that I've made, mmap beats all other file reading variants by some amount. How much depends on what proportion of the overall time is spent on waiting for the disk, and how much time is spent actually processing the file content.

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227