3

I am dealing with a callback function where based on the data in the callback, I want to write to different files.

For example, in one call, I might want to write to january.csv while in another call with different data, it might be july.csv instead. There is no pre-determined sequence, it could be any month in each callback and I have no way of knowing in advance. january.csv (all the months actually) will get written to multiple times.

These callbacks are happening extremely rapidly so I need this code to be as efficient as possible.

The naive approach I would take would be to use the following code each time:

ofstream fout;
fout.open(month_string);
fout<<data_string<<endl;
fout.close();

The problem is that this doesn't seem very efficient since I am continuously opening/closing the month.csv file. Is there a faster way where I can say keep january.csv, february.csv, etc open all the time to make this faster?

EDIT: I am writing to /dev/shm on linux so I/O delays are not really a problem.

user788171
  • 16,753
  • 40
  • 98
  • 125
  • Why not have multiple `std::ofstream` objects? One for each month? Like an array of them? – chrisaycock Mar 19 '13 at 14:39
  • That's possible, but my example above is contrived, in reality, I have more than 12 ofstream objects, the number is closer to 10,000. I would have to loop through an array of 10,000 in each callback which is slower than fout.open() and fout.close(). – user788171 Mar 19 '13 at 14:56
  • Do you really write to 10K files for each callback? Or do you only write to one specific file per callback? If the latter, then how do you decide which file to write to? Any chance you could use a `std::map` or `std::unordered_map` from the input to the `std::ofstream` object? – chrisaycock Mar 19 '13 at 14:59
  • I write to one specific file per callback, but the callback could write to one of 10,000 possible files. The callback passes a string that indicates which file to write to. I could do an ofstream array, but I would need a highly efficient way to find the right index corresponding to each filename string. – user788171 Mar 19 '13 at 15:04
  • @user788171 `std::map` has [`log(size)`](http://www.cplusplus.com/reference/map/map/operator%5B%5D/) access time for any given index. – Mihai Todor Mar 19 '13 at 15:24
  • are these filenames generated? – didierc Mar 19 '13 at 15:54

3 Answers3

2

You want to reduce the number of I/O calls and at the same time, make the best use of them when you do call them.

For example, cache the data and write larger chunks to the file. You could have another thread that is responsible for periodically flushing the buffers to the file.

The foundation of the inefficiency is two-fold: waiting for the hard drive to initialize (get up to speed) and the second is locating the file and an empty sector to write in. This overhead occurs regardless of the amount of data you are writing. The larger the block of data, the more time spent efficiently writing (while the platters are spinning). This is also true for Flash / Thumb drives; Thumb drives have an overhead (unlocking, erasing, etc.). So the objective is to reduce the overhead by writing in large chunks.

You may want to consider using a database: Evaluating the need for database.

Community
  • 1
  • 1
Thomas Matthews
  • 56,849
  • 17
  • 98
  • 154
  • Forgot to mention, I am writing to memory (/dev/shm) so I/O delays are not really a problem. – user788171 Mar 19 '13 at 15:18
  • Again, you want to cache your data and write large chunks. For example, one `ostream::write` is faster than many `operator<<`. This can be demonstrated by outputting constant text to the console. First, use `operator<<` and observe the time. Now put the text into a `static const char []` and use `ostream::write` with the array. – Thomas Matthews Mar 19 '13 at 15:24
  • Even if you are writing to memory, I still think Thomas Matthews's suggestion is sound design. Does it not fulfill all your requirements? – Ed Rowlett-Barbu Mar 19 '13 at 15:24
0

I doubt most systems will allow you to have ~10K files open at once, which more or less rules out just opening all the files and writing to them as needed.

As such, you probably just about need to create some sort of proxy-ish object to buffer data for each file, and when a buffer exceeds some given size, open the file, write the data to disk, and close it again.

I can see two fairly simple approaches to this. One would be to write most of the code yourself, using a stringstream as the buffer. The client streams to your object, which just passing through to the stringstream. Then you check if the stringstream exceeds some length, and if so, you write the content to disk and empty the stringstream.

The other approach would be to write your own file buffer object that implements sync to open the file, write the data, and close the file again (where it would normally leave the file open all the time).

Then you'd store those in a std::map (or std::unordered_map) to let you do a lookup from the file name to the matching proxy object.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • On linux, it is possible to increase the file descriptor limit to handle 10k simultaneously open files. – user788171 Mar 19 '13 at 15:47
  • 1
    @user788171: In theory, sure. In reality, I suspect there are some fairly good reasons it's normally limited to a lot fewer than that, so while you *could* do it, you probably don't want to. – Jerry Coffin Mar 19 '13 at 15:49
0

I don't think that opening and closing the same file over and over will be that expensive. OSes are usually designed to handle that use case by caching part of the FS metadata in memory. The cost will mostly be context switching for the system call. On the other hand, doing that on 10k files will probably exhaust the OS caching ability.

You could offload a bit of the FS work on your side by writing sequentially all your output annotated with its target in a single file to form a journal. Then another programme (the FS suppleant) would have the task of opening that journal, buffer the write commands (grouping them by file) and then flush them to disk when a buffer reaches a certain threshold. You would have to tag the performed commands in the journal as commited, so that in the event that the suppleant breaks and must recover, it would know what is left to be done.


Update:

You can tune the file system to support opening and caching of 10000 files at the same time, and leave it to deal with the issue of of scheduling the commands (this is what FS are made for).

Your problem is to actually pick the right file system for your use case. I suggest conducting tests with different FS and see which one would perform best.

The only remaining part would be to have your programme use std::map to associate the filenames with their descriptors (trivial).

See SO for tuning linux max open files, or perhaps ask a question on that topic if you cannot find one on your specific FS.

Community
  • 1
  • 1
didierc
  • 14,572
  • 3
  • 32
  • 52