3

Here is the situation: A c++ program is endlessly generating data in a regular fashion. The data needs to be stored in persistent storage very quickly so it does not impede the computing time. It is not possible to know the amount of data that will be stored in advance. After reading this and this posts, I end up following this naive strategy:

  1. Creating one std::ofstream ofs
  2. Opening a new file ofs.open("path/file", std::ofstream::out | std::ofstream::app)
  3. Adding std::string using the operator <<
  4. Closing the file has terminated ofs.close()

Nevertheless, I am still confused about the following:

  1. Since the data will only be read afterwards, is it possible to use a binary (ios::binary) file storage? Would that be faster?
  2. I have understood that flushing should be done automatically by std::ofstream, I am safe to use it as such? Is there any impact on memory I should be aware of? Do I have to optimize the std::ofstream in some ways (changing its size?)?
  3. Should I be concerned about the file getting bigger and bigger? Should I close it at some point and open a new one?
  4. Does using std::string have some drawbacks? Is there some hidden conversions that could be avoided?
  5. Is using std::ofstream::write() more advantageous?

Thanks for your help.

Community
  • 1
  • 1
leag
  • 71
  • 5
  • most of these questions depend on what are you inserting in the file... – Claudiordgz Mar 13 '14 at 17:09
  • @Claudiordgz I don't get your point, why and how should I care about the characters in the strings? – leag Mar 13 '14 at 17:23
  • You asked for the benefits of storing in binary mode, depends on the data and how you are searching it. Should you be concerned with size? Well this depends on how your are using the file, what are you injecting, what frequency? You also said string has drawbacks... Data will be stored as data, and when reading again you'll have to parse to something, parsing to string again depends on what you need the data for, maybe you need doubles for that column. Conclusion is you should ALWAYS care about the characters, you have to consider that for someone that info is really valuable. – Claudiordgz Mar 13 '14 at 18:07
  • I agree that data should be correctly formatted depending on its type and future analyses, but in this case I don't care about the format, they are just strings that need to be stored in a way or another. – leag Mar 13 '14 at 18:46
  • Well, if that's the case: Binary files I/O operations are faster, but you wont see A HELL of improvement as with storing lots of numbers, just some, since your data are strings. Size will be smaller in binary, but binary will make the file unreadable to a human being. According to your specs this applies to you mostly, there are other advantages of binary storing but are more relevant to other type of data such as numbers or objects. – Claudiordgz Mar 13 '14 at 19:26

2 Answers2

2

1.Since the data will only be read afterwards, is it possible to use a binary (ios::binary) file storage? Would that be faster?

Since all the datatype on any storage device is binary telling compiler to save it so will result in more or less optimized saving of 0's & 1's. It depends on... many things and how you are going to use/read it after. Some of them listed in Writing a binary file in C++ very fast. When comes to storing on HD, perfomance of your code is always limited to speed of particular HD (which is widespread fact).

Try to give a "certainty/frames" to your questions, they are too general for stating as "problem"

Community
  • 1
  • 1
EpiGen
  • 70
  • 6
1

I'm probably not answering your direct questions, but please excuse me trying if I take a step back.

If I understand the issue correctly, the concern is about staying too long writing to disk that would delay the endless data generation.

Perhaps you can allocate a thread just for writing, while processing continues on the main thread.

The writer thread could awake at periodic intervals to write to disk what it has been generated so far.

Communication between the two threads can be either:

  1. two buffers (one active where the generation happens, one frozen, ready to be written to disk on the next batch)
  2. or a queue of data, inserted by the producer and removed by the consumer/writer.
jsantander
  • 4,972
  • 16
  • 27
  • Interesting strategy. Would it be a delay or an increase of memory consumption if the producer were to insert in the queue/buffers quicker than the writer? – leag Mar 13 '14 at 18:57
  • Yes, it certainly can happen. This strategy can be used when it is paramount not to introduce delays the _producer_ work (e.g. waiting for I/O). An example of this is when you're a receiver. Things should be dimensioned for the _writer_ to be able to complete its work... or you should provision for a way of signaling back to slow down when in _overload_ – jsantander Mar 14 '14 at 05:47