-1

Info: What is the best way to store a position in a txt file, close the file, and later open it at the same position using c++?

I have a large text file that I need to parse in chunks and feed into some system. As of now, I load the file in the ifstream and then getlines until I find the data I need (let's say data is at position {x}). After this I close the file, process the data, and now I need to continue feeding the data from the big file. So I open the file again, and getlines until I get to position {x+d} this time ( d is the offset from the data I read)...

Instead of going through file once, it is easy to see, that I go (1d + 2d + ... + (N-1)d + Nd) ~ d*N^2 times through the file. Now I want to save the position in the file after d, close the file, and then instantly open the file at the same position. What can be used for this?

Sadikov
  • 127
  • 11
  • You would use the same thing you would use to store the position in a very small file. The size of the file makes no material difference, whatsoever. – Sam Varshavchik Jul 04 '18 at 15:45
  • @SamVarshavchik: That's not entirely true. For large files (especially once over 2GB) you need to be more careful with the data type you use to store the position. – Ben Voigt Jul 04 '18 at 15:47
  • Looks like that you have the [tag:seekg] tag? – user202729 Jul 04 '18 at 15:47
  • [Possible d](https://stackoverflow.com/questions/10669673/saving-off-the-current-read-position-so-i-can-seek-to-it-later)... that's C. – user202729 Jul 04 '18 at 15:48
  • *I load the file in the ifstream and then getlines* -- As soon as you said this about "getlines", inefficiency rears its head. Read the answer given by @BenVoigt – PaulMcKenzie Jul 04 '18 at 15:54
  • I know, and this was something I could live with until today. I wasn't aware that the files in question would grow in size like so. – Sadikov Jul 04 '18 at 16:00
  • 1
    @Sadikov -- First, does the file in question have a fixed line length? If so, you don't need `getline`, as you can simply use the offset to calculate which line / column you're on in the file. – PaulMcKenzie Jul 04 '18 at 16:05
  • @SamVarshavchik I don't think it is really useful to write efficient code and optimizations for it, when processing very small file. But I edited the question, thanks for the input. – Sadikov Jul 04 '18 at 16:07
  • @PaulMcKenzie The line length is not fixed, and neither is the d-offset mentioned in the question. Both of them vary through the file. – Sadikov Jul 04 '18 at 16:09
  • There's nothing in my comment that suggested that the way to restore the position in a small file is to read it again. I repeat: you would use the same thing you would use to store the position in a very small file. You already have a `seekg` tag on your question, which indicates that you are familiar with how to save and restore file positions, no matter how big or small the files are. So what part of "use seekg" you are unclear about? – Sam Varshavchik Jul 04 '18 at 16:23
  • @SamVarshavchik: `seekg()` will result in the library reading from the front of the file again, if any translations are in effect. (If no translations are in effect then it is a QoI issue, nothing actually requires the standard library to special case the no-translation setup to be fast; a brain-dead implementation could use the generic code path that reads from the beginning for all cases) – Ben Voigt Jul 04 '18 at 16:35

1 Answers1

3

You can't do this with newline translation enabled (what the Standard calls "text mode"), because seeking back to the position requires the standard library to scan through the entire front of the file to find N characters-not-double-counting-newlines. Translations of variable length encodings (e.g. between UTF-8 and UCS) cause a similar problem.

The solution is to turn off newline translation (what the Standard calls "binary mode") and any other translations that involve variable-length encodings, and handle these yourself. With all translations turned off, the "file position" is the number directly used by the OS to perform file I/O, and therefore has the potential to be very efficient (whether it actually is efficient depends on the standard library implementation details).

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • Can you elaborate a bit more on the number used by the OS to perform file I/O? How to access this number? If I turn off the standard text mode, will I still be able to use getline function? – Sadikov Jul 04 '18 at 15:57
  • 1
    @Sadikov: If you turn off newline conversion, nothing stops you from calling `getline()`, but depending on the software that generated your file, line endings might be \r\n, \r, or \n, instead of just \n, which will affect the parameters to getline. The number used by the OS to perform file I/O is the file position that the standard library passes to the OS file I/O functions, for example on Windows that would be the `Offset` and `OffsetHigh` members of an `OVERLAPPED` structure. When translations are turned off, the logical position used by `tell` and `seek` is the same as the OS position. – Ben Voigt Jul 04 '18 at 16:19
  • So to recap, it the translation is on, the standard library also goes through the file and counts N-offset? If the tranlsation is off it jumps to the position immediately. We need to read the file in the binary mode like this std::ifstream ifs("foo.txt", std::ios::binary); Thanks Ben! – Sadikov Jul 04 '18 at 16:50
  • 1
    @Sadikov: That is the expectation, although it is possible to have a brain-dead library that doesn't implement the "jump immediately" path and just counts from the beginning whether translation is on or off. Simply measuring how long `seekg(large number)` + a small fixed-length read takes should let you easily test whether you're hitting the fast path. – Ben Voigt Jul 04 '18 at 16:59
  • 1
    Do you have a reference for this claim? I was under the impression that if you call seek using a position returned by tell, you will restore the file position precisely (unless your multibyte encoding is contextual), regardless of translation. – rici Jul 05 '18 at 15:00
  • @rici: Yes, you will restore the file position precisely, and the library will have to reprocess the entire file from the beginning to find that precise position (a "Schlemiel the Painter's algorithm"). The question excludes doing that, by saying "instantly" and describing in detail problems with quadratic performance. – Ben Voigt Jul 05 '18 at 18:22
  • 1
    @Ben: I really don't believe that to be the case, which is why I'd like to see a reference rather than your assertion. For text files, you can only seekg to a position previously returned by a previous seekpos or seekoff, and that will be a value appropriate for resetting the file position to its former value, without having to call on Schlemiel's services. – rici Jul 05 '18 at 23:51
  • 1
    C++ eventually defers to the C library here, so it's worth quoting section 7.21.9.4 para. 2 of the C standard, which describes the return value of ftell: "For a text stream, its file position indicator contains unspecified information, usable by the `fseek` function for returning the file position indicator for the stream to its position at the time of the `ftell` call; the difference between two such return values is not necessarily a meaningful measure of the number of characters written or read." – rici Jul 05 '18 at 23:56