10

When mmap()ing a text file, like so

int fd = open("file.txt", O_RDWR);
fstat(fd, &sb)
char *text = mmap(0, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);

the file contents are mapped into memory directly, and text it will not contain a NUL-terminator so operating on it with normal string functions would not be safe. On Linux (at least) the remaining bytes of the unused page are zero-filled, so effectively you get a NUL terminator in all cases where the file size isn't a multiple of the page size.

But relying on that feels dirty and other mmap() implementations (e.g., in FreeBSD, I think) don't zero-fill partial pages. Mapping files that are multiples of the page size will also lack the NUL terminator.

Are there reasonable ways to work around this or to add the NUL terminator?

Things I've considered

  1. Using strn*() functions exclusively and tracking distance to the end of the buffer.
    • Pros: No need for NUL terminator
    • Cons: Extra tracking required to know distance to end of file when parsing text; some str*() functions don't have strn*() counterpart, like strstr.
  2. As another answer suggested, make a anonymous mapping at a fixed address following the mapping of your text file.
    • Pros: Can use regular C str*() functions
    • Cons: Using MAP_FIXED is not thread-safe; Seems like an awful hack anyway
  3. mmap() an extra byte and make the map writeable, and write the NUL terminator. The OpenGroup's mmap man page says you can make a mapping larger than your object's size but that accessing data outside of the actual mapped object will generate a SIGBUS.
    • Pros: Can use regular C str*() functions
    • Cons: Requires handling (ignoring?) SIGBUS, which could potentially mean something else happened. I'm not actually sure writing the NUL terminator will work?
  4. Expand files with sizes that are multiples of page size with ftruncate() by one byte.
    • Pros: Can use regular C str*() functions; ftruncate() will write a NUL byte to the newly allocated area for you
    • Cons: Means we have to write to the files, which may not be possible or acceptable in all cases; Doesn't solve problem for mmap() implementations that don't zero-fill partial pages
  5. Just read() the file into some malloc()'d memory and forget about mmap()
    • Pros: Avoids all of these solutions; Easy to malloc() and extra byte for NUL
    • Cons: Different performance characteristics than mmap()

Solution #1 seems generally the best, and just requires a some extra work on the part of the functions reading the text.

Are there better alternatives, or are these the best solutions? Are there aspects of these solutions I haven't considered that makes them more or less attractive?

Community
  • 1
  • 1
mattst88
  • 1,462
  • 13
  • 21
  • 2
    My vote is on #5. [KISS](http://en.wikipedia.org/wiki/KISS_principle). – Jonathon Reinhart Nov 24 '14 at 02:16
  • Think about #5. cons. mmap requires reading the disk, so does read. Whay is that a con? BTW +1 @Johnathon Reinhart – jim mcnamara Nov 24 '14 at 02:40
  • String detail: In C, by definition, a string _always_ has a terminating `'\0'`, else it is not a string. A `char` array might not have a `'\0'`. Does not change your problem much other than nomenclature. Typical text files do not have _any_ strings, but lines of text. – chux - Reinstate Monica Nov 24 '14 at 03:10
  • Note that relying on the "Linux behavior" of getting zeros for the rest of the page *does not work* since the file size might be an exact multiple of the page size. In this case you'll either get a faulting read when you run off the end of the mapping, or you'll start reading whatever's at the beginning of another mapping that happens to be adjacent. – R.. GitHub STOP HELPING ICE Nov 24 '14 at 05:31
  • However you could work around that problem by mapping a whole page past the end of the file, then mapping an anonymous page over top of the last (past-the-end) page of the file mapping using `MAP_FIXED`. – R.. GitHub STOP HELPING ICE Nov 24 '14 at 05:32
  • @R..: Thanks, but I mentioned both of those in my original question. – mattst88 Nov 24 '14 at 06:49
  • A file can legitimately contain 0-bytes, so `str*` functions (including the `strn*` functions) cannot handle the contents anyway. – mafso Nov 24 '14 at 12:05

1 Answers1

3

I would suggest undergoing a paradigm shift here.

You're looking at the entire universe consisting of '\0'-delimited strings that define your text. Instead of looking at the world this way, why don't you try looking at the world where text is defined as a sequence defined by a beginning and an ending iterator.

You mmap your file, then initially set the beginning iterator, call it beg_iter to the start of the mmap-ed segment, and the ending iterator, call it end_iter, to the first byte following the last byte in the mmap-ed segment, or beg_iter+number_of_pages*pagesize, then until either

A) end_iter equals beg_iter, or

B) beg_iter[-1] is not a null character, then

C) decrement end_iter, and go back to step A.

When you're done, you have a pair of iterators, the beginning iterator value, and the ending iterator value that define your text string.

Of course, in this case, your iterators are plain char *, but that's really not very important. What is important is that now you find yourself with a rich set of algorithms and templates from the C++ standard library at your disposal, that let you implement many complicated operations, both mutable (like std::transform), and non-mutable, (like std::find).

Null-terminated strings are really a holdover from the days of plain C. With C++, null-terminated strings are somewhat archaic, and mundane. Modern C++ code should use std::string objects, and sequences defined by beginning and ending iterators.

One small footnote: instead of figuring out how much NULL padding you ended up mmap-ing(), you might find it easier to fstat() the file, and get the file's exact length, in bytes, before mmap-ing it. Then you'll now exactly know much got mmaped, and you don't have to reverse-engineer it, by looking at the padding.

Sam Varshavchik
  • 114,536
  • 5
  • 94
  • 148
  • Thanks for the answer. I'm really looking for a solution usable in C with C `str*()` functions, but essentially it sounds like what you're suggesting is akin to solution #1. About `fstat()`: definitely -- I'm using it in my example in fact. – mattst88 Nov 24 '14 at 04:17
  • 1
    Thinking more about your answer, I think you're definitely on to something about just storing a pointer to the end of the text. That allows you to simply calculate how much you have left with a subtraction. Have an upvote! – mattst88 Nov 24 '14 at 06:47
  • What if you have a huge string mmapped say 16k and you want it to become a valid null terminated string without doing memcpy? There are plenty of reasons one would like to do this – ericcurtin Aug 23 '21 at 19:13
  • This cannot be done, @ericcurtin. The trailing `'\0'` must exist somewhere. There are no mechanisms anywhere that result in the trailing`'\0'` popping out of existence, somehow sandwiching itself between two consecutive bytes in memory. Either it needs to be copied into a different null-erminated buffer, or whatever wants to use this string needs to be recoded to define the string in a manner that does not require a null-terminated byte. `std::string_view` comes to mind, and it can do many of the same things that `std::string` can. – Sam Varshavchik Aug 23 '21 at 23:37