-1

I recently learned (initially from here) how to use mmap to quickly read a file in C, as in this example code:

// main.c
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>

#define INPUT_FILE "test.txt"

int main(int argc, char* argv) {
  struct stat ss;

  if (stat(INPUT_FILE, &ss)) {
    fprintf(stderr, "stat err: %d (%s)\n", errno, strerror(errno));
    return -1;
  }

  {
    int fd = open(INPUT_FILE, O_RDONLY);
    char* mapped = mmap(NULL, ss.st_size, PROT_READ, MAP_PRIVATE, fd, 0);

    close(fd);
    fprintf(stdout, "%s\n", mapped);
    munmap(mapped, ss.st_size);
  }

  return 0;
}

My understanding is that this use of mmap returns a pointer to length heap-allocated bytes.
I've tested this on plain text files, that are not explicitly null-terminated, e.g. a file with the 13-byte ascii string "hello, world!":

$ cat ./test.txt
hello, world!$
$ stat ./test.txt
  File: ./test.txt
  Size: 13              Blocks: 8          IO Block: 4096   regular file
Device: 810h/2064d      Inode: 52441       Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/   user)   Gid: ( 1000/   user)
Access: 2022-10-25 20:30:52.563772200 -0700
Modify: 2022-10-25 20:30:45.623772200 -0700
Change: 2022-10-25 20:30:45.623772200 -0700
 Birth: -

When I run my compiled code, it never segfaults or spews garbage -- the classic symptoms of printing an unterminated C-string.

When I run my executable through gdb, mapped[13] is always '\0'.

Is this undefined behavior?

I can't see how it's possible that the bytes that are memory-mapped from the input file are reliably NULL-terminated.
For a 13-byte string, the "equivalent" that I would have normally done with malloc and read would be to allocate a 14-byte array, read from file to memory, then explicitly set byte 13 (0-based) to '\0'.

StoneThrow
  • 5,314
  • 4
  • 44
  • 86
  • 2
    Typical OS behavior is to clear memory before providing it to a new process (for security reasons). And "clear" usually means "set all bytes to 0", but it could just as easily be "set all bytes to 0xAA". So yes, undefined behavior as far as the C standard is concerned. – user3386109 Oct 26 '22 at 04:19

1 Answers1

2

mmap returns a pointer to whole pages allocated by the kernel. It doesn't go through malloc. Pages are usually 4096 bytes each and apparently the kernel fills the extra bytes with zeroes, not with garbage.

user253751
  • 57,427
  • 7
  • 48
  • 90