3

Here is my problem: I want to map the file "filename.txt", which basically consists of two pairs of strings per line:

"string1 string2
 string3 string4
 string5 string6..."

and then I wanted to separate the different strings using strtok.

So I map the file like this:

// open file
if ((fdsrc = open("filename.txt", O_RDONLY)) < 0) {
        fprintf(stderr, "src open error");
        exit(1);
    }

// get the size of the file
if (fstat(fdsrc, &statbuf) < 0) {
    fprintf(stderr, "fstat error");
    exit(1);
}

// mmap the file
if ((src = mmap(0, statbuf.st_size, PROT_READ, MAP_SHARED, fdsrc, 0)) == (caddr_t) -1) {
    fprintf(stderr, "mmap src");
    exit(1);
}

When I run the line

printf("src: %s \n", src);

it prints the content of the file correctly!

But when I try to separate the words

char* token;
token = strtok(src, " \n");
while (token != NULL) {
    token = strtok(NULL, " \n");
}

the output is Segmentation Fault. Why can't I use the StrTok then?

Inês
  • 587
  • 1
  • 4
  • 10
  • 3
    strtok is *DESTRUCTIVE* - it writes to the string as it tokenizes it. SOLUTIONS: 1) Copy each string to a local buffer before trying "strtok()", or 2) Open the mapping Read/Write (instead of read-only). – paulsm4 Nov 24 '15 at 21:31

3 Answers3

6

strtok() modifies the string it operates on. Assuming you don't want to change the file contents, you need to change your mmap() options.

You are opening and mapping the file read-only:

if ((fdsrc = open("filename.txt", O_RDONLY)) < 0) {
...
if ((src = mmap(0, statbuf.st_size, PROT_READ, MAP_SHARED, fdsrc, 0)) == (caddr_t) -1) {
...

Map the file with PROT_READ|PROT_WRITE and MAP_PRIVATE:

src = mmap(0, statbuf.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fdsrc, 0);
if (src == (caddr_t) -1) {

You might need to open the file with O_RDWR instead of O_RDONLY

BEWARE THOUGH:

If the file size exactly matches a multiple of the page size used for the mapping, the file will not be a NUL-terminated string and you will likely get a SIGSEGV when strtok() attempts to read past the end of the mapping.

In that case, you can mmap() a zero-filled page immediately following the file's mapping.

Andrew Henle
  • 32,625
  • 3
  • 24
  • 56
  • 1
    This looks like bad advice. The file will be modified, probably not what the OP intends. – chqrlie Nov 24 '15 at 21:39
  • 1
    @chqrlie Wrong. Per http://man7.org/linux/man-pages/man2/mmap.2.html: **MAP_PRIVATE** *Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, **and are not carried through to the underlying file**. ...* – Andrew Henle Nov 24 '15 at 21:41
  • 2
    My bad, you are correct. But I insist it is still a contorted and way to parse a text file. Specifically, `strtok()` should not be used because the buffer may not be `'\0'` terminated before the end of the mapping, hence invoking undefined behavior, for instance if the file length is an exact multiple of the page size. – chqrlie Nov 24 '15 at 21:45
  • @chqrlie - Agreed, it's contorted. But that's how the OP seems to want to solve his problem. I'll add the part about the size of the file exactly matching the page size. – Andrew Henle Nov 24 '15 at 22:19
  • 1
    What you are suggesting is not guaranteed to work: The BSD man page for `mmap` says: *If offset or len is not a multiple of the pagesize, the mapped region **may** extend past the specified range. Any extension beyond the end of the mapped object will be zero-filled.* So it may work or not work. On systems that have fine grained control for memory protection, the mmapped area may be byte based. – chqrlie Nov 24 '15 at 22:32
  • @chqrlie Thanks. There are definitely platform-dependent aspects to modifying `mmap()`'d files with string-based functions such as `strtok()`. For example, try determining the page size on Solaris, expecially on SPARC hardware, where there are 5 possible page sizes from 8kB up to 2GB. – Andrew Henle Nov 24 '15 at 22:43
  • `sysconf()` should be used to get the pagesize. It is needed if you pass a non-zero offset, which must be a multiple of pagesize. `mmap` to parse input files is not for the faint of heart and definitely not advisable for beginners. It is not a generic solution as it cannot be used for pipes or console input. It is usually overkill as a classic scanner is fast enough in most cases. – chqrlie Nov 24 '15 at 22:49
  • @chqrlie - The problem with `sysconf()` is that it returns **a** pagesize. Solaris and Linux both now support different page sizes at the same time, even in the same process address space. See the **MAP_HUGE_2MB** and **MAP_HUGE_1GB** flags: http://man7.org/linux/man-pages/man2/mmap.2.html So "the" page size isn't sufficient. – Andrew Henle Nov 24 '15 at 23:12
  • 1
    Good point. *not for the faint of heart* was an understatement. – chqrlie Nov 24 '15 at 23:24
3

Your file is mapped read only with PROT_READ. But strtok() modifies its first argument, src, and gets a segmentation fault. You would need to either make a writable copy before you use strtok, or switch to a mechanism that only reads its input. In my opinion, changing the protection of that buffer to PROT_RW seems odd, especially if you intend to use the unmodified contents of that file elsewhere in your program.

For an alternative, I'd recommend using strstr() (or an alternative implementation that doesn't require nul-byte termination) to locate the end-of-line substring, and then starting your next search where you found the last occurrence, plus the length of your substring. See the note below on nul-byte termination. A simplified example:

  const char *delim = "\n";                                                                                                                          
  const char *start = src;                                                                                                                           
  const char *end = NULL;                                                                                                                            
  const int srclen = statbuf.st_size;                                                                                                                
  const int delim_length = strlen(delim);                                                                                                            

  while (start && start < (src + srclen)) {                                                                                                          
    end = strstr(start, delim);                                                                                                                      

    if (NULL == end) {  
      // use of %.* to print at most X chars from string.                                                                                                                             
      printf("Token: %.*s\n", (int) (src + srclen - start), start);                                                                                 
      break;                                                                                                                                         
    } else {                                                                                                                                         
      printf("Token: %.*s\n", (int) (end - start), start);                                                                                           
      start = end + delim_length;                                                                                                                    
    }                                                                                                                                                
  }                     

mmap region may not end in a nul byte (as suggested by comment)

strstr() works on null-terminated strings. Your mmapped region may not end with a null byte. It is likely that the kernel erases the remainder of the last mmapped memory page (past the end-of-file) with \0s to avoid data leaks between processes, but if your file length is exactly a multiple of the page-size, you'll have trouble with using strstr() -- there won't be a nul byte to get your back.

You could roll out your own little string finder strnstr(). Or force another null page to be stamped at the end.

init_js
  • 4,143
  • 2
  • 23
  • 53
  • 1
    You should not use `strstr` because the `mmap`ped buffer is not necessarily `'\0'` terminated. – chqrlie Nov 24 '15 at 21:43
  • thanks @chqrlie. you are correct. this sent me in a little test spiral. `read()` calls are allowed when `seek()`ing past EOF (returns 0B), so I wondered what asking mmap a region bigger than the file would yield. What I ended up finding (on linux at least) was that the last buffer page containing file data will be filled with zeroes if the file size isn't a multiple of the page size. Reading a 10B file into an 8MB mmap buffer (w/ 4K page) for instance will give you zeroes for indices 10 to 4095, but if you read anything between index 4096 and 8MB, you'll get a bus error (SIGBUS). – init_js Nov 25 '15 at 03:58
2

strtok() modifies the char array to which you pass a pointer.

You mmap the file in read only mode, so you get a violation when strtok tried to modify the memory.

It would be a bad idea to mmap the file in read+write mode, the file would be modified and probably corrupted.

strtok is inappropriate for you purpose, write your own matching function that does not modify its argument array and returns offsets and lengths.

Also be aware that the mmapped memory should not be accessed beyond the size of the file and is not necessarily '\0' terminated, hence you should not use string functions to search into it (strchr, strstr, strlen...) nor copy from it (strcpy).

chqrlie
  • 131,814
  • 10
  • 121
  • 189