2

I am trying to find a regular expression in a large memory mapped file by using regexec() function. I discovered that the program crashes when the file size is the multiple of the page size.

Is there a regexec() function that has the length of the string as additional argument?

Or:

How to find a regex in a memory mapped file?

Here is the minimal example that ALWAYS crashes (if I run less that 3 threads program doesn't crash):

ls -la ttt.txt 
-rwx------ 1 bob bob 409600 Jun 14 18:16 ttt.txt

gcc -Wall mal.c -o mal -lpthread -g && ./mal
[1]    11364 segmentation fault (core dumped)  ./mal

And the program is:

#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>

#include <stdio.h>
#include <assert.h>
#include <pthread.h>
#include <regex.h>

void* f(void*arg) {
  int size = 409600;
  int fd = open("ttt.txt", O_RDONLY);
  char* text = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
  close(fd);

  fd = open("/dev/zero", O_RDONLY);
  char* end = mmap(text + size, 4096, PROT_READ, MAP_PRIVATE | MAP_FIXED, fd, 0);
  close(fd);

  assert(text+size == end);

  regex_t myre;
  regcomp(&myre, "XXXXX", REG_EXTENDED);
  regexec(&myre, text, 0, NULL, 0);
  regfree(&myre);
  return NULL;
}

int main(int argc, char* argv[]) {
  int n = 10;
  int i;
  pthread_t t[n];
  for (i = 0; i < n; ++i) {
    pthread_create(&t[n], NULL, f, NULL);
  }
  for (i = 0; i < n; ++i) {
    pthread_join(t[n], NULL);
  }
  return 0;
}

P.S. This is the output from gdb:

gdb ./mal 
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/bob/prog/c/mal...done.
(gdb) r

Starting program: /home/srdjan/prog/c/mal 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff77ff700 (LWP 11817)]
[New Thread 0x7ffff6ffe700 (LWP 11818)]
[New Thread 0x7ffff6799700 (LWP 11819)]
[New Thread 0x7fffeffff700 (LWP 11820)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6799700 (LWP 11819)]
__strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:72
72  ../sysdeps/x86_64/multiarch/../strlen.S: No such file or directory.
(gdb) bt
#0  __strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:72
#1  0x00007ffff78df254 in __regexec (preg=0x7ffff6798e80, string=0x7fffef79b000 'a' <repeats 200 times>..., nmatch=<optimized out>, 
pmatch=0x0, eflags=<optimized out>) at regexec.c:245
#2  0x00000000004008e6 in f (arg=0x0) at mal.c:24
#3  0x00007ffff7bc4e9a in start_thread (arg=0x7ffff6799700) at pthread_create.c:308
#4  0x00007ffff78f24bd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5  0x0000000000000000 in ?? ()
(gdb) 
Srđan Stipić
  • 191
  • 1
  • 1
  • 7

2 Answers2

3

Celada correctly identifies the problem - the file data does not necessarily include a null terminator.

You could fix the problem by mapping a page of zeroes immediately after the file:

int fd;
char *text;

fd = open("ttt.txt", O_RDONLY);
text = mmap(NULL, 409600, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);

fd = open("/dev/zero", O_RDONLY);
mmap(text + 409600, 4096, PROT_READ, MAP_PRIVATE | MAP_FIXED, fd, 0);
close(fd);

(Note that you can close fd immediately after the mmap(), because mmap() adds a reference to the open file description).

You should of course add error-checking to the above. Also, many UNIX systems support a MAP_ANONYMOUS flag which you can use instead of opening /dev/zero (but this is not in POSIX).

Community
  • 1
  • 1
caf
  • 233,326
  • 40
  • 323
  • 462
  • I tried your solution and it worked for less than 3 threads. I changed the code to use 10 threads and now it almost always breaks. I added one more assert to be sure that the null page is allocated just after the mmaped file. – Srđan Stipić Jun 15 '12 at 14:37
  • @user903597: Presumably with multiple threads it is breaking on the assertion. You will need to lock a mutex around the two `mmap()`s, to ensure that the `mmap()`s from two threads don't interleave. – caf Jun 16 '12 at 00:34
  • Won't this approach fail if the page directly after the first `mmap()`'d region happens to already be mapped to something? – Stuntddude May 08 '20 at 04:37
  • 1
    The correct approach is to reserve enough address space is the first `mmap` call, to put the blank page mapped using `MAP_FIXED` within. – Timothy Baldwin Jul 25 '21 at 19:12
2

The problem is that regexec() is used to match a null-terminated string against the precompiled pattern buffer, but an mmaped file is not necessarily (indeed not usually) null-terminated. Thus, it is looking beyond the end of the file to find a NUL character (0 byte).

You would need a version of regexec() that takes a buffer and a size argument instead of a null-terminated string, but there doesn't appear to be one.

Celada
  • 21,627
  • 4
  • 64
  • 78
  • 1
    I was aware of the problem of null-terminated strings, and because of that I posted the question. For me it is strange that other pople didn't encounter this problem in C. I was trying to find a solution but without any luck. – Srđan Stipić Jun 14 '12 at 18:04