0

I have a program that reads in a dictionary file (1 word per line) and load it into a hash table as fast as possible, currently I'm using mmap() to read in the whole file then to parse it I just use a loop to check every single character and if that character is a '\n' then we store it in the hash table.

My question is there any way I can do this any faster by using other functions or by improving my code, I've tried fgets and fscanf but mmap seems to be the fastest.

Here's the gist of my code

    int fp = open(file, O_RDONLY, S_IRUSR | S_IWUSR);

    struct stat sb;
    if (fstat(fp, &sb) == -1) {
        perror("couldn't get file size.\n");
    }

    char *text = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fp, 0);

    char word[46];
    int temp_index = 0;

    for (int i = 0; i < sb.st_size; i++) {

        if (text[i] == '\n') {

            temp[temp_index] = '\0';
            int index = hash(word);
            continue;
        }
        temp[temp_index] = text[i];
        temp_index ++;
    }

Sample dictionary.txt

home
house
phone
stack
overflow

Ojou Nii
  • 244
  • 4
  • 11
  • If you want speed (AND simplicity), use `getc()` + a simple state machine. BTW: you fail to commit the last word (if a '\n' is absent) – wildplasser Jul 29 '20 at 21:50
  • if `text` is zero-terminated, then you can use [`strtok`](https://en.cppreference.com/w/c/string/byte/strtok) or [`strchr`](https://en.cppreference.com/w/c/string/byte/strchr). However, it probably isn't zero-terminated. In that case, you can use [`memchr`](https://en.cppreference.com/w/c/string/byte/memchr) instead. – Andreas Wenzel Jul 29 '20 at 22:00
  • @wildplasser: That doesn’t sound like a recipe for speed. Is there a reason it would be fast? – Ry- Jul 29 '20 at 22:08
  • Is there some confusion between `temp` and `word` in this code? When does `temp_index` get reset to 0? Also, how’s the performance so far, and how fast would be fast enough? (Programs that are really optimized for this kind of thing, like ripgrep, don’t do it with particularly simple code.) – Ry- Jul 29 '20 at 22:10
  • @Ry : because I/O bus speed is the bottleneck in any case. Stdio does the buffering for you, and the sequential read (ahead) is recognised, without the need for walking through the entire address space. Also: getc() is a macro, so it is (often) inlined. – wildplasser Jul 29 '20 at 22:50
  • The number of (read()) system calls is basically the same, but, in the case of mmap() these are hidden inside blocking pagefaults. – wildplasser Jul 29 '20 at 22:54

0 Answers0