1

I have a raw text file weighing 70GB, over 1B rows of differing length, no columns involved, raw text.

I wish to scan it and simply count how many times each word of a predefined set search_words appears (size ~100). Currently I'm using read_lines_chunked from the readr package, reading in 100K chunks of lines, invoking a callable function f which updates a global counter like so:

library(tidyverse)

write_lines("cat and dog\r\ndog\r\nowl\r\nowl and cat", "test.txt")

search_words <- c("cat", "dog", "owl") # real size is about 100

counter <- numeric(length(search_words))

regex_word <- function(w) str_c("\\b", w, "\\b")

search_words <- map_chr(search_words, regex_word)

count_word <- function(i, chunk) sum(str_count(chunk, search_words[i]))

f <- function(x, pos) {
  counter <<- counter + map_int(1:length(search_words), count_word, x)
}

read_lines_chunked("test.txt", SideEffectChunkCallback$new(f), chunk_size = 100000)

This works great, and less than 24 hours on my 8-core Windows 10 16GB RAM laptop is not too bad if it's a one-time effort. But time is of the essence. Are there any solutions out there involving text, not tabulated CSVs (like data.table's fread), to do this fast on a single laptop? Preferably something with read_lines_chunked's elegance.

Possible solutions I have thought of but couldn't get them to work with raw text or with chunking:

  • ff package
  • bigmemory package
  • simply invoking the command line through system() and counting with cat file.txt | head -1000000 | grep -o "\bword\b" | wc -l - do I have any reason to believe this would be faster?
  • parallelizing? Not sure if possible in Windows.
Giora Simchoni
  • 3,487
  • 3
  • 34
  • 72
  • You could check out the cli tool `ripgrep` which seems crazy fast, see here https://github.com/BurntSushi/ripgrep. You'd have to invoke through `system()` to use in R. – gfgm Feb 27 '19 at 12:35

0 Answers0