I have a raw text file weighing 70GB, over 1B rows of differing length, no columns involved, raw text.
I wish to scan it and simply count how many times each word of a predefined set search_words
appears (size ~100). Currently I'm using read_lines_chunked
from the readr
package, reading in 100K chunks of lines, invoking a callable
function f
which updates a global counter
like so:
library(tidyverse)
write_lines("cat and dog\r\ndog\r\nowl\r\nowl and cat", "test.txt")
search_words <- c("cat", "dog", "owl") # real size is about 100
counter <- numeric(length(search_words))
regex_word <- function(w) str_c("\\b", w, "\\b")
search_words <- map_chr(search_words, regex_word)
count_word <- function(i, chunk) sum(str_count(chunk, search_words[i]))
f <- function(x, pos) {
counter <<- counter + map_int(1:length(search_words), count_word, x)
}
read_lines_chunked("test.txt", SideEffectChunkCallback$new(f), chunk_size = 100000)
This works great, and less than 24 hours on my 8-core Windows 10 16GB RAM laptop is not too bad if it's a one-time effort. But time is of the essence. Are there any solutions out there involving text, not tabulated CSVs (like data.table
's fread
), to do this fast on a single laptop? Preferably something with read_lines_chunked
's elegance.
Possible solutions I have thought of but couldn't get them to work with raw text or with chunking:
ff
packagebigmemory
package- simply invoking the command line through
system()
and counting withcat file.txt | head -1000000 | grep -o "\bword\b" | wc -l
- do I have any reason to believe this would be faster? - parallelizing? Not sure if possible in Windows.