Read text file with separator longer than one character with fread

Question

Is there a way to get data.table fread to read text file with separators like "|||"?

I have a text file (2GB) that has lines that look like

aaa|||bbb|||random characters !$^!$£"!$ contain single |. |||other cols

If it's not possible to use fread, any other recommendation? I'll get them into data.table in the end.

first `fread()` and then splitting the string with `strsplit()` — jogo, Nov 13 '15 at 14:24
@Heroka, I tried read.table but that gives a similar error `invalid 'sep' value: must be one byte` — jf328, Nov 13 '15 at 14:26
Maybe this post helps http://stackoverflow.com/questions/18186357/importing-csv-file-with-multiple-character-separator-to-r — PhillipD, Nov 13 '15 at 14:28
No native support for this. If on *nix smth like `fread("sed 's/|||/,/g' yourfile")` would be your best bet. — eddi, Nov 13 '15 at 17:50

score 1 · Answer 1 · answered Jun 21 '16 at 06:45

The command read_delim from the package readr supports delimiters with more than one character.

I ran some benchmarks (1.6 mil rows, 30 columns, 350 mb txt file).

I find that it is approx 40% quicker than a solution using strsplit in the following manner:

do.call(rbind,strsplit(readLines('test.txt'),'~~~',fixed=T))

If you install gawk for windows and set appropriate system paths in windows, you can also do:

fread("sed 's/|||/,/g' yourfile", sep = ',')

as suggested by eddi in the comments. This is about 20% slower than the read_delim solution as it has to write a temporary file from calling sed but faster than the base R solution.

The fastest solution is to use fread with sep = '|' and remove the duplicated columns yourself. This works the best if you know apriori where they are, otherwise it can be calculated (presumably at some non-trivial time cost).

I could not get fread and tstrsplit to complete for my dataset, but you may have better luck.

it may be possible to improve the sed performance by opening a text connection to the output of sed, so that fread does not have to write (and read) a temporary file. — Alex, Jun 21 '16 at 06:46

Read text file with separator longer than one character with fread

1 Answers1

Linked