2

Is there a way to get data.table fread to read text file with separators like "|||"?

I have a text file (2GB) that has lines that look like

aaa|||bbb|||random characters !$^!$£"!$ contain single |. |||other cols

If it's not possible to use fread, any other recommendation? I'll get them into data.table in the end.

jf328
  • 6,841
  • 10
  • 58
  • 82

1 Answers1

1

The command read_delim from the package readr supports delimiters with more than one character.

I ran some benchmarks (1.6 mil rows, 30 columns, 350 mb txt file).

I find that it is approx 40% quicker than a solution using strsplit in the following manner:

do.call(rbind,strsplit(readLines('test.txt'),'~~~',fixed=T))

If you install gawk for windows and set appropriate system paths in windows, you can also do:

fread("sed 's/|||/,/g' yourfile", sep = ',')

as suggested by eddi in the comments. This is about 20% slower than the read_delim solution as it has to write a temporary file from calling sed but faster than the base R solution.

The fastest solution is to use fread with sep = '|' and remove the duplicated columns yourself. This works the best if you know apriori where they are, otherwise it can be calculated (presumably at some non-trivial time cost).

I could not get fread and tstrsplit to complete for my dataset, but you may have better luck.

Alex
  • 15,186
  • 15
  • 73
  • 127
  • it may be possible to improve the sed performance by opening a text connection to the output of sed, so that fread does not have to write (and read) a temporary file. – Alex Jun 21 '16 at 06:46