The command read_delim
from the package readr
supports delimiters with more than one character.
I ran some benchmarks (1.6 mil rows, 30 columns, 350 mb txt file).
I find that it is approx 40% quicker than a solution using strsplit
in the following manner:
do.call(rbind,strsplit(readLines('test.txt'),'~~~',fixed=T))
If you install gawk
for windows and set appropriate system paths in windows, you can also do:
fread("sed 's/|||/,/g' yourfile", sep = ',')
as suggested by eddi in the comments. This is about 20% slower than the read_delim
solution as it has to write a temporary file from calling sed
but faster than the base R solution.
The fastest solution is to use fread with sep = '|'
and remove the duplicated columns yourself. This works the best if you know apriori where they are, otherwise it can be calculated (presumably at some non-trivial time cost).
I could not get fread
and tstrsplit
to complete for my dataset, but you may have better luck.