I have a number of pipe-delimited files, a few very large (many GB), that are flawed. Someone mistakenly left out one field and a delimiter, but only in some of the rows.
I'd like to read each file as a vector of character strings using read_lines
, then apply count.fields
(is there a tidyverse version of this?) to split the lines into two vectors of strings by the number of tokens. (I must delete the last string in one of the vectors). After all that manipulation, I want to use read_delim
twice to parse the sets of lines with a different number of delimiters. How can I fool read_delim
to read each vector of strings without any modifications? Or is my only solution to write new temporary files and then use read_delim
?
The online help for read_delim
says: "file" can be a literal but "It must contain at least one new line to be recognised as data (instead of a path)."
"paste" can add newlines (like shown below), but why should I need to do this to fool read_delim
into reading the literal data instead of a path? Why not add a parameter to read_delim
so it an read literal data in the form a vector of strings?
I want to pass the targetSet string vector, but read_delim
needs a newline:
d <- read_delim(paste(targetSet, collapse="\n"), delim="|",
col_types=cols(.default="c"))
This works on files that have 100,000 records, but paste fails with one file of about 160 million records:
Error in paste(targetSet, collapse = "\n") : result would exceed 2^31-1 bytes
I'm using 64-bit R, so I don't understand this message. I should have dozens of GB of memory on the Linux box I'm using.
Is there a trick to get read_delim
to read a modified vector of strings from read_lines? Is there a tidyverse version of textConnection
?