2

I have a number of pipe-delimited files, a few very large (many GB), that are flawed. Someone mistakenly left out one field and a delimiter, but only in some of the rows.

I'd like to read each file as a vector of character strings using read_lines, then apply count.fields (is there a tidyverse version of this?) to split the lines into two vectors of strings by the number of tokens. (I must delete the last string in one of the vectors). After all that manipulation, I want to use read_delim twice to parse the sets of lines with a different number of delimiters. How can I fool read_delim to read each vector of strings without any modifications? Or is my only solution to write new temporary files and then use read_delim?

The online help for read_delim says: "file" can be a literal but "It must contain at least one new line to be recognised as data (instead of a path)."

"paste" can add newlines (like shown below), but why should I need to do this to fool read_delim into reading the literal data instead of a path? Why not add a parameter to read_delim so it an read literal data in the form a vector of strings?

I want to pass the targetSet string vector, but read_delim needs a newline:

 d <- read_delim(paste(targetSet, collapse="\n"), delim="|",
                  col_types=cols(.default="c"))

This works on files that have 100,000 records, but paste fails with one file of about 160 million records:

Error in paste(targetSet, collapse = "\n") : result would exceed 2^31-1 bytes

I'm using 64-bit R, so I don't understand this message. I should have dozens of GB of memory on the Linux box I'm using.

Is there a trick to get read_delim to read a modified vector of strings from read_lines? Is there a tidyverse version of textConnection?

Amar
  • 1,340
  • 1
  • 8
  • 20
Earl F Glynn
  • 375
  • 2
  • 7

1 Answers1

1

Taking a look at the paste source code in paste.c, we find the two lines

if (pwidth > INT_MAX)
error(_("result would exceed 2^31-1 bytes"));

So paste checks if the length of the pasted character string is <= INT_MAX, and returns above error if it is not.

R has a LONG_INT_MAX and an INT_MAX, the latter of which is a 32bit integer therefore can have a maximum value of 2^31 - 1 (one bit for the sign), corresponding to around 2.1 billion.

It seems that paste(targetSet, collapse = "\n") exceeds that character limit.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68