3

I have a file with large, multiline blocks of text. I would like to read the file into a list of character vectors -- one for each block. My reading of the documentation on functions like scan(), read.table() etc seem to suggest that the end of a line will end the vector. Is there some option or some other function that allows me to specify a separation character and won't go to a new vector until that character is encountered?

smci
  • 32,567
  • 20
  • 113
  • 146

1 Answers1

2

R read.csv observes RFC 4180 on the format of csv files, so if your files are formatted in that way, they will be read correctly. Basically the long text fields with embedded carriage returns will be read as one field (including the line feed) if they are enclosed in double quotes. What if the text itself has quotes in it? That's the rub, embedded quotes in the text you are trying to read must be replaced by two consecutive quotes ("").

Here is an example:

> read.csv(stringsAsFactors = FALSE, text = '
+ id, text
+ 1, Hello World
+ 2, "Hello
+ World"
+ 3, "I say ""Hello 
+ World"" often"
+ ')

  id                         text
1  1                  Hello World
2  2                 Hello\nWorld
3  3  I say "Hello \nWorld" often

Here is the relevant section of the RFC:

  1. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:

    "aaa","b CRLF

   bb","ccc" CRLF
   zzz,yyy,xxx
  1. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:

    "aaa","b""bb","ccc"

Community
  • 1
  • 1
James King
  • 6,229
  • 3
  • 25
  • 40