3

I have hundreds of large CSV files (sizes vary from 10k lines to 100k lines in each) and some of them have badly formed descriptions with quotes within quotes so they might look something like

ID,Description,x
3434,"abc"def",988
2344,"fred",3484
2345,"fr""ed",3485
2346,"joe,fred",3486

I need to be able to cleanly parse all of these lines in R as CSV. dput()'ing it and reading ...

txt <- c("ID,Description,x",
    "3434,\"abc\"def\",988",
    "2344,\"fred\",3484", 
    "2345,\"fr\"\"ed\",3485",
    "2346,\"joe,fred\",3486")

read.csv(text=txt[1:4], colClasses='character')
    Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
      incomplete final line found by readTableHeader on 'text'

If we change the quoting and do not include the last line with the embedded comma - it works well

read.csv(text=txt[1:4], colClasses='character', quote='')

However, if we change the quoting and include the last line with the embedded comma...

read.csv(text=txt[1:5], colClasses='character', quote='')
    Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
      line 1 did not have 4 elements

EDIT x2: Should have said that unfortunately some of the descriptions have commas in them - code is edited above.

Sean
  • 3,765
  • 3
  • 26
  • 48
  • What if there are 32 columns only some of which are quoted? Should I make this another question? – Sean Apr 05 '13 at 10:28

3 Answers3

5

Change the quote setting:

read.csv(text=txt, colClasses='character',quote = "")

    ID Description    x
1 3434   "abc"def"  988
2 2344      "fred" 3484
3 2345    "fr""ed" 3485
4 2346       "joe" 3486

Edit to deal with errant commas:

  txt <- c("ID,Description,x",
         "3434,\"abc\"def\",988",
         "2344,\"fred\",3484", 
         "2345,\"fr\"\"ed\",3485",
         "2346,\"joe,fred\",3486")

txt2 <- readLines(textConnection(txt)) 

txt2 <- strsplit(txt2,",")

txt2 <- lapply(txt2,function(x) c(x[1],paste(x[2:(length(x)-1)],collapse=","),x[length(x)]) )
m <- do.call("rbind",txt2)
df <- as.data.frame(m,stringsAsFactors = FALSE)
names(df) <- df[1,]
df <- df[-1,]

#     ID Description    x
# 2 3434   "abc"def"  988
# 3 2344      "fred" 3484
# 4 2345    "fr""ed" 3485
# 5 2346  "joe,fred" 3486

No idea, if that is sufficiently efficient for your use case.

Roland
  • 127,288
  • 10
  • 191
  • 288
  • That's great does the job nicely - slow is less important than accurate! – Sean Apr 05 '13 at 10:09
  • 1
    You should use this only if "read.csv" gives an error and you could parallelize the `lapply` loop (eg. with `mclapply`). – Roland Apr 05 '13 at 11:02
2

As there is only one quoted column in this set of nasty files, I can do a read.csv() on each side to handle the other unquoted columns left and right of the quoted column, so my current solution based on the info from both @agstudy and @roland

csv.parser <- function(txt) {
    df <- do.call('rbind', regmatches(txt,gregexpr(',"|",',txt),invert=TRUE))
    # remove the header
    df <- df[-1,]
    # parse the left csv
    df1 <- read.csv(text=df[,1], colClasses='character', comment='', header=FALSE)
    # parse the right csv
    df3 <- read.csv(text=df[,3], colClasses='character', comment='', header=FALSE)
    # put them back together
    dfa <- cbind(df1, df[,2], df3)
    # put the header back in
    names(dfa) <- names(read.csv(text=txt[1], header=TRUE))
    dfa
}

# debug(csv.parser)
csv.parser(txt)

so running this on a wider set of data works thankfully.

txt <- c("ID,Description,x,y",
         "3434,\"abc\"def\",988,344",
         "2344,\"fred\",3484,3434", 
         "2345,\"fr\"\"ed\",3485,7347",
         "2346,\"joe,fred\",3486,484")
csv.parser(txt)
    ID Description    x    y
1 3434     abc"def  988  344
2 2344        fred 3484 3434
3 2345      fr""ed 3485 7347
4 2346    joe,fred 3486  484
Sean
  • 3,765
  • 3
  • 26
  • 48
  • 1
    +1! even I think you can do this by one regular expression something like ..`,"|",|[0-9],[0-9]`( I haven't test it). – agstudy Apr 05 '13 at 12:32
1

You can use readLines and extract element using regmatches between ," and ",

ll <- readLines(textConnection(object='ID,Description,x
  3434,"abc"def",988
2344,"fred",3484
2345,"fr""ed",3485
2346,"joe,fred",3486'))
ll<- ll[-1]     ## remove the header
ll <- regmatches(ll,gregexpr(',"|",',ll),invert=TRUE)
do.call(rbind,ll)
       [,1]     [,2]       [,3]  
[1,] "  3434" "abc\"def" "988" 
[2,] "2344"   "fred"     "3484"
[3,] "2345"   "fr\"\"ed" "3485"
[4,] "2346"   "joe,fred" "3486"
agstudy
  • 119,832
  • 17
  • 199
  • 261