-1

My data has around 2.6 million rows with both quotes and commas as delimiters. But sometimes there are commas withing a variable which i do not want to be considered as a delimiter.

Also, i think there is an open quote somewhere, so when i simply read.csv, i get only 1.3 million observations. But when i do it with quote="", i get the required number of observations, but the number of columns increase because the address variable ends up getting split on the comma.

Sharvi
  • 1
  • 1

1 Answers1

0

sounds like you need to find and fix your open quote. I suggest you read each line and count the number of "s in each line. If there is indeed an open quote, you should find an odd number of quotes in some particular rows

library(stringr)
fileName="read.csv"
con=file(fileName,open="r")
count = 1
while(length(line <- readLines(con, 1)) > 0) {
    s = str_count(line, '"')  # Count number of " in the line
    if (s %% 2 != 0) { # If there is an odd number, means there is an open "
        print(paste("Row", count, "has", s, 'occurrence of "'))
    }
    count = count + 1
} 
whalea
  • 301
  • 1
  • 7
  • Thanks a lot. This definitely helped me figure out which rows were creating an issue. But there are way too many of them. I am trying to split the csv into 5-6 files using a csv splitter and then will open them to replace all quotes – Sharvi Apr 11 '18 at 09:58
  • I have another suggestion for you, can consider modifying the above code to write the data into 2 different files, one containing the valid rows, and the other with invalid rows. Then you can just clean up the file with the invalid rows and recombine them. To write line by line, see https://stackoverflow.com/questions/45046184/how-to-write-a-file-line-by-line-in-r – whalea Apr 11 '18 at 11:38
  • Ok, will do and will let you guys know how it goes. Thanks a lot for the help. I wasn't expecting a response so fast. You're the best! – Sharvi Apr 12 '18 at 05:38