1

I have a CSV file with a few issues.

--Mismatched quotes

--Commas within those mismatched quotes.

Which is making reading the data in a nightmare.

I've read reading badly formed csv in R - mismatched quotes

Reading my file in using

rawData = read.csv(curFile, stringsAsFactors=FALSE, header=TRUE, quote="")

As suggested here: R Programming: "More Columns than Column Names"

I think its because of the unmatched quotes, but using read.csv(quote="") still gives me that error. Removing the quote="" allows me to read the file in (no more more columns than column names error) but it still reads incorrectly.

"@realdonaldtrump","870440000","870442502","Louis  Tonelli","L00byLou26","364","292","",0,0,"Wed Mar 23 03:03:18 +0000 2016","RT @realDonaldTrump: Incompetent Hillary, despite the horrible attack in Brussels today, wants borders to be weak and open-and let the Musl&","7.1247e+17","712474777378820097","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",NA,NA,NA,NA,NA,"7.1247e+17","712473816614772736","Wed Mar 23 02:59:29 +0000 2016","Donald J. Trump","realDonaldTrump","New York, NY","7259400","41","<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>" 
Phone</a>"
"@realdonaldtrump","4831200000","4831194209","Chris Mattingly","_chrismattingly","605","194","Missouri, USA",0,0,"Wed Mar 23 03:03:18 +0000 2016","@realDonaldTrump &lt;- Favorite buffoonish reply: ""Be careful, or [insert stock threat]"". How's the ""libel"" suit going? https://twitter.com","7.1247e+17","712474777064181761","<a href=""http://twitter.com/#!/download/ipad"" rel=""nofollow"">Twitter for iPad</a>",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"@realdonaldtrump","4799600000","4799556991","Leann Rehm      Lawrence","rehm_leann","101","295","",0,0,"Tue Mar 22 21:24:15 +0000 2016","RT     @TrumpDynastyUSA: KINDRED SPIRITSBrought to TEARS.
LOVE &amp; HONOR the ""Apple of G-D's EYE!""
Deuteronomy 32:9-10
@ElianaBenador @realDona&","7.1239e+17","712389451679342593","<a     href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for     Android</a>",NA,NA,NA,NA,NA,"7.1238e+17","712384968836718593","Tue Mar 22     21:06:26 +0000 2016","Lionhearted1","TrumpDynastyUSA","United     States","5153","5140","<a href=""http://twitter.com"" rel=""nofollow"">Twitter     Web Client</a>"

For reasons beyond my control, this is how the data appears in the file.

What I should be "reading in" here is 3 rows/observations.

Each line beginning with "@realdonaldtrump" is a new observation.

If anything falls below it, it should be part of the observation above it, but there are embedded \n or \r. That shouldn't matter, though, as its comma delimited.

However that does cause problems when commas fall in between.

The mismatch quotes problem is easy to see with line "2"

Community
  • 1
  • 1
Jibril
  • 967
  • 2
  • 11
  • 29
  • A one-off dataset, or something you're likely to receive mis-formatted over and over again? – rbm Mar 29 '16 at 15:13
  • 1
    Also, have you tried splitting the string by `","` (and then removing the first and last `"`) – rbm Mar 29 '16 at 15:14
  • No, I will give this a shot! And its likely to happen again, I have somewhere in the realm of 2000 files, and this happened right from the first. – Jibril Mar 29 '16 at 15:45
  • If they are all mis-formatted the same way, then i'd pre-process the data with some simple perl script (or see Mekki's answer below for using `fread`) – rbm Mar 29 '16 at 15:51

1 Answers1

2

Try package data.table's fread (See page 31 of help file). It automatically does multi-row comparisons to try to identify mismatched quotes and commas in the manner you have described. It's not perfect, but it tends to work a lot better than read.csv.

It also supports reading in only certain row ranges, so with some trial-and-error, if you can identify the naughty rows, you can skip those with your initial fread and then handle them separately, assuming there aren't too many of them.

Short of pre-processing your data using PERL or PHP to identify and correct your mis-matched quotes before the read, this is probably your best bet from with R.

Mekki MacAulay
  • 1,727
  • 2
  • 12
  • 23