Reading in Poor CSV File Structure

Question

I am trying to read in a large csv datafile (delimited by ,), and I keep on getting stuck on rows such as the following: link to raw file: "http://daniels-pull.universityofdenv.netdna-cdn.com/assets/GeneralOccurrencesAll.csv"

| RIN    | UCR_Group | Loc                                         |
|--------|-----------|---------------------------------------------|
|490658  | Property  | 400 BLOCK 17TH ST, 5TH FL                   |
|506928  | Disorder  | 1400 BLOCK W COLORADO AVE, COLORADO SPRINGS |

As you can see, the delimiter for the file is used in the column as well. Is there a way to specify the delimiter for a column as a regular expression to read in the file, or would I need to look into using read.fwf to find the max length of each field, and parse the data using that? Currently, this is my code that I have came up with so far:

datafile <- "http://daniels-pull.universityofdenv.netdna-cdn.com/assets/GeneralOccurrencesAll.csv"
new <-readr::read_delim(datafile, sep ='[\\S],[\\S]')  
new <-read.table(datafile, sep ='[\\S],[\\S]' )

Should I be using read.fwf, or trying to pull out the problem column manually? Any help would be appreciated.

Edit:

For Bonus points, I'd really like to build a function that would detect columns that are bad in a csv file, or data that looks like it may mess up the structure of the file, such as this case. That way, I don't have to mess around with a text editor, and can programmatically find these errors in the file. Thoughts on how to build something like this?

Yes readr::read_csv works, but with errors. I was hoping I could supply a regular expression as the delimiter in R, but I guess not. I'm just wondering if there is a more elegant way, such as read.table or pulling in the data column by column with fread that may be easier. — petergensler, Jan 09 '17 at 03:13
If its even possible to do in pandas, that would be fantastic, but I'd prefer r. I know the structure of my data is bad based on this website: http://csvlint.io/validation/5872bb3e3738390004000005 — petergensler, Jan 09 '17 at 03:16
@petergensler I think @darshan-baral means the original `read.csv` from base R — Hao, Jan 09 '17 at 03:23
no I don't think read.csv works...I get this error: read.csv(datafile, header = TRUE, sep = '[\\S],[\\S]') Error in scan(file, what = "", sep = sep, quote = quote, nlines = 1, quiet = TRUE, : invalid 'sep' value: must be one byte — petergensler, Jan 09 '17 at 03:27
You could use `readLines` to suck the data in and then do some splitting and fixing post-hoc. There's really no easy way to deal with this unfortunately without some semi-manual adjustment. — thelatemail, Jan 09 '17 at 03:37
`count.fields(file, sep=",")` can also be useful for debugging the rows where you have an excess of commas. — thelatemail, Jan 09 '17 at 03:39
@thelatemail how would you go about using read lines? Can you provide a mwe? — petergensler, Jan 09 '17 at 03:43
@petergensler - `readLines("filename")` is about as minimal as it gets :-) — thelatemail, Jan 09 '17 at 03:54
Have you tried `read.csv(file, fill=TRUE)`? The file seems to read fine (and has 15 columns). — Remko Duursma, Jan 09 '17 at 04:18

piRSquared · Accepted Answer · 2017-01-09T08:52:52.150

2

Using panda.read_csv and regex negative look ahead. The same regex should work in R as well.

import pandas as pd

df = pd.read_csv(filename, sep=r',(?!\s)')

Filter df for rows in which LOC has a comma, to verify that we've parsed correctly:

df[df.LOC.str.contains(',')]

edited Jan 09 '17 at 08:52

answered Jan 09 '17 at 04:58

piRSquared

285,575
57
475
624

How did you design your regular expression? I've been trying to make that for days......where do you test your regex's at? – petergensler Jan 14 '17 at 02:27

G. Grothendieck · Answer 2 · 2017-01-10T04:27:42.910

Replace each comma that is surrounded by non-space with semicolon and then read in the result using read.csv2.

(Replace Lines with readLines(u) command to read it from u. Also if there are semicolons in the file then use a different character and specify it in the sep= argument to read.csv2 or just read.csv and in the second arg to gsub.)

read.csv2(text = gsub(",(\\S)", ";\\1", Lines)))

giving:

     RIN UCR_Group                                         Loc
1 490658  Property                   400 BLOCK 17TH ST, 5TH FL
2 506928  Disorder 1400 BLOCK W COLORADO AVE, COLORADO SPRINGS

Note: We used this as the input Lines:

Lines <- c("RIN,UCR_Group,Loc", 
  "490658,Property,400 BLOCK 17TH ST, 5TH FL", 
  "506928,Disorder,1400 BLOCK W COLORADO AVE, COLORADO SPRINGS")

Update: It seems in the actual file that a space can appear before a valid comma separator and there are one character fields so we modified the pattern accordingly. Here is the result for the first 3 lines of the file:

u <- "http://daniels-pull.universityofdenv.netdna-cdn.com/assets/GeneralOccurrencesAll.csv"
Lines <- readLines(u, 3)

read.csv2(text = gsub(",(\\S)", ";\\1", Lines))

giving:

     RIN               UCR_Group                            UCR_Cat
1 416667 Crimes Against Property Criminal Mischief/Damaged Property
2 416673  Crimes Against Persons              Forcible Sex Offenses
              EXP_TRANSLATION         OCC_DATE OCC_TIME                     LOC
1 CRIMINAL MISCHIEF - MTR VEH 1/1/2010 0:00:00      145  200 BLOCK S ZENOBIA ST
2             SEX ASLT - RAPE 1/1/2010 0:00:00      300 1500 BLOCK S DECATUR ST
  TRANSLATION       PIN               DOB SEX          X          Y     LON
1 VICTIM      235602181  5/6/1979 0:00:00   M 3126041.08 1684996.73 -105.05
2 ARRESTEE    219220590 3/19/1988 0:00:00   M 3134340.56 1676360.06 -105.02
    LAT
1 39.71
2 39.68

I don't think this works with the original data file, but it does work with the vector supplied like you stated. — petergensler, Jan 10 '17 at 02:10
Seems the descrption in the question is not entirely accurate as there are valid comma separators that are prefaced with a space. Also there are one character fields whereas there were none in the short example. I have modified the code accordingly so it should work on the file. — G. Grothendieck, Jan 10 '17 at 04:05
Apologies for that. I should have been more clear. Building on your answer, how would you use your logic to search for any delimiter in a file that would throw off parsing it correctly? I'm sure so many other people have run into a similar issue — petergensler, Jan 12 '17 at 02:09
Are there any resources to learn about using text connections to read files into R? This seems like such a handy feature, but I doubt anyone is familiar with how to leverage it properly — petergensler, Jan 12 '17 at 02:19
Building on your approach, this is how I was able to clean up my file:system.time(genocc <- readr::read_file(datafile)) system.time(test <- str_replace_all(genocc, ",(?!\\s)", ";")) file <- readr::read_delim(file = test, delim = ";") — petergensler, Jan 14 '17 at 03:14

score 1 · Answer 3 · answered Jan 09 '17 at 12:49

You know which field has the non-escaped commas:

library(stringi)
library(purrr)

txt <- readr::read_lines("http://daniels-pull.universityofdenv.netdna-cdn.com/assets/GeneralOccurrencesAll.csv")
commas <- stri_locate_all_fixed(txt, ",")

map2_chr(txt[1:100], commas[1:100], function(x, y) {
  len <- nrow(y)
  bits <- c(1:6, (len-6):len)
  for (i in bits) { stri_sub(x, y[i,1], y[i,2]) <- ";" }
  x
}) -> rd

read.table(text=rd, header=TRUE, sep=";", stringsAsFactors=FALSE) %>%
  dplyr::glimpse()
## Observations: 99
## Variables: 14
## $ RIN             <int> 416667, 416673, 416674, 416680, 416684, 416686...
## $ UCR_Group       <chr> "Crimes Against Property", "Crimes Against Per...
## $ UCR_Cat         <chr> "Criminal Mischief/Damaged Property", "Forcibl...
## $ EXP_TRANSLATION <chr> "CRIMINAL MISCHIEF - MTR VEH", "SEX ASLT - RAP...
## $ OCC_DATE        <chr> "1/1/2010 0:00:00", "1/1/2010 0:00:00", "1/1/2...
## $ OCC_TIME        <int> 145, 300, 500, 730, 200, 440, 100, 851, 140, 2...
## $ LOC.TRANSLATION <chr> "200 BLOCK S ZENOBIA ST,VICTIM     ", "1500 BL...
## $ PIN             <int> 235602181, 219220590, 119013720, 174326399, 32...
## $ DOB             <chr> "5/6/1979 0:00:00", "3/19/1988 0:00:00", "5/25...
## $ SEX             <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "...
## $ X               <dbl> 3126041, 3134341, 3134360, 3127695, 3193317, 3...
## $ Y               <dbl> 1684997, 1676360, 1700160, 1682545, 1708673, 1...
## $ LON             <dbl> -105.05, -105.02, -105.02, -105.04, -104.81, -...
## $ LAT             <dbl> 39.71, 39.68, 39.75, 39.70, 39.77, 39.78, 39.7...

Wow, that is truly wonderful! How would you go about writing a function to detect which fields are bad from within R? Is that even possible, rather than searching manually in a text editor? — petergensler, Jan 10 '17 at 00:02
Aye, tis possible. Find the common "comma" pattern and then isolate the field with the errant one. — hrbrmstr, Jan 10 '17 at 00:40
Ok, you want to advise on how to build a function that could do that? — petergensler, Jan 10 '17 at 01:56

score 0 · Answer 4 · answered Jan 09 '17 at 04:24

Here's a worked example showing you can use a regex to parse this file, relying on the fact that the commas in the addresses have a space. This of course gets more complicated if this rule doesn't always hold:

txt <- "RIN,UCR_Group,Loc
123456,Property,1 STREET
490658,Property,400 BLOCK 17TH ST, 5TH FL
506928,Disorder,1400 BLOCK W COLORADO AVE, COLORADO SPRINGS"

dat <- readLines(textConnection(txt))
# in a real example:
# dat <- readLines("filename.csv")

spl <- strsplit(dat, "(?<=\\S),(?=\\S)", perl=TRUE)
setNames(data.frame(do.call(rbind, spl[-1])), spl[[1]])

#     RIN UCR_Group                                         Loc
#1 123456  Property                                    1 STREET
#2 490658  Property                   400 BLOCK 17TH ST, 5TH FL
#3 506928  Disorder 1400 BLOCK W COLORADO AVE, COLORADO SPRINGS

Reading in Poor CSV File Structure

4 Answers4