0

I have a text file that is comma delimited but each row does not have the same number of fields. The first field has a 2 digit code that determines how many fields will be in that row (it's a code NOT a count). I'm curious if there's some kind of standard method for handling this kind of data for getting it into a SQL environment.

File looks like this:

10,abc,20141001,test@test.com,555-555-5555
20,abc,20141001,123 Main St,Springfield,CT,10001
10,xyz,20141001,test2@test.com,111-111-1111
...

Also the file is ~12 GB so I can't open it in a text editor to manipulate it. I originally tried to read it into R and split it into separate files using grep('^10,') or something like that but there appears to be an encoding/delimiter issue I'm missing. I'm assuming others have dealt with a similar piece of data and would appreciate any suggestions.

screechOwl
  • 27,310
  • 61
  • 158
  • 267
  • Do the codes 10, 20, 10, etc have an actual reference to a number of fields or give any indication of the number of fields that should be in one line? – Rich Scriven Oct 09 '14 at 00:03
  • @RichardScriven: not at all sadly. – screechOwl Oct 09 '14 at 00:07
  • 2
    `read.csv` defaults to `fill = TRUE` which will insert NAs where there are blank entries. You may need to set `col.names` to tell R what the maximum number of columns to expect is. If you want to manipulate a 12 gb file, either read it line by line, or use a tool like `awk`. – Alex Oct 09 '14 at 00:20

1 Answers1

1

Assuming that the codes are consistent - e.g. that lines coded 10 are all formatted the same way, you could do something like this:

text <- "10,abc,20141001,test@test.com,555-555-5555
20,abc,20141001,123 Main St,Springfield,CT,10001
10,xyz,20141001,test2@test.com,111-111-1111"

library(data.table)
conn <- textConnection(text)
result.10 <- do.call(rbind,lapply(1:3,function(i){
  x=readLines(conn,n=1)
  if(grepl("^10,",x)) return(setNames(strsplit(x,",")[[1]],c("code","name","date","email","phone")))
#  if(grepl("^20,",x)) return(setNames(strsplit(x,",")[[1]],c("code","name","date","address","city","state","zipcode")))
}))
result.10 <- as.data.table(result.10)
result.10[,code:=NULL]
result.10
#    name     date          email        phone
# 1:  abc 20141001  test@test.com 555-555-5555
# 2:  xyz 20141001 test2@test.com 111-111-1111

Then do the same thing for result.20, etc. Then you would have to merge the files into one, probably based on name (and perhaps date??), with something like:

setkey(result.10,name,date)
setkey(result.20,name,date)
result <- merge(result.10,result.20,all.x=TRUE,all.y=TRUE)
result
#    name     date          email        phone     address        city state zipcode
# 1:  abc 20141001  test@test.com 555-555-5555 123 Main St Springfield    CT   10001
# 2:  xyz 20141001 test2@test.com 111-111-1111          NA          NA    NA      NA

I'm using data.tables rather than data frames because, with such a large file, it's likely to be faster.

jlhoward
  • 58,004
  • 7
  • 97
  • 140