4

I am processing the US Weather service Storm Data, which has one large CSV data file for each year from 1950 onwards. The 1999 year file contains several rows with very large freeform text fields which contain embedded NUL characters, in an otherwise vanilla ascii database. (The offending file is at ftp://ftp.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1999_c20140915.csv.gz).

R cannot handle corrupted string data without errors,and this includes R data.frame, data.table, stringr, and stringi package functions (tried).

I can clean the files of NULs with sed, but I would prefer not to use external programs, as this is for an R markdown type report with embedded code.

Suggestions?

Bill
  • 5,600
  • 15
  • 27

3 Answers3

3

Maybe this could be of help:

in.file <- file(description = "StormEvents_details-ftp_v1.0_d1999_c20140915.csv", 
                open = "r")
writeLines(iconv(readLines(in.file), to = "ASCII"), 
           con = "StormEvents_ascii.csv")

I was able to read the csv file without errors with this call do read.table:

options(stringAsFactors = FALSE)
StormEvents <- read.table("StormEvents_ascii.csv", header = TRUE, 
                           sep = ",", fill = TRUE, quote = '"')

Obviously you'd need to change the class of several columns, since all are considered character as it is.

Dominic Comtois
  • 10,230
  • 1
  • 39
  • 61
  • Yes, that works! I had not tried iconv(), which CAN handle those NUL chars in strings, it seems. – Bill Mar 13 '15 at 09:02
1

Just for posterity - you can use binary reads (readBin()) and replace the NULs with anything else - see Removing "NUL" characters (within R)

Community
  • 1
  • 1
Simon Urbanek
  • 13,842
  • 45
  • 45
0

An update for May 2020: The tidyverse and data.table both still choke on null characters within files however the base::read.*() family and readLines() will gracefully skip them with the skipNul=TRUE option. You can read a file in skipping over null characters and then write it back out again.

D3SL
  • 117
  • 8