Is there a sed type package in R for removing embedded NULs?

Question

I am processing the US Weather service Storm Data, which has one large CSV data file for each year from 1950 onwards. The 1999 year file contains several rows with very large freeform text fields which contain embedded NUL characters, in an otherwise vanilla ascii database. (The offending file is at ftp://ftp.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1999_c20140915.csv.gz).

R cannot handle corrupted string data without errors,and this includes R data.frame, data.table, stringr, and stringi package functions (tried).

I can clean the files of NULs with sed, but I would prefer not to use external programs, as this is for an R markdown type report with embedded code.

Suggestions?

Can you give an example of a location for such a null character, or have an hex code for it? — Dominic Comtois, Mar 11 '15 at 07:23

Dominic Comtois · Answer 1 · 2015-03-11T08:49:39.717

3

Maybe this could be of help:

in.file <- file(description = "StormEvents_details-ftp_v1.0_d1999_c20140915.csv", 
                open = "r")
writeLines(iconv(readLines(in.file), to = "ASCII"), 
           con = "StormEvents_ascii.csv")

I was able to read the csv file without errors with this call do read.table:

options(stringAsFactors = FALSE)
StormEvents <- read.table("StormEvents_ascii.csv", header = TRUE, 
                           sep = ",", fill = TRUE, quote = '"')

Obviously you'd need to change the class of several columns, since all are considered character as it is.

edited Mar 11 '15 at 08:49

answered Mar 11 '15 at 08:27

Dominic Comtois

10,230
1
39
61

Yes, that works! I had not tried iconv(), which CAN handle those NUL chars in strings, it seems. – Bill Mar 13 '15 at 09:02

score 1 · Answer 2 · edited May 23 '17 at 11:44

1

Just for posterity - you can use binary reads (readBin()) and replace the NULs with anything else - see Removing "NUL" characters (within R)

edited May 23 '17 at 11:44

Community

1
1

answered Dec 11 '15 at 02:42

Simon Urbanek

13,842
45
45

score 0 · Answer 3 · answered May 04 '20 at 10:04

An update for May 2020: The tidyverse and data.table both still choke on null characters within files however the base::read.*() family and readLines() will gracefully skip them with the skipNul=TRUE option. You can read a file in skipping over null characters and then write it back out again.

Is there a sed type package in R for removing embedded NULs?

3 Answers3

Linked