2

How can I get fread() to set "" to a NA for all variables including character variables?

I am importing a .csv file where missing values are empty strings (""; no space). I want "" to be interpreted as missing value NA and tried `na.strings = "" without success:

data <- fread("file.csv", na.strings = "")

unique(data$character_variable)
# [1] "abc" "def"      ""            

On the other hand, when I use read.csv with na.strings = "", the "" are turned into NAs, even for character variables. This is the result I want.

data <- read.csv("file.csv", na.strings = "")

unique(data$character_variable)
# [1] "abc" "def"      NA

versions

  • R version 3.6.1 (2019-07-05)
  • data.table_1.12.8
Henrik
  • 65,555
  • 14
  • 143
  • 159
Danielle
  • 733
  • 1
  • 10
  • 24

1 Answers1

2

Well, you can't if your csv file looks like this

a,b
x,y
"",1

Note that whatever inside the "" is treated as a string literal because "" are the escape characters. In that sense, ,"", in a csv file just means an empty string, but not a missing value (i.e. ,,). I would consider this a good feature for consistency. This is also written in the section na.strings of the documentation of fread:

A character vector of strings which are to be interpreted as NA values. By default, ",," for columns of all types, including type character is read as NA for consistency. ,"", is unambiguous and read as an empty string. To read ,NA, as NA, set na.strings="NA". To read ,, as blank string "", set na.strings=NULL. When they occur in the file, the strings in na.strings should not appear quoted since that is how the string literal ,"NA", is distinguished from ,NA,, for example, when na.strings="NA".

On the other hand, you may notice that if the file looks like this

a,b
1,y
"",1

, then the empty string will be converted into NA. However, I think it's not a bug because this behaviour is probably a consequence of type coercion by the parser. In the Details section of the same document, you can see that

The lowest type for each column is chosen from the ordered list: logical, integer, integer64, double, character.

So column a is first read as a character column and later converted into an integer one. The empty string is still read as is but coerced into an NA_integer_ in the second step.

Henrik
  • 65,555
  • 14
  • 143
  • 159
ekoam
  • 8,744
  • 1
  • 9
  • 22
  • So in essence, fread() handles the "" differently than read.csv()? – Danielle Nov 12 '20 at 08:06
  • 1
    Yup, I would say so. – ekoam Nov 12 '20 at 08:07
  • Also, in my example the variable was a true character variable unlike c("",1), which in this case I'd want a numeric: c(NA, 1). Does the variable being truly character matter? – Danielle Nov 12 '20 at 08:08
  • Yes. As pointed out by the documentation of `fread`, the character type is the highest type possible, so there is no type coercion to be performed on. You will just get an empty string in your resulting dataframe. On the other hand, if a possible lower type exists, the column will be further coerced into that type. That's why you may see c(NA,1) even if the column is actually something like c("",1). fread does a type coercion like `as.numeric(c("",1))` silently for you. – ekoam Nov 12 '20 at 08:14
  • 1
    I am happy to see people reading manuals so well. Yes, it is worth! – jangorecki Nov 12 '20 at 19:12