3

Example:

x <- data.frame(X = c("",""), Y=1:2, stringsAsFactors = F)
write.csv("/tmp/temp.txt", row.names=F, quote=T)

read.csv("/tmp/temp.txt")
   X Y
1 NA 1
2 NA 2

readr::read_csv("/tmp/temp.txt", col_types = list(col_character(), col_double()))
  X         Y
  <chr> <dbl>
1 NA        1
2 NA        2

I expect the X column to be empty strings, but it is converted to NA_logical_ despite being a field that has quotation marks (quote=T). I can find no parameter that lets me read the X column as empty strings. The problem occurs for data.table and readr too.

Why does this happen?

Edit: I'm mostly looking for an explanation why this happens, not a solution.

thc
  • 9,527
  • 1
  • 24
  • 39
  • I think the short answer is that when the variable type is not specified, R has to guess. An empty string is ambiguous (could mean empty string or missing value), so R defaults to the "lowest" type, which is `logical`. Perhaps someone with a strong grasp of R internals can elaborate. – neilfws Jan 29 '19 at 00:21

1 Answers1

2

You can alter the colClasses argument to read.csv:

x <- read.csv("/tmp/temp.txt", colClasses = c(X = "character"))
str(x)
#'data.frame':  2 obs. of  2 variables:
# $ X: chr  "" ""
# $ Y: int  1 2
dave-edison
  • 3,666
  • 7
  • 19
  • It doesn't work with readr: `readr::read_csv("/tmp/temp.txt", col_types = list(col_character(), col_double()))`. Bug? Also, why do I need to do this? What If I don't know the columns ahead of time? – thc Jan 29 '19 at 00:10
  • It looks like `readr::read_csv` gives you `NA_character_` rather than an empty string (even without the column specification). I think this is a feature rather than a bug. If you don't know the columns ahead of time you can always apply the transformations after you read in the data. – dave-edison Jan 29 '19 at 00:35