can't prevent NAs for empty cells in factor columns using readr

Question

I am trying to read file with some empty cells and getting for empty cells an expected NA. I have some special columns which can only have the values '' or '+'. So I would like to set these columns to a factor class by using

read_tsv('file.txt', 
         col_types=list(
             column_with_empty_cells=col_factor(c('','+'))))

But the column still has NAs in these columns. I could change the global behaviour of the readr_tsv function by changing the na parameter, but this is not what I want. I want to change this only in specific columns.

Is there a way to convert these NAs directly to ''? I could do this afterwards for sure, but I am wondering if I am using the thing in the wrong way.

EDIT Here is a test file

How do I actually upload a file? I could only attach images...

Hm, so `readr::col_factor(c('','+'), na=character())` gives me a `unused argument` error... — drmariod, Nov 14 '16 at 07:06
`col_factor(c('','+'), na=character())` does not give me an `unused argument` error, but it still does not do what you want `df <- read_tsv('file.txt', na = character(), col_types = list( column_with_empty_cells=col_factor(c('','+'))) ) ` — jmuhlenkamp, Nov 23 '16 at 16:35
Btw, my code above gives me this object `Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 28 obs. of 2 variables: $ test : chr "" "" "" "" ... $ column_with_empty_cells: Factor w/ 2 levels "","+": NA NA NA NA NA NA NA NA NA NA ...` From reading your question this still does not sound like what you are looking for. — jmuhlenkamp, Nov 23 '16 at 16:37

aldo_tapia · Accepted Answer · 2016-11-25T15:51:12.123

You can make a new function to solve this issue using lapply and factor:

library(readr)

read_tsv2 <- function(file, na.char=" "){
  test <- read_tsv(file = file, col_types=list(column_with_empty_cells=col_character()))
  test <- as.data.frame(test)
  names_tsv <- names(test)
  test <- lapply(test,
         function(x){
    if(sum(is.na(x))!=length(x)){
      x[is.na(x)] <- na.char 
    factor(x,levels = unique(x))
    }else{
      x
    }
  }
  )
  test <- do.call(cbind.data.frame, test)
  names(test) <- names_tsv
  test
}

file <- read_tsv2(file = "~/Downloads/file.txt", na.char = " ")

file

   test column_with_empty_cells
1  <NA>                        
2  <NA>                        
3  <NA>                        
4  <NA>                        
5  <NA>                        
6  <NA>                        
7  <NA>                        
8  <NA>                        
9  <NA>                        
10 <NA>                        
11 <NA>                        
12 <NA>                        
13 <NA>                        
14 <NA>                        
15 <NA>                        
16 <NA>                        
17 <NA>                        
18 <NA>                        
19 <NA>                        
20 <NA>                        
21 <NA>                        
22 <NA>                        
23 <NA>                        
24 <NA>                       +
25 <NA>                       +
26 <NA>                        
27 <NA>                        
28 <NA>                       +

score 0 · Answer 2 · answered Nov 24 '16 at 07:55

Based on the documentation for readr, there isn't an implementation for passing multiple na arguments for subsets of columns, only a global specification. I assume that this would be most salient when computational efficiency is necessary. For those instances, it might be worth it to do multiple calls to read_tsv specifying the subset of columns to read with an na argument specification and skip all other columns. Then repeat the process for the other subset of columns with the different na argument and read only the columns which should be parsed with that na argument. Lastly, one can could cbind the multiple data frames.

This issue has not been raised with the readr developers. If you wish to submit it as an enhancement feel free to do so by generating a new issue at the project's repository: Readr.

score 0 · Answer 3 · answered Nov 27 '16 at 03:48

read_tsv is a custom implementation of read_delim and so is read_csv. The tsv is designed specifically to read tab separated files, which in this case is also your test file. You can solve your problem easily by using read_csv if you are not tied to using the specific tab separated implementation.

read_csv will by default take the class as factor if it finds few unique character set in a column.

To get the values as factor

read.csv("test.txt", sep = "\t")

To get the values as character

read.csv("test.txt", sep = "\t", stringsAsFactors = FALSE)

Sample dataframe read

EDIT 1

if you want specific columns with " " to be treated as NA, you can pass the class using lappy only to those column list at the time of reading, however based on your question, it looks like you want NULL to be treated as NA and any other character should not be coerced.

can't prevent NAs for empty cells in factor columns using readr

3 Answers3