2

I am trying to read a csv (~ 18,000,000 rows, ~ 1000 columns) into arrow (in R) with open_dataset pre-specifying a schema. There are some instances in which the csv was generated incorrectly and some values don't match the intended schema (say some values where the age (int) of the individual was supposed to be entered have the name (string) of the individual). My intention is to set these ages that have strings that can't be parsed as integers as NA.

The default behaviour of open_dataset is to throw the following error:

 CSV conversion error to int8: invalid value

Is there a way in which instead of getting an error when the schema is unable to parse I can get a missing value NA?

Here is an example of code that generates the error:

library(tidyverse)
library(arrow)

#Write csv 
tibble(age = c(1,2,"StackOverflow",5)) %>%
  write_csv("example.csv")

#Read the csv
arrow::open_dataset("example.csv", format = "csv", schema = schema(age = int8()), skip = 1) %>% 
  collect()

I know that I can specify the null_values inside the CsvConvertOptions if I know them previously as follows:

arrow::open_dataset("example.csv",  format = "csv", schema = schema(age = int8()), skip = 1,
     convert_options = CsvConvertOptions$create(null_values = "StackOverflow")) %>% 
     collect()

However this feels pretty inefficient as not knowing the mistakes a priori it seems to me that I need to go through the data twice (once to search the values and then once to set the schema correctly).

Rodrigo Zepeda
  • 1,935
  • 2
  • 15
  • 25
  • RodrigoZepeda, I don't know if this is possible (I'm still learning arrow/parquet), but if it isn't, this is such a fundamental base function that should be present, would you be willing to write a [feature-request](https://github.com/apache/arrow/issues?q=is%3Aissue+is%3Aopen+csv+default+value)? I feel it would be of high value to many. In the interim, have you tried using `mutate(n = cast(n as int8()))` (or similar), as suggested [here](https://github.com/apache/arrow/issues/12469#issuecomment-1046246626)? – r2evans Sep 09 '22 at 16:00

0 Answers0