I am trying to read a csv
(~ 18,000,000 rows, ~ 1000 columns) into arrow (in R
) with open_dataset
pre-specifying a schema
. There are some instances in which the csv
was generated incorrectly and some values don't match the intended schema (say some values where the age
(int
) of the individual was supposed to be entered have the name
(string
) of the individual). My intention is to set these ages that have strings that can't be parsed as integers as NA
.
The default behaviour of open_dataset
is to throw the following error:
CSV conversion error to int8: invalid value
Is there a way in which instead of getting an error when the schema is unable to parse I can get a missing value
NA
?
Here is an example of code that generates the error:
library(tidyverse)
library(arrow)
#Write csv
tibble(age = c(1,2,"StackOverflow",5)) %>%
write_csv("example.csv")
#Read the csv
arrow::open_dataset("example.csv", format = "csv", schema = schema(age = int8()), skip = 1) %>%
collect()
I know that I can specify the null_values
inside the CsvConvertOptions
if I know them previously as follows:
arrow::open_dataset("example.csv", format = "csv", schema = schema(age = int8()), skip = 1,
convert_options = CsvConvertOptions$create(null_values = "StackOverflow")) %>%
collect()
However this feels pretty inefficient as not knowing the mistakes a priori it seems to me that I need to go through the data twice (once to search the values and then once to set the schema correctly).