2

{arrow}s auto-detection of column types is causing me some trouble when opening a large csv file. In particular, it drops leading zeroes for some identifiers and does some other unfortunate stuff. As the dataset is quite wide (a few hundred cols) and I don't want to set all schema values manually, I would like to somehow programatically set it.

A good start would be to convert all columns to character when opening the dataset with arrow::open_dataset. Or correct the existing datase_connection$schema object for particular columns.

However, I was not able to find out how to do so.

Rob G.
  • 395
  • 1
  • 8

1 Answers1

4

When you use arrow::open_dataset() you can manually define a schema which determines the column names and types. I've pasted an example below, which shows the default behaviour of auto-detecting column names types first, and then using a schema to override this and specify your own column names and types. The example here does this programmatically as requested but you can define a schema by hand too.

library(arrow)

write_dataset(mtcars, "mtcars")

# opens the dataset with column detection
dataset <- open_dataset("mtcars")
dataset
#> FileSystemDataset with 1 Parquet file
#> mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> 
#> See $metadata for additional Schema metadata

# define new schema automatically
chosen_schema <- schema(
  purrr::map(names(dataset), ~Field$create(name = .x, type = string()))
)

# now opens the dataset with the chosen schema
open_dataset("mtcars", schema = chosen_schema) 
#> FileSystemDataset with 1 Parquet file
#> mpg: string
#> cyl: string
#> disp: string
#> hp: string
#> drat: string
#> wt: string
#> qsec: string
#> vs: string
#> am: string
#> gear: string
#> carb: string
thisisnic
  • 820
  • 5
  • 10
  • how can I set the col types for all cols to string()? Without having to type all column names? A hacky way would be to first open a connection, get the $schema and perhaps manipulate it somehow by iteration and then open another connection with the new schema. Is there a better way? Also is there a way to specify a default other than null() when auto detection fails? – Rob G. Mar 01 '22 at 08:38
  • I'll update my answer above. Can you expand on what you mean by "a default other than null()", i.e. provide a reprex? – thisisnic Mar 01 '22 at 08:51
  • providing a reprex is not possible in this case. What I meant was that i noticed that in some cases the column type is identified as null() even though it is string or something else. So i assumed that it would take null() in the case the column type cannot be inferred by arrow. – Rob G. Mar 01 '22 at 08:58
  • Ahh, the Field$create method is useful. I did not see it in the function reference, thanks! – Rob G. Mar 01 '22 at 09:06
  • 1
    There will be some specific reason or property of your data which leads to it being defined as null() though, as this is not the usual thing. Perhaps the column is empty or missing in some of the files of your dataset and one of these is the first file being read in and so it's the one the schema is being generated from? Either way, it probably warrants discussion in a new issue. – thisisnic Mar 01 '22 at 09:08
  • Thanks for the hint, I will dig a bit into it and see if I can spot the issue myself before opening a new one. If I open one, I will try to provide a reprex for it. – Rob G. Mar 01 '22 at 09:12