r arrow schema update

Question

I have multiple .csv files that I am trying to read with arrow::open_dataset() but it is throwing an error due to column type inconsistency.

I found this question mostly related to my problem, but I am trying a slightly different approach.

I want to utilize autodetection from the arrow type, using one sample CSV file. It is time-consuming to figure out all the types of columns.
Then, I take the schema and correct some of the columns that cause problems.
And then I use the updated schema to read all files.

Below is my approach:

data = read_csv_arrow('data.csv.gz', as_data_frame = F) # has more than 30 columns
sch = data$schema
print(sch)

Schema
trade_id: int64
secid: int64
side: int64
...
nonstd: int64
flags: string

I would like to change the 'trade_id' column type from int64 to string and leave other columns to be the same.

How can I update the schema?

I'm using R arrow, but I guess answers related pyarrow could be applicable.

score 5 · Accepted Answer · answered Oct 06 '22 at 09:05

There are a couple of different ways to do this; you could either extract the code for the schema and manually update it yourself, or you could save the schema as a variable and update it programmatically.

library(arrow)


# set up an arrow table
cars_table <- arrow_table(mtcars)

# view the schema
sch <- cars_table$schema

# print the code that makes up the schema - you could now copy this and edit it
sch$code()
#> schema(mpg = float64(), cyl = float64(), disp = float64(), hp = float64(), 
#>     drat = float64(), wt = float64(), qsec = float64(), vs = float64(), 
#>     am = float64(), gear = float64(), carb = float64())

# look at an individual element in the schema
sch[[2]]
#> Field
#> cyl: double

# update this element
sch[[2]] <- Field$create("cylinders", int32())
sch[[2]]
#> Field
#> cylinders: int32

sch$code()
#> schema(mpg = float64(), cylinders = int32(), disp = float64(), hp = float64(), 
#>     drat = float64(), wt = float64(), qsec = float64(), vs = float64(), 
#>     am = float64(), gear = float64(), carb = float64())

Thank you so much for this answer. I tried many things including unify_schemas but this is the solution that works for my use case. I just want to add that we could refer by name in addition to index, which may be more robust, so sch[["cyl"]] in this case. — trangdata, Oct 31 '22 at 15:56

r arrow schema update

1 Answers1