I have multiple .csv
files that I am trying to read with arrow::open_dataset()
but it is throwing an error due to column type inconsistency.
I found this question mostly related to my problem, but I am trying a slightly different approach.
I want to utilize autodetection from the
arrow
type, using one sample CSV file. It is time-consuming to figure out all the types of columns.Then, I take the schema and correct some of the columns that cause problems.
And then I use the updated schema to read all files.
Below is my approach:
data = read_csv_arrow('data.csv.gz', as_data_frame = F) # has more than 30 columns
sch = data$schema
print(sch)
Schema
trade_id: int64
secid: int64
side: int64
...
nonstd: int64
flags: string
I would like to change the 'trade_id'
column type from int64
to string
and leave other columns to be the same.
How can I update the schema?
I'm using R arrow
, but I guess answers related pyarrow
could be applicable.