The original dataset should only contain 28 columns but arrow
returns 29 columns.
The code is as below:
schema1 = arrow::schema( Key =int64(),
Sex = string(),
Age = int64(),
Date = date32(),
ED = string(),
N = string(),
I = string(),
AD = date32(),
DD = date32(),
E = string(),
EH = string(),
EH_1 = string(),
EH_2 = binary(),
M = string(),
M_1 = string(),
M_2 = timestamp(),
HN = string(),
W_1 = string(),
W_2 = string(),
W_3 = string(),
Clin_1 = string(),
Clin_2 = string(),
Clin_3 = string(),
D = string(),
P = string(),
A_1 = string(),
A_2 = string(),
A_3 = string())
file.list <- list.files(pattern='*.csv')
ds1 <- open_dataset(file.list,schema = schema1,format ="text",skip = 1,delim = ",",unify_schemas = T)
ds = ds1 %>% filter(A_3 =='Y') %>% collect
After that, an error was showed:
> ds1 <- open_dataset(file.list,schema = schema1,format ="text",skip = 1,delim = ",",unify_schemas = T)
> ds = ds1 %>% filter(A_3 =='Y') %>% collect
Error in `compute.arrow_dplyr_query()`:
! Invalid: Could not open CSV input source 'C:/Users/xxx.csv': Invalid: CSV parse error: Row #2: Expected 28 columns, got 29: 11,M,22,2000-04-05,N,XY123456,CCC,2011-11-11,2011-11-12,456123789,C - CAT,2011-11-11 20 ...
Run `rlang::last_error()` to see where the error occurred.
For some reason, the data in the above error is made-up.
After trying the suggestion from shs,
d = read.csv(file.list[23],nrows = 10,fileEncoding="latin1")
I read the CSV and found that ncol(d) = 29. However, when I opened the csv in Excel, there is no 29th col...Is it some bug of Excel?
After deleting the empty column in CSV, the error message changed to another one:
Error in `compute.arrow_dplyr_query()`:
! Invalid: Could not open CSV input source 'C:/Users/xxx.csv': Invalid: In CSV column #3: Row #2: CSV conversion error to date32[day]: invalid value '5/4/2000'
ℹ If you have supplied a schema and your data contains a header row, you should supply the argument `skip = 1` to prevent the header being read in as data.
However, according to the schema, column 3 should be Age = int64() instead of date.