The error of read more columns than the original data in r::arrow

Question

The original dataset should only contain 28 columns but arrow returns 29 columns.

The code is as below:

schema1 = arrow::schema( Key =int64(),
                         Sex = string(),
                         Age = int64(),
                         Date = date32(),
                         ED = string(),
                         N = string(),
                         I = string(),
                         AD = date32(),
                         DD = date32(),
                         E = string(),
                         EH  = string(),
                         EH_1   = string(),
                         EH_2   = binary(),
                         M = string(),
                         M_1 = string(),
                         M_2 = timestamp(),
                         HN = string(),
                         W_1    = string(),
                         W_2 = string(),
                         W_3 = string(),
                         Clin_1 = string(),
                         Clin_2 = string(),
                         Clin_3 = string(),
                         D  = string(),
                         P = string(),
                         A_1 = string(),
                         A_2 = string(),
                         A_3 = string())

file.list <- list.files(pattern='*.csv')

ds1 <- open_dataset(file.list,schema = schema1,format ="text",skip = 1,delim = ",",unify_schemas = T)

ds = ds1 %>% filter(A_3 =='Y') %>% collect

After that, an error was showed:

> ds1 <- open_dataset(file.list,schema = schema1,format ="text",skip = 1,delim = ",",unify_schemas = T)
> ds = ds1 %>% filter(A_3 =='Y') %>% collect
Error in `compute.arrow_dplyr_query()`:
! Invalid: Could not open CSV input source 'C:/Users/xxx.csv': Invalid: CSV parse error: Row #2: Expected 28 columns, got 29: 11,M,22,2000-04-05,N,XY123456,CCC,2011-11-11,2011-11-12,456123789,C - CAT,2011-11-11 20 ...
Run `rlang::last_error()` to see where the error occurred.

For some reason, the data in the above error is made-up.

After trying the suggestion from shs,

d = read.csv(file.list[23],nrows = 10,fileEncoding="latin1")

I read the CSV and found that ncol(d) = 29. However, when I opened the csv in Excel, there is no 29th col...Is it some bug of Excel?

After deleting the empty column in CSV, the error message changed to another one:

Error in `compute.arrow_dplyr_query()`:
! Invalid: Could not open CSV input source 'C:/Users/xxx.csv': Invalid: In CSV column #3: Row #2: CSV conversion error to date32[day]: invalid value '5/4/2000'
ℹ If you have supplied a schema and your data contains a header row, you should supply the argument `skip = 1` to prevent the header being read in as data.

However, according to the schema, column 3 should be Age = int64() instead of date.

Did you write the CSVs yourself? There maybe an unescaped comma in a string variable. Since the error message tells you in what file and row it is, you could import the just the first couple of lines with `readr::read_csv()` and limit the imported rows with the `nrow` argument. Then check how the names of the imported data frame deviate from your expectation. — shs, Mar 29 '23 at 08:05
No, the datasets were from an organization and I did not edit it. T will try to use read_csv() to figure out what happen, thx! — doraemon, Mar 29 '23 at 08:11
There may be trailing commas at the end. This would be read as a variable with only empty values and in Excel it would look like the column did not exist — shs, Mar 29 '23 at 08:25

The error of read more columns than the original data in r::arrow

0 Answers0