1

I am using arrow to combine two feather file.

#data collection_from oah

setwd('C:/Users/')

library(readxl)
library(tidyverse)
library(magrittr)
library(arrow)
library(data.table)
library(lubridate)
library(vctrs)
library(stringr)
################################################################################
#arrow::open_dataset

file.list <- list.files(pattern='*.csv')

ds1 <- open_dataset(file.list[-c(23,27)],schema = schema1,format ="csv", skip = 1)

schema` = arrow::schema(`Key`=int64(),
                         Sex = string(),
                        `Age` = int64(),
                        `Date of Birth` = date32(),
                        `Institution` = string(),
                        `Admission Date` = date32(),
                        `Discharge Date` = date32(),
                        `Elderly Home` = string(),
                        `Paycode` = string())

ds = ds1 %>% collect

ds %>% arrow::write_feather('C:/Users/ds.feather')

ds = arrow::read_feather('C:/Users/ds.feather')

ds2 = map(file.list[c(23,27)],\(.x)read.csv(.x))%>% do.call(rbind,.)

names(ds2) = names(ds)

ds3 = map(names(ds2),\(.x)eval(parse(text=paste('as.character(ds2$`',.x,'`)',sep='')))) %>% bind_cols

names(ds3) = names(ds)

ds = map(ds,as.character)

ds = bind_cols(ds)

ds %<>% as.data.frame

ds3 %>% write_feather('C:/Users/ds1.feather')

ds %>% write_feather('C:/Users/ds2.feather')

file.list1 <- list.files(pattern='*.feather')

ds1 <- open_dataset(file.list1,format ="feather")

The reason why I read 23th and 27th csv separately, is that, the formats of these two files are not consistent with other files. So I have no choice but read them individually.

However, when I was using open_dateset, a error was showed:

> ds1 <- open_dataset(file.list1,format ="feather")
Error in `open_dataset()`:
! Invalid: Error creating dataset. Could not read schema from 'C:/Users/ds1.feather': Could not open IPC input source 'C:/Users/ds1.feather': Not an Arrow file. Is this a 'ipc' file?

I am sure that I had saved ds1.feather into feather format, as I used write_feather to save it. I do not know why open_dataset could not read the .feather file correctly...

Session_info:

> sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
#I deleted due to privacy

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] feather_0.3.5     vctrs_0.5.2       data.table_1.14.8 arrow_11.0.0.2    magrittr_2.0.3    lubridate_1.9.2   forcats_1.0.0     stringr_1.5.0    
 [9] dplyr_1.1.0       purrr_1.0.1       readr_2.1.4       tidyr_1.3.0       tibble_3.1.8      ggplot2_3.4.1     tidyverse_2.0.0   readxl_1.4.2     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10             santoku_0.9.0           lattice_0.20-45         assertthat_0.2.1        digest_0.6.31           utf8_1.2.3             
 [7] R6_2.5.1                cellranger_1.1.0        plyr_1.8.8              repr_1.1.6              extraInserts_0.0.0.9003 backports_1.4.1        
[13] evaluate_0.20           pillar_1.8.1            rlang_1.0.6             rstudioapi_0.14         checkmate_2.1.0         rmarkdown_2.20         
[19] htmlwidgets_1.6.1       bit_4.0.5               munsell_0.5.0           compiler_4.2.2          metR_0.13.0             janitor_2.2.0          
[25] xfun_0.37               base64enc_0.1-3         pkgconfig_2.0.3         lemon_0.4.6             htmltools_0.5.4         tidyselect_1.2.0       
[31] gridExtra_2.3           fansi_1.0.4             viridisLite_0.4.1       withr_2.5.0             tzdb_0.3.0              grid_4.2.2             
[37] jsonlite_1.8.4          gtable_0.3.1            lifecycle_1.0.3         scales_1.2.1            zip_2.2.2               cli_3.6.0              
[43] stringi_1.7.12          cachem_1.0.6            viridis_0.6.2           snakecase_0.11.0        skimr_2.1.5             ellipsis_0.3.2         
[49] generics_0.1.3          openxlsx_4.2.5.2        ggeasy_0.1.3            tools_4.2.2             bit64_4.0.5             glue_1.6.2             
[55] hms_1.1.2               fastmap_1.1.0           yaml_2.3.7              timechange_0.2.0        colorspace_2.1-0        memoise_2.0.1  

The version of arrow that I used:

> arrow::arrow_info()
Arrow package version: 11.0.0.2

Capabilities:
               
dataset    TRUE
substrait FALSE
parquet    TRUE
json       TRUE
s3         TRUE
gcs        TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli     TRUE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2        TRUE
jemalloc  FALSE
mimalloc   TRUE

Arrow options():
                       
arrow.use_threads FALSE

Memory:
                  
Allocator mimalloc
Current    8.21 Gb
Max       11.12 Gb

Runtime:
                          
SIMD Level          avx512
Detected SIMD Level avx512

Build:
                                                             
C++ Library Version                                    11.0.0
C++ Compiler                                              GNU
C++ Compiler Version                                   10.3.0
Git ID               58286965ec6974f700ff9fe3f7dcbe56095878d7

The size of these feather files are 4gb and 1gb.

doraemon
  • 439
  • 3
  • 10
  • Just to check, if you open the file with `arrow::read_feather()`, what message do you get? – thisisnic Mar 30 '23 at 07:48
  • arrow::read_feather('xxx.feather') Error: file must be a "InputStream" – doraemon Mar 30 '23 at 07:54
  • But the window file system shows that it is a feather file... – doraemon Mar 30 '23 at 07:55
  • ds3 %>% write_feather('C:/Users/xxx.feather') I use this code to generate the feather file. – doraemon Mar 30 '23 at 07:56
  • 1
    Can I also double check which version of Arrow you're using? It'll be in `arrow::arrow_info()`. – thisisnic Mar 30 '23 at 07:56
  • I will update the code that I use in r...Please wait a minute. I need some time to delete some confidential information – doraemon Mar 30 '23 at 08:01
  • Updated the complete code – doraemon Mar 30 '23 at 08:12
  • Would you mind try running this? `rf <- arrow::ReadableFile$create(path)` `fr <- arrow::FeatherReader$create(rf)` `fr` (where `path` is the path to one of your files) (apologies, can't get it to format that code properly - that should be 3 separate lines) – thisisnic Mar 30 '23 at 10:09
  • 1
    FYI, *"window file system"* means nothing, it is based _entirely_ on the file (name) extension, not on its contents. You can create a file in notepad, then rename it from `file.txt` to `file.feather`, and Windows will happily tell you that the file is of type feather when it is clearly not. Assuming you have Rtools installed, what does `system("file C:/Users/xxx.feather")` return? (using the real path/filename). On my system using a `mt.feather` file I just created, it returns the elusive `mt.feather: data`, but many other files have meaningful results. – r2evans Mar 30 '23 at 12:19
  • Sorry for late reply. I just left the office. I will run the code you provided tomorrow. Thank you so much!! I appreciate your help!! – doraemon Mar 30 '23 at 12:34

0 Answers0