I am using arrow to combine two feather file.
#data collection_from oah
setwd('C:/Users/')
library(readxl)
library(tidyverse)
library(magrittr)
library(arrow)
library(data.table)
library(lubridate)
library(vctrs)
library(stringr)
################################################################################
#arrow::open_dataset
file.list <- list.files(pattern='*.csv')
ds1 <- open_dataset(file.list[-c(23,27)],schema = schema1,format ="csv", skip = 1)
schema` = arrow::schema(`Key`=int64(),
Sex = string(),
`Age` = int64(),
`Date of Birth` = date32(),
`Institution` = string(),
`Admission Date` = date32(),
`Discharge Date` = date32(),
`Elderly Home` = string(),
`Paycode` = string())
ds = ds1 %>% collect
ds %>% arrow::write_feather('C:/Users/ds.feather')
ds = arrow::read_feather('C:/Users/ds.feather')
ds2 = map(file.list[c(23,27)],\(.x)read.csv(.x))%>% do.call(rbind,.)
names(ds2) = names(ds)
ds3 = map(names(ds2),\(.x)eval(parse(text=paste('as.character(ds2$`',.x,'`)',sep='')))) %>% bind_cols
names(ds3) = names(ds)
ds = map(ds,as.character)
ds = bind_cols(ds)
ds %<>% as.data.frame
ds3 %>% write_feather('C:/Users/ds1.feather')
ds %>% write_feather('C:/Users/ds2.feather')
file.list1 <- list.files(pattern='*.feather')
ds1 <- open_dataset(file.list1,format ="feather")
The reason why I read 23th and 27th csv separately, is that, the formats of these two files are not consistent with other files. So I have no choice but read them individually.
However, when I was using open_dateset, a error was showed:
> ds1 <- open_dataset(file.list1,format ="feather")
Error in `open_dataset()`:
! Invalid: Error creating dataset. Could not read schema from 'C:/Users/ds1.feather': Could not open IPC input source 'C:/Users/ds1.feather': Not an Arrow file. Is this a 'ipc' file?
I am sure that I had saved ds1.feather into feather format, as I used write_feather to save it. I do not know why open_dataset could not read the .feather file correctly...
Session_info:
> sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
#I deleted due to privacy
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] feather_0.3.5 vctrs_0.5.2 data.table_1.14.8 arrow_11.0.0.2 magrittr_2.0.3 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
[9] dplyr_1.1.0 purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.1.8 ggplot2_3.4.1 tidyverse_2.0.0 readxl_1.4.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.10 santoku_0.9.0 lattice_0.20-45 assertthat_0.2.1 digest_0.6.31 utf8_1.2.3
[7] R6_2.5.1 cellranger_1.1.0 plyr_1.8.8 repr_1.1.6 extraInserts_0.0.0.9003 backports_1.4.1
[13] evaluate_0.20 pillar_1.8.1 rlang_1.0.6 rstudioapi_0.14 checkmate_2.1.0 rmarkdown_2.20
[19] htmlwidgets_1.6.1 bit_4.0.5 munsell_0.5.0 compiler_4.2.2 metR_0.13.0 janitor_2.2.0
[25] xfun_0.37 base64enc_0.1-3 pkgconfig_2.0.3 lemon_0.4.6 htmltools_0.5.4 tidyselect_1.2.0
[31] gridExtra_2.3 fansi_1.0.4 viridisLite_0.4.1 withr_2.5.0 tzdb_0.3.0 grid_4.2.2
[37] jsonlite_1.8.4 gtable_0.3.1 lifecycle_1.0.3 scales_1.2.1 zip_2.2.2 cli_3.6.0
[43] stringi_1.7.12 cachem_1.0.6 viridis_0.6.2 snakecase_0.11.0 skimr_2.1.5 ellipsis_0.3.2
[49] generics_0.1.3 openxlsx_4.2.5.2 ggeasy_0.1.3 tools_4.2.2 bit64_4.0.5 glue_1.6.2
[55] hms_1.1.2 fastmap_1.1.0 yaml_2.3.7 timechange_0.2.0 colorspace_2.1-0 memoise_2.0.1
The version of arrow that I used:
> arrow::arrow_info()
Arrow package version: 11.0.0.2
Capabilities:
dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 TRUE
gcs TRUE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip TRUE
brotli TRUE
zstd TRUE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 TRUE
jemalloc FALSE
mimalloc TRUE
Arrow options():
arrow.use_threads FALSE
Memory:
Allocator mimalloc
Current 8.21 Gb
Max 11.12 Gb
Runtime:
SIMD Level avx512
Detected SIMD Level avx512
Build:
C++ Library Version 11.0.0
C++ Compiler GNU
C++ Compiler Version 10.3.0
Git ID 58286965ec6974f700ff9fe3f7dcbe56095878d7
The size of these feather files are 4gb and 1gb.