I tried to read the same .csv
file using different functions in R (base::read.csv()
, readr::read_csv()
, data.table::fread()
, and arrow::read_csv_arrow()
), but this same file leads to very different sizes in memory. See an example below:
library(nycflights13)
library(readr)
library(data.table)
library(arrow)
library(dplyr)
library(lobstr)
fl_original = nycflights13::flights
fwrite(fl_original, 'nycflights13_flights.csv')
fl_baseR = read.csv('nycflights13_flights.csv')
fl_readr = readr::read_csv('nycflights13_flights.csv')
fl_data.table = data.table::fread('nycflights13_flights.csv')
fl_arrow = arrow::read_csv_arrow('nycflights13_flights.csv')
lobstr::obj_size(fl_baseR) # 33.12 MB
lobstr::obj_size(fl_readr) # 51.43 MB
lobstr::obj_size(fl_data.table) # 32.57 MB
lobstr::obj_size(fl_arrow) # 21.56 MB
class(fl_baseR) # "data.frame"
class(fl_readr) # "spec_tbl_df" "tbl_df" "tbl" "data.frame"
class(fl_data.table) # "data.table" "data.frame"
class(fl_arrow) # "tbl_df" "tbl" "data.frame"
Reading the exact same file, the memory use of data read in by arrow::read_csv_arrow()
is ~42% of the object created by readr::read_csv()
, while the data classes are similar (they all include data.frame
as a class). My hunch is that the difference in memory use is related to variables types (something like float32
and float64
) and metadata, but I'm not very clear on this. But this huge difference surprised me quite a bit.
Any clues and suggestions for reading would be greatly appreciated.