memory use for reading the same .csv file using baseR::read.csv(), readr::read_csv(), data.table::fread(), and arrow::read_csv_arrow() in R

Question

I tried to read the same .csv file using different functions in R (base::read.csv(), readr::read_csv(), data.table::fread(), and arrow::read_csv_arrow()), but this same file leads to very different sizes in memory. See an example below:

library(nycflights13)
library(readr)
library(data.table)
library(arrow)
library(dplyr)
library(lobstr)


fl_original = nycflights13::flights
fwrite(fl_original, 'nycflights13_flights.csv')


fl_baseR = read.csv('nycflights13_flights.csv')
fl_readr = readr::read_csv('nycflights13_flights.csv')
fl_data.table = data.table::fread('nycflights13_flights.csv')
fl_arrow = arrow::read_csv_arrow('nycflights13_flights.csv')

lobstr::obj_size(fl_baseR) # 33.12 MB
lobstr::obj_size(fl_readr) # 51.43 MB
lobstr::obj_size(fl_data.table) # 32.57 MB
lobstr::obj_size(fl_arrow) # 21.56 MB

class(fl_baseR) # "data.frame"
class(fl_readr) # "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
class(fl_data.table) # "data.table" "data.frame"
class(fl_arrow) # "tbl_df"     "tbl"        "data.frame"

Reading the exact same file, the memory use of data read in by arrow::read_csv_arrow() is ~42% of the object created by readr::read_csv(), while the data classes are similar (they all include data.frame as a class). My hunch is that the difference in memory use is related to variables types (something like float32 and float64) and metadata, but I'm not very clear on this. But this huge difference surprised me quite a bit.

Any clues and suggestions for reading would be greatly appreciated.

*"while the data classes are similar (they all include data.frame as a class"*. The `data.frame` is just a container for the columns. Check out the column classes to make sure they are the same. And the non-vanilla data frames do have more stuff there... if you want to compare apples to apples, convert all of them to base data frames with `as.data.frame()` and see how much things change. — Gregor Thomas, Jul 31 '22 at 02:37
@GregorThomas Not sure if `as.data.frame()` is the right function to use. I converted all the all four into data.frame, and the object sizes did not change at all. `> fl_baseR_df = as.data.frame(fl_baseR) > fl_readr_df = as.data.frame(fl_readr) > fl_data.table_df = as.data.frame(fl_data.table) > fl_arrow_df = as.data.frame(fl_arrow) > lobstr::obj_size(fl_baseR_df) 33.12 MB > lobstr::obj_size(fl_readr_df) 51.43 MB > lobstr::obj_size(fl_data.table_df) 32.57 MB > lobstr::obj_size(fl_arrow_df) 21.56 MB` — Miao Cai, Jul 31 '22 at 03:03
Hi @MiaoCai; I'm really not sure what you're asking here. You are comparing apples with oranges. For example, `readr::read_csv` returns a `tibble` with additional column specifications, `data.table::fread` returns a `data.table`, `arrow::read_csv_arrow` returns a vanilla `tibble`. These are all different objects with different mem footprints. To understand where those differences come from requires you to dig into the source code for each of these functions. — Maurits Evers, Jul 31 '22 at 04:08
@MauritsEvers Hi Maurits, thank you for replying. My question is why seemingly identical data (a nycflights dataframe) can lead to vastly different object sizes in R. Even though I tried converting them all into dataframes, the object sizes did not change at all. I understand that it may require digging into the source code to fully understand why, but are there some "big-picture" explanations for the 40% difference? I probably haven't got the idea of apples-to-oranges comparison, but happy to hear any further discussions. — Miao Cai, Jul 31 '22 at 04:37
*"why seemingly identical data (a nycflights dataframe) can lead to vastly different object sizes"* I told you why: The functions you use store raw data in different formats (apples vs oranges: "dressed" `tibble` vs.`data.table` vs. vanilla `tibble`). These "why" questions are notoriously difficult to answer and IMO of limited use: You are asking for insights & design choices that only the corresponding code devs can answer. — Maurits Evers, Jul 31 '22 at 05:01
@MauritsEvers Ok. Thanks for the information! Do you mind posting this as an answer? It would probably be more informative if you could add some **brief** explanation on the difference between tibble, vanilla tibble (I've seen this phase quite often but honestly haven't seen any good explanation on it), and data.table. — Miao Cai, Jul 31 '22 at 05:11
Check types of each column, if they are identical, using `str` you can also see if there are extra attributes, that may sometimes be heavy as well. DT by default should not create indexes, that are int32 and for high cardinality columns can be long. Dunno about other pkgs. — jangorecki, Jul 31 '22 at 17:40
Instead of as.data.frame you can also `unlist()` each object and see what will be left, preferably using `str` — jangorecki, Jul 31 '22 at 17:42

memory use for reading the same .csv file using baseR::read.csv(), readr::read_csv(), data.table::fread(), and arrow::read_csv_arrow() in R

0 Answers0