I have a 2GB CSV file that I read into a pyarrow table with the following:
from pyarrow import csv
tbl = csv.read_csv(path)
When I call tbl.nbytes
I get 3.4GB. I was surprised at how much larger the csv was in arrow memory than as a csv. Maybe I have a fundamental misunderstanding of what pyarrow is doing under the hood but I thought if anything it would be smaller due to its columnar nature (i also probably could have squeezed out more gains using ConvertOptions but i wanted a baseline). I definitely wasnt expecting an increase of almost 75%. Also when I convert it from arrow table to pandas df the df took up roughly the same amount of memory as the csv - which was expected.
Can anyone help explain the difference in memory for arrow tables compared to a csv / pandas df.
Thx.
UPDATE
Full code and output below.
In [2]: csv.read_csv(r"C:\Users\matth\OneDrive\Data\Kaggle\sf-bay-area-bike-shar
...: e\status.csv")
Out[2]:
pyarrow.Table
station_id: int64
bikes_available: int64
docks_available: int64
time: string
In [3]: tbl = csv.read_csv(r"C:\Users\generic\OneDrive\Data\Kaggle\sf-bay-area-bik
...: e-share\status.csv")
In [4]: tbl.schema
Out[4]:
station_id: int64
bikes_available: int64
docks_available: int64
time: string
In [5]: tbl.nbytes
Out[5]: 3419272022
In [6]: tbl.to_pandas().info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71984434 entries, 0 to 71984433
Data columns (total 4 columns):
# Column Dtype
--- ------ -----
0 station_id int64
1 bikes_available int64
2 docks_available int64
3 time object
dtypes: int64(3), object(1)
memory usage: 2.1+ GB