2

I have a 2GB CSV file that I read into a pyarrow table with the following:

from pyarrow import csv

tbl = csv.read_csv(path)

When I call tbl.nbytes I get 3.4GB. I was surprised at how much larger the csv was in arrow memory than as a csv. Maybe I have a fundamental misunderstanding of what pyarrow is doing under the hood but I thought if anything it would be smaller due to its columnar nature (i also probably could have squeezed out more gains using ConvertOptions but i wanted a baseline). I definitely wasnt expecting an increase of almost 75%. Also when I convert it from arrow table to pandas df the df took up roughly the same amount of memory as the csv - which was expected.

Can anyone help explain the difference in memory for arrow tables compared to a csv / pandas df.

Thx.

UPDATE

Full code and output below.

In [2]: csv.read_csv(r"C:\Users\matth\OneDrive\Data\Kaggle\sf-bay-area-bike-shar
   ...: e\status.csv")
Out[2]:
pyarrow.Table
station_id: int64
bikes_available: int64
docks_available: int64
time: string

In [3]: tbl = csv.read_csv(r"C:\Users\generic\OneDrive\Data\Kaggle\sf-bay-area-bik
   ...: e-share\status.csv")

In [4]: tbl.schema
Out[4]:
station_id: int64
bikes_available: int64
docks_available: int64
time: string

In [5]: tbl.nbytes
Out[5]: 3419272022

In [6]: tbl.to_pandas().info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71984434 entries, 0 to 71984433
Data columns (total 4 columns):
 #   Column           Dtype
---  ------           -----
 0   station_id       int64
 1   bikes_available  int64
 2   docks_available  int64
 3   time             object
dtypes: int64(3), object(1)
memory usage: 2.1+ GB
matthewmturner
  • 566
  • 7
  • 21
  • You should expect the arrow table to be smaller than the CSV, in particular if your csv has got numeric columns. It may be the case that arrow doesn't guess the schema correctly. What does `tbl.schema` returns? – 0x26res Apr 02 '20 at 09:23
  • I updated with full code and details (including schema). – matthewmturner Apr 03 '20 at 04:34
  • I recommend adding inside the pandas .info() call the parameter memory_usage='deep' to get a more accurate estimate of memory usage. – nachose Jun 16 '20 at 07:58

1 Answers1

4

There are two problems:

  1. The integers columns are using int64, but int32 would be more appropriate (unless the values are big)
  2. The time column is interpreted as a string. It doesn't help that the input format isn't following any standard (%Y/%m/%d %H:%M:%S)

The first problem is easy to solve, using ConvertionOptions:

tbl = csv.read_csv(
    <path>,
    convert_options=csv.ConvertOptions(
        column_types={
            'station_id': pa.int32(),
            'bikes_available': pa.int32(),
            'docks_available': pa.int32(),
            'time': pa.string()
        }))

The second one is a bit more complicated because as far as I can tell the read_csv API doesn't let you provide a format for the time column, and there's no easy way to convert string columns to datetime in pyarrow. So you have to use pandas instead:

series = tbl.column('time').to_pandas()
series_as_datetime = pd.to_datetime(series, format='%Y/%m/%d %H:%M:%S')
tbl2 = pa.table(
    {
        'station_id':tbl.column('station_id'),
        'bikes_available':tbl.column('bikes_available'),
        'docks_available':tbl.column('docks_available'),
        'time': pa.chunked_array([series_as_datetime])
    })
tbl2.nbytes
>>> 1475683759

1475683759 is the number you expect, you can't get any better. Each row is 20 bytes (4 + 4 + 4 + 8).

0x26res
  • 11,925
  • 11
  • 54
  • 108
  • Thanks much for the detailed response - definitely helpful. Separate from downcasting the integers I looked specifically at the memory footprint of the time field (represented as string in pyarrow table and object in pandas) and in pyarrow it takes up almost 3 times as much memory as in pandas. Is that expected? – matthewmturner Apr 03 '20 at 13:37
  • If datetime are stored as string they will take a lot more space than they should. That's why in the my answer I've suggested using pandas to convert them to datetime and then converting them back into arrow. – 0x26res Apr 07 '20 at 10:46
  • agree. but on a like for like basis (string in arrow and object in pandas) im surprised with no optimization that pyarrow strings are so much more expensive the pandas objects. i agree with everything youve said to fix the issue but its confusing why its so much worse to begin with. – matthewmturner Apr 07 '20 at 13:14
  • 2
    for `object` dtype in pandas you have to use `memory_usage(deep=True)` to get (an estimate of) the memory usage. In this case it's around 5go. So arrow is actually more compact than pandas. – 0x26res Apr 07 '20 at 14:52