Questions on arrow support in pandas

Question

Currently pandas support 3 dtype backends : numpy, native nullable (extended) dtypes, and pyarrow dtypes. My understanding is that arrow will eventually replace numpy as the pandas backend, even if it is most likely a long term goal.

Considering to complexity to support a new dtype backend, I have hard time understanding why native nullable dtypes (e.g., 'Int32') have been developed as they seem similar (I mean also nullable) to arrow dtypes (may be arrow support was not yet ready when native nullable dtypes have been developed?). Also in terms of performance, on a simple test, I got

s_arrow = pd.Series(range(1_000_000), dtype='int32[pyarrow]')
s_extended = pd.Series(range(1_000_000), dtype='Int32')
s_numpy = pd.Series(range(1_000_000), dtype=np.int32)

%timeit s_arrow**2
%timeit s_extended**2
%timeit s_numpy**2

10.6 ms ± 382 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.9 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1e+03 µs ± 69.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Also in terms of memory usage

s_arrow.memory_usage(deep=True), s_extended.memory_usage(deep=True), s_numpy.memory_usage(deep=True)
>>>

(4000128, 5000128, 4000128)

the native extended dtype use much more than the two other backends.

Is there (a written) plan or roadmap to improve the performance of native extended dtype and arrow dtypes in pandas?

Currently, is there any reason (for production code) to use arrow dtype instead of native nullable dtype (assuming we need nullable dtype)?

BTW, while I answered, I think this question might be better suited for the pandas development mailing list (https://mail.python.org/pipermail/pandas-dev/) to allow a conversation, if you would like to post it there as well. — joris, Mar 23 '23 at 13:51

score 3 · Answer 1 · answered Mar 23 '23 at 13:49

I have hard time understanding why native nullable dtypes (e.g., 'Int32') have been developed as they seem similar (I mean also nullable) to arrow dtypes (may be arrow support was not yet ready when native nullable dtypes have been developed?)

That assumption is mostly correct as far as I know: the nullable data types were started before pyarrow had the extensive computational features that it has nowadays. (the first protype of the nullable integer data type was added in 2018: https://pandas.pydata.org/docs/whatsnew/v0.24.0.html#optional-integer-na-support)

Also in terms of performance

The exact performance comparison will depend a lot on the specific operation and data types you are comparing.

For this specific case, the power operation, numpy is clearly quite a bit faster (it's also using integer dtype, where numpy doesn't have to care about missing values because it simply doesn't support that, while the other two handle missing values as well). And that is probably something that can be optimized in pyarrow. But you can also find examples where the others are currently faster. For example for mean() pyarrow seems to be faster. And certainly operations on string data will typically be a lot faster using pyarrow.

Also in terms of memory usage ... the native extended dtype use much more than the two other backends.

The nullable extension dtypes (both the native pandas and pyarrow ones) support missing values for all data types, and do that by keeping track of a separate byte or bit mask, respectively (as an additional array next to the data array). So for numeric data types like this, they will by definition use more memory. The reason that in this specific example pyarrow doesn't use more memory compare to numpy, is because there are no missing values, and in that case pyarrow can avoid allocating the mask (an optimization that is not yet implemented for the nullable dtype in pandas).

Is there (a written) plan or roadmap to improve the performance of native extended dtype and arrow dtypes in pandas?

There is active work to improve this going on, but there is no specific or detailed roadmap at the moment, no (but it is something we as pandas developers should do).

Currently, is there any reason (for production code) to use arrow dtype instead of native nullable dtype (assuming we need nullable dtype)?

In general, I think the numpy-based nullable dtypes are a bit more stable (already have existed for a longer time, are better supported in more functionality throughout the library). At the time of writing, I would not recommend using the pyarrow-based nullable dtypes in production code (it's a first release as experimental feature for people to test), except for the arrow-backed string data type. One possible reason to use the arrow dtypes is the wider support for other data types (eg decimal, nested data, ..)

Questions on arrow support in pandas

1 Answers1