Currently pandas support 3 dtype backends : numpy, native nullable (extended) dtypes, and pyarrow dtypes. My understanding is that arrow will eventually replace numpy as the pandas backend, even if it is most likely a long term goal.
Considering to complexity to support a new dtype backend, I have hard time understanding why native nullable dtypes (e.g., 'Int32') have been developed as they seem similar (I mean also nullable) to arrow dtypes (may be arrow support was not yet ready when native nullable dtypes have been developed?). Also in terms of performance, on a simple test, I got
s_arrow = pd.Series(range(1_000_000), dtype='int32[pyarrow]')
s_extended = pd.Series(range(1_000_000), dtype='Int32')
s_numpy = pd.Series(range(1_000_000), dtype=np.int32)
%timeit s_arrow**2
%timeit s_extended**2
%timeit s_numpy**2
10.6 ms ± 382 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.9 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1e+03 µs ± 69.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Also in terms of memory usage
s_arrow.memory_usage(deep=True), s_extended.memory_usage(deep=True), s_numpy.memory_usage(deep=True)
>>>
(4000128, 5000128, 4000128)
the native extended dtype use much more than the two other backends.
Is there (a written) plan or roadmap to improve the performance of native extended dtype and arrow dtypes in pandas?
Currently, is there any reason (for production code) to use arrow dtype instead of native nullable dtype (assuming we need nullable dtype)?