How do I sort an Arrow table in PyArrow?
There does not appear to be a single function that will do this, the closest is sort_indices.
How do I sort an Arrow table in PyArrow?
There does not appear to be a single function that will do this, the closest is sort_indices.
PyArrow includes Table.sort_by
since 7.0.0, no need to manually call the compute functions (reference)
table = pa.table([
pa.array(["a", "a", "b", "b", "b", "c", "d", "d", "e", "c"]),
pa.array([15, 20, 3, 4, 5, 6, 10, 1, 14, 123]),
], names=["keys", "values"])
sorted_table = table.sort_by([("values", "ascending")])
Using PyArrow function:
def arrow_sort_values(table: pa.lib.Table, by: str or list) -> pa.lib.Table:
"""
Sort an Arrow table. Same as sort_values for a Dataframe.
:param table: Arrow table.
:param by: Column names to sort by. String or array.
:return: Sorted Arrow table.
"""
table_sorted_indexes = pa.compute.bottom_k_unstable(table, sort_keys=by, k=len(table))
table_sorted = table.take(table_sorted_indexes)
return table_sorted
Test code:
df = pd.DataFrame({"x": [1,4,2,3], "y": [1.1, 4.4, 2.2, 3.3]})
table = pa.Table.from_pandas(df)
table_sorted = arrow_sort_values(table, by=["x"])
df_sorted = table_sorted.to_pandas()
In (unsorted):
x y
1 1.1
4 4.4
2 2.2
3 3.3
Out (sorted):
x y
1 1.1
2 2.2
3 3.3
4 4.4
Tested under Python 3.9 and PyArrow v6.0.1. Use one of the following to install using pip or Anaconda / Miniconda:
pip install pyarrow==6.0.1
conda install -c conda-forge pyarrow=6.0.1 -y
Discussion: PyArrow is designed to have low-level functions that encourage zero-copy operations.