4

How do I sort an Arrow table in PyArrow?

There does not appear to be a single function that will do this, the closest is sort_indices.

Contango
  • 76,540
  • 58
  • 260
  • 305

2 Answers2

3

PyArrow includes Table.sort_by since 7.0.0, no need to manually call the compute functions (reference)

table = pa.table([
      pa.array(["a", "a", "b", "b", "b", "c", "d", "d", "e", "c"]),
      pa.array([15, 20, 3, 4, 5, 6, 10, 1, 14, 123]),
      ], names=["keys", "values"])
sorted_table = table.sort_by([("values", "ascending")])
li.davidm
  • 11,736
  • 4
  • 29
  • 31
2

Using PyArrow function:

def arrow_sort_values(table: pa.lib.Table, by: str or list) -> pa.lib.Table:
    """
    Sort an Arrow table. Same as sort_values for a Dataframe.
    :param table: Arrow table.
    :param by: Column names to sort by. String or array.
    :return: Sorted Arrow table.
    """
    table_sorted_indexes = pa.compute.bottom_k_unstable(table, sort_keys=by, k=len(table))
    table_sorted = table.take(table_sorted_indexes)
    return table_sorted

Test code:

df = pd.DataFrame({"x": [1,4,2,3], "y": [1.1, 4.4, 2.2, 3.3]})
table = pa.Table.from_pandas(df)
table_sorted = arrow_sort_values(table, by=["x"])
df_sorted = table_sorted.to_pandas()

In (unsorted):

x    y
1  1.1
4  4.4
2  2.2
3  3.3

Out (sorted):

x    y
1  1.1
2  2.2
3  3.3
4  4.4

Tested under Python 3.9 and PyArrow v6.0.1. Use one of the following to install using pip or Anaconda / Miniconda:

pip install pyarrow==6.0.1
conda install -c conda-forge pyarrow=6.0.1 -y

Discussion: PyArrow is designed to have low-level functions that encourage zero-copy operations.

Contango
  • 76,540
  • 58
  • 260
  • 305
  • How unstable is "unstable". I generally need to sort data before dropping duplicates and I always want to keep the latest version of each row. I am doing this in pandas currently and then I need to convert back to a pyarrow table – trench Feb 06 '22 at 05:29
  • 1
    @trench If you specify enough sorting columns so that the order is always the same, then the sort order will always be identical between stable and unstable. However, if you omit a column necessary for sorting, then the results of stable vs. unstable may vary from run to run. For example, when sorting a table of [FirstName, LastName], if you sort by both columns then stable is identical to unstable, whereas if you sort only by [FirstName] then stable could differ from unstable from run to run because "Aaron Smith" and "Aaron Jones" could be in different orders from run to run. – Contango Feb 06 '22 at 09:13
  • 1
    thank you for the clarification! – trench Feb 07 '22 at 16:32
  • It doesn't make sense to use `bottom_k_unstable` here instead of `sort_indices`. The latter is stable, unlike the former, and you won't get any performance benefit from bottom-k using a k equal to the entire number of rows (it might actually be slower). – Antoine P. Feb 16 '22 at 16:12
  • @AntoineP. Can also use `sort_indices`, use this in the test code: `by=[("x", "ascending")]`. – Contango Feb 19 '22 at 18:47