I suggest using a left join
to accomplish this. This will maintain the order corresponding to your list of index values. (And it is quite performant.)
For example, let's start with this shuffled DataFrame.
nbr_rows = 30_000_000
df = pl.DataFrame({
'c1': pl.arange(0, nbr_rows, eager=True).shuffle(2),
'c2': pl.arange(0, nbr_rows, eager=True).shuffle(3),
})
df
shape: (30000000, 2)
┌──────────┬──────────┐
│ c1 ┆ c2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════════╪══════════╡
│ 4052015 ┆ 20642741 │
│ 7787054 ┆ 17007051 │
│ 20246150 ┆ 19445431 │
│ 1309992 ┆ 6495751 │
│ ... ┆ ... │
│ 10371090 ┆ 4791782 │
│ 26281644 ┆ 12350777 │
│ 6740626 ┆ 24888572 │
│ 22573405 ┆ 14885989 │
└──────────┴──────────┘
And these index values:
nbr_index_values = 10_000
s1 = pl.Series(name='c1', values=pl.arange(0, nbr_index_values, eager=True).shuffle())
s1
shape: (10000,)
Series: 'c1' [i64]
[
1754
6716
3485
7058
7216
1040
1832
3921
1639
6734
5560
7596
...
4243
4455
894
7806
9291
1883
9947
3309
2030
7731
4706
8528
8426
]
We now perform a left join
to obtain the rows corresponding to the index values. (Note that the list of index values is the left DataFrame in this join.)
start = time.perf_counter()
df2 = (
s1.to_frame()
.join(
df,
on='c1',
how='left'
)
)
print(time.perf_counter() - start)
df2
>>> print(time.perf_counter() - start)
0.8427023889998964
shape: (10000, 2)
┌──────┬──────────┐
│ c1 ┆ c2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════════╡
│ 1754 ┆ 15734441 │
│ 6716 ┆ 20631535 │
│ 3485 ┆ 20199121 │
│ 7058 ┆ 15881128 │
│ ... ┆ ... │
│ 7731 ┆ 19420197 │
│ 4706 ┆ 16918008 │
│ 8528 ┆ 5278904 │
│ 8426 ┆ 18927935 │
└──────┴──────────┘
Notice how the rows are in the same order as the index values. We can verify this:
s1.series_equal(df2.get_column('c1'), strict=True)
>>> s1.series_equal(df2.get_column('c1'), strict=True)
True
And the performance is quite good. On my 32-core system, this takes less than a second.