Problem
I have a dataframe df
with indexes not monotonically increasing over 4 partitions, meaning every partition is indexed with [0..N].
I need to select rows based on a indexes list [0..M] where M > N.
Using loc
would yield to an inconsistent output as there are multiple rows indexed by 0
(see example).
In other words, I'd need to overcome the difference between Dask's and Pandas' reset_index, as it'd easily solve my issue.
Example
print df.loc[0].compute()
results in:
Unnamed: 0 best_answer thread_id ty_avc ty_ber ty_cjr ty_cpc \
0 0 1 1 1 0.052174 9 18
0 0 1 5284 12 0.039663 34 60
0 0 1 18132 2 0.042254 7 20
0 0 1 44211 4 0.025000 5 5
Possible solutions
- repartition
df
to 1 single partition andreset_index
, don't like as won't fit in memory; - add a column with [0..M] indexes and use set_index, discouraged in performance tips;
- solution to this question solves a different problem as his df has unique indexes;
- split the indexes list in
npartitions
parts, apply offset computation and usemap_partitions
I cannot think of other solutions... probably the last one is more efficient although not sure if it's actually feasible.