How to select data with list of indexes from a partitioned DF (non-unique indexes)?

Question

Problem

I have a dataframe df with indexes not monotonically increasing over 4 partitions, meaning every partition is indexed with [0..N]. I need to select rows based on a indexes list [0..M] where M > N. Using loc would yield to an inconsistent output as there are multiple rows indexed by 0 (see example).

In other words, I'd need to overcome the difference between Dask's and Pandas' reset_index, as it'd easily solve my issue.

Example

print df.loc[0].compute() results in:

   Unnamed: 0  best_answer  thread_id  ty_avc    ty_ber  ty_cjr  ty_cpc  \
0           0            1          1       1  0.052174       9      18   
0           0            1       5284      12  0.039663      34      60   
0           0            1      18132       2  0.042254       7      20   
0           0            1      44211       4  0.025000       5       5

Possible solutions

repartition df to 1 single partition and reset_index, don't like as won't fit in memory;
add a column with [0..M] indexes and use set_index, discouraged in performance tips;
solution to this question solves a different problem as his df has unique indexes;
split the indexes list in npartitions parts, apply offset computation and use map_partitions

I cannot think of other solutions... probably the last one is more efficient although not sure if it's actually feasible.

score 2 · Accepted Answer · answered Apr 12 '17 at 18:42

2

Generally Dask.dataframe does not track the lengths of the pandas dataframes that make up the dask.dataframe. I suspect that your option 4 is best. You might also consider using dask.delayed

See also http://dask.pydata.org/en/latest/delayed-collections.html

answered Apr 12 '17 at 18:42

MRocklin

55,641
23
163
235

Indeed [dask.delayed](http://dask.pydata.org/en/latest/delayed.html) is a good way to go, turns out to be fast enough. – w00dy Apr 19 '17 at 09:57

How to select data with list of indexes from a partitioned DF (non-unique indexes)?

Problem

Example

Possible solutions

1 Answers1