2

Problem

I have a dataframe df with indexes not monotonically increasing over 4 partitions, meaning every partition is indexed with [0..N]. I need to select rows based on a indexes list [0..M] where M > N. Using loc would yield to an inconsistent output as there are multiple rows indexed by 0 (see example).

In other words, I'd need to overcome the difference between Dask's and Pandas' reset_index, as it'd easily solve my issue.

Example

print df.loc[0].compute() results in:

   Unnamed: 0  best_answer  thread_id  ty_avc    ty_ber  ty_cjr  ty_cpc  \
0           0            1          1       1  0.052174       9      18   
0           0            1       5284      12  0.039663      34      60   
0           0            1      18132       2  0.042254       7      20   
0           0            1      44211       4  0.025000       5       5   

Possible solutions

  1. repartition df to 1 single partition and reset_index, don't like as won't fit in memory;
  2. add a column with [0..M] indexes and use set_index, discouraged in performance tips;
  3. solution to this question solves a different problem as his df has unique indexes;
  4. split the indexes list in npartitions parts, apply offset computation and use map_partitions

I cannot think of other solutions... probably the last one is more efficient although not sure if it's actually feasible.

Community
  • 1
  • 1
w00dy
  • 748
  • 1
  • 6
  • 23

1 Answers1

2

Generally Dask.dataframe does not track the lengths of the pandas dataframes that make up the dask.dataframe. I suspect that your option 4 is best. You might also consider using dask.delayed

See also http://dask.pydata.org/en/latest/delayed-collections.html

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Indeed [dask.delayed](http://dask.pydata.org/en/latest/delayed.html) is a good way to go, turns out to be fast enough. – w00dy Apr 19 '17 at 09:57