Basically this is answered for pandas in python pandas: Remove duplicates by columns A, keeping the row with the highest value in column B. In pandas I adopted the solution
df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()
but I cannot apply the same solutions efficiently to dask, since dask doesn't like sort_values
. I can get the max indices via
max_idx = df.groupby("A")["B"].idxmax().values
But I have to compute the max indeces before I can use them as an argument to df.loc
, i.e.
df.loc[max_idx.compute()]
On an entire dask frame, the method df.nlargest(1, "B")
does what I need, but I haven't figured out how to use with groupby for my needs.
In my dask-frame based analysis my workflow is currently to use dask for out-of-memory operations to do different operations an selections on a dataset until it gets to a managable size and then continue with pandas, so my temporary solution is to move the duplicate removal to the pandas part of my analysis, but I'm curious whether there is an efficient an elegant way to do it in dask.