Drop rows based on column values in Dask

Question

I am using dask to read a large csv file. I wanted to drop a few rows based on one column value. If for that particular column the row value is empty I want to remove the entire row.

I tried using .dropna:

df = df.dropna(subset=['tier1_name'],how = 'any',axis =0)

However, I got this error:

TypeError: dropna() got an unexpected keyword argument 'axis'

So I used .drop instead:

df.drop(df['tier1_name'].isnull(), axis = 0)

But then got this error:

"Drop currently only works for axis=1 or when columns is not None"
NotImplementedError: Drop currently only works for axis=1 or when columns is not None

I don't understand what should I use to execute the desired operation. Help!

score 1 · Answer 1 · answered Dec 30 '21 at 06:14

1

The key issue here is that, in general, dask will not know the number of rows or their content without evaluation, so row-based operations are not always easy to integrate.

As one solution to this, it's possible to use .loc with an appropriate mask, this pseudo-code might help:

mask = df['tier1_name'].notna()
df_modified = df.loc[mask]
# note that if you need to use .isna(), then the mask value
# should be negated

answered Dec 30 '21 at 06:14

SultanOrazbayev

14,900
3
16
46

I don't understand how using .loc will help in removing the rows. – krk Dec 30 '21 at 23:08
I tried to mention the axis =0 in isna(). But it shows an error, isna() got an unexpected keyword argument 'axis' – krk Dec 30 '21 at 23:11
I tried updating the versions of dask and pandas but nothing really is working for me – krk Dec 30 '21 at 23:23
1

Sultan's solution works, I just reproduced it. :) @krk, to answer your follow-up questions -- First, `axis=0` acts on rows which *is not* supported in Dask, for reasons that Sultan mentioned in their first sentence. Next, `.loc` isn't removing rows, it's creating a new dataframe with the rows you want to keep. Hope this helps. – pavithraes Jan 04 '22 at 18:16

Drop rows based on column values in Dask

1 Answers1