2

I am using dask to read a large csv file. I wanted to drop a few rows based on one column value. If for that particular column the row value is empty I want to remove the entire row.

I tried using .dropna:

df = df.dropna(subset=['tier1_name'],how = 'any',axis =0)

However, I got this error:

TypeError: dropna() got an unexpected keyword argument 'axis'

So I used .drop instead:

df.drop(df['tier1_name'].isnull(), axis = 0)

But then got this error:

"Drop currently only works for axis=1 or when columns is not None"
NotImplementedError: Drop currently only works for axis=1 or when columns is not None

I don't understand what should I use to execute the desired operation. Help!

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
krk
  • 37
  • 4

1 Answers1

1

The key issue here is that, in general, dask will not know the number of rows or their content without evaluation, so row-based operations are not always easy to integrate.

As one solution to this, it's possible to use .loc with an appropriate mask, this pseudo-code might help:

mask = df['tier1_name'].notna()
df_modified = df.loc[mask]
# note that if you need to use .isna(), then the mask value
# should be negated
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • I don't understand how using .loc will help in removing the rows. – krk Dec 30 '21 at 23:08
  • I tried to mention the axis =0 in isna(). But it shows an error, isna() got an unexpected keyword argument 'axis' – krk Dec 30 '21 at 23:11
  • I tried updating the versions of dask and pandas but nothing really is working for me – krk Dec 30 '21 at 23:23
  • 1
    Sultan's solution works, I just reproduced it. :) @krk, to answer your follow-up questions -- First, `axis=0` acts on rows which *is not* supported in Dask, for reasons that Sultan mentioned in their first sentence. Next, `.loc` isn't removing rows, it's creating a new dataframe with the rows you want to keep. Hope this helps. – pavithraes Jan 04 '22 at 18:16