0

pandas -> cuDF

Converting some python written for pandas to run on rapids

pandas

temp=df_train.copy()
temp['buildingqualitytypeid']=temp['buildingqualitytypeid'].fillna(-1)
temp=temp.groupby("buildingqualitytypeid").filter(lambda x: x.buildingqualitytypeid.size > 3)
temp['buildingqualitytypeid'] = temp['buildingqualitytypeid'].replace(-1,np.nan)
print(temp.buildingqualitytypeid.isnull().sum())
print(temp.shape)

Anyone know what to use in place of pandas.Series.filter for same outcome in cuDF?

gumdropsteve
  • 70
  • 1
  • 14
  • Series and groupby `filter` are not currently implemented, but it's likely you could do this example without it. Could you please provide a reproducible example, following https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports or https://stackoverflow.com/help/minimal-reproducible-example? – Nick Becker Aug 22 '19 at 13:23

1 Answers1

1

We're still working on filter functionality in cudf, but for now the following approach will implement many filter-like needs:

df_train = pd.DataFrame({'buildingqualitytypeid': np.random.randint(0, 4, 12), 'value': np.arange(12)})
temp=df_train.copy()
temp['buildingqualitytypeid']=temp['buildingqualitytypeid'].fillna(-1)
gtemp=temp.groupby("buildingqualitytypeid").count()
gtemp=gtemp[gtemp['value'] > 3]
gtemp = gtemp.drop('value', axis=1)
gtemp = gtemp.merge(temp.reset_index(), on="buildingqualitytypeid")
gtemp = gtemp.sort_values('index')
gtemp.index = gtemp['index']
gtemp.index.name = None
gtemp = gtemp.drop('index', axis=1)

This can be completed considerably more simply if you don't need the index values.

Thomson Comer
  • 3,919
  • 3
  • 30
  • 32