1

Im using drop_duplicates to remove duplicates from my dataframe based on a column, the problem is this column is empty for some entries and those ended being removed to is there a way to make the function ignore the empty value. here is an example

    Title                  summary                  
0   TITLE A                summaryA       
1   TITLE A                summaryB  
2                          summaryC       
3                          summaryD

using this

data.drop_duplicates(subset ="TITLE", 
                     keep = 'first', inplace = True)

I get a result like this:

    Title                  summary                  
0   TITLE A                summaryA        
2                          summaryC

but since last two rows are not duplicates i want to keep them. is there a ways for drop_duplicates to ignore empty values?

Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52

2 Answers2

0

Fill missing values with the index number? Maybe not the prettiest way but it works

df = pd.DataFrame(
    {'Title':['TITLE A', 'TITLE A', None, None], 'summary':['summaryA', 'summaryB', 
    'summaryC', 'summaryD']}
    )

df['_id'] = df.index
df['_id'] = df['_id'].apply(str)
df['Title2'] = df['Title'].fillna(df['_id'])  

df.drop_duplicates(subset ="Title2", keep = 'first')
jon
  • 35
  • 9
0

You can do this

data.drop_duplicates(subset ="TITLE", 
                     keep = 'last', inplace = True)
Jorge Luis
  • 813
  • 6
  • 21