0

I have this dataframe enter image description here

in column stage I have 4 values :

enter image description here

I Have duplicates rows in this dataframe, and I wanted to drop them, for example:

enter image description here

I want to keep row #8015

and I don't have 2 rows with the same stage and the same tweet_id, for example:

enter image description here

I tried this solution:

twitter_archive = twitter_rchive.sort_values(by='stage', ascending=False).drop_duplicates(subset='tweet_id', keep='first').sort_index().reset_index(drop=True)

which I find it in this solution, But then I've lost 10 doggo although I sorted my values and keeped the First occurance.

enter image description here

Asma
  • 137
  • 3
  • 15

1 Answers1

0

Is this something you're looking for?

df = pd.DataFrame([{'tweet_id':89324938479283648628, 'name':'Phineas', 'stage': np.nan}, 
                   {'tweet_id':8932493847987465848628, 'name':'Tilly', 'stage': np.nan}, 
                  {'tweet_id':8932493847987465848628, 'name':'Tilly', 'stage': 'Doggo'}])
df = df.groupby(['tweet_id','name']).agg(tuple).applymap(list).reset_index()
df['stage'] = df['stage'].apply(lambda x : [i for i in x if str(i) != 'nan'])
df['stage'] = df['stage'].apply(lambda x : np.nan if len(x) == 0 else x[0])
df
Aditya
  • 1,357
  • 1
  • 9
  • 19