python pandas: duplicated rows using sort_values and drop_duplicates

Question

I have this dataframe

in column stage I have 4 values :

I Have duplicates rows in this dataframe, and I wanted to drop them, for example:

I want to keep row #8015

and I don't have 2 rows with the same stage and the same tweet_id, for example:

I tried this solution:

twitter_archive = twitter_rchive.sort_values(by='stage', ascending=False).drop_duplicates(subset='tweet_id', keep='first').sort_index().reset_index(drop=True)

which I find it in this solution, But then I've lost 10 doggo although I sorted my values and keeped the First occurance.

Please add sample data and sample code so that it's easier for people to help instead of images. — Aditya, Sep 11 '21 at 15:32

score 0 · Answer 1 · answered Sep 11 '21 at 15:32

Is this something you're looking for?

df = pd.DataFrame([{'tweet_id':89324938479283648628, 'name':'Phineas', 'stage': np.nan}, 
                   {'tweet_id':8932493847987465848628, 'name':'Tilly', 'stage': np.nan}, 
                  {'tweet_id':8932493847987465848628, 'name':'Tilly', 'stage': 'Doggo'}])
df = df.groupby(['tweet_id','name']).agg(tuple).applymap(list).reset_index()
df['stage'] = df['stage'].apply(lambda x : [i for i in x if str(i) != 'nan'])
df['stage'] = df['stage'].apply(lambda x : np.nan if len(x) == 0 else x[0])
df

python pandas: duplicated rows using sort_values and drop_duplicates

1 Answers1