Pandas drop_duplicates() without empty rows

Question

I have 2 equal columns in a pandas data frame. Each of the columns have the same duplicates.

I want to delete the duplicates only from column B so that the goal is like the following:

I cloned the column B in a new DataFrame and used drop duplicates. The new dataframe with only the column B after drop_duplicates() looks like:

But when i took it back to the original data frame it looks like this:

My Code:

df[['A','B']]
df1=df['B']
df1=df1.sort_values()
df1.drop_duplicates(keep='first', inplace=True)
df1.to_numpy()
df['B']=df1

score 2 · Answer 1 · answered Mar 03 '23 at 08:33

You can drop_duplicates, then reindex your output with set_axis to force index alignment on the first rows:

s = df['B'].drop_duplicates()
#s = df.drop_duplicates()['B'] # alternative if you want to consider A+B

df['B'] = s.set_axis(df.index[:len(s)])

NB. this solution work with any original index of df, not only with a range index.

Output:

Ynjxsjmh · Answer 2 · 2023-03-03T11:10:57.893

1

You can do

df['B'] = df['B'].drop_duplicates().reset_index(drop=True)
# or with DataFrame.drop_duplicates which can take a ignore_index parameter.
df['B'] = df[['B']].drop_duplicates(ignore_index=True)

print(df)

   A    B
0  1  1.0
1  1  2.0
2  2  3.0
3  3  4.0
4  3  NaN
5  4  NaN
6  4  NaN

edited Mar 03 '23 at 11:10

answered Mar 03 '23 at 08:35

Ynjxsjmh

28,441
6
34
52

Maybe you can explain why you can use `ignore_index` for a dataframe and not for a series. Note: it works only because the OP uses the default `RangeIndex`. – Corralien Mar 03 '23 at 08:44
It's a nice short way, but note that this would fail if the index is not a range (e.g., `df = pd.DataFrame({'A': [1, 1, 2, 3, 3, 4, 4], 'B': [1, 1, 2, 3, 3, 4, 4]}, index=list('ABCDEFG'))`) ;) – mozway Mar 03 '23 at 08:47
@Corralien I see `ignore_index` for a dataframe in doc but there is no such parameter for series, can't tell much about the design logic. – Ynjxsjmh Mar 03 '23 at 09:39
1

@mozway Thanks for pointing this out, your answer is just good for this. – Ynjxsjmh Mar 03 '23 at 09:40
1

@Ynjxsjmh. That's why you should to explain that. The same method name doesn't take the same arguments because one is `Series.drop_duplicates` and the other is `DataFrame.drop_duplicates`. For a Series, you have to reset manually the index. – Corralien Mar 03 '23 at 10:19

jezrael · Answer 3 · 2023-03-03T08:45:29.613

If default indices recreate column from list:

df['B'] = pd.Series(df['B'].drop_duplicates().tolist())
#alternative
#df['B'] = pd.Series(pd.unique(df['B']).tolist())
print (df)
   A    B
0  1  1.0
1  1  2.0
2  2  3.0
3  3  4.0
4  3  NaN
5  4  NaN
6  4  NaN

If any indices also filtere first induces by length of list:

L = df['B'].drop_duplicates().tolist()
#L = pd.unique(df['B']).tolist()
df['B'] = pd.Series(L, index=df.index[:len(L)])

Or:

a = pd.unique(df['B'])
df['B'] = np.hstack([a, np.full((len(df) - len(a), ), np.nan)])
print (df)
   A    B
0  1  1.0
1  1  2.0
2  2  3.0
3  3  4.0
4  3  NaN
5  4  NaN
6  4  NaN

Pandas drop_duplicates() without empty rows

3 Answers3