1

I have 2 equal columns in a pandas data frame. Each of the columns have the same duplicates.

A B
1 1
1 1
2 2
3 3
3 3
4 4
4 4

I want to delete the duplicates only from column B so that the goal is like the following:

A B
1 1
1 2
2 3
3 4
3 
4 
4 

I cloned the column B in a new DataFrame and used drop duplicates. The new dataframe with only the column B after drop_duplicates() looks like:

B
1
2
3
4

But when i took it back to the original data frame it looks like this:

A B
1 1
1 
2 2
3 3
3 
4 4
4 

My Code:

df[['A','B']]
df1=df['B']
df1=df1.sort_values()
df1.drop_duplicates(keep='first', inplace=True)
df1.to_numpy()
df['B']=df1
David95
  • 35
  • 5

3 Answers3

2

You can drop_duplicates, then reindex your output with set_axis to force index alignment on the first rows:

s = df['B'].drop_duplicates()
#s = df.drop_duplicates()['B'] # alternative if you want to consider A+B

df['B'] = s.set_axis(df.index[:len(s)])

NB. this solution work with any original index of df, not only with a range index.

Output:

   A    B
0  1  1.0
1  1  2.0
2  2  3.0
3  3  4.0
4  3  NaN
5  4  NaN
6  4  NaN
mozway
  • 194,879
  • 13
  • 39
  • 75
1

You can do

df['B'] = df['B'].drop_duplicates().reset_index(drop=True)
# or with DataFrame.drop_duplicates which can take a ignore_index parameter.
df['B'] = df[['B']].drop_duplicates(ignore_index=True)
print(df)

   A    B
0  1  1.0
1  1  2.0
2  2  3.0
3  3  4.0
4  3  NaN
5  4  NaN
6  4  NaN
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52
  • Maybe you can explain why you can use `ignore_index` for a dataframe and not for a series. Note: it works only because the OP uses the default `RangeIndex`. – Corralien Mar 03 '23 at 08:44
  • It's a nice short way, but note that this would fail if the index is not a range (e.g., `df = pd.DataFrame({'A': [1, 1, 2, 3, 3, 4, 4], 'B': [1, 1, 2, 3, 3, 4, 4]}, index=list('ABCDEFG'))`) ;) – mozway Mar 03 '23 at 08:47
  • @Corralien I see `ignore_index` for a dataframe in doc but there is no such parameter for series, can't tell much about the design logic. – Ynjxsjmh Mar 03 '23 at 09:39
  • 1
    @mozway Thanks for pointing this out, your answer is just good for this. – Ynjxsjmh Mar 03 '23 at 09:40
  • 1
    @Ynjxsjmh. That's why you should to explain that. The same method name doesn't take the same arguments because one is `Series.drop_duplicates` and the other is `DataFrame.drop_duplicates`. For a Series, you have to reset manually the index. – Corralien Mar 03 '23 at 10:19
0

If default indices recreate column from list:

df['B'] = pd.Series(df['B'].drop_duplicates().tolist())
#alternative
#df['B'] = pd.Series(pd.unique(df['B']).tolist())
print (df)
   A    B
0  1  1.0
1  1  2.0
2  2  3.0
3  3  4.0
4  3  NaN
5  4  NaN
6  4  NaN

If any indices also filtere first induces by length of list:

L = df['B'].drop_duplicates().tolist()
#L = pd.unique(df['B']).tolist()
df['B'] = pd.Series(L, index=df.index[:len(L)])

Or:

a = pd.unique(df['B'])
df['B'] = np.hstack([a, np.full((len(df) - len(a), ), np.nan)])
print (df)
   A    B
0  1  1.0
1  1  2.0
2  2  3.0
3  3  4.0
4  3  NaN
5  4  NaN
6  4  NaN
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252