1

Hi I have a section of my pandas dataframe that has duplicates, but the difference is minor.

The only differentiator is a period at the end.

Header A
First
First.

I just want to drop the row that has a duplicate that does not have a period.

2 Answers2

3

First sorting by Header A, then remove last . and get last duplicated values by Series.duplicated:

print (df)
  Header A
0   First.
1    First
2   First.
3  Second.
4   Second
5    Third
6    Third


df1 = df.sort_values('Header A')
df1 = df1[~df1['Header A'].str.rstrip('.').duplicated(keep='last')]
print (df1)
  Header A
2   First.
3  Second.
6    Third

If need prioritize values without .:

df1 = df.sort_values('Header A')
df2 = df1[~df1['Header A'].str.rstrip('.').duplicated()]
print (df2)
  Header A
1    First
4   Second
5    Third
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

Or try loc:

>>> x = df['Header A'].str.split('.', expand=True)
>>> df.loc[x[0].duplicated(keep=False) & x[1].isna()]
  Header A
0    First
>>> 
U13-Forward
  • 69,221
  • 14
  • 89
  • 114