Python Pandas Drop Consecutive Data Frames but Period (.) at the End is the Differentiator

Question

Hi I have a section of my pandas dataframe that has duplicates, but the difference is minor.

The only differentiator is a period at the end.

Header A
First
First.

I just want to drop the row that has a duplicate that does not have a period.

I add new data sample to my answer, what is expcted ouput? – jezrael Aug 26 '21 at 06:37 — jezrael, Aug 26 '21 at 06:37

jezrael · Accepted Answer · 2021-08-26T06:47:01.733

First sorting by Header A, then remove last . and get last duplicated values by Series.duplicated:

print (df)
  Header A
0   First.
1    First
2   First.
3  Second.
4   Second
5    Third
6    Third


df1 = df.sort_values('Header A')
df1 = df1[~df1['Header A'].str.rstrip('.').duplicated(keep='last')]
print (df1)
  Header A
2   First.
3  Second.
6    Third

If need prioritize values without .:

df1 = df.sort_values('Header A')
df2 = df1[~df1['Header A'].str.rstrip('.').duplicated()]
print (df2)
  Header A
1    First
4   Second
5    Third

score 0 · Answer 2 · answered Aug 26 '21 at 06:35

0

Or try loc:

>>> x = df['Header A'].str.split('.', expand=True)
>>> df.loc[x[0].duplicated(keep=False) & x[1].isna()]
  Header A
0    First
>>>

answered Aug 26 '21 at 06:35

U13-Forward

69,221
14
89
114

But OP need `I just want to drop the row that has a duplicate that does not have a period.` – jezrael Aug 26 '21 at 06:35
@jezrael Yes this does that :) – U13-Forward Aug 26 '21 at 06:36

Python Pandas Drop Consecutive Data Frames but Period (.) at the End is the Differentiator

2 Answers2