Data-frame to remove consecutive duplicated rows with same contents across columns

Question

Data-frame below and I want to drop the consecutive duplicated rows, when 'People', 'Year' and 'Project' are the same.

If the original data-frame like below, rows with the same 'People','Year','Project' when consecutive, are to be removed.

data = {'People' : ["David","David","David","David","John","John","John"],
'Year': ["2016","2016","2017","2016","2016","2017","2017",],
'Project' : ["TN","TN","TN","TN","DJ","DM","DM"],
'Earning' : [878,682,767,620,964,610,772]}

I tried this but it doesn't work:

df_1 = df.loc[(df['People', 'Year', 'Project'].shift() != df['People', 'Year', 'Project'])]

attempt - this line removes the non-consecutive "David, 2016, TN, 620"

df_1 = df.drop_duplicates(subset=['People','Year','Project'])

when changed to this, it keeps all the rows:

df_1 = df.drop_duplicates(subset=['People','Year','Project', 'Earning'])

What's the right way to do it? Thank you!

Use `df = df.drop_duplicates()` or `df = df.drop_duplicates(subset=['People', 'Year', 'Project'])` — jezrael, Feb 07 '20 at 10:08
@jezrael, thank you, sir! I will delete this question, if no more question after re-think! wish you a good weekend! — Mark K, Feb 07 '20 at 10:19
@jezrael, good morning sir! would you mind I edit the question? there's a scenario I want to explore. thank you. — Mark K, Feb 10 '20 at 02:05

score 1 · Accepted Answer · answered Feb 10 '20 at 06:19

You can compare DataFrame.shifted values for not equal and then test at least one True per rows by DataFrame.any with boolean indexing:

cols = ['People','Year','Project']
df_1 = df[df[cols].ne(df[cols].shift()).any(axis=1)]
print (df_1)
  People  Year Project  Earning
0  David  2016      TN      878
2  David  2017      TN      767
3  David  2016      TN      620
4   John  2016      DJ      964
5   John  2017      DM      610

Details:

print (df[cols].ne(df[cols].shift()))
   People   Year  Project
0    True   True     True
1   False  False    False
2   False   True    False
3   False   True    False
4    True  False     True
5   False   True     True
6   False  False    False

print (df[cols].ne(df[cols].shift()).any(axis=1))
0     True
1    False
2     True
3     True
4     True
5     True
6    False
dtype: bool

thank you, sir! you demo the pandas magic again! much appreciate your knowledge sharing and the help! — Mark K, Feb 10 '20 at 06:37

Data-frame to remove consecutive duplicated rows with same contents across columns

1 Answers1