I'm developing a student project about data analysis and I want to find all of the duplicates in the data frame, but with one specific cell changed e.g.
Id | Name | Surname | Job | Wage |
---|---|---|---|---|
1 | John | Black | Artist | 1200 |
2 | Adam | Smith | Artist | 1400 |
3 | John | Black | Artist | 1900 |
4 | John | Black | Driver | 1200 |
5 | Adam | Smith | Artist | 1400 |
6 | Adam | Black | Driver | 1200 |
and now I'd like to receive person with the same name, surname and job but with different salary or the same. It should look like this:
Id | Name | Surname | Job | Wage |
---|---|---|---|---|
1 | John | Black | Artist | 1200 |
3 | John | Black | Artist | 1900 |
2 | Adam | Smith | Artist | 1400 |
5 | Adam | Smith | Artist | 1400 |
(It's only simple data, I've got much, much more rows and columns). How could I get this? I've tried with code like this:
names=df['Name'].value_counts()
surnames=df['Surname'].value_counts()
jobs=df['Job'].value_counts()
wages=df['Wage'].value_counts()
for i in names:
for j in surnames:
for k in jobs:
if (df['Name'] == i and df['Surname'] == j and df['Job'] == k):
print ("something")
but I still have an error:
f"The truth value of a {type(self).__name__} is ambiguous. "
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
also I've tried with lambda expression:
for i in names:
for j in surnames:
for k in jobs:
persons= df.apply(lambda x: print (x) if x['Name'] == i and x['Surname'] == j and x['Job'] == l else False, axis=1)
print(persons)
But I get pairs of id and value true or false. How could I repair it? Or what should I do? Thank you in advice