1

I'm developing a student project about data analysis and I want to find all of the duplicates in the data frame, but with one specific cell changed e.g.

Id Name Surname Job Wage
1 John Black Artist 1200
2 Adam Smith Artist 1400
3 John Black Artist 1900
4 John Black Driver 1200
5 Adam Smith Artist 1400
6 Adam Black Driver 1200

and now I'd like to receive person with the same name, surname and job but with different salary or the same. It should look like this:

Id Name Surname Job Wage
1 John Black Artist 1200
3 John Black Artist 1900
2 Adam Smith Artist 1400
5 Adam Smith Artist 1400

(It's only simple data, I've got much, much more rows and columns). How could I get this? I've tried with code like this:

names=df['Name'].value_counts()
surnames=df['Surname'].value_counts()
jobs=df['Job'].value_counts()
wages=df['Wage'].value_counts()
for i in names:
    for j in surnames:
       for k in jobs:
            if (df['Name'] == i and df['Surname'] == j and df['Job'] == k):
                  print ("something")

but I still have an error:

f"The truth value of a {type(self).__name__} is ambiguous. "
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

also I've tried with lambda expression:

for i in names:
    for j in surnames:
       for k in jobs:
           persons= df.apply(lambda x: print (x) if x['Name'] == i and x['Surname'] == j and x['Job'] == l else False, axis=1)

print(persons)

But I get pairs of id and value true or false. How could I repair it? Or what should I do? Thank you in advice

Ismail Durmaz
  • 2,521
  • 1
  • 6
  • 19
alqueen
  • 241
  • 1
  • 9
  • What if 3rd row was John Black Driver 1200 and 4th row as John Black Artist 1900. Do you want John Black Driver 1200 to be part of the result set and exclude John Black Artist 1900 ? – Joe Ferndz Jan 10 '21 at 02:54
  • 2
    Does this answer your question? [How to select duplicate rows with pandas?](https://stackoverflow.com/questions/41042996/how-to-select-duplicate-rows-with-pandas). This was already asked. Please review the link for more details and options. – Joe Ferndz Jan 10 '21 at 05:32

2 Answers2

2

You can use only pandas functions to filter dublicated records

# group by columns
df_group = df.groupby(['Name', 'Surname', 'Job'])

# get group index size
df_group_size = df_group.size()

# filter on groups, get only duplications
df_group_more = df_group_size[df_group_size > 1]

# filter on main dataframe
filtered = df[df.apply(lambda r:(r['Name'], r['Surname'], r['Job']) in df_group_more.index, axis=1)]

print(filtered)

Output

   Id  Name Surname     Job  Wage
0   1  John   Black  Artist  1200
1   2  Adam   Smith  Artist  1400
2   3  John   Black  Artist  1900
4   5  Adam   Smith  Artist  1400
Ismail Durmaz
  • 2,521
  • 1
  • 6
  • 19
2

To get all the duplicate records and non-duplicate records, you can use Series.duplicated with parameter keep=False. It will result in a boolean index. You can then use the result to select rows you need.

The single line answer for your question is:

df[(df[['Name','Surname','Job']].duplicated(keep=False))]

The output is:

   Id  Name Surname     Job  Wage
0   1  John   Black  Artist  1200
1   2  Adam   Smith  Artist  1400
2   3  John   Black  Artist  1900
4   5  Adam   Smith  Artist  1400

Here's how it works:

For a dataframe as shown below:

>>> df
   Id  Name Surname     Job  Wage
0   1  John   Black  Artist  1200
1   2  Adam   Smith  Artist  1400
2   3  John   Black  Artist  1900
3   4  John   Black  Driver  1200
4   5  Adam   Smith  Artist  1400
5   6  Adam   Black  Driver  1200

Searching for duplicates of Name, Surname, and Job will result in:

>>> df[['Name','Surname','Job']].duplicated(keep=False)
0     True
1     True
2     True
3    False
4     True
5    False
dtype: bool

With this boolean index, you can get all records that are True and False.

To get all the duplicates, you can give this command:

>>> df[(df[['Name','Surname','Job']].duplicated(keep=False))]
   Id  Name Surname     Job  Wage
0   1  John   Black  Artist  1200
1   2  Adam   Smith  Artist  1400
2   3  John   Black  Artist  1900
4   5  Adam   Smith  Artist  1400

To get all the non-duplicates, you can give this command. The ~ will negate and give you all the values that do not meet this criteria.

>>> df[~(df[['Name','Surname','Job']].duplicated(keep=False))]
   Id  Name Surname     Job  Wage
3   4  John   Black  Driver  1200
5   6  Adam   Black  Driver  1200
Joe Ferndz
  • 8,417
  • 2
  • 13
  • 33