-1

I have a large dataframe from a csv file which has a few dozen columns. I have another csv file which I concatenated to the original. Now, the second file has exactly the same structure but a particular column may have incorrect values. I want to delete rows which are duplicates that have this one wrong column. For example in the below the last row should be removed. (The names of the specimens (Albert, etc.) are unique). I have been struggling to find a way of deleting only the data which has the wrong value, without risking deleting the correct row.

0      Albert       alive
1      Newton       alive
2      Galileo      alive
3      Copernicus   dead
4      Galileo      dead
...

Any help would be greatly appreciated!

darzan
  • 17
  • 4
  • how are you defining it as being 'wrong' – Emi OB May 23 '22 at 15:13
  • @Emi Ob If the first file says that Galileo is alive (for example) and the second says he's dead, delete the row that says he's dead. If the newly added rows don't agree with the old ones, they should not make it into the final file. – darzan May 23 '22 at 15:19
  • If the names are unique, doesn't simply `df.drop_duplicates('name_column')` do? – fsimonjetz May 23 '22 at 15:23

1 Answers1

0

You could use this to determine if a name is mentioned more than 1 time

df['RN'] = df.groupby(['Name']).cumcount() + 1

You can also expand it out to have more columns in the "groupby" to see if there are any more limitations you want to put on the duplicates

df['RN'] = df.groupby(['Name', 'Another Column']).cumcount() + 1

The advantage I like with this is it gives you more control over the RN selection if you needed to df.loc[df['RN'] > 2].

ArchAngelPwn
  • 2,891
  • 1
  • 4
  • 17