In pandas how to use drop_duplicates with one exception?

Question

In python 3 and pandas I need to eliminate duplicate rows from a dataframe by repeating values in a column. For this I used:

consolidado = df_processos.drop_duplicates(['numero_unico'], keep='last')

The column "numero_unico" has codes in string format like 0029126-45.2019.1.00.0000, 0026497-98.2019.1.00.0000, 0027274-83.2019.1.00.0000...

So the above command keeps only the last string code appearance found

Please does anyone know how to use drop_duplicates with one exception?

But the column contents will not always be string codes. In several lines appears the content "Sem número único"

And I want to keep all the lines where this exception exists. But with the above command the generated dataframe keeps only the last appearance of "Sem número único"

remove the rows that have "Sem número único", do dedup on original df, then merge the "Sem número único" rows back in — Gabriel, Oct 09 '19 at 19:22

score 2 · Accepted Answer · answered Oct 09 '19 at 19:27

2

Example from my comment on the OP,

df = pandas.DataFrame({
    'a': ['snu', 'snu', '002', '002', '003', '003'], 
    'b': [1, 2, 2, 1, 5, 6]
})
df_dedupe = pandas.concat([ 
    df[df['a']=='snu'], 
    df[df['a']!='snu'].drop_duplicates(['a'], keep='last') 
])

answered Oct 09 '19 at 19:27

Gabriel

10,524
1
23
28

Good stuff thanks! I'd been trying to get this to work for ages – foakesm Sep 24 '20 at 12:06

score 2 · Answer 2 · answered Oct 09 '19 at 19:44

Similar to the other answers, but in one multi-line command using the duplicated() method:

consolidado = df_processos[
    df_processos['numero_unico'] == "Sem número único" |
    ~df_processos[df_processos['numero_unico'] != "Sem número único"].duplicated(
        subset='numero_unico', keep='last'
    )
]

Link

score 1 · Answer 3 · answered Oct 09 '19 at 19:26

There's no parameter in the pandas drop_duplicates you can use, but you can get around it by separating the DataFrame into two parts (with "Sem número único" and without), and then concat back together after deduplicating. As so:

tmp_df1 = df_processos[df_processos['numero_unico']=="Sem número único"]
tmp_df2 = df_processos[df_processos['numero_unico']!='Sem número único']
tmp_df2 = tmp_df2.drop_duplicates(['numero_unico'], keep='last')
new_df = pd.concat([tmp_df1, tmp_df2])

In pandas how to use drop_duplicates with one exception?

3 Answers3