1

In python 3 and pandas I need to eliminate duplicate rows from a dataframe by repeating values in a column. For this I used:

consolidado = df_processos.drop_duplicates(['numero_unico'], keep='last')

The column "numero_unico" has codes in string format like 0029126-45.2019.1.00.0000, 0026497-98.2019.1.00.0000, 0027274-83.2019.1.00.0000...

So the above command keeps only the last string code appearance found

Please does anyone know how to use drop_duplicates with one exception?

But the column contents will not always be string codes. In several lines appears the content "Sem número único"

And I want to keep all the lines where this exception exists. But with the above command the generated dataframe keeps only the last appearance of "Sem número único"

Reinaldo Chaves
  • 965
  • 4
  • 16
  • 43
  • remove the rows that have "Sem número único", do dedup on original df, then merge the "Sem número único" rows back in – Gabriel Oct 09 '19 at 19:22

3 Answers3

2

Example from my comment on the OP,

df = pandas.DataFrame({
    'a': ['snu', 'snu', '002', '002', '003', '003'], 
    'b': [1, 2, 2, 1, 5, 6]
})
df_dedupe = pandas.concat([ 
    df[df['a']=='snu'], 
    df[df['a']!='snu'].drop_duplicates(['a'], keep='last') 
])
Gabriel
  • 10,524
  • 1
  • 23
  • 28
2

Similar to the other answers, but in one multi-line command using the duplicated() method:

consolidado = df_processos[
    df_processos['numero_unico'] == "Sem número único" |
    ~df_processos[df_processos['numero_unico'] != "Sem número único"].duplicated(
        subset='numero_unico', keep='last'
    )
]

Link

ymzkala
  • 323
  • 3
  • 7
1

There's no parameter in the pandas drop_duplicates you can use, but you can get around it by separating the DataFrame into two parts (with "Sem número único" and without), and then concat back together after deduplicating. As so:

tmp_df1 = df_processos[df_processos['numero_unico']=="Sem número único"]
tmp_df2 = df_processos[df_processos['numero_unico']!='Sem número único']
tmp_df2 = tmp_df2.drop_duplicates(['numero_unico'], keep='last')
new_df = pd.concat([tmp_df1, tmp_df2])
Ian
  • 3,605
  • 4
  • 31
  • 66