1

I am adding entries to an existing dataframe, where they can be new or updates to existing in the dataframe. Older and outdated entries will be deleted from the dataframe by using Pandas drop_duplicates, which worked as expected in Jupyter notebook:

df.drop_duplicates(keep = 'last', inplace = True, subset = ['col_a', 'col_b', ...])

However, when I transferred the codes over to a python script and ran it via the CLI, the duplicated/outdated entries were not deleted.

So I created another python script to load this dataframe with the duplicated/outdated entries, and perform the drop_duplicates code above. In this case, the code worked.

Any idea what's going on here? Thanks!

Update with MVCE:

I think I know what went wrong. This is a sample code to illustrate:

from datetime import datetime
import time
import pandas as pd

# sample dataframe
a = pd.DataFrame({
    'date': ['2019-12-01', '2019W48', '2019-12-01', '2019-12-02'],
    'value': [1, 2, 2, 3]
})

print('original df:')
print(a)

print('data type of dates of original df')
for i in range(0, len(a)):
    print('{} - type = {}'.format(a.iloc[i,:]['date'], type(a.iloc[i,:]['date'])))

new_entry = [datetime.strptime("2019-12-02", '%Y-%m-%d').date(), 4]
a.loc[len(a)] = new_entry
print('updated df')
print(a)

print('data type of dates of updated df')
for i in range(0, len(a)):
    print('{} - type = {}'.format(a.iloc[i,:]['date'], type(a.iloc[i,:]['date'])))

print('try to drop duplicates again')
b = a.drop_duplicates(keep = 'last', subset = ['date'], inplace = False)
print(b)

Results from CLI:

original df:
         date  value
0  2019-12-01      1
1     2019W48      2
2  2019-12-01      2
3  2019-12-02      3

data type of dates of original df
2019-12-01 - type = <class 'str'>
2019W48 - type = <class 'str'>
2019-12-01 - type = <class 'str'>
2019-12-02 - type = <class 'str'>

updated df
         date  value
0  2019-12-01      1
1     2019W48      2
2  2019-12-01      2
3  2019-12-02      3
4  2019-12-02      4
data type of dates of updated df
2019-12-01 - type = <class 'str'>
2019W48 - type = <class 'str'>
2019-12-01 - type = <class 'str'>
2019-12-02 - type = <class 'str'>
2019-12-02 - type = <class 'datetime.date'>

try to drop duplicates again
         date  value
1     2019W48      2
2  2019-12-01      2
3  2019-12-02      3
4  2019-12-02      4

Seems that date type needs to be the same in order for drop_duplicates to work properly. So the next step is to convert the 'date' column to string right?

griffinleow
  • 93
  • 12
  • 1
    Double check if you have the same version of pandas in your environments – Paul Lo Dec 11 '19 at 03:45
  • Please show us a [mcve](https://stackoverflow.com/help/minimal-reproducible-example) of what you ran via the CLI. – Kent Shikama Dec 11 '19 at 03:46
  • @PaulLo same version of pandas used in both environments. – griffinleow Dec 11 '19 at 06:12
  • @KentShikama I have included a mcve. But still doesn't explain why I could drop duplicated rows when I reload the same dataframe into another script... – griffinleow Dec 11 '19 at 06:26
  • How are you reloading it in both cases? Are you using something like `pd.read_csv` to read from a file? – Kent Shikama Dec 11 '19 at 06:30
  • @KentShikama yeah pd.read_csv. something like this: `df = pd.read_csv(path, header = 0)` `df = df.drop_duplicates(keep = 'last', inplace = False, subset = [some columns])` `df.to_csv(path, index = False)` Yup always loading and overwriting to the same .csv file – griffinleow Dec 11 '19 at 06:37
  • And you used the same file? – Kent Shikama Dec 11 '19 at 06:37
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/204028/discussion-between-griffinleow-and-kent-shikama). – griffinleow Dec 11 '19 at 06:44

1 Answers1

1

Thanks to Kent Shikama for a brief dicussion.

Based on my findings, it appears that drop_duplicates did not work because of mixed data types in one of the columns, which in this case, was a 'date' column. After a calculation, there are both string and datetime entries in the column. Hence, drop_duplicates did not work immediately on it.

Upon reloading the same dataframe, Python seems to does a coercion by converting datetime entries to string automatically (can someone confirm this?). Thus, this was why drop_duplicates worked after a reloading.

griffinleow
  • 93
  • 12