I am adding entries to an existing dataframe, where they can be new or updates to existing in the dataframe. Older and outdated entries will be deleted from the dataframe by using Pandas drop_duplicates, which worked as expected in Jupyter notebook:
df.drop_duplicates(keep = 'last', inplace = True, subset = ['col_a', 'col_b', ...])
However, when I transferred the codes over to a python script and ran it via the CLI, the duplicated/outdated entries were not deleted.
So I created another python script to load this dataframe with the duplicated/outdated entries, and perform the drop_duplicates code above. In this case, the code worked.
Any idea what's going on here? Thanks!
Update with MVCE:
I think I know what went wrong. This is a sample code to illustrate:
from datetime import datetime
import time
import pandas as pd
# sample dataframe
a = pd.DataFrame({
'date': ['2019-12-01', '2019W48', '2019-12-01', '2019-12-02'],
'value': [1, 2, 2, 3]
})
print('original df:')
print(a)
print('data type of dates of original df')
for i in range(0, len(a)):
print('{} - type = {}'.format(a.iloc[i,:]['date'], type(a.iloc[i,:]['date'])))
new_entry = [datetime.strptime("2019-12-02", '%Y-%m-%d').date(), 4]
a.loc[len(a)] = new_entry
print('updated df')
print(a)
print('data type of dates of updated df')
for i in range(0, len(a)):
print('{} - type = {}'.format(a.iloc[i,:]['date'], type(a.iloc[i,:]['date'])))
print('try to drop duplicates again')
b = a.drop_duplicates(keep = 'last', subset = ['date'], inplace = False)
print(b)
Results from CLI:
original df:
date value
0 2019-12-01 1
1 2019W48 2
2 2019-12-01 2
3 2019-12-02 3
data type of dates of original df
2019-12-01 - type = <class 'str'>
2019W48 - type = <class 'str'>
2019-12-01 - type = <class 'str'>
2019-12-02 - type = <class 'str'>
updated df
date value
0 2019-12-01 1
1 2019W48 2
2 2019-12-01 2
3 2019-12-02 3
4 2019-12-02 4
data type of dates of updated df
2019-12-01 - type = <class 'str'>
2019W48 - type = <class 'str'>
2019-12-01 - type = <class 'str'>
2019-12-02 - type = <class 'str'>
2019-12-02 - type = <class 'datetime.date'>
try to drop duplicates again
date value
1 2019W48 2
2 2019-12-01 2
3 2019-12-02 3
4 2019-12-02 4
Seems that date type needs to be the same in order for drop_duplicates to work properly. So the next step is to convert the 'date' column to string right?