If I had the following data:
+---------------+---------------------+---------------------+----------+--------------+
| email | date_opened | order_date | order_id | product_name |
+---------------+---------------------+---------------------+----------+--------------+
| abc@email.com | 2019-01-01 10:20:12 | 2019-01-03 09:21:43 | 1234 | xyz |
+---------------+---------------------+---------------------+----------+--------------+
| abc@email.com | 2019-01-01 10:45:09 | 2019-01-03 09:21:43 | 1234 | xyz |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:13:46 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:15:20 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:24:43 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-12 00:39:21 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-09 01:24:54 | 2018-08-10 11:12:14 | 5678 | zyx |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-10 15:22:34 | 2018-08-10 11:12:14 | 5678 | zyx |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-10 00:12:14 | 2018-08-10 11:12:14 | 5678 | zyx |
+---------------+---------------------+---------------------+----------+--------------+
| ... | ... | ... | ... | ... |
+---------------+---------------------+---------------------+----------+--------------+
How would I keep a single order_date
, order_id
, product_name
against the earliest or minimum date_opened
per email
, and replace all other duplicate order_date
, order_id
and product_names
as NaN
s?
Code:
import pandas as pd
import numpy as np
import psycopg2
import pyodbc
dwh_conn = psycopg2.connect(...)
dm_query = ...
dm = pd.read_sql(dm_query, dwh_conn, parse_dates='date_opened', index_col='email')
dfdev_conn = pyodbc.connect(...)
bkgs_query = ...
bkgs = pd.read_sql(bkgs_query, dfdev_conn, parse_dates='order_date', index_col='email')
dm_bkgs = pd.merge(dm, bkgs, how='left', left_index=True, right_index=True)
dm_bkgs['diff_days'] = dm_bkgs['date_opened'] - dm_bkgs['order_date']
dm_bkgs['diff_days'] = dm_bkgs['diff_days']/np.timedelta64(1,'D')
dm_bkgs.index.name = 'email'
dm_bkgs.sort_values(by=['email','diff_days'], inplace=True)
dm_bkgs['order_date'] = np.where(dm_bkgs.duplicated('order_id'), np.NaN, dm_bkgs['order_id'])
dm_bkgs['product_name'] = np.where(dm_bkgs.duplicated('order_id'), np.NaN, dm_bkgs['order_id'])
dm_bkgs['diff_days'] = np.where(dm_bkgs.duplicated('order_id'), np.NaN, dm_bkgs['booking_id'])
dm_bkgs['order_id'] = np.where(dm_bkgs.duplicated('order_id'), np.NaN, dm_bkgs['order_id'])
My code somewhat works but I've noticed that the dm
dataframe has 1433 rows and after the merge
or join, the number of rows goes up to 1448. Not sure why that is as the bkgs
dataframe alone has no duplicates...
Also feel as though the code is a bit messsy...
Expecting:
+---------------+---------------------+---------------------+----------+--------------+
| email | date_opened | order_date | order_id | product_name |
+---------------+---------------------+---------------------+----------+--------------+
| abc@email.com | 2019-01-01 10:20:12 | 2019-01-03 09:21:43 | 1234 | xyz |
+---------------+---------------------+---------------------+----------+--------------+
| abc@email.com | 2019-01-01 10:45:09 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:13:46 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:15:20 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:24:43 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-12 00:39:21 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-09 01:24:54 | 2018-08-10 11:12:14 | 5678 | zyx |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-10 15:22:34 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-10 00:12:14 | NaN | NaN | NaN |
+---------------+---------------------+---------------------+----------+--------------+
| ... | ... | ... | ... | ... |
+---------------+---------------------+---------------------+----------+--------------+