1

If I had the following data:

+---------------+---------------------+---------------------+----------+--------------+
| email         | date_opened         | order_date          | order_id | product_name |
+---------------+---------------------+---------------------+----------+--------------+
| abc@email.com | 2019-01-01 10:20:12 | 2019-01-03 09:21:43 | 1234     | xyz          |
+---------------+---------------------+---------------------+----------+--------------+
| abc@email.com | 2019-01-01 10:45:09 | 2019-01-03 09:21:43 | 1234     | xyz          |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:13:46 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:15:20 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:24:43 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-12 00:39:21 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-09 01:24:54 | 2018-08-10 11:12:14 | 5678     | zyx          |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-10 15:22:34 | 2018-08-10 11:12:14 | 5678     | zyx          |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-10 00:12:14 | 2018-08-10 11:12:14 | 5678     | zyx          |
+---------------+---------------------+---------------------+----------+--------------+
| ...           | ...                 | ...                 | ...      | ...          |
+---------------+---------------------+---------------------+----------+--------------+

How would I keep a single order_date, order_id, product_name against the earliest or minimum date_opened per email, and replace all other duplicate order_date, order_id and product_names as NaNs?

Code:

import pandas as pd
import numpy as np
import psycopg2
import pyodbc

dwh_conn = psycopg2.connect(...)
dm_query = ...
dm = pd.read_sql(dm_query, dwh_conn, parse_dates='date_opened', index_col='email')

dfdev_conn = pyodbc.connect(...)
bkgs_query = ...
bkgs = pd.read_sql(bkgs_query, dfdev_conn, parse_dates='order_date', index_col='email')

dm_bkgs = pd.merge(dm, bkgs, how='left', left_index=True, right_index=True)
dm_bkgs['diff_days'] = dm_bkgs['date_opened'] - dm_bkgs['order_date']
dm_bkgs['diff_days'] = dm_bkgs['diff_days']/np.timedelta64(1,'D')

dm_bkgs.index.name = 'email'
dm_bkgs.sort_values(by=['email','diff_days'], inplace=True)

dm_bkgs['order_date'] = np.where(dm_bkgs.duplicated('order_id'), np.NaN, dm_bkgs['order_id'])
dm_bkgs['product_name'] = np.where(dm_bkgs.duplicated('order_id'), np.NaN, dm_bkgs['order_id'])
dm_bkgs['diff_days'] = np.where(dm_bkgs.duplicated('order_id'), np.NaN, dm_bkgs['booking_id'])
dm_bkgs['order_id'] = np.where(dm_bkgs.duplicated('order_id'), np.NaN, dm_bkgs['order_id'])

My code somewhat works but I've noticed that the dm dataframe has 1433 rows and after the merge or join, the number of rows goes up to 1448. Not sure why that is as the bkgs dataframe alone has no duplicates...

Also feel as though the code is a bit messsy...

Expecting:

+---------------+---------------------+---------------------+----------+--------------+
| email         | date_opened         | order_date          | order_id | product_name |
+---------------+---------------------+---------------------+----------+--------------+
| abc@email.com | 2019-01-01 10:20:12 | 2019-01-03 09:21:43 | 1234     | xyz          |
+---------------+---------------------+---------------------+----------+--------------+
| abc@email.com | 2019-01-01 10:45:09 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:13:46 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:15:20 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-11 08:24:43 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| def@email.com | 2019-02-12 00:39:21 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-09 01:24:54 | 2018-08-10 11:12:14 | 5678     | zyx          |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-10 15:22:34 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| ghi@email.com | 2018-08-10 00:12:14 | NaN                 | NaN      | NaN          |
+---------------+---------------------+---------------------+----------+--------------+
| ...           | ...                 | ...                 | ...      | ...          |
+---------------+---------------------+---------------------+----------+--------------+
AK91
  • 671
  • 2
  • 13
  • 35
  • Not sure what you try to achieve exactly. Could you add an example dataframe just like your data you already posted, but then with expected output. – Erfan May 08 '19 at 15:32

1 Answers1

1

How about:

duplicated = dm_bkgs.duplicated('order_id')

dm_bkgs.loc[duplicated, ['order_date', 'order_id', 'product_name']] = np.NaN

It's essentially what you did, in generic form.

Quang Hoang
  • 146,074
  • 10
  • 56
  • 74