0

The aim of below code is to find the number of hours a customer id spends online on a #particular date. I am unable to get the online hours for all of the #dates(i.e. 01/06/2017 to 21/06/2017). I have attached the images of o/p of current code, csv file of customer data and pings data:

import datetime
import math

final_train_df  = pd.DataFrame(columns=['id', 'date', 'hours'])
excluded_cust_ids = []
final_set = {}
train__df = pd.DataFrame()
for index, i in enumerate(customer_data['id'][:10].values.tolist()):
    print('Started for CustID '+ str(i) + " count is "+ str(index))
    train__df['timestamp'] = pings_data.loc[pings_data['id']==i]["timestamp"]
    train__df['date'] = pings_data.loc[pings_data['id']==i]["timestamp"].apply(lambda arr: 
    datetime.datetime.fromtimestamp(arr).strftime('%Y-%m-%d'))
    train__df['time'] = pings_data.loc[pings_data['id']==i]["timestamp"].apply(lambda arr: 
    datetime.datetime.fromtimestamp(arr).strftime('%H:%M:%S'))
    unique_dates = pd.unique(train__df['date']).tolist()
    print(unique_dates)
    if math.isnan(train__df['timestamp'].values.tolist()[0]):
        excluded_cust_ids.append(i)
    else:        
        for j in unique_dates:
            time_stamp = train__df.loc[train__df['date']==j].sort_values(by='time')
            end_time = time_stamp['time'].iloc[len(time_stamp)-1]
            start_time = time_stamp['time'].iloc[0]
            t1 = pd.to_datetime(start_time)
            t2 = pd.to_datetime(end_time)
            diff_time = round((pd.Timedelta(t2 - t1).seconds/(3600)),1)
            val = {'id':i, 'date':j, 'hours':diff_time}
            final_train_df = final_train_df.append({'id':i, 'date':j, 'hours':diff_time}, True)

print(final_train_df)
print(excluded_cust_ids)

final_train_df.to_excel("Final_train_df_excluding_cust_ids.xlsx")
print("completed for all customers except for non-active customers")
Parfait
  • 104,375
  • 17
  • 94
  • 125

2 Answers2

1

Please try this method. If you need date and timestamp in two different columns, we can do that as well.

df_cust = pd.read_excel("./cust.xlsx")
df_pings = pd.read_excel("./pings.xlsx")
df_cust.set_index("id")
df_pings.set_index("id")
new_df = pd.merge(df_cust, df_pings)
new_df['time'] = new_df['timestamp'].apply(lambda x: 
str(datetime.datetime.fromtimestamp(x) ))

O/p for the above code.

id      gender  age number_of_kids  timestamp   time
21      MALE    26  2               1496278800  2017-06-01 06:30:00
23      FEMALE  21  1               1496278800  2017-06-01 06:30:00
22      MALE    33  1               1496278800  2017-06-01 06:30:00
24      MALE    35  1               1496278815  2017-06-01 06:30:15
25      FEMALE  40  2               1496278800  2017-06-01 06:30:00
26      MALE    50  2               1496278900  2017-06-01 06:31:40
geek_bs
  • 19
  • 1
0

new_df['time'] = new_df['timestamp'].apply(lambda x: str(datetime.datetime.fromtimestamp(x) ))

new_df['time'] = new_df['time'].apply(lambda x: x.split(' ')[1].split(":")[0])

geek_bs
  • 19
  • 1
  • With changed dates also I am able to get the output without NAN. Please share the entire xls so that we can look at it why the column data is becoming NAN. – geek_bs May 30 '21 at 07:20
  • I'm pasting the link to the excel sheet.Please look into it. https://drive.google.com/file/d/1eee6kcYUecMR_u_dPsBA6CvXEWglnUC3/view?usp=sharing – ashwinpatil1484 Jun 01 '21 at 03:12