-1

I have a table with individual records and another which holds historical information about the individuals in the former.

I want to extract information about the individuals from the second table. Both tables have timestamp. It is very important that the historical information happened before the record in the first table.

               Date_Time              name
0    2021-09-06 10:46:00       Leg It Liam
1    2021-09-06 10:46:00  Hollyhill Island
2    2021-09-06 10:46:00    Shani El Bolsa
3    2021-09-06 10:46:00     Kilbride Fifi
4    2021-09-06 10:46:00                Go
2100 2021-10-06 11:05:00   Slaneyside Babs
2101 2021-10-06 11:05:00      Hillview Joe
2102 2021-10-06 11:05:00     Fairway Flyer
2103 2021-10-06 11:05:00  Whiteys Surprise
2104 2021-10-06 11:05:00       Astons Lucy

The name is the variable by which you connect the two tables:

             Date_Time              name    cc
13 2021-09-15 12:16:00  Hollyhill Island  6.00
14 2021-09-06 10:46:00  Hollyhill Island  4.50
15 2021-05-30 18:28:00  Hollyhill Island  3.50
16 2021-05-25 10:46:00  Hollyhill Island  2.50
17 2021-05-18 12:46:00  Hollyhill Island  2.38
18 2021-04-05 12:31:00  Hollyhill Island  3.50
19 2021-04-28 12:16:00  Hollyhill Island  3.75

I want to add aggregated data from this table to the first. Such as adding the cc mean and count.

               Date_Time              name
1    2021-09-06 10:46:00  Hollyhill Island

This line I would add 5 for cc count and 3.126 for the cc mean. Remember the historical records need to be before the date time of the individual records.

I am a bit confused how to do this efficiently. I know I need to groupby the historical data.

Also the individual records are usually in groups of Date_Time, if that makes it any easier.

Borut Flis
  • 15,715
  • 30
  • 92
  • 119

1 Answers1

1

IIUC:

try:

out=df1.merge(df2,on='name',suffixes=('','_y'))
#merging both df's on name
out=out.mask(out['Date_Time']<=out['Date_Time_y']).dropna()
#filtering results
out=out.groupby(['Date_Time','name'])['cc'].agg(['count','mean']).reset_index()
#aggregrating values

output of out:

    Date_Time               name             count  mean
0   2021-09-06 10:46:00     Hollyhill Island    5   3.126
Anurag Dabas
  • 23,866
  • 9
  • 21
  • 41
  • Thanks. I want aggregate data only for the "name", than your solution could be right. But the problem is that I need to do this for every record in the first data-frame. I need historical data from df2 for every record in df1, so I would have to do this calculation so many times, it might not be very efficient. – Borut Flis Jun 21 '21 at 12:20
  • Let's say there is another example for Hollyhill Island in one hour you would have to calculate again as there might be new examples in the historical data. – Borut Flis Jun 21 '21 at 12:24
  • @BorutFlis sorry sir if I don't understand...but we are merging both df's on name so i don't think there is need of repeating the process maybe untill data is dynamic – Anurag Dabas Jun 21 '21 at 15:21
  • Because rows in df1 progress through time. Hollyhill Island in the beginning of the df1 has less historic records that fit the condition in df2 than Hollyhill Island in the end of the df2. I know it is a difficult thing to explain. And my sample data does not show this example, I will try and edit it. – Borut Flis Jun 21 '21 at 20:51