Pandas series between two date time files

Question

My question is about using Pandas time series.

I have one file(Spots) that has pandas time series for a month's data with 7.5 seconds range. Example :

2016-11-01 00:00:00,0
2016-11-01 00:00:07.500000,1
2016-11-01 00:00:15,2
2016-11-01 00:00:22.500000,3
2016-11-01 00:00:30,4

The other file(Target) has just time information .

Example:

2016-11-01 00:00:05
2016-11-01 00:00:07
2016-11-01 00:00:23
2016-11-01 00:00:25

I want to check which spot does the target date time belong to: Output in above example :

2016-11-01 00:00:00,0 '\t' count of targets in this spot = 2
2016-11-01 00:00:07.500000,1 '\t' count of targets in this spot = 0
2016-11-01 00:00:15,2 '\t' count of targets in this spot = 0
2016-11-01 00:00:22.500000,3 '\t' count of targets in this spot = 0
2016-11-01 00:00:30,4 '\t' count of targets in this spot = 2

Thank you so much in advance. Kinda let me know if this is clear otherwise I can try to explain more.

How does the first row have 2 targets in that spot? It looks like it should be 0. Unless you're rounding down the Target to the closest Spot? A little more clarity on the logic behind the result would be helpful. — elPastor, May 03 '17 at 02:53

score 1 · Answer 1 · answered May 03 '17 at 03:57

Here's my suggestion. First, add another column to the target frame. This will make it possible to recognize the targets after a future merge:

target['T'] = 1

Concatenate the targets and the spots and sort them by the time:

both = pd.concat([spots,target]).sort_values(0)
#                        0    1    T
#0 2016-11-01 00:00:00.000  0.0  NaN
#0 2016-11-01 00:00:05.000  NaN  1.0
#1 2016-11-01 00:00:07.000  NaN  1.0
#1 2016-11-01 00:00:07.500  1.0  NaN
#2 2016-11-01 00:00:15.000  2.0  NaN
#3 2016-11-01 00:00:22.500  3.0  NaN
#2 2016-11-01 00:00:23.000  NaN  1.0
#3 2016-11-01 00:00:25.000  NaN  1.0
#4 2016-11-01 00:00:30.000  4.0  NaN

Forward-fill the spot IDs:

both[1] = both[1].fillna(method='ffill').astype(int)
#                        0  1    T
#0 2016-11-01 00:00:00.000  0  NaN
#0 2016-11-01 00:00:05.000  0  1.0
#1 2016-11-01 00:00:07.000  0  1.0
#1 2016-11-01 00:00:07.500  1  NaN
#2 2016-11-01 00:00:15.000  2  NaN
#3 2016-11-01 00:00:22.500  3  NaN
#2 2016-11-01 00:00:23.000  3  1.0
#3 2016-11-01 00:00:25.000  3  1.0
#4 2016-11-01 00:00:30.000  4  NaN

Select the original target rows and columns:

both[both['T']==1][[0,1]]
#                    0  1
#0 2016-11-01 00:00:05  0
#1 2016-11-01 00:00:07  0
#2 2016-11-01 00:00:23  3
#3 2016-11-01 00:00:25  3

If you want to count targets in spots, use groupby():

both.groupby(1).count()['T']
#1
#0    2
#1    0
#2    0
#3    2
#4    0

score 1 · Answer 2 · answered May 03 '17 at 05:32

Use a combination of np.searchsorted and pd.value_counts plus some other stuff.

idx = Spots.index.to_series()
i = idx.values
t = Target.Date.values
m = pd.value_counts(i[i.searchsorted(t) - 1]).to_dict()
Spots.assign(TargetCount=idx.map(lambda x: m.get(x, 0)))

                         Value  TargetCount
Date                                       
2016-11-01 00:00:00.000      0            2
2016-11-01 00:00:07.500      1            0
2016-11-01 00:00:15.000      2            0
2016-11-01 00:00:22.500      3            2
2016-11-01 00:00:30.000      4            0

How It Works

idx is the index of Spots turned to a pd.Series because I want to use pd.Series.map later.
i is the underlying numpy array that I'll perform a searchsorted operation with
t same as i... part of the searchsorted
searchsorted will go through each element in the right array and find the position in which that element should be inserted with respect to the right array. This information can be used to find the "bin" the elements belong to. I then subtract one to align with the appropriate indices
I then perform a pd.value_counts to count them
Use map to build a new column.

Setup

from io import StringIO
import pandas as pd

tx1 = """2016-11-01 00:00:00,0
2016-11-01 00:00:07.500000,1
2016-11-01 00:00:15,2
2016-11-01 00:00:22.500000,3
2016-11-01 00:00:30,4"""

tx2 = """2016-11-01 00:00:05
2016-11-01 00:00:07
2016-11-01 00:00:23
2016-11-01 00:00:25"""

Spots = pd.read_csv(StringIO(tx1), parse_dates=[0], index_col=0, names=['Date', 'Value'])

Target = pd.read_csv(StringIO(tx2), parse_dates=[0], names=['Date'])

Scott Boston · Accepted Answer · 2017-05-03T04:32:03.230

0

Let's use merge_ordered, fillna, and groupby:

Inputs:

df_spots

                     Date  Value
0 2016-11-01 00:00:00.000      0
1 2016-11-01 00:00:07.500      1
2 2016-11-01 00:00:15.000      2
3 2016-11-01 00:00:22.500      3
4 2016-11-01 00:00:30.000      4

df_target

                 Date
0 2016-11-01 00:00:05
1 2016-11-01 00:00:07
2 2016-11-01 00:00:23
3 2016-11-01 00:00:25

Code:

merged_df = pd.merge_ordered(df_spots, df_target, on = 'Date')
df_out = (merged_df.groupby(by=merged_df['Value']
               .fillna(method='ffill'), as_index=False)
               .agg({'Date':'first',
                     'Value':{'first':'first','count':lambda x:len(x)-1}}))

Outputs:

df_out

                     Date Value      
                    first first count
0 2016-11-01 00:00:00.000   0.0   2.0
1 2016-11-01 00:00:07.500   1.0   0.0
2 2016-11-01 00:00:15.000   2.0   0.0
3 2016-11-01 00:00:22.500   3.0   2.0
4 2016-11-01 00:00:30.000   4.0   0.0

edited May 03 '17 at 04:32

answered May 03 '17 at 04:20

Scott Boston

147,308
15
139
187

Wow! Thank you all so much! I din't know of so many nice and elegant ways to use Panda for this. I will try all of these tonight and let you all know how it goes. Thank you so much! – Himanshu Sharma May 03 '17 at 12:59
Hi @Scott Boston, I tried your method but I am not getting the counts rather it is just matching the first column which I think is the row number?. First , I had to change merge_ordered to order_merged but even then I am getting the same. I noticed that in the merged_df there is no 'value' in the second data frame and so would the groupie on 'value' still work? Can you help me understand what I might be doing wrong?. Also when I print my df_out, I get first count first instead of first first count. – Himanshu Sharma May 04 '17 at 01:42
What version of pandas do you have? I am on the latest 0.19.2, I think you might be on an early version. – Scott Boston May 04 '17 at 02:01
Yes, I thought about that and upgraded pandas and ran it and it worked beautifully. Thank you all for the help – Himanshu Sharma May 04 '17 at 15:50

score 0 · Answer 4 · answered May 03 '17 at 12:38

Using pandas merge_asof (note, all time values must be ordered - may have to sort first):

Setup ~~~~~~~~

import pandas as pd    

# make date_range with 1 sec interval (fake targets)
rng = pd.date_range('2016-11-01', periods=100, freq='S')

# resample to make 7.5 sec intervals (fake spot bins)
ts = pd.Series(np.arange(100), index=rng)
ts_vals = ts.resample('7500L').asfreq().index

df_spots = pd.DataFrame({'spot': np.arange(len(ts_vals)), 'bin': ts_vals})
df_spots.head()
                      bin  spot
0 2016-11-01 00:00:00.000  0   
1 2016-11-01 00:00:07.500  1   
2 2016-11-01 00:00:15.000  2   
3 2016-11-01 00:00:22.500  3   
4 2016-11-01 00:00:30.000  4 

df_targets = pd.DataFrame(rng, columns=['tgt'])
df_targets.head()

                  tgt
0 2016-11-01 00:00:00
1 2016-11-01 00:00:01
2 2016-11-01 00:00:02
3 2016-11-01 00:00:03
4 2016-11-01 00:00:04

Solution ~~~~~~~

# this will produce spot membership for targets
df = pd.merge_asof(df_targets, df_spots, left_on='tgt', right_on='bin')
df.head()
                  tgt                     bin  spot
0 2016-11-01 00:00:00 2016-11-01 00:00:00.000  0   
1 2016-11-01 00:00:01 2016-11-01 00:00:00.000  0   
2 2016-11-01 00:00:02 2016-11-01 00:00:00.000  0   
3 2016-11-01 00:00:03 2016-11-01 00:00:00.000  0   
4 2016-11-01 00:00:04 2016-11-01 00:00:00.000  0   
5 2016-11-01 00:00:05 2016-11-01 00:00:00.000  0   
6 2016-11-01 00:00:06 2016-11-01 00:00:00.000  0   
7 2016-11-01 00:00:07 2016-11-01 00:00:00.000  0   
8 2016-11-01 00:00:08 2016-11-01 00:00:07.500  1   
9 2016-11-01 00:00:09 2016-11-01 00:00:07.500  1   

# for spot counts...
df_counts = pd.DataFrame(df.groupby('bin')['spot'].count())
df_counts.head()
                         spot
bin                          
2016-11-01 00:00:00.000  8   
2016-11-01 00:00:07.500  7   
2016-11-01 00:00:15.000  8   
2016-11-01 00:00:22.500  7   
2016-11-01 00:00:30.000  8

Pandas series between two date time files

4 Answers4