0

I simply want to merge two dataframes within ±1 min interval.

Here name_df with the sample data set:

Name    Date
A       2/19/2019 17:16:15
B       2/19/2019 17:19:46
C       2/19/2019 17:23:03

Another dateframe job_df:

Job         Datestamp
Engineer    2/19/2019  17:15:56 
Dancer      2/19/2019  17:19:27 
Singer      2/19/2019  17:22:44 

Here is what I tried to implement but this method misses some rows that went over the other side of the 1 minute(e.g <30 sec rounds down, >30 rounds up):

name_df['Date&Time'] = name_df['Date&Time'].dt.round('1min')
job_df['Date&Time'] = job_df['Date&Time'].dt.round('1min')

merged_df = pd.merge(name_df, job_df, on='Date&Time')

Any help on this is greatly appreciated!

Malik Asad
  • 441
  • 4
  • 15
LearningNoob
  • 662
  • 6
  • 23
  • Provide expected output for above data – Sociopath Feb 20 '19 at 10:24
  • rounding to the nearest minute means Dancer: 17:19 in job_df, and 17:20 in name_df. So rounding to nearest hour could be your solution. why would you not round to hours? – MEdwin Feb 20 '19 at 10:49

1 Answers1

0

Here is a potential method to accomplish what you describe:

  1. Create a new data frame (e.g. time_df) that contains the date and timestamps from name_df and job_df
  2. Sort time_df in ascending order
  3. Create an additional column in time_df to specify a unique group number for each set of date and timestamps that are within a minute of each other
  4. Populate this new column by computing the difference in seconds between the first date and timestamp and the ones after it, giving each of them the same unique group number as long as the difference is 120 seconds or less.
  5. Once the difference is greater than 120 seconds, increment the group number and repeat the process with the next un-grouped row.
  6. After assigning all of the rows a group number, merge time_df back into name_df and job_df on the date timestamp column, keeping only the data that matches each original data frame, respectively
  7. Finally, perform a merge using the group number column from each of the two data frames

Ultimately, if you want to group them within a +/- 1 minute interval, you can do it in more than one way depending on whether you start at the beginning or ending date and time. If you have a series of successively increasing times that are each less than 120 seconds apart, you could group them together into brackets of +/- 1 minute in multiple ways. The above method would allow you to do it systematically in a way that would prevent some of the problem you describe, but it may ultimately be impossible to avoid it altogether.

With that being said, it might be easier to try out a different rounding method, such as np.floor or np.ceil or their equivalent. It seems somewhat arbitrary to decide when to round up or when to round down on a case-by-case basis. Finally, once you have determined two time-stamps are within a minute of each other, you may need to decide what time-stamp to associate with them both. Perhaps using the average time could be a good solution.

Nathaniel
  • 3,230
  • 11
  • 18