7

I have a dataframe with two columns: Category and Datetime

I want to create a new column that shows the difference between the Datetime of the current row vs the previous row, restarting at each category.

What I have:

Category    Datetime
A           2018-02-01 01:51:04
A           2018-02-01 02:04:04
B           2018-02-01 02:28:34 
B           2018-02-01 02:41:34
B           2018-02-01 02:45:34    

What I want:

Category    Datetime               Difference
A           2018-02-01 01:51:04    NaT
A           2018-02-01 02:04:04    00:13:00
B           2018-02-01 02:28:34    NaT
B           2018-02-01 02:41:34    00:13:00
B           2018-02-01 02:45:34    00:04:00

EDIT:

@sacul I tried your solution of doing df['Difference'] = list(by_group.apply(lambda x: x['Datetime']-x['Datetime'].shift())) but it's giving me weird results...here's the actual data I'm working with:

Category    Datetime        Difference
A           2/1/18 1:51     NaT
A           2/1/18 2:04     1 days 02:52:00
B           2/1/18 2:28     NaT
C           2/1/18 2:41     NaT
D           2/1/18 6:31     0 days 00:10:30
E           2/1/18 8:26     3 days 23:19:30
F           2/1/18 10:03    0 days 00:21:00
G           2/1/18 11:11    NaT
G           2/1/18 11:11    NaT
G           2/1/18 11:11    0 days 00:00:02
G           2/1/18 11:11    0 days 00:02:30
H           2/1/18 11:12    0 days 00:00:02
H           2/1/18 11:22    0 days 00:02:28
I           2/1/18 15:26    0 days 00:00:02
I           2/1/18 16:01    0 days 00:08:26
I           2/1/18 17:26    0 days 00:00:01
J           2/1/18 17:42    0 days 00:01:31
J           2/1/18 17:42    NaT
Josephine M. Ho
  • 543
  • 1
  • 6
  • 8

2 Answers2

8

alternative solution

import pandas as pd
import numpy as np
df.DateTime = pd.to_datetime(df.DateTime)


df['Difference'] = np.where(df.Category == df.Category.shift(), df.DateTime - df.DateTime.shift(), np.nan)

note: this only works if your data is presorted

usernamenotfound
  • 1,540
  • 2
  • 11
  • 18
1

Assuming your data is in a dataframe called df:

# In case Datetime is not a Datetime object yet (skip if it is):
df.Datetime = pd.to_datetime(df.Datetime)

by_group = df.groupby(df.Category)

df['Difference'] = list(by_group.apply(lambda x: x['Datetime']-x['Datetime'].shift()))

>>> df
  Category            Datetime Difference
0        A 2018-02-01 01:51:04        NaT
1        A 2018-02-01 02:04:04   00:13:00
2        B 2018-02-01 02:28:34        NaT
3        B 2018-02-01 02:41:34   00:13:00
4        B 2018-02-01 02:45:34   00:04:00

This groups it by category, and then subtracts the datetime object in each row from the row below in each group.

EDIT:

This seems to work with your new data as well, when starting with a Datetime column of strings in the form 2/1/18 1:51, and modifying that via pd.to_datetime(df.Datetime):

>>> df1
   Category            Datetime Difference
0         A 2018-02-01 01:51:00        NaT
1         A 2018-02-01 02:04:00   00:13:00
2         B 2018-02-01 02:28:00        NaT
3         C 2018-02-01 02:41:00        NaT
4         D 2018-02-01 06:31:00        NaT
5         E 2018-02-01 08:26:00        NaT
6         F 2018-02-01 10:03:00        NaT
7         G 2018-02-01 11:11:00        NaT
8         G 2018-02-01 11:11:00   00:00:00
9         G 2018-02-01 11:11:00   00:00:00
10        G 2018-02-01 11:11:00   00:00:00
11        H 2018-02-01 11:12:00        NaT
12        H 2018-02-01 11:22:00   00:10:00
13        I 2018-02-01 15:26:00        NaT
14        I 2018-02-01 16:01:00   00:35:00
15        I 2018-02-01 17:26:00   01:25:00
16        J 2018-02-01 17:42:00        NaT
17        J 2018-02-01 17:42:00   00:00:00
sacuL
  • 49,704
  • 8
  • 81
  • 106
  • Hi @sacul! I tried your solution, please see my edit above. `df.Datetime` is in datetime format ( – Josephine M. Ho Feb 26 '18 at 23:52
  • That's strange, it seems to work for me; I'm updating my answer with your new data: – sacuL Feb 27 '18 at 00:08
  • I had to reprocess your data in order to get it into a dataframe, what I ended up with was a column named `Datetime` with *strings* in the form `2/1/18 1:51`, then called `pd.to_datetime(df.Datetime)` on that column, went through the `groupby` process, and I got the results posted. Try changing your 'Datetime' column to strings, then doing as I posted. You can use `df['Datetime'] = df['Datetime'].to_string()` – sacuL Feb 27 '18 at 00:15
  • I see, it's something with my datetime format (this data was exported from Splunk). When I did `df['Datetime'].to_string()` and then `pd.to_datetime(df['Datetime'])`, I got **ValueError: Unknown string format**. Apparently, the `to_string` command turned it into something super weird (it's long so I'll paste it in another comment). Going to manually reformat df.Datetime into a string and give your solution a try. – Josephine M. Ho Feb 27 '18 at 01:37
  • '41 2018-02-01 01:51:04\n72 2018-02-01 02:04:04\n73 2018-02-01 02:28:34\n74 2018-02-01 02:41:34\n82 2018-02-01 06:31:04\n71 2018-02-01 08:26:04\n209 2018-02-01 10:03:04\n203 2018-02-01 11:11:04\n204 2018-02-01 11:11:04\n205 2018-02-01 11:11:05\n206 2018-02-01 11:11:06\n207 2018-02-01 11:12:04\n208 2018-02-01 11:22:34\n202 2018-02-01 15:26:04\n201 2018-02-01 16:01:04\n58 2018-02-01 17:26:04\n59 2018-02-01 17:42:04\n60 2018-02-01 17:42:05\n62 2018-02-01 17:42:34\n61 2018-02-01 17:42:34\n63 2018-02-01 17:50:04\n64 ... – Josephine M. Ho Feb 27 '18 at 01:39
  • You can try changing it with `df.Datetime = df.Datetime.astype('str')`, it actually makes more sense than the `.to_string()` method (sorry, I should have tested that before my last comment!) – sacuL Feb 27 '18 at 01:48
  • So weird, but still didn't work (getting those weird "0 days" results). However, @Usernamenotfound's solution worked like a charm - thanks guys!! – Josephine M. Ho Feb 27 '18 at 02:21