1

I have a dataframe with some dates,and associated data with each date that I am reading in from a csv file (the file is relatively small, on the magnitude of 10,000s of rows, and ~10 columns):

memid   date        a   b
10000   7/3/2017    221 143
10001   7/4/2017    442 144
10002   7/6/2017    132 145
10003   7/8/2017    742 146
10004   7/10/2017   149 147

I want to add a column, "date_diff", to this dataframe that calculates the amount of days between each date and the previously most recent date (the rows are always sorted by date):

memid   date        a   b    date_diff
10000   7/3/2017    221 143  NaN
10001   7/4/2017    442 144  1
10002   7/6/2017    132 145  2
10003   7/8/2017    742 146  2
10004   7/11/2017   149 147  3

I am having trouble figuring out a good way to create this "date_diff" column as iterating row by row tends to be frowned upon when using pandas/numpy. Is there an easy way to create this column in python/pandas/numpy or is this job better done before the csv is read into my script?

Thanks!

EDIT: Thanks to jpp and Tai for their answer. It covers the original question but I have a follow up:

What if my dataset has multiple rows for each date? Is there a way to easily check the difference between each group of dates to produce an output like the example below? Is it easier if there are a set number of rows for each date?

 memid  date        a   b   date_diff
 10000  7/3/2017    221 143 NaN
 10001  7/3/2017    442 144 NaN
 10002  7/4/2017    132 145 1
 10003  7/4/2017    742 146 1
 10004  7/6/2017    149 147 2
 10005  7/6/2017    457 148 2
Nizag
  • 909
  • 1
  • 9
  • 15
  • use timedelta imported from datetime. read [this](https://stackoverflow.com/questions/12448592/how-to-add-delta-to-python-datetime-time) – manandearth Mar 02 '18 at 15:35

1 Answers1

3

Edit to answer OP's new question: what if there are duplicates in date columns?

Set up: creating a df that does not contains duplicates

df.date = pd.to_datetime(df.date, infer_datetime_format=True) 
df_no_dup = df.drop_duplicates("date").copy()
df_no_dup["diff"] = df_no_dup["date"].diff().dt.days

Method 1 : merge

df.merge(df_no_dup[["date", "diff"]], left_on="date", right_on="date", how="left")

    memid   date        a   b   diff
0   10000   2017-07-03  221 143 NaN
1   10001   2017-07-03  442 144 NaN
2   10002   2017-07-04  132 145 1.0
3   10003   2017-07-04  742 146 1.0
4   10004   2017-07-06  149 147 2.0
5   10005   2017-07-06  457 148 2.0

Method 2 : map

df["diff"] = df["date"].map(df_no_dup.set_index("date")["diff"])

Try this.

df.date = pd.to_datetime(df.date, infer_datetime_format=True)
df.date.diff()

0      NaT
1   1 days
2   2 days
3   2 days
4   2 days
Name: date, dtype: timedelta64[ns]

To convert to integers:

df['diff'] = df['date'].diff() / np.timedelta64(1, 'D')

#    memid       date    a    b  diff
# 0  10000 2017-07-03  221  143   NaN
# 1  10001 2017-07-04  442  144   1.0
# 2  10002 2017-07-06  132  145   2.0
# 3  10003 2017-07-08  742  146   2.0
# 4  10004 2017-07-10  149  147   2.0
Tai
  • 7,684
  • 3
  • 29
  • 49
  • Along a similar vein, what if my data has multiple values at each date? I would still want to check the difference between dates, but would want each row for a given date to have the same date difference value. For example in the df I provided, if rows [0,1,2] were one date, and rows [3,4] were another, I would want the value of the "diff" column to be NaN for rows [0,1,2] and the difference between the dates of the the second group of rows and the first group of rows for rows [3,4] – Nizag Mar 03 '18 at 15:47