1

A sample of my data-frame has the following format;

**place**  **time**
delhi        11/5/2015
delhi        17/5/2015
chennai      13/4/2015
kolkata      21/3/2015
chennai      29/6/2014
bhopal       17/4/2015
delhi        16/4/2015
kolkata      23/6/2014
chennai      11/6/2015
punjab       7/7/2014
chennai      9/11/2015
delhi        13/12/2015
delhi        6/7/2014
kolkata      25/4/2015
bhopal       11/3/2015
punjab       19/6/2015
bhopal       21/5/2015

the column place is in factor format, where the time column is of class character.

Now for each unique place i would like to find out the maximum & minimum dates reported. After that i like to find the duration (difference in time) for the same, i would do it by subtracting min date from max date.

So my objective is to create a data frame like the below format;

**place**  **time**       **min_date**   **max_date**  **duration(in minute)**
delhi        11/5/2015      6/7/2014       13/12/2015    max_date-min_date
delhi        17/5/2015      6/7/2014       13/12/2015    max_date-min_date
chennai      13/4/2015      13/4/2015      9/11/2015     max_date-min_date
kolkata      21/3/2015      23/6/2014      25/4/2015     max_date-min_date
chennai      29/6/2014      13/4/2015      9/11/2015     max_date-min_date
bhopal       17/4/2015      11/3/2015      21/5/2015     max_date-min_date
delhi        16/4/2015      6/7/2014       13/12/2015    max_date-min_date
kolkata      23/6/2014      23/6/2014      25/4/2015     max_date-min_date
chennai      11/6/2015      13/4/2015      9/11/2015     max_date-min_date
punjab       7/7/2014       7/7/2014       19/6/2015     max_date-min_date
chennai      9/11/2015      13/4/2015      9/11/2015     max_date-min_date
delhi        13/12/2015     6/7/2014       13/12/2015    max_date-min_date
delhi        6/7/2014       6/7/2014       13/12/2015    max_date-min_date
kolkata      25/4/2015      23/6/2014      25/4/2015     max_date-min_date
bhopal       11/3/2015      11/3/2015      21/5/2015     max_date-min_date
punjab       19/6/2015      7/7/2014       19/6/2015     max_date-min_date
bhopal       21/5/2015      11/3/2015      21/5/2015     max_date-min_date

I am using python3 & pandas. I know if my data-frame has a name df then for finding minimum & maximum date i have to write df.time.min() & df.time.max(), but this won't serve my purpose.

Can you please guide me? Thanks in advance, as always.

anon
  • 855
  • 12
  • 22
Hindol Ganguly
  • 363
  • 1
  • 4
  • 16

1 Answers1

1

You can firt convert to_datetime column **time**, then groupby with transform and last gett difference by sub, convert Timedelta to seconds by total_seconds and last divide by 60 for minutes:

df['**time**'] = pd.to_datetime(df['**time**'], dayfirst=True)
g = df.groupby('**place**')['**time**']
df['**min_date**'] = g.transform('min')
df['**max_date**'] = g.transform('max')
df[' **duration(in minute)**'] = df['**max_date**'].sub(df['**min_date**'])
                                                   .dt.total_seconds()
                                                   .div(60)
print (df)
   **place**   **time** **min_date** **max_date**   **duration(in minute)**
0      delhi 2015-05-11   2014-07-06   2015-12-13                  756000.0
1      delhi 2015-05-17   2014-07-06   2015-12-13                  756000.0
2    chennai 2015-04-13   2014-06-29   2015-11-09                  717120.0
3    kolkata 2015-03-21   2014-06-23   2015-04-25                  440640.0
4    chennai 2014-06-29   2014-06-29   2015-11-09                  717120.0
5     bhopal 2015-04-17   2015-03-11   2015-05-21                  102240.0
6      delhi 2015-04-16   2014-07-06   2015-12-13                  756000.0
7    kolkata 2014-06-23   2014-06-23   2015-04-25                  440640.0
8    chennai 2015-06-11   2014-06-29   2015-11-09                  717120.0
9     punjab 2014-07-07   2014-07-07   2015-06-19                  499680.0
10   chennai 2015-11-09   2014-06-29   2015-11-09                  717120.0
11     delhi 2015-12-13   2014-07-06   2015-12-13                  756000.0
12     delhi 2014-07-06   2014-07-06   2015-12-13                  756000.0
13   kolkata 2015-04-25   2014-06-23   2015-04-25                  440640.0
14    bhopal 2015-03-11   2015-03-11   2015-05-21                  102240.0
15    punjab 2015-06-19   2014-07-07   2015-06-19                  499680.0
16    bhopal 2015-05-21   2015-03-11   2015-05-21                  102240.0

If need similar format with day first use dt.strftime:

df['**time**'] = df['**time**'].dt.strftime('%d/%m/%Y')
df['**min_date**']= df['**min_date**'].dt.strftime('%d/%m/%Y')
df['**max_date**']= df['**max_date**'].dt.strftime('%d/%m/%Y')
print (df)
   **place**    **time** **min_date** **max_date**   **duration(in minute)**
0      delhi  11/05/2015   06/07/2014   13/12/2015                  756000.0
1      delhi  17/05/2015   06/07/2014   13/12/2015                  756000.0
2    chennai  13/04/2015   29/06/2014   09/11/2015                  717120.0
3    kolkata  21/03/2015   23/06/2014   25/04/2015                  440640.0
4    chennai  29/06/2014   29/06/2014   09/11/2015                  717120.0
5     bhopal  17/04/2015   11/03/2015   21/05/2015                  102240.0
6      delhi  16/04/2015   06/07/2014   13/12/2015                  756000.0
7    kolkata  23/06/2014   23/06/2014   25/04/2015                  440640.0
8    chennai  11/06/2015   29/06/2014   09/11/2015                  717120.0
9     punjab  07/07/2014   07/07/2014   19/06/2015                  499680.0
10   chennai  09/11/2015   29/06/2014   09/11/2015                  717120.0
11     delhi  13/12/2015   06/07/2014   13/12/2015                  756000.0
12     delhi  06/07/2014   06/07/2014   13/12/2015                  756000.0
13   kolkata  25/04/2015   23/06/2014   25/04/2015                  440640.0
14    bhopal  11/03/2015   11/03/2015   21/05/2015                  102240.0
15    punjab  19/06/2015   07/07/2014   19/06/2015                  499680.0
16    bhopal  21/05/2015   11/03/2015   21/05/2015                  102240.0

Solution without transform is first aggregate min and max and then use map:

df['**time**'] = pd.to_datetime(df['**time**'], dayfirst=True)
g = df.groupby('**place**')['**time**']

min_ser = g.min()
print (min_ser)
**place**
bhopal    2015-03-11
chennai   2014-06-29
delhi     2014-07-06
kolkata   2014-06-23
punjab    2014-07-07
Name: **time**, dtype: datetime64[ns]

max_ser = g.max()
print (max_ser)
**place**
bhopal    2015-05-21
chennai   2015-11-09
delhi     2015-12-13
kolkata   2015-04-25
punjab    2015-06-19
Name: **time**, dtype: datetime64[ns]

df['**min_date**'] = df['**place**'].map(min_ser)
df['**max_date**'] = df['**place**'].map(max_ser)
print (df)
   **place**   **time** **min_date** **max_date**
0      delhi 2015-05-11   2014-07-06   2015-12-13
1      delhi 2015-05-17   2014-07-06   2015-12-13
2    chennai 2015-04-13   2014-06-29   2015-11-09
3    kolkata 2015-03-21   2014-06-23   2015-04-25
4    chennai 2014-06-29   2014-06-29   2015-11-09
5     bhopal 2015-04-17   2015-03-11   2015-05-21
6      delhi 2015-04-16   2014-07-06   2015-12-13
7    kolkata 2014-06-23   2014-06-23   2015-04-25
8    chennai 2015-06-11   2014-06-29   2015-11-09
9     punjab 2014-07-07   2014-07-07   2015-06-19
10   chennai 2015-11-09   2014-06-29   2015-11-09
11     delhi 2015-12-13   2014-07-06   2015-12-13
12     delhi 2014-07-06   2014-07-06   2015-12-13
13   kolkata 2015-04-25   2014-06-23   2015-04-25
14    bhopal 2015-03-11   2015-03-11   2015-05-21
15    punjab 2015-06-19   2014-07-07   2015-06-19
16    bhopal 2015-05-21   2015-03-11   2015-05-21
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • oops, in function `to_datetime` is necessary add parameter `dayfirst=True`, now is output correct. – jezrael Dec 05 '16 at 07:50
  • @jazrael, If I am using pyspark 2.0.1 & DataFrame based API, then the transform function does not work. Can you please guide me how can I proceed for that? What edits need to be made for , df['**min_date**'] = g.transform('min') df['**max_date**'] = g.transform('max') these two lines to get similar result in pyspark? – Hindol Ganguly Jan 24 '17 at 10:15
  • Is possible use function `join` ? – jezrael Jan 24 '17 at 10:17
  • would join function be able to find minimum & maximum dates for each entries of 'places' column? As transform did earlier? – Hindol Ganguly Jan 24 '17 at 10:20
  • Sorry i took quite long. Thanks for your effort; but as i said, I am using pyspark 2.0.1 & DataFrame based API, so i am trying to avoid pandas operation. As the time column is a string, i first need to change it in date-time format. I progressed as below; from pyspark.sql.functions import from_unixtime, unix_timestamp, min, max from_pattern = 'M/dd/yyyy' to_pattern = 'dd-M-yyyy' df.withColumn('format_time', from_unixtime(unix_timestamp(df['time'], from_pattern), to_pattern)) g= df.groupBy(“place”, “format_time”) ...ToBeContinued.. – Hindol Ganguly Jan 25 '17 at 06:55
  • Here g.min() & g.max() is not working. I need to find the min max dates & after that i will calculate the duration by a time_delta function, def time_delta(y,x): from datetime import datetime max = datetime.strptime(y, '%d-%m-%Y') min = datetime.strptime(x, '%d-%m-%Y') delta = (max-min).total_seconds() return delta Please guide me from here. – Hindol Ganguly Jan 25 '17 at 06:55
  • Sorry, but technology `pyspark` is for me new, I never work with it. I find solution for aggregating in pyspark [here](http://stackoverflow.com/a/34955432/2901002), but I dont know if it works. th best is create new question and I hope some pyspark guru help you. Try add your code what you alreade do in pyspark, input and desired output. – jezrael Jan 25 '17 at 07:01
  • @jazrael, ok, no issue. I would like to offer you the crown of 'Python guru', if possible. :) – Hindol Ganguly Jan 25 '17 at 07:12
  • Unfortunately it is no True, `python` has many libraries and I know better only `pandas`. Plain `python` know a bit, but some another thing like `django`, `pyspark` and many others are tottaly unknown for me. – jezrael Jan 25 '17 at 07:14