Resampling and interpolate

Question

Stuck on an assignment in understanding how to resample the data into an hourly rate instead of the 4 hour blocks they are now. Here is what has been asked to do.

Differences The ENTRIES and EXITS fields hold raw counts that do not reset to zero each week. We would like to know how many entries and exits there are in the 4-hour periods. To calculate this, we need to calculate the difference between neighboring rows that have the same (UNIT, C/A, SCP) key. Create NUM_ENTRIES and NUM_EXITS columns that store these numbers.

Hints:

The shift method will be useful. It will be easier to use groupby when doing the shift as it will respect boundaries between subunits. The level argument will help define the subunits. Most of the counters count up, but there are some that count down. How should you handle those cases? Fix this for extra credit.

One problem with the numbers from the previous question is that they are sampled at different times. Resample the ENTRIES and EXITS columns to an hourly rate and interpolate it to fill in the missing values. Use the "pchip" interpolation method as it will preserve monotonicity. Again, this should be done in groups using groupby, but the apply function will allow the use of arbitrary interpolate methods. Now, recompute the NUM_ENTRIES and NUM_EXITS columns from Part 2. Hints: Use reset_index to clear the UNIT, C/A, and SCP levels of the index as this makes the resample and interpolate methods used in the apply function more straightforward. Add the index back after performing the interpolation via set_index.

here is code I have up to the part it is asked to resample

df = pd.read_csv("turnstile_161126.txt")
timestamp =pd.to_datetime(df['DATE'] + ' ' + df['TIME'])
df.insert(3, 'TIMESTAMP', timestamp)
df.columns = df.columns.str.strip()
df = df.set_index(['UNIT','C/A','SCP','TIMESTAMP'])
df['NUM_ENTRIES'] = df.ENTRIES - df.ENTRIES.shift(1)
df['NUM_EXITS'] = df.EXITS - df.EXITS.shift(1)

                STATION LINENAME    DIVISION    DATE    TIME    DESC    ENTRIES EXITS   NUM_ENTRIES NUM_EXITS
UNIT    C/A SCP TIMESTAMP                                       
R051    A002    02-00-00    2016-11-19 03:00:00 59 ST   NQR456W BMT 11/19/2016  03:00:00    REGULAR 5924658 2007780 NaN NaN
2016-11-19 07:00:00 59 ST   NQR456W BMT 11/19/2016  07:00:00    REGULAR 5924672 2007802 14.0    22.0
2016-11-19 11:00:00 59 ST   NQR456W BMT 11/19/2016  11:00:00    REGULAR 5924738 2007908 66.0    106.0
2016-11-19 15:00:00 59 ST   NQR456W BMT 11/19/2016  15:00:00    REGULAR 5924979 2007980 241.0   72.0
2016-11-19 19:00:00 59 ST   NQR456W BMT 11/19/2016  19:00:00    REGULAR 5925389 2008056 410.0   76.0
2016-11-19 23:00:00 59 ST   NQR456W BMT 11/19/2016  23:00:00    REGULAR 5925614 2008081 225.0   25.0
2016-11-20 03:00:00 59 ST   NQR456W BMT 11/20/2016  03:00:00    REGULAR 5925684 2008096 70.0    15.0
2016-11-20 07:00:00 59 ST   NQR456W BMT 11/20/2016  07:00:00    REGULAR 5925688 2008113 4.0 17.0
2016-11-20 11:00:00 59 ST   NQR456W BMT 11/20/2016  11:00:00    REGULAR 5925755 2008191 67.0    78.0
2016-11-20 15:00:00 59 ST   NQR456W BMT 11/20/2016  15:00:00    REGULAR 5925937 2008260 182.0   69.0
2016-11-20 19:00:00 59 ST   NQR456W BMT 11/20/2016  19:00:00    REGULAR 5926232 2008332 295.0   72.0
2016-11-20 23:00:00 59 ST   NQR456W BMT 11/20/2016  23:00:00    REGULAR 5926394 2008367 162.0   35.0
2016-11-21 03:00:00 59 ST   NQR456W BMT 11/21/2016  03:00:00    REGULAR 5926425 2008378 31.0    11.0
2016-11-21 07:00:00 59 ST   NQR456W BMT 11/21/2016  07:00:00    REGULAR 5926440 2008420 15.0    42.0
2016-11-21 11:00:00 59 ST   NQR456W BMT 11/21/2016  11:00:00    REGULAR 5926622 2008741 182.0   321.0
2016-11-21 15:00:00 59 ST   NQR456W BMT 11/21/2016  15:00:00    REGULAR 5926872 2008851 250.0   110.0
2016-11-21 19:00:00 59 ST   NQR456W BMT 11/21/2016  19:00:00    REGULAR 5927775 2008927 903.0   76.0

https://i.stack.imgur.com/M0595.png

full data set

Can you post the dataframe you have as a picture sample? It is hard to see what columns / indexes you have. — NickBraunagel, Dec 14 '16 at 15:58
just added image at the end of the post @NickBraunagel. Thanks for helping out. I tried running your code and playing around with it but I think you did something different then Im trying to do and I couldn't figure out how to alter it. — , Dec 14 '16 at 18:43
Thanks for the image. Is the above question a direct copy/paste from the homework problem? The English is not good and is a bit confusing on what it wants. If you could re-paste the corrected English that would be great. — NickBraunagel, Dec 15 '16 at 15:15
Also, can you specifically call out the question for the problem? I THINK the assignment wants you to find the number of rider entries and exits at each group ("UNIT","C/A","SCP") within a 1 hour period. Is this correct? Any additional info on the problem would really help me out. — NickBraunagel, Dec 15 '16 at 15:25
you are right that he wants us to find the number of rider entries and exits at each("UNIT","C/A","SCP") where he wants us to use groupby, The deadline has passed for the assignment so no rush, but I would really appreciate help for future work. — , Dec 16 '16 at 01:15

NickBraunagel · Accepted Answer · 2016-12-16T14:39:22.853

UPDATED ANSWER:

import pandas as pd
import random
from random import randint
from dateutil.relativedelta import relativedelta
import datetime

start = datetime.datetime(2016,1,1)
r = range(0,2000)
units = ['UnitA','UnitB','UnitC']
CA = ['A002','B002','C002']
CSP = ['02-00-00','03-00-00','04-00-00']

DF = pd.DataFrame()

# create dummy dataframe    
for unit in units:

    date = [start + relativedelta(hours=4*i) for i in r]
    Entries = [random.uniform(5000000, 6000000) for i in r]
    Exits = [random.uniform(1000000, 2000000) for i in r]

    df = pd.DataFrame(data={
            'date': date,
            'unit': unit,
            'Entries': Entries,
            'Exits': Exits,
            'CA': CA[randint(0,2)],
            'CSP': CSP[randint(0,2)],
        })

    DF = pd.concat([DF,df])

# set index based upon desired groupby PLUS date column
DF.set_index(keys=['unit','CA','CSP','date'], inplace=True)

# show results
print DF[['Entries','Exits']].unstack(['unit','CA','CSP']).resample('1H').interpolate(method='pchip')      # <--if desired, add .unstack() at end to re-arrange DF

produces:

I actually need to group "UNIT","C/A","SCP","TIMESTAMP" – Dec 14 '16 at 02:38 — , Dec 14 '16 at 02:38

Resampling and interpolate

1 Answers1