counting records on per hour, per day and create multindex DataFrame as output

Question

Sample DataFrame :

process_id | app_path | start_time

the desired output data frame should be multi-Indexed based on the date and time value in start_time column with unique dates as first level of index and one hour range as second level of index the count of records in each time slot should be calculated

def activity(self):
    # find unique dates from db file
    columns = self.df['start_time'].map(lambda x: x.date()).unique()

    result = pandas.DataFrame(np.zeros((1,len(columns))), columns = columns)
    for i in range(len(self.df)):
        col = self.df.iloc[i]['start_time'].date()
        result[col][0] = result.get_value(0, col) + 1

    return result

I have tried the above code which gives the output as :

15-07-2014 16-7-2014 17-07-2014 18-07-2014 3217 2114 1027 3016 I want to count records on per hour basis as well

score 8 · Accepted Answer · answered Nov 07 '14 at 13:19

It would be helpful to start your question with some sample data. Since you didn't, I assumed the following is representative of your data (looks like app_path was not being used):

rng = pd.date_range('1/1/2011', periods=10000, freq='1Min')
df = pd.DataFrame(randint(size=len(rng), low=100, high = 500), index=rng)
df.columns = ['process_id']

It looks like you could benefit from exploring the groupby method in Pandas data frames. Using groupby, your example above become a simple one-liner:

df.groupby( [df.index.year, df.index.month, df.index.day] ).count()

and grouping by hour means simply adding hour to the group:

df.groupby( [df.index.year, df.index.month, df.index.day, df.index.hour] ).count()

Don't recreate the wheel in Pandas, use the methods provided for much more readable, as well as faster, code.

Thanks, but I guess I'll have to make start_time as my index for dataframe. This gives me count for each column but I want only a single column representing the count — Pooja Mulchandani, Nov 10 '14 at 05:49
No, you don't have to use the index. But that is a typical use case for time series data. If the start_time is in a different column then use that column name instead of index. — JD Long, Nov 10 '14 at 14:36

counting records on per hour, per day and create multindex DataFrame as output

1 Answers1

Linked