Using pandas, how do I summarize my dataframe to show daily average and standard deviation per category?

Question

I have a dataframe that has a datetime Purchase Date column, a float64 Price column, and an object Location column. I have over 1000 locations, and almost all locations do not have daily data.

I need to find the daily average price and standard deviation of price for each location, using null or NaN when there are no values to average.

My raw data looks like this: | Lane | Book Date | Purchase Price | | -------- | -------------- | ----- | | FL-AB | 1/1/2020 | 123.00 | | MA-CA | 1/1/2020 | 300.00 | | MA-CA | 1/1/2020 | 279.50 | | MA-CA | 1/2/2020 | 268.00 | | FL-NY | 1/3/2020 | 179.51 |

I'm very new to programming, and I'm not sure where to start.

I converted my datetime to date with pd.to_datetime(pd['Purchase Date']).dt.date

Any help from here as well as how to aggregate the same data at a weekly level would be greatly appreciated.

Bruno Oliveira · Answer 1 · 2023-07-06T19:47:19.350

[EDITED]

the pandas.Grouper here is a way for it, here is the kinds of frequency you can have from time series data. Anyway, if you can provided some insight or example of your data, a little table, it will elucidate better the problem. For the first part of the problem You can "aggregate the same data at a day or weekly level", i didn't understand if the price need to be an average value for the week or day too, but can be done like this:

Since the autor provided some clarification, think that the table/dataframe above should be or can be created in a way like this:

df = pd.DataFrame(data={'Lane':pd.Series(['FL-AB',  'MA-CA', 'MA-CA', 'MA-CA', 'FL-NY'], index=[1, 2, 3, 4, 5]),
                        'Book date':pd.Series(pd.to_datetime(["2020/01/01", "2020/01/01", "2020/01/01", "2020/01/02", "2020/01/03"]), index=[1, 2, 3, 4, 5]),
                        'Purchase Price':pd.Series([123.0, 300.0, 279.5, 268.0, 179.51], index=[1, 2, 3, 4, 5]),
                        },
                  index=[0, 1, 2, 3, 4, 5]
                  )

Probably will result in this dataframe:

>>>df
    Lane  Book date  Purchase Price
0    NaN        NaT             NaN
1  FL-AB 2020-01-01          123.00
2  MA-CA 2020-01-01          300.00
3  MA-CA 2020-01-01          279.50
4  MA-CA 2020-01-02          268.00
5  FL-NY 2020-01-03          179.51

If your dataframe look like that, you can get average and standard deviation (std) from the dataframe.describe() function Just remember to group by "day" before it

df.groupby([pd.Grouper(key='Book date', freq='D')])['Purchase Price'].describe()

it will result in something like this:

            count        mean        std     min     25%     50%     75%     max
Book date
2020-01-01    3.0  234.166667  96.817268  123.00  201.25  279.50  289.75  300.00
2020-01-02    1.0  268.000000        NaN  268.00  268.00  268.00  268.00  268.00
2020-01-03    1.0  179.510000        NaN  179.51  179.51  179.51  179.51  179.51

where if needed you can use the Transpose method, to preference as show by:

df.groupby([pd.Grouper(key='Book date', freq='D')])['Purchase Price'].describe().T

Where you get:

Book date  2020-01-01  2020-01-02  2020-01-03
count        3.000000         1.0        1.00
mean       234.166667       268.0      179.51
std         96.817268         NaN         NaN
min        123.000000       268.0      179.51
25%        201.250000       268.0      179.51
50%        279.500000       268.0      179.51
75%        289.750000       268.0      179.51
max        300.000000       268.0      179.51

Hope that this can get you your daily price average and std, and after that groupby week!

For the second part of the problem

your_df.groupby([pd.Grouper(key='Purchase Date', freq='W')])['Price'].mean()

in this case there is only the "first week" so

> df.groupby([pd.Grouper(key='Book date', freq='W')])['Purchase Price'].describe().T

Book date  2020-01-05
count        5.000000
mean       230.002000
std         75.485909
min        123.000000
25%        179.510000
50%        268.000000
75%        279.500000
max        300.000000

Thank you, I'll use that line for my weekly time series analysis. I'm still struggling with the first part. I'll try to add a sample of my data. — novicePythonLearner, Jul 02 '23 at 23:58
I got the sample data added to the body of my original post. — novicePythonLearner, Jul 03 '23 at 00:29
Sure, I'll edit the answer above, let's see if i get it correct. — Bruno Oliveira, Jul 06 '23 at 18:14
@novicePythonLearner i edited the answer, see if fits you problem! — Bruno Oliveira, Jul 07 '23 at 15:50

Using pandas, how do I summarize my dataframe to show daily average and standard deviation per category?

1 Answers1

[EDITED]

For the second part of the problem