Average of a time related datasets in Pandas with missing values

Question

For a project I am working on I need to calculate average price of products for shops. Every time a shop changes the price of a product a new entry is added to the dataset. If a shop stops (temporarily or permanently) to sell a product, an entry is made with the time stamp and a price value of -1. Example:

          timestamp     shop  product    price

2014-01-01 10:07:32        E        4    19.99
2014-01-01 10:07:32        F        5    54.00
2014-01-02 14:41:12        A        1    28.00
2014-01-02 14:41:12        D        3   249.99
2014-01-02 15:12:38        C        1    29.99
2014-01-03 14:05:12        B        2    43.00
2014-01-05 12:21:57        F        5    49.99
2014-01-06 23:55:32        F        5    -1
2014-01-07 03:05:12        B        2    39.99
2014-01-07 11:24:49        D        3    -1
2014-01-08 11:35:33        C        2    40.99
2014-01-08 16:28:07        F        5    65.00
2014-01-12 21:41:04        E        3   199.00

Test cases:

Shop: A that has no price entry for product 1 in the time period to calculate
Shop B that has product 2 switch prices within the period
Shop C that start selling product 2 in the period, and sells product 1 all through
Shop D that stops selling product 3 in the period.
Shop E that starts selling product 3 after the period and sells product 4 throughout
Shop F that, for product 5 changes price, then stop selling, then starts again with a new price, all within the period

The period to fint the averages is from 2014-01-05 00:00:00 to 2014-01-10 23:59:59

What I need to do is calculate the average price within a certain period for a certain shop, and overall. That is the average is time weighted (3 days a price of 3 and 1 day price 1 is an avg of 2.5 not 1.5 for those 4 days). I have two problems:

The starting value can be missing. The last price change i most likely in the begging of the time period being calculated, so I need to find a way to fill it in so that it will be used in the avg. In fact it possible that this is the only price in the whole period.
Calculating with -1 will give the wrong results. The value should be ignored, and the overal time delta should be reduced with the time the product is no longer available.

The expected output for the data given above is (prices rounded up to the nearest cent):

shop   product    price
   A         1    28.00
   B         2    41.06
   C         1    29.99
   C         2    40.99
   D         3   249.99
   E         4    19.99
   F         5    53.81

I have tried using numpy.ma to mask out the -1 values. However I have been unsuccessful in doing this as isnan and masked_less cannot handle this.

Any idea as to how I can achieve this?

Edit: Edited test data en expected results to more clearly reflect the problem

Also, we need to see the expected output for only the data you have included. — Paul H, Jan 14 '14 at 15:14
I added a small set that I think cover most cases with all the expected outputs — kfirbreger, Jan 16 '14 at 08:38

score 0 · Answer 1 · answered Jan 14 '14 at 14:46

0

AFAIR, pandas doesn't handle masked values the numpy.ma way. However, it should handle nans when computing the mean. The simplest solution is to parse your Dataframe and replace your price of -1.00 by np.nan with something like:

price = dataframe['price']
price[price == -1] = np.nan

answered Jan 14 '14 at 14:46

Pierre GM

19,809
3
56
67

Thank you for your answer. I have tried this to handle my second problem. However pandas's mean is not weighted. I need time weighted average. [See this question](http://stackoverflow.com/questions/10839701/time-weighted-average-with-pandas) – kfirbreger Jan 14 '14 at 14:51

score 0 · Answer 2 · answered Jan 14 '14 at 15:15

It's not clear to me exactly what you need. It seems like a simple groupby will do the trick:

import StringIO
import numpy as np
import pandas

datatext = StringIO.StringIO("""\
          timestamp                              shop  product   price
2014-01-02 14:41:12  3075774DFBB0014CC1257822003CE79B  1135972   28.00
2014-01-03 14:05:12  3075774DFBB0014CC1257822003CE79B  1129779   43.00
2014-01-03 20:49:12  FDB140FB8F5FB469C125713B0035474F  1643094  145.00
2014-01-04 07:26:12  3075774DFBB0014CC1257822003CE79B  1080521   27.00
2014-01-05 10:44:12  949BC26900E5E48BC125773D00520641  1149009   59.99
2014-01-06 13:21:53  FDB140FB8F5FB469C125713B0035474F  1644526   -1.00
2014-01-06 15:28:53  FDB140FB8F5FB469C125713B0035474F  1643094   -1.00
2014-01-06 13:21:53  00000000000000000000000001262068  1742831   -1.00
2014-01-07 12:00:10  9D973A188A017E3DC1256C220034A923  1067842   31.99
2014-01-09 12:25:54  78F58D53EA47E73AC12575F4004A42AA   232979  129.00
""")
df = pandas.read_table(datatext, index_col=[0], parse_dates=True, sep='\s\s+')

# mask out the negative values
df['price'][df['price'] < 0] = np.nan

# group by the shop and product and compute the mean
df.groupby(by=['shop', 'product']).mean()

This give me:

                                           price
shop                             product        
00000000000000000000000001262068 1742831     NaN
3075774DFBB0014CC1257822003CE79B 1080521   27.00
3075774DFBB0014CC1257822003CE79B 1129779   43.00
3075774DFBB0014CC1257822003CE79B 1135972   28.00
78F58D53EA47E73AC12575F4004A42AA 232979   129.00
949BC26900E5E48BC125773D00520641 1149009   59.99
9D973A188A017E3DC1256C220034A923 1067842   31.99
FDB140FB8F5FB469C125713B0035474F 1643094  145.00
FDB140FB8F5FB469C125713B0035474F 1644526     NaN

This would work if it was not time weighted, and if -1 was not reversible within the period. See for instance shop E in my example data. — kfirbreger, Jan 16 '14 at 08:39

Average of a time related datasets in Pandas with missing values

2 Answers2