1

For a project I am working on I need to calculate average price of products for shops. Every time a shop changes the price of a product a new entry is added to the dataset. If a shop stops (temporarily or permanently) to sell a product, an entry is made with the time stamp and a price value of -1. Example:

          timestamp     shop  product    price

2014-01-01 10:07:32        E        4    19.99
2014-01-01 10:07:32        F        5    54.00
2014-01-02 14:41:12        A        1    28.00
2014-01-02 14:41:12        D        3   249.99
2014-01-02 15:12:38        C        1    29.99
2014-01-03 14:05:12        B        2    43.00
2014-01-05 12:21:57        F        5    49.99
2014-01-06 23:55:32        F        5    -1
2014-01-07 03:05:12        B        2    39.99
2014-01-07 11:24:49        D        3    -1
2014-01-08 11:35:33        C        2    40.99
2014-01-08 16:28:07        F        5    65.00
2014-01-12 21:41:04        E        3   199.00

Test cases:

  • Shop: A that has no price entry for product 1 in the time period to calculate
  • Shop B that has product 2 switch prices within the period
  • Shop C that start selling product 2 in the period, and sells product 1 all through
  • Shop D that stops selling product 3 in the period.
  • Shop E that starts selling product 3 after the period and sells product 4 throughout
  • Shop F that, for product 5 changes price, then stop selling, then starts again with a new price, all within the period

The period to fint the averages is from 2014-01-05 00:00:00 to 2014-01-10 23:59:59

What I need to do is calculate the average price within a certain period for a certain shop, and overall. That is the average is time weighted (3 days a price of 3 and 1 day price 1 is an avg of 2.5 not 1.5 for those 4 days). I have two problems:

  • The starting value can be missing. The last price change i most likely in the begging of the time period being calculated, so I need to find a way to fill it in so that it will be used in the avg. In fact it possible that this is the only price in the whole period.
  • Calculating with -1 will give the wrong results. The value should be ignored, and the overal time delta should be reduced with the time the product is no longer available.

The expected output for the data given above is (prices rounded up to the nearest cent):

shop   product    price
   A         1    28.00
   B         2    41.06
   C         1    29.99
   C         2    40.99
   D         3   249.99
   E         4    19.99
   F         5    53.81

I have tried using numpy.ma to mask out the -1 values. However I have been unsuccessful in doing this as isnan and masked_less cannot handle this.

Any idea as to how I can achieve this?

Edit: Edited test data en expected results to more clearly reflect the problem

kfirbreger
  • 109
  • 5

2 Answers2

0

AFAIR, pandas doesn't handle masked values the numpy.ma way. However, it should handle nans when computing the mean. The simplest solution is to parse your Dataframe and replace your price of -1.00 by np.nan with something like:

price = dataframe['price']
price[price == -1] = np.nan
Pierre GM
  • 19,809
  • 3
  • 56
  • 67
  • Thank you for your answer. I have tried this to handle my second problem. However pandas's mean is not weighted. I need time weighted average. [See this question](http://stackoverflow.com/questions/10839701/time-weighted-average-with-pandas) – kfirbreger Jan 14 '14 at 14:51
0

It's not clear to me exactly what you need. It seems like a simple groupby will do the trick:

import StringIO
import numpy as np
import pandas

datatext = StringIO.StringIO("""\
          timestamp                              shop  product   price
2014-01-02 14:41:12  3075774DFBB0014CC1257822003CE79B  1135972   28.00
2014-01-03 14:05:12  3075774DFBB0014CC1257822003CE79B  1129779   43.00
2014-01-03 20:49:12  FDB140FB8F5FB469C125713B0035474F  1643094  145.00
2014-01-04 07:26:12  3075774DFBB0014CC1257822003CE79B  1080521   27.00
2014-01-05 10:44:12  949BC26900E5E48BC125773D00520641  1149009   59.99
2014-01-06 13:21:53  FDB140FB8F5FB469C125713B0035474F  1644526   -1.00
2014-01-06 15:28:53  FDB140FB8F5FB469C125713B0035474F  1643094   -1.00
2014-01-06 13:21:53  00000000000000000000000001262068  1742831   -1.00
2014-01-07 12:00:10  9D973A188A017E3DC1256C220034A923  1067842   31.99
2014-01-09 12:25:54  78F58D53EA47E73AC12575F4004A42AA   232979  129.00
""")
df = pandas.read_table(datatext, index_col=[0], parse_dates=True, sep='\s\s+')

# mask out the negative values
df['price'][df['price'] < 0] = np.nan

# group by the shop and product and compute the mean
df.groupby(by=['shop', 'product']).mean()

This give me:

                                           price
shop                             product        
00000000000000000000000001262068 1742831     NaN
3075774DFBB0014CC1257822003CE79B 1080521   27.00
3075774DFBB0014CC1257822003CE79B 1129779   43.00
3075774DFBB0014CC1257822003CE79B 1135972   28.00
78F58D53EA47E73AC12575F4004A42AA 232979   129.00
949BC26900E5E48BC125773D00520641 1149009   59.99
9D973A188A017E3DC1256C220034A923 1067842   31.99
FDB140FB8F5FB469C125713B0035474F 1643094  145.00
FDB140FB8F5FB469C125713B0035474F 1644526     NaN
Paul H
  • 65,268
  • 20
  • 159
  • 136
  • This would work if it was not time weighted, and if -1 was not reversible within the period. See for instance shop E in my example data. – kfirbreger Jan 16 '14 at 08:39