1

I have a data frame with items and their prices, something like this: ╔══════╦═════╦═══════╗ ║ Item ║ Day ║ Price ║ ╠══════╬═════╬═══════╣ ║ A ║ 1 ║ 10 ║ ║ B ║ 1 ║ 20 ║ ║ C ║ 1 ║ 30 ║ ║ D ║ 1 ║ 40 ║ ║ A ║ 2 ║ 100 ║ ║ B ║ 2 ║ 20 ║ ║ C ║ 2 ║ 30 ║ ║ D ║ 2 ║ 40 ║ ║ A ║ 3 ║ 500 ║ ║ B ║ 3 ║ 25 ║ ║ C ║ 3 ║ 35 ║ ║ D ║ 3 ║ 1000 ║ ╚══════╩═════╩═══════╝

I want to exclude all rows from this df where the item has an average price over 200. So filtered df should look like this: ╔══════╦═════╦═══════╗ ║ Item ║ Day ║ Price ║ ╠══════╬═════╬═══════╣ ║ B ║ 1 ║ 20 ║ ║ C ║ 1 ║ 30 ║ ║ B ║ 2 ║ 20 ║ ║ C ║ 2 ║ 30 ║ ║ B ║ 3 ║ 25 ║ ║ C ║ 3 ║ 35 ║ ╚══════╩═════╩═══════╝

I'm new to python and pandas but as a first step was thinking something like this to get a new df for avg prices: avg_prices_df = df.groupby('ItemID').Price.mean().reset_index and then not sure how to proceed from there. Not sure even that first step is correct.

To further complicate the matter, I am using vaex to read the data in ndf5 form as I have over 400 million rows.

Many thanks in advance for any advice.

EDIT: So I got the following code working, though I am sure it is not optimised..

`

create dataframe of ItemIDs and their average prices

df_item_avg_price = df.groupby(df.ItemID, agg=[vaex.agg.count('ItemID'), vaex.agg.mean('Price')])

filter this new dataframe by average price threshold

df_item_avg_price = (df_item_avg_price[df_item_avg_price["P_r_i_c_e_mean"] <= 50000000])

create list of ItemIDs which have average price under the threshold

items_in_price_range = df_item_avg_price['ItemID'].tolist()

filter the original dataframe to include rows only with the items in price range

filtered_df = df[df.ItemID.isin(items_in_price_range)] ` Any better way to do this?

dalayr
  • 13
  • 4
  • Please include your data in a format that is both readable and easy for others to use. – AMC Dec 08 '19 at 10:54
  • Yeah, I tried. How do I get tables to show up properly here? @jezrael"'s reply was so neat and tidy. I tried using table to ascii converter and then pasting the ascii within the code quotes.. – dalayr Dec 08 '19 at 12:55
  • I like the tabulate library for readability, CSV is great in terms of ease of use, and can be made quite readable. – AMC Dec 08 '19 at 15:31

2 Answers2

1

Use GroupBy.transform for means per groups with same size like original, so possible filter out by boolean indexing all groups with means less like 200:

avg_prices_df = df[df.groupby('Item')['Price'].transform('mean') < 200]

Another solution with DataFrameGroupBy.filter:

avg_prices_df = df.groupby('Item').filter(lambda x: x['Price'].mean() < 200)

print (avg_prices_df)
   Item  Day  Price
1     B    1     20
2     C    1     30
5     B    2     20
6     C    2     30
9     B    3     25
10    C    3     35

print (df.groupby('Item')['Price'].transform('mean'))
0     203.333333
1      21.666667
2      31.666667
3     360.000000
4     203.333333
5      21.666667
6      31.666667
7     360.000000
8     203.333333
9      21.666667
10     31.666667
11    360.000000
Name: Price, dtype: float64

Solution for vaex:

df_item_avg_price = df.groupby(df.ItemID).agg({'Price' : 'mean'})
df_item_avg_price = (df_item_avg_price[df_item_avg_price["Price"] <= 200])

df = df_item_avg_price.drop(['Price']).join(df, on='ItemID')
print (df)
  ItemID  Day  Price
0      B    1     20
1      B    2     20
2      B    3     25
3      C    1     30
4      C    2     30
5      C    3     35
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/203890/discussion-on-answer-by-jezrael-jupyter-pandas-dropping-items-which-have-avera). – Samuel Liew Dec 09 '19 at 09:09
0

Let me show you how I would do it (vaex main author), although Jupyter Pandas - dropping items which have average over a threshold is almost there.

Lets first produce the dataframe:

data = [["A", 1,  10],
["A", 1,  10],
["B", 1,  20],
["C", 1,  30],
["D", 1,  40],
["A", 2, 100],
["B", 2,  20],
["C", 2,  30],
["D", 2,  40],
["A", 3, 500],
["B", 3,  25],
["C", 3,  35],
["D", 3, 1000]]

import vaex
df = vaex.from_arrays(Item=[k[0] for k in data], Day=[k[1] for k in data], Price=[k[2] for k in data])
print(df)
#    Item    Day    Price
0    A       1      10
1    A       1      10
2    B       1      20
3    C       1      30
4    D       1      40
...  ...     ...    ...
8    D       2      40
9    A       3      500
10   B       3      25
11   C       3      35
12   D       3      1000

And generate a dataframe with the average prices:

df_avg_price = df.groupby(df.Item, agg={"Price": 'mean'})
print(df_avg_price)
#  Item       Price
0  A       155
1  B        21.6667
2  C        31.6667
3  D       360

The latest release of vaex does not like to join back, because of the duplicate name (Item) in both dataframes, despite the column being joined on (this is fixed in master (https://github.com/vaexio/vaex), but we can work around that by using a prefix for the right dataframe.

df2 = df.join(df_avg_price, on='Item', rprefix='avg')
df2 = df2[df2.avgPrice < 200]  # notice the prefix

If you want to get rid of the other columns:

df2 = df2[['Item', 'Day', 'Price']]  # only get the rows we want
print(df2)
#  Item      Day    Price
0  A           1       10
1  A           1       10
2  B           1       20
3  C           1       30
4  A           2      100
5  B           2       20
6  C           2       30
7  A           3      500
8  B           3       25
9  C           3       35
Maarten Breddels
  • 1,344
  • 10
  • 12