I have a data frame with items and their prices, something like this:
╔══════╦═════╦═══════╗
║ Item ║ Day ║ Price ║
╠══════╬═════╬═══════╣
║ A ║ 1 ║ 10 ║
║ B ║ 1 ║ 20 ║
║ C ║ 1 ║ 30 ║
║ D ║ 1 ║ 40 ║
║ A ║ 2 ║ 100 ║
║ B ║ 2 ║ 20 ║
║ C ║ 2 ║ 30 ║
║ D ║ 2 ║ 40 ║
║ A ║ 3 ║ 500 ║
║ B ║ 3 ║ 25 ║
║ C ║ 3 ║ 35 ║
║ D ║ 3 ║ 1000 ║
╚══════╩═════╩═══════╝
I want to exclude all rows from this df where the item has an average price over 200. So filtered df should look like this:
╔══════╦═════╦═══════╗
║ Item ║ Day ║ Price ║
╠══════╬═════╬═══════╣
║ B ║ 1 ║ 20 ║
║ C ║ 1 ║ 30 ║
║ B ║ 2 ║ 20 ║
║ C ║ 2 ║ 30 ║
║ B ║ 3 ║ 25 ║
║ C ║ 3 ║ 35 ║
╚══════╩═════╩═══════╝
I'm new to python and pandas but as a first step was thinking something like this to get a new df for avg prices: avg_prices_df = df.groupby('ItemID').Price.mean().reset_index and then not sure how to proceed from there. Not sure even that first step is correct.
To further complicate the matter, I am using vaex to read the data in ndf5 form as I have over 400 million rows.
Many thanks in advance for any advice.
EDIT: So I got the following code working, though I am sure it is not optimised..
`
create dataframe of ItemIDs and their average prices
df_item_avg_price = df.groupby(df.ItemID, agg=[vaex.agg.count('ItemID'), vaex.agg.mean('Price')])
filter this new dataframe by average price threshold
df_item_avg_price = (df_item_avg_price[df_item_avg_price["P_r_i_c_e_mean"] <= 50000000])
create list of ItemIDs which have average price under the threshold
items_in_price_range = df_item_avg_price['ItemID'].tolist()
filter the original dataframe to include rows only with the items in price range
filtered_df = df[df.ItemID.isin(items_in_price_range)] ` Any better way to do this?