1

I want to use VAEX for lazy work wih my dataframe. After quick start with export big csv and some simple filters and extract() I have initial df for my work with 3 main columns: cid1, cid2, cval1. Each combitations of cid1 and cid2 is a workset with some rows where is cval1 is different. My df contents only valid cid1 and cid2. I want to save in df only rows with minimun cval1 and drop other. cval1 is float, cid1 and cid2 is int.

I try one filter:

df = df.filter(df.cid1 == 36 & df.cid2 == 182 & df.cval1 == df.min(df.cval1))

I must to receive in result df with only one row. But it not work properly, it's result: enter image description here

It's a first problem. But next I must to find minimum cval1 for each valid combination of cid1 and cid2.

I have list of tuples with each values cid1 and cid2:

cart_prod=[(2, 5), (3, 9), ...]

I think I try:

df_finally = vaex.DataFrame()
for x in cart_prod:
   df2 = df.filter(df.cid1 == x[0] & df.cid2 == x[1] & df.cval1 == df.min(df.cval1))
   df_finally = vaex.concat([df_finally, df2])

But the filter not valid, and VAEX can not concat with error that DataFrame have not attribute concat.. But I try really vaex.concat(list_of_dataframes).

I think may be I can use:

df.select(df.cid1 == x[0] & df.cid2 == x[1] & df.cval1 == df.min(df.cval1), name = "selection")

But I can't to make that df this selection take and use..

df = df.filter((df.cid1, df.cid2) in cart_prod)

This code have not result too..

Hmmm.. Help me please!

How to choose minimum df.cval1 for each combinations of df.cid1 and df.cid2 and save result to dataframe in VAEX?

Maybe goupby? But I don't understand how it works..

Jahspear
  • 151
  • 11

1 Answers1

0

I've not used VAEX but the documentation says its groupby syntax is virtually same as pandas. So, here is what I would do in Pandas:

import pandas as pd
import numpy as np


df["min_cid3"] = df.groupby(['cid1', 'cid2'])['cid3'].transform(np.min)

Then filter your df wherever cid3==min_cid3.

EDIT: As per OP's comment, above pandas solution is working but fails for VAEX. So, based on VAEX docs, I believe this would work there:

df.groupby(by=['cid1', 'cid2']).agg({'min_cid3': 'min'})

PS: I haven't installed VAEX, so if this doesn't work and you figure out the change needed, feel free to suggest edit.

Zircoz
  • 504
  • 3
  • 14
  • 1
    If I try it with vaex df I get error: "TypeError: 'GroupBy' object is not subscriptable." But if I transform df to pandas df, it's working properly. Thank you! It's a variant if I can't make it with vaex. – Jahspear Dec 16 '22 at 19:30
  • @Jahspear Try the new edit! – Zircoz Dec 16 '22 at 19:37
  • 1
    Thank you! It's working properly. But how I add to output another column from df? I see only cids1-3 – Jahspear Dec 16 '22 at 20:04
  • 1
    I see {'min_cid3': 'min'} is a dict but when I add column name as a key and as value I take an error. I think it bacause a value must be a function? I don't catch it, sorry) – Jahspear Dec 16 '22 at 20:15
  • 1
    I can: .agg({'cid4': 'max', 'cid3': 'min'}) but I can't put cid4 as is.. – Jahspear Dec 16 '22 at 20:30
  • @Jahspear there was no mention of cid4 in your question but i think i should stay as it was after aggregation. If not, can you edit the question and add what you want to do on cid4 and what it is currently ? I'll according add any code/explanation in my answer! – Zircoz Dec 17 '22 at 10:50
  • 1
    My df has not only cid1, cid2, cid3. It has many columns with statistical data for next calculating. How I can leave it without changes for rows where cid3 is minimal value? When I use your code, I can view only cids1-3 values. – Jahspear Dec 17 '22 at 11:56
  • @Jahspear Join the columns back but multicol joins [aren't available](https://github.com/vaexio/vaex/issues/852) so make a new col in both with string concat of the three columns and use `join` to retain only the ones with minimum cid3. This should be a different question so I am not adding the code for join in above answer. – Zircoz Dec 17 '22 at 14:47
  • 1
    Big thanx! In this stage I choose combined way between vaex that help me lazy work with first filters for big init df, and pandas for findig minimal values and filter, like in your first comment. May be next I'll can use only vaex... – Jahspear Dec 17 '22 at 15:13