I want calculate the maximum (axis=1) of several columns in a very large dataset efficiently, while the code I use now is: df["ia_timestamp"] = df[labels].values.max(axis=1)
. Here df is the DataFrame in Vaex.
I think the step taking "values" transforming it to numpy.array is time-consuming, so is there better methods?
Asked
Active
Viewed 452 times
1

Channing
- 11
- 1
1 Answers
1
The max
method provided by vaex is computing the max for the column, in your case you want to have the max for each row.
In order to compute this you can use the apply
method, here is an example with vaex 3.0.0:
import vaex
import pandas as pd
df = pd.DataFrame(
{
"c1": [1, 2, 3, 4],
"c2": [2, 3, 4, 1]
}
)
df_vaex = vaex.from_pandas(df)
df_vaex.apply(lambda *x: max(x), arguments=["c1", "c2"])
And it gives you the expected output:
Expression = lambda_function_3(c1, c2)
Length: 4 dtype: int64 (expression)
-----------------------------------
0 2
1 3
2 4
3 4
Note: I used *
before the x
to make it usable for any number of columns. If you have a fixed number of columns you can use the following:
df_vaex.apply(lambda c1, c2: max(c1, c2), arguments=["c1", "c2"])
In your case you will have to use:
df["ia_timestamp"] = df.apply(lambda *x: max(x), arguments=labels)

M. Perier--Dulhoste
- 884
- 6
- 8