1

I want calculate the maximum (axis=1) of several columns in a very large dataset efficiently, while the code I use now is: df["ia_timestamp"] = df[labels].values.max(axis=1). Here df is the DataFrame in Vaex.
I think the step taking "values" transforming it to numpy.array is time-consuming, so is there better methods?

Channing
  • 11
  • 1

1 Answers1

1

The max method provided by vaex is computing the max for the column, in your case you want to have the max for each row.

In order to compute this you can use the apply method, here is an example with vaex 3.0.0:

import vaex
import pandas as pd

df = pd.DataFrame(
    {
        "c1": [1, 2, 3, 4],
        "c2": [2, 3, 4, 1]
    }
)

df_vaex = vaex.from_pandas(df)

df_vaex.apply(lambda *x: max(x), arguments=["c1", "c2"])

And it gives you the expected output:

Expression = lambda_function_3(c1, c2)
Length: 4 dtype: int64 (expression)
-----------------------------------
0  2
1  3
2  4
3  4

Note: I used * before the x to make it usable for any number of columns. If you have a fixed number of columns you can use the following:

df_vaex.apply(lambda c1, c2: max(c1, c2), arguments=["c1", "c2"])

In your case you will have to use:

df["ia_timestamp"] = df.apply(lambda *x: max(x), arguments=labels)