1

Starting from a Data Frame with the columns A B D P:

import numba
import numpy as np
import pandas as pd
import vaex

d = {'A':[0,1,2,3,4,5,6],'B':[30,35,32,35,31,39,37],'D':[12,10,13,19,12,21,13],'P':[3,3,3,3,3,3,3]}
df = pd.DataFrame(data=d)

df['C'] = 0

df

Exporting to hdf5:

df.to_csv("v_df", index=False)
    
df = vaex.from_csv("v_df", convert=True, chunk_size=5_000_000)

I need to calculate the column 'C' with each row with a desired result as follow:

'C': [0, 1.666666667, 0.552380952, 2.333630952, 0.962202381, 6.38155722, 5.714890553]

To reach this result I need to iterate each row starting on row (1) and calculate something like:

%%time
@numba.njit
def func(B, D, C, b_0=0):
    n = len(B)
    b = np.full(n, b_0, dtype=np.float64)
    for i in range(1, n):
        b[i] = ((((B[i] - B[i - 1]) / B[i - 1])) * D[i]) + C[i-1]
    return b
df['C'] = func(df['B'].to_numpy(),df['D'].to_numpy(),df['C'].to_numpy())
df

But it doesn't work.

It results ok for the first and second row:

'C': [0, 1.666666667, -1.114286, 1.781250, -1.371429, 5.419355, -0.666667]

I've tried the 'shift' (suggestion by sammywemmy). It works ok for this example data frame before 'vaex.from_csv', but when implemented the concept to the big data frame via vaex, shift is not working.

So, the question is: Is there a 'good practice' way to perform this calculus (iterate the result of a row in column 'C' considering the previous rows in 'C') with a large data source (I´m using vaex to import from a 1 million rows csv)?

Thanks in advance.

  • try this : ``df.B.shift(-1).sub(df.B).div(df.B).mul(df.D.shift(-1)).cumsum().shift(fill_value=0)`` – sammywemmy Apr 11 '21 at 23:29
  • sammywemmy, thanks for your fast answer and help... It works perfect for my example createad with a simple pandas data frame! Thanks! But... when I try to implement the concept to the big data frame from vaex, it is not working. Shows the message: 'AttributeError: 'Expression' object has no attribute 'shift''. Thanks for yout help... – juliothomaz Apr 11 '21 at 23:39
  • I am not familiar with vaex; you could have a look at their API and see if there is alternative to `shift` – sammywemmy Apr 11 '21 at 23:40
  • Once again, I would like to thank you for your help and kindness. Thanks! – juliothomaz Apr 11 '21 at 23:55

1 Answers1

0

I am not sure it is the optimal solution but at least it is working: you can use the apply method with vectorize set to True.

Here is the full snippet:

import numba
import numpy as np
import pandas as pd
import vaex

d = {'A':[0,1,2,3,4,5,6],'B':[30,35,32,35,31,39,37],'D':[12,10,13,19,12,21,13],'P':[3,3,3,3,3,3,3]}
df = pd.DataFrame(data=d)

df['C'] = 0

df

# I removed the b_0 for simplicity
def my_func(B, D, C):
    n = len(B)
    b = np.full(n, 0, dtype=np.float64)
    for i in range(1, n):
        b[i] = ((((B[i] - B[i - 1]) / B[i - 1])) * D[i]) + C[i-1]

    return b

df_vaex = vaex.from_pandas(df)

df_vaex.apply(my_func, arguments=[df_vaex["B"], df_vaex["D"], df_vaex["C"]], vectorize=True)

It gives the expected output:

0          0
1    1.66667
2   -1.11429
3    1.78125
4   -1.37143
5    5.41935
6  -0.666667

Basically, apply method allows to apply functions row by row and if you set vectorize to True the full arrays will be passed instead of the row values. In your case it is mandatory since you need the values from the previous rows to compute the C value in a given row.

  • Perfect! I am very happy to achieve results without need of 'shift' and to learn a bit more! Thank you very much! – juliothomaz Apr 14 '21 at 01:52
  • Perfect! Another suggestion is to implement the shift method for vaex, here you can fin the answer related to this task: https://stackoverflow.com/a/61075267/8056572 – M. Perier--Dulhoste Apr 14 '21 at 16:51
  • Hi, I tried to understand your solution, and I think I am lost? At no moment in the code you have posted you store result in column 'C'. So how/when in the calculation the result from the previous row 'C[i-1]' is retrieved to be used in the next row 'C[i]'? Thanks for clarifying. – pierre_j Nov 15 '21 at 20:10