0

I have a huge file, around 35GB stored in form of hdf5. I have to do certain calculations on some specific columns and want to insert those calculations as new columns. I know I can assign new columns directly as

df['new_column'] = 0(or some other value). But I have some calculations in which I have to use previous row value. In pandas, we can use iloc function to get the value of the previous index. But, pandas cannot handle this much big file. I got memory error lot of the time trying this.

So how can I implement some function that can use the value from the previous row and can do calculations for me in dask? or in other words how can I implement an alternative to iloc method? I know how to use df.apply function.

The code with implementation will be appreciated. Thank you.

Urvish
  • 643
  • 3
  • 10
  • 19
  • I don't know Dask. I am going straight to Spark. This sounds hard in Spark too, but bet someone has figured it out already. – Chad Bernier Aug 02 '18 at 02:08

1 Answers1

1

Dask.dataframe does not implement iloc.

You might be interested in rolling instead

df.rolling(window=1).apply(...)
MRocklin
  • 55,641
  • 23
  • 163
  • 235