I am working with a code in Pandas that involves reading a lot of files and then performing various operations on each file inside a loop (which iterates over a file list).
I am trying to convert this to a Dask-based approach instead of a Pandas-based approach and have the following attempt so far - I am new to Dask and need to ask about whether this is a reasonable approach.
Here is what the input data looks like:
A X1 X2 X3 A_d S_d
0 1.0 0.475220 0.839753 0.872468 1 1
1 2.0 0.318410 0.940817 0.526758 2 2
2 3.0 0.053959 0.056407 0.169253 3 3
3 4.0 0.900777 0.307995 0.689259 4 4
4 5.0 0.670465 0.939116 0.037865 5 5
Here is the code:
import dask.dataframe as dd
import numpy as np; import pandas as pd
def my_func(df,r): # perform representative calculations
q = df.columns.tolist()
df2 = df.loc[:,q[1:]] / df.loc[:,q()[1:]].sum()
df2['A'] = df['A']
df2 = df2[ ( df2['A'] >= r[0] ) & ( df2['A'] <= r[1] ) ]
c = q[1:-2]
A = df2.loc[:,c].sum()
tx = df2.loc[:,c].min() * df2.loc[:,c].max()
return A - tx
list_1 = []
for j in range(1,13):
df = dd.read_csv('Test_file.csv')
N = my_func(df,[751.7,790.4]) # perform calculations
out = ['X'+str(j)+'_2', df['A'].min()] + N.compute().tolist()
list_1.append(out)
df_f = pd.DataFrame(list_1)
my_func
returns a Dask Series N
. Currently, I must .compute()
the Dask Series before I can convert it into a list. I am having trouble overcoming this.
- Is it possible to vertically append
N
(which is a Dask Series) as a row to a blank Dask DF? eg. in Pandas, I tend to do this:df_N = pd.DataFrame()
would go outside thefor
loop and then something likedf_N = pd.concat([df_N,N],axis=0)
. This would allow a Dask DF to be built up in thefor
loop. After that (outside the loop), I could easily just horizontally concatenate the built-up Dask DF topd.DataFrame(list_1)
. - Another approach is to create a single row Dask DF from the Dask
series
N
. Then, vertically concatenate this single row DF to a blank Dask DF (that was created outside the loop). Is it possible in Dask to create single row Dask DataFrame from a Series?
Additional Information (if needed):
- In my real code, I am reading from a
*.csv
file inside a loop. For this reason, when I generated a sample dastaset, I wrote it to a*.csv
file in order to usedd.read_csv()
inside the loop. df2s['A'] = df['A']
- this line is needed since the line above it omits columnA
(during a normalization of each column to its sum) and produces new DataFrame.df2s['A'] = df['A']
adds columnA
back to the new DataFrame.