1

I am working with a code in Pandas that involves reading a lot of files and then performing various operations on each file inside a loop (which iterates over a file list).

I am trying to convert this to a Dask-based approach instead of a Pandas-based approach and have the following attempt so far - I am new to Dask and need to ask about whether this is a reasonable approach.

Here is what the input data looks like:

     A        X1        X2        X3  A_d  S_d
0  1.0  0.475220  0.839753  0.872468    1    1
1  2.0  0.318410  0.940817  0.526758    2    2
2  3.0  0.053959  0.056407  0.169253    3    3
3  4.0  0.900777  0.307995  0.689259    4    4
4  5.0  0.670465  0.939116  0.037865    5    5

Here is the code:

import dask.dataframe as dd
import numpy as np; import pandas as pd

def my_func(df,r): # perform representative calculations
    q = df.columns.tolist()

    df2 = df.loc[:,q[1:]] / df.loc[:,q()[1:]].sum()
    df2['A'] = df['A']

    df2 = df2[ ( df2['A'] >= r[0] ) & ( df2['A'] <= r[1] ) ]

    c = q[1:-2]
    A = df2.loc[:,c].sum()

    tx = df2.loc[:,c].min() * df2.loc[:,c].max()

    return A - tx

list_1 = []
for j in range(1,13):
    df = dd.read_csv('Test_file.csv')
    N = my_func(df,[751.7,790.4]) # perform calculations
    out = ['X'+str(j)+'_2', df['A'].min()] + N.compute().tolist()
    list_1.append(out)
df_f = pd.DataFrame(list_1)

my_func returns a Dask Series N. Currently, I must .compute() the Dask Series before I can convert it into a list. I am having trouble overcoming this.

  1. Is it possible to vertically append N (which is a Dask Series) as a row to a blank Dask DF? eg. in Pandas, I tend to do this: df_N = pd.DataFrame() would go outside the for loop and then something like df_N = pd.concat([df_N,N],axis=0). This would allow a Dask DF to be built up in the for loop. After that (outside the loop), I could easily just horizontally concatenate the built-up Dask DF to pd.DataFrame(list_1).
  2. Another approach is to create a single row Dask DF from the Dask series N. Then, vertically concatenate this single row DF to a blank Dask DF (that was created outside the loop). Is it possible in Dask to create single row Dask DataFrame from a Series?

Additional Information (if needed):

  • In my real code, I am reading from a *.csv file inside a loop. For this reason, when I generated a sample dastaset, I wrote it to a *.csv file in order to use dd.read_csv() inside the loop.
  • df2s['A'] = df['A'] - this line is needed since the line above it omits column A (during a normalization of each column to its sum) and produces new DataFrame. df2s['A'] = df['A'] adds column A back to the new DataFrame.
edesz
  • 11,756
  • 22
  • 75
  • 123
  • I suspect that you will receive a better answer more quickly if you are able to reduce your problem to smaller examples. You may want to read https://stackoverflow.com/help/mcve – MRocklin May 03 '17 at 17:22
  • Alright, how's that? I've removed Pandas related material and truncated the remaining Dask-based code to keep it to a minimum. – edesz May 03 '17 at 17:56
  • And any thoughts about how to handle the series `N`? – edesz May 03 '17 at 18:36
  • 1
    Here is a good example of a minimal question: http://stackoverflow.com/questions/43416809/on-dask-dataframe-apply-receiving-n-rows-of-value-1-before-actual-rows-proces . A stackoverflow user can understand that code in around 20 seconds. Which is a good number to shoot for. – MRocklin May 03 '17 at 23:19

1 Answers1

2

Is it possible to vertically append N (which is a Dask Series) as a row to a blank Dask DF? eg. in Pandas, I tend to do this: df_N = pd.DataFrame() would go outside the for loop and then something like df_N = pd.concat([df_N,N],axis=0). This would allow a Dask DF to be built up in the for loop. After that (outside the loop), I could easily just horizontally concatenate the built-up Dask DF to pd.DataFrame(list_1).

You should never append rows to either a Pandas dataframe or a Dask dataframe. This is very inefficient. Instead it is better to collect many pandas/dask dataframes together and then call the pd.concat or dd.concat function.

Also I note that you are calling compute within your for loop. It is recommended to call compute only after you have set up your entire computation if possible. Otherwise you are probably not getting much parallelism.

Note: I haven't actually gone through the trouble of understanding your code. I'm just responding to the questions at the end. Hopefully someone else comes along with a more comprehensive answer.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Is there a way to get a Dask series of dataframes into `dd.concat`? It keeps asking that the dfs must be a list, but series is not suitable? – CMCDragonkai Oct 22 '19 at 06:20