Dask DataFrame calculate mean within multi-column groupings

Question

I have a data frame as shown in Image, what I want to do is to take the mean along the column 'trial'. It for every subject, condition and sample (when all these three columns has value one), take average of data along column trial (100 rows).

what I have done in pandas is as following

sub_erp_pd= pd.DataFrame()
for j in range(1,4):
    sub_c=subp[subp['condition']==j]
    for i in range(1,3073):
        sub_erp_pd=sub_erp_pd.append(sub_c[sub_c['sample']==i].mean(),ignore_index=True)

But this take alot of time.. So i am thinking to use dask instead of Pandas. But in dask i am having issue in creating an empty data frame. Like we create an empty data frame in pandas and append data to it.

image of data frame

as suggested by @edesz I made changes in my approach
EDIT

%%time
sub_erp=pd.DataFrame()
for subno in progressbar.progressbar(range(1,82)):
    try:
        sub=pd.read_csv('../input/data/{}.csv'.format(subno,subno),header=None)
    except:
        sub=pd.read_csv('../input/data/{}.csv'.format(subno,subno),header=None)    
    sub_erp=sub_erp.append(sub.groupby(['condition','sample'], as_index=False).mean())

Reading a file using pandas take 13.6 seconds while reading a file using dask take 61.3 ms. But in dask, I am having trouble in appending.

NOTE - The original question was titled Create an empty dask dataframe and append values to it.

This question asks about how to calculate a mean value within a grouping across multiple columns of a Dask DataFrame. One attempt at doing this was to access each group of values in each column separately, compute the mean and then append the mean to an empty Dask DataFrame in Python. In the accepted answer ([below](https://stackoverflow.com/a/55666676/4057186)), this is done with a multi-column *GROUP BY* across the relevant columns of the Dask DataFrame and then computes the mean. The accepted answer does not address how to create an empty Dask DataFrame and iteratively append values to it. — edesz, May 31 '21 at 17:49

edesz · Accepted Answer · 2021-05-31T18:26:46.387

If I understand correctly, you need to

use groupby (read more here) in order to group the subject, condition and sample columns
- this will gather all rows, which have the same value in each of these three columns, into a single group
take the average using .mean()
- this will give you the mean within each group

Generate some dummy data

df = df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)),
                        columns=['trial','condition','sample'])
df.insert(0,'subject',[1]*10 + [2]*30 + [5]*60)

print(df.head())
   subject  trial  condition  sample
0        1     71         96      34
1        1      2         89      66
2        1     90         90      81
3        1     93         43      18
4        1     29         82      32

Pandas approach

Aggregate and take mean

df_grouped = df.groupby(['subject','condition','sample'], as_index=False)['trial'].mean()

print(df_grouped.head(15))
    subject  condition  sample  trial
0         1         18      24     89
1         1         43      18     93
2         1         67      47     81
3         1         82      32     29
4         1         85      28     97
5         1         88      13     48
6         1         89      59     23
7         1         89      66      2
8         1         90      81     90
9         1         96      34     71
10        2          0      81     19
11        2          2      39     58
12        2          2      59     94
13        2          5      42     13
14        2          9      42      4

Dask approach

Step 1. Imports

import dask.dataframe as dd
from dask.diagnostics import ProgressBar

Step 2. Convert Pandas DataFrame to Dask DataFrame, using .from_pandas

ddf = dd.from_pandas(df, npartitions=2)

Step 3. Aggregate and take mean

ddf_grouped = (
    ddf.groupby(['subject','condition','sample'])['trial']
        .mean()
        .reset_index(drop=False)
            )

with ProgressBar():
    df_grouped = ddf_grouped.compute()
[                                        ] | 0% Completed |  0.0s
[########################################] | 100% Completed |  0.1s

print(df_grouped.head(15))
    subject  condition  sample  trial
0         1         18      24     89
1         1         43      18     93
2         1         67      47     81
3         1         82      32     29
4         1         85      28     97
5         1         88      13     48
6         1         89      59     23
7         1         89      66      2
8         1         90      81     90
9         1         96      34     71
10        2          0      81     19
11        2          2      39     58
12        2          2      59     94
13        2          5      42     13
14        2          9      42      4

IMPORTANT NOTE: The approach in this answer does not use the approach of creating an empty Dask DataFrame and append values to it in order to calculate a mean within groupings of subject, condition and trial. Instead, this answer provides an alternate approach (using GROUP BY) to obtaining the desired end result (of calculating the mean within groupings of subject, condition and trial).

thanks @edesz it help me to decrease me execution time from days to 17 minute. I think if i can create an empty dask data frame, this can be improved more. btw thanks — Talha Anwar, Apr 14 '19 at 17:02
@TalhaAnwar, you are welcome. I have updated the answer to include an approach using `dask`. Also, if you find this answer useful, please mark this answer as the Accepted Answer. Thanks. — edesz, Apr 14 '19 at 17:51
thanks a lot, time as now reduced to 10 minute from 17. Though the output index is bit complicated in the form of pair (subject,condition). Btw thanks a lot — Talha Anwar, Apr 14 '19 at 20:24
Just use `.reset_index(drop=False)` - this should reset the index and then you'll get `subject` and `condition` as separate columns. — edesz, Apr 14 '19 at 20:36
When accepting answers, which don't answer the original question, please amend the title of the question, otherwise this just creates noise when people need information on "appending new rows to Dask dataframe". — Limeran, May 31 '21 at 17:28

Dask DataFrame calculate mean within multi-column groupings

1 Answers1