Pandas: doing quantile normalization on a subset of a flattened dataset with replicates

Question

I'm new to pandas, so please pardon my inexperience. It is hard to explain exactly what my question is off the bat, so instead I will walk you through an m.w.e. My question will be highlighted with the > markdown (i.e. yellow)

I have a flattened dataframe that looks like this:

import pandas as pd
df = pd.DataFrame({
        'label': [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2],                                        
        'ind_1': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],        
        'out_1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],                                    
        'out_2': [16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1],                                    
        'ind_2': ['z', 'z','y', 'y', 'z', 'z', 'y', 'y', 'z', 'z', 'y', 'y', 'z', 'z', 'y', 'y']
        #, and a whole lot of other columns         
    })
    df.head()

    ind_1   ind_2   label   out_1   out_2
0   a   z   1   1   16
1   a   z   1   2   15
2   a   y   1   3   14
3   a   y   1   4   13
4   b   z   1   5   12

where ind_1 and ind_2 are two independent variables, and out_1 and out_2 are two measurements made in response to altering the independent variables.

In addition, there are technical replicates (in this case 2) for each combination of the independent variables.

I want to quantile normalize across ind_1 i.e. quantile normalize the data for ind_1 == 'a' and then again for ind_1 == 'b'.

So my road map for tackling the problem is as follows:

get the subset of the data to be quantile normalized across
take the mean of the technical replicates (in this case, we will take the median values of out_1 and out_2 grouped by ind_2)
quantile normalize this matrix
merge with other meta data (the other columns not shown for simplicity)
repeat 1-4 for all values of ind_1

1 get subset of data

sdf = df[df.ind_1 == 'a']
sdf.head()

    ind_1   ind_2   label   out_1   out_2
0   a   z   1   1   16
1   a   z   1   2   15
2   a   y   1   3   14
3   a   y   1   4   13
8   a   z   2   9   8

not so bad.

2 take mean of replicates

ind_2_means = sdf.groupby(['label', 'ind_2']).mean()


                out_1   out_2
label   ind_2       
1   y           3.5     13.5
    z           1.5     15.5
2   y           11.5    5.5
    z           9.5     7.5


intermediate = ind_2_means.reset_index()



    label   ind_2   out_1_mean  out_2_mean
0   1   y   3.5     13.5
1   1   z   1.5     15.5
2   2   y   11.5    5.5
3   2   z   9.5     7.5



piv = intermediate.pivot('label', 'ind_2')


        out_1_mean      out_2_mean
ind_2   y       z       y       z
label               
1       3.5     1.5     13.5    15.5
2       11.5    9.5     5.5     7.5

my quantile normalization function requires this to be without hierarchy i.e.:

ind_2   out_1_mean_y    out_1_mean_z    out_2_mean_y    out_2_mean_z
label               
1       3.5     1.5     13.5    15.5
2       11.5    9.5     5.5     7.5

Question 1: how can I convert piv to the above layout? (lets call this variable piv_flattened)

3 Quantile Normalization

qn = quantile_normalize(piv_flattened, ['out_1_mean_y', 'out_1_mean_z', 'out_2_mean_y', 'out_2_mean_z'])

# has same layout as piv_flattened, just quantile normalized

4 merge with original data

how can I

1.) return from this layout to that of intermediate (e.g. split up the keys out_1_mean_z to out_1_mean and z) and

2.) merge this back with the original data?

This likely has to modify the answer provided here

5 Repeat for other subsets

I will encapsulate 1-4 into a function

Other approaches are welcome and any optimizations (and explanations as to why they work) would be appreciated.

Expected Outcome

this is the form of the expected output, there is redundancy as you have to broadcast the means, but this preserves the data in the other columns

    ind_1   ind_2   label   out_1   out_2 out_1_mean out_2_mean    ... <all-non-shown-columns>
0   a   z   1   1   16  <a_z_out_1_mean_qn>     <a_z_out_2_mean_qn>
1   a   z   1   2   15  <a_z_out_1_mean_qn>     <a_z_out_2_mean_qn>
2   a   y   1   3   14  <a_y_out_1_mean_qn>     <a_y_out_2_mean_qn>
3   a   y   1   4   13  <a_y_out_1_mean_qn>     <a_y_out_2_mean_qn>
4   b   z   1   5   12  <b_z_out_1_mean_qn>     <b_z_out_2_mean_qn>

score 1 · Accepted Answer · answered Jan 15 '18 at 16:56

1

For you 1st question

piv.columns=piv.columns.map('_'.join)
piv
Out[431]: 
       out_1_y  out_1_z  out_2_y  out_2_z
label                                    
1          3.5      1.5     13.5     15.5
2         11.5      9.5      5.5      7.5

answered Jan 15 '18 at 16:56

BENY

317,841
20
164
234