I'm new to pandas, so please pardon my inexperience. It is hard to explain exactly what my question is off the bat, so instead I will walk you through an m.w.e. My question will be highlighted with the >
markdown (i.e. yellow)
I have a flattened dataframe that looks like this:
import pandas as pd
df = pd.DataFrame({
'label': [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2],
'ind_1': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'out_1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],
'out_2': [16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
'ind_2': ['z', 'z','y', 'y', 'z', 'z', 'y', 'y', 'z', 'z', 'y', 'y', 'z', 'z', 'y', 'y']
#, and a whole lot of other columns
})
df.head()
ind_1 ind_2 label out_1 out_2
0 a z 1 1 16
1 a z 1 2 15
2 a y 1 3 14
3 a y 1 4 13
4 b z 1 5 12
where ind_1
and ind_2
are two independent variables, and out_1
and out_2
are two measurements made in response to altering the independent variables.
In addition, there are technical replicates (in this case 2) for each combination of the independent variables.
I want to quantile normalize across ind_1
i.e. quantile normalize the data for ind_1 == 'a'
and then again for ind_1 == 'b'
.
So my road map for tackling the problem is as follows:
- get the subset of the data to be quantile normalized across
- take the mean of the technical replicates (in this case, we will take the median values of
out_1
andout_2
grouped byind_2
) - quantile normalize this matrix
- merge with other meta data (the other columns not shown for simplicity)
- repeat 1-4 for all values of
ind_1
1 get subset of data
sdf = df[df.ind_1 == 'a']
sdf.head()
ind_1 ind_2 label out_1 out_2
0 a z 1 1 16
1 a z 1 2 15
2 a y 1 3 14
3 a y 1 4 13
8 a z 2 9 8
not so bad.
2 take mean of replicates
ind_2_means = sdf.groupby(['label', 'ind_2']).mean()
out_1 out_2
label ind_2
1 y 3.5 13.5
z 1.5 15.5
2 y 11.5 5.5
z 9.5 7.5
intermediate = ind_2_means.reset_index()
label ind_2 out_1_mean out_2_mean
0 1 y 3.5 13.5
1 1 z 1.5 15.5
2 2 y 11.5 5.5
3 2 z 9.5 7.5
piv = intermediate.pivot('label', 'ind_2')
out_1_mean out_2_mean
ind_2 y z y z
label
1 3.5 1.5 13.5 15.5
2 11.5 9.5 5.5 7.5
my quantile normalization function requires this to be without hierarchy i.e.:
ind_2 out_1_mean_y out_1_mean_z out_2_mean_y out_2_mean_z
label
1 3.5 1.5 13.5 15.5
2 11.5 9.5 5.5 7.5
Question 1: how can I convert
piv
to the above layout? (lets call this variablepiv_flattened
)
3 Quantile Normalization
qn = quantile_normalize(piv_flattened, ['out_1_mean_y', 'out_1_mean_z', 'out_2_mean_y', 'out_2_mean_z'])
# has same layout as piv_flattened, just quantile normalized
4 merge with original data
how can I
1.) return from this layout to that of
intermediate
(e.g. split up the keysout_1_mean_z
toout_1_mean
andz
) and2.) merge this back with the original data?
This likely has to modify the answer provided here
5 Repeat for other subsets
I will encapsulate 1-4 into a function
Other approaches are welcome and any optimizations (and explanations as to why they work) would be appreciated.
Expected Outcome
this is the form of the expected output, there is redundancy as you have to broadcast the means, but this preserves the data in the other columns
ind_1 ind_2 label out_1 out_2 out_1_mean out_2_mean ... <all-non-shown-columns>
0 a z 1 1 16 <a_z_out_1_mean_qn> <a_z_out_2_mean_qn>
1 a z 1 2 15 <a_z_out_1_mean_qn> <a_z_out_2_mean_qn>
2 a y 1 3 14 <a_y_out_1_mean_qn> <a_y_out_2_mean_qn>
3 a y 1 4 13 <a_y_out_1_mean_qn> <a_y_out_2_mean_qn>
4 b z 1 5 12 <b_z_out_1_mean_qn> <b_z_out_2_mean_qn>