I am trying to normalize experimental data in a pandas data table that contains multiple columns with numerical observables (features), columns with date and experiment conditions as well as additional non-numerical conditions such as filenames.
I would like to
- use the split-apply-combine paradigm
- normalize within groups, using aggregate statistics of subgroups
- use different normalizations (e.g. divide-by-control-mean, Z-score)
- apply this to all numerical columns (observables)
- finally, generate an augmented data table which has the same structure as the original, but with additional columns, e.g. for column Observable1 a column normalized_Observable1 should be added
A simplified data table that has this structure can be generated with this code snippet::
import numpy as np
import pandas as pd
df = pd.DataFrame({
'condition': ['ctrl', 'abc', 'ctrl', 'abc', 'def', 'ctlr', 'ctlr', 'asdasd', 'afff', 'afff', 'gr1','gr2', 'gr2', 'ctrl', 'ctrl', 'kjkj','asht','ctrl'],
'date': ['20170131', '20170131', '20170131', '20170131','20170131', '20170606', '20170606', '20170606', '20170606', '20170606', '20170404', '20170404', '20170404', '20170404', '20170404', '20161212', '20161212', '20161212'],
'observation1': [1.2, 2.2, 1.3, 1.1, 2.3 , 2.3, 4.2, 3.3, 5.1, 3.3, 3.4, 5.5, 9.9, 3.2, 1.1, 3.3, 1.2, 5.4],
'observation2': [3.1, 2.2, 2.1, 1.2, 2.4, 1.2, 1.5, 1.33, 1.5, 1.6, 1.4, 1.3, 0.9, 0.78, 1.2, 4.0, 5.0, 6.0],
'observation3': [2.0, 1.2, 1.2, 2.01, 2.55, 2.05, 1.66, 3.2, 3.21, 3.04, 8.01, 9.1, 7.06, 8.1, 7.9, 5.12, 5.23, 5.15],
'rawsource': ["1.tif", "2.tif", "3.tif", "4.tif", "5.tif","6.tif", "7.tif", "8.tif", "9.tif", "10.tif", "11.tif", "12.tif", "13.tif", "14.tif", "15.tif", "16.tif", "17.tif", "18.tif"]
})
print(df)
and would look like this
condition date observation1 observation2 observation3 rawsource
0 ctrl 20170131 1.2 3.10 2.00 1.tif
1 abc 20170131 2.2 2.20 1.20 2.tif
2 ctrl 20170131 1.3 2.10 1.20 3.tif
3 abc 20170131 1.1 1.20 2.01 4.tif
4 def 20170131 2.3 2.40 2.55 5.tif
5 ctlr 20170606 2.3 1.20 2.05 6.tif
6 ctlr 20170606 4.2 1.50 1.66 7.tif
7 asdasd 20170606 3.3 1.33 3.20 8.tif
8 afff 20170606 5.1 1.50 3.21 9.tif
9 afff 20170606 3.3 1.60 3.04 10.tif
10 gr1 20170404 3.4 1.40 8.01 11.tif
11 gr2 20170404 5.5 1.30 9.10 12.tif
12 gr2 20170404 9.9 0.90 7.06 13.tif
13 ctrl 20170404 3.2 0.78 8.10 14.tif
14 ctrl 20170404 1.1 1.20 7.90 15.tif
15 kjkj 20161212 3.3 4.00 5.12 16.tif
16 asht 20161212 1.2 5.00 5.23 17.tif
17 ctrl 20161212 5.4 6.00 5.15 18.tif
Now, for each experiment date I have different experimental conditions, but I always have condition named ctrl
. One of normalizations I'd like to perform is to calculate the (for each numerical column) the mean of the control experiment for that date and then divide all observables from that date by their corresponding mean.
I can quickly calculate some of the per-date, per-condition summary statiscs using:
grsummary = df.groupby(["date","condition"]).agg((min, max, np.nanmean, np.nanstd))
Then I would like to apply these summary statistics in a normalization for each experiment date:
grdate = df.groupby("date")
and apply the normalization in a fashion like this:
def normalize_by_ctrlmean(grp_frame, summarystats):
# the following is only pseudo-code as I don't know how to do this
grp_frame/ summarystats(nanmean)
grdate.apply(normalize_by_cntrlmean, summarystats= grsummary)
The last step is only pseudo-code. This is what I'm struggling with. I could do the normalization using nested for-loops over dates, conditions, and column names of numerical columns but I'm new to the split-apply-combine paradigm and I think there must be a simple solution ? Any help greatly appreciated.