4

I would like to average certain column values depending on whether a condition is met in another column. Specifically, if column 1 in the below dataframe is < 1700, I want to include the corresponding value in that row from column 51 in my average calculation. And if column 2 < 1700, I want to also include the value in that row from column 52 in my average calculation.

So, for row 0, the new calculated column for that row would be 64 (average of 65 & 63). For row 1, the average would be just 80 (column 51 value) since neither columns 2 nor 3 were less than 1700 and hence not included in the average calculation.

This is a simplified example as my actual dataframe has about 10 columns for conditions with 10 corresponding columns of values to average.

As a potential complexity, the column headers are numbers rather than traditional text labels and do not refer to the order of that column in the dataframe since I've excluded certain columns when I imported the csv file. In other words, column 51 isn't the 51st column in the dataframe.

When I run the below code I'm getting the following error:

ValueError: ("No axis named 1 for object type ", 'occurred at index 0')

Is there a more efficient way to code this and avoid this error? Thanks for your help!

import pandas as pd
import numpy as np

test_df = pd.DataFrame({1:[1600,1600,1600,1700,1800],2:[1500,2000,1400,1500,2000],
3:[2000,2000,2000,2000,2000],51:[65,80,75,80,75],52:[63,82,85,85,75],53:[83,80,75,76,78]})

test_df

     1     2     3   51  52  53
0  1600  1500  2000  65  63  83
1  1600  2000  2000  80  82  80
2  1600  1400  2000  75  85  75
3  1700  1500  2000  80  85  76
4  1800  2000  2000  75  75  78


def calc_mean_based_on_conditions(row):

        list_of_columns_to_average = []
        for i in range(1,4):
            if row[i] < 1700:
                list_of_columns_to_average.append(i+50)

        if not list_of_columns_to_average:
            return np.nan
        else:
            return row[(list_of_columns_to_average)].mean(axis=1)

test_df['MeanValue'] = test_df.apply(calc_mean_based_on_conditions, axis=1)
benvc
  • 14,448
  • 4
  • 33
  • 54
newcoder
  • 65
  • 7
  • Does this answer your question? [Calculating mean of a specific column by specific rows](https://stackoverflow.com/questions/43306199/calculating-mean-of-a-specific-column-by-specific-rows) – Gonçalo Peres Jul 22 '20 at 16:54

3 Answers3

3

Something very relevant (supporting int as column names)- https://github.com/theislab/anndata/issues/31

Due to this bug/issue, I converted the column names to type string:

test_df = pd.DataFrame({'1':[1600,1600,1600,1700,1800],'2':[1500,2000,1400,1500,2000],
'3':[2000,2000,2000,2000,2000],'51':[65,80,75,80,75],'52':[63,82,85,85,75],'53': 
[83,80,75,76,78]})

Created a new dataframe - new_df to meet out requirements

new_df = test_df[['1', '2', '3']].where(test_df[['1','2','3']]<1700).notnull()

new_df now looks like this

       1      2      3
0   True   True  False
1   True  False  False
2   True   True  False
3  False   True  False
4  False  False  False

Then simply rename the column and check using 'where'

new_df = new_df.rename(columns={"1": "51", "2":"52", "3":"53"})
test_df['mean_value'] = test_df[['51', '52', '53']].where(new_df).mean(axis=1)

This should give you the desired output -

    1     2     3  51  52  53  mean_value
0  1600  1500  2000  65  63  83        64.0
1  1600  2000  2000  80  82  80        80.0
2  1600  1400  2000  75  85  75        80.0
3  1700  1500  2000  80  85  76        85.0
4  1800  2000  2000  75  75  78         NaN
hkr
  • 270
  • 1
  • 11
  • 1
    Using Pandas' `where` was the approach I was looking for to use with a boolean dataframe, instead of the typical series mask. Good answer! – m13op22 Oct 03 '19 at 14:19
1

I deleted my other answer because it was going down the wrong path. What you want to do is generate a mask of your conditional columns, then use that mask to apply a function to other columns. In this case, 1 corresponds to 51, 2 to 52, etc.

import pandas as pd
import numpy as np

test_df = pd.DataFrame({1:[1600,1600,1600,1700,1800],2:[1500,2000,1400,1500,2000],
3:[2000,2000,2000,2000,2000],51:[65,80,75,80,75],52:[63,82,85,85,75],53:[83,80,75,76,78]})

test_df

     1     2     3   51  52  53
0  1600  1500  2000  65  63  83
1  1600  2000  2000  80  82  80
2  1600  1400  2000  75  85  75
3  1700  1500  2000  80  85  76
4  1800  2000  2000  75  75  78



# create dictionary to map columns to one another
l1=list(range(1,4))
l2=list(range(50,54))
d = {k:v for k,v in zip(l1,l2)}

d
{1: 51, 2: 52, 3: 53}

temp=test_df[l1] > 1700 # Subset initial dataframe, generate mask
for _, row in temp.iterrows(): #iterate through subsetted data
    list_of_columns_for_mean=list() # list of columns for later computation
    for k, v in d.items(): #iterate through each k:v and evaluate conditional for each row
        if row[k]:
            list_of_columns_for_mean.append(v)
            # the rest should be pretty easy to figure out

This is not an elegant solution, but it is a solution. Unfortunately, I've run out of time to dedicate to it, but hopefully this gets you pointed in a better direction.

HelpfulHound
  • 326
  • 2
  • 9
1

There is probably a better, vectorized way to do this, but you could do it without the function

import numpy as np
import pandas as pd
from collections import defaultdict

test_df = pd.DataFrame({1:[1600,1600,1600,1700,1800],2:[1500,2000,1400,1500,2000],
3:[2000,2000,2000,2000,2000],51:[65,80,75,80,75],52:[63,82,85,85,75],53:[83,80,75,76,78]})

# List of columns that you're applying the condition to
condition_cols = list(range(1,4))

# Get row and column indices where this condition is true
condition = np.where(test_df[condition_cols].lt(1700))

# make a dictionary mapping row to true columns
cond_map = defaultdict(list)
for r,c in zip(*condition):
    cond_map[r].append(c)

# Get the means of true columns
means = []
for row in range(len(test_df)):
    if row in cond_map:
        temp = []
        for col in cond_map[row]:
            # Needs 51 because of Python indexing starting at zero + 50
            temp.append(test_df.loc[row, col+51])
        means.append(temp)
    else:
        # If the row has no true columns (i.e row 4)
        means.append(np.nan)

test_df['Means'] = [np.mean(l) for l in means]   

The issue is indexing true rows and columns in a vectorized way.

m13op22
  • 2,168
  • 2
  • 16
  • 35