Pandas DataFrame, how to calculate a new column element based on multiple rows

Question

I am currently trying to implement a statistical test for a specific row based on the content of different rows. Given the dataframe in the following image:

DataFrame I would like to create a new column based on a function that takes into account all the columns of the dataframe that has the same string in column "Template".

For example, in this case there are 2 rows with Template "[Are|Off]", and for each one of those rows I would need to create an element in a new column based on "Clicks", "Impressions" and "Conversions" of both rows.

How would you best approach this problem?

PS: I apologise in advance for the way I am describing the problem, as you might have notices I am not a professional codes :D But I would really appreciate your help!

Here the formula with which I solved this in excel:

Excel Chi Squared test

What are you desired output of the new column? You're only saying 'based' on, is that the sum of the other columns or what? — Deusdeorum, Mar 11 '16 at 09:44
First u need to gropby using Template and then do mathematical part on coloumns and other coloumn ..u can use map/vectorize for this — Dark Matter, Mar 11 '16 at 10:04
@HugoHonorem the output of the new column is a chi squared test, I added an example of how I deal with this in excel, maybe it makes it more clear! — Roberto Bertinetti, Mar 11 '16 at 10:39
@DarkMatter I tried this solution with groupby or pivoting, but i am stuck in the second part since I do not only need the sum or the 22 rows. Maybe the excel solution I pasted might help you. Thanks again! — Roberto Bertinetti, Mar 11 '16 at 10:42

Davis Kirkendall · Accepted Answer · 2016-03-12T02:50:05.047

This might be overly general but I would use some sort of function map if different things should be done depending on the template name:

import pandas as pd
import numpy as np
import collections

n = 5
template_column = list(['are|off', 'are|off', 'comp', 'comp', 'comp|city'])
n = len(template_column)
df = pd.DataFrame(np.random.random((n, 3)), index=range(n), columns=['Clicks', 'Impressions', 'Conversions'])
df['template'] = template_column

# Use a defaultdict so that you can define a default value if a template is
# note defined
function_map = collections.defaultdict(lambda: lambda df: np.nan)

# Now define functions to compute what the new columns should do depending on
# the template.
function_map.update({
    'are|off': lambda df: df.sum().sum(),
    'comp': lambda df: df.mean().mean(),
    'something else': lambda df: df.mean().max()
})

# The lambda functions are just placeholders.  You could do whatever you want in these functions... for example:

def do_special_stuff(df):
    """Do something that uses rows and columns... 
    you could also do looping or whatever you want as long 
    as the result is a scalar, or a sequence with the same 
    number of columns as the original template DataFrame
    """
    crazy_stuff = np.prod(np.sum(df.values,axis=1)[:,None] + 2*df.values, axis=1)
    return crazy_stuff

function_map['comp'] = do_special_stuff

def wrap(f):
    """Wrap a function so that it returns an updated dataframe"""

    def wrapped(df):
        df = df.copy()
        new_column_data = f(df.drop('template', axis=1))
        df['new_column'] = new_column_data
        return df

    return wrapped

# wrap all the functions so that each template has a function defined that does
# the correct thing
series_function_map = {k: wrap(function_map[k]) for k in df['template'].unique()}

# throw everything back together
new_df = pd.concat([series_function_map[label](group)
                    for label, group in df.groupby('template')],
                   ignore_index=True)

# print your shiny new dataframe
print(new_df)

The result is then something like:

     Clicks  Impressions  Conversions   template  new_column
0  0.959765     0.111648     0.769329    are|off    4.030594
1  0.809917     0.696348     0.683587    are|off    4.030594
2  0.265642     0.656780     0.182373       comp    0.502015
3  0.753788     0.175305     0.978205       comp    0.502015
4  0.269434     0.966951     0.478056  comp|city         NaN

Hope it helps!

Hi Davis, from what I understand, this solution relies on aggregated data for each template, which is different from the way I need to calculate my output. The function I need to use should work as follows: Let's suppose I have Template A in 2 rows (A.1 and A.2), I would need to fill the new column both on the individual numbers inserted in each one of the 2 rows (A.1 and A.2), and also on the sum of them (A.1 + A.2). Please feel free to let me know if is not clear so I can improve my description :) — Roberto Bertinetti, Mar 11 '16 at 17:07
@RobertoBertinetti, I added a `do_special_stuff(df)` function to demonstrate that you don't have to use a complete aggregate function like in the lambdas above it. This way you have pretty much total freedom. — Davis Kirkendall, Mar 12 '16 at 02:52
thanks again for the explanation and clarification, will try this out and let you know! — Roberto Bertinetti, Mar 15 '16 at 13:21

Dark Matter · Answer 2 · 2016-03-11T14:21:59.753

2

Ok so after groupby u need to apply this formula ..so you can do this in pandas also ...

import numpy as np
t = df.groupby("Template") # this is for groupby
def calculater(b5,b6,c5,c6):
    return b5/(b5+b6)*((c5+c6))
t['result'] = np.vectorize(calculater)(df["b5"],df["b6"],df["c5"],df["c6"])

here b5,b6 .. are column names of the cells shown in image

This should work for you or may need to do some minor changes in maths there

edited Mar 11 '16 at 14:21

answered Mar 11 '16 at 10:59

Dark Matter

190
2
13

The B5, b6 in excel are actually not column names, but the cells. Does that make sense anyway? – Roberto Bertinetti Mar 11 '16 at 14:04
Anyway they should have some column names rite ?? ... :) ....use those column names – Dark Matter Mar 11 '16 at 14:20
And I edited answer according to your suggetions ...thanx – Dark Matter Mar 11 '16 at 14:22
In the function you suggested though, there are 4 entries, while I have only 2 columns: B and C (from which I take elements from 2 rows: 5 and 6). So I cannot apply directly the funciton – Roberto Bertinetti Mar 11 '16 at 15:21

Pandas DataFrame, how to calculate a new column element based on multiple rows

2 Answers2