0

I am currently trying to implement a statistical test for a specific row based on the content of different rows. Given the dataframe in the following image:

DataFrame I would like to create a new column based on a function that takes into account all the columns of the dataframe that has the same string in column "Template".

For example, in this case there are 2 rows with Template "[Are|Off]", and for each one of those rows I would need to create an element in a new column based on "Clicks", "Impressions" and "Conversions" of both rows.

How would you best approach this problem?

PS: I apologise in advance for the way I am describing the problem, as you might have notices I am not a professional codes :D But I would really appreciate your help!

Here the formula with which I solved this in excel:

Excel Chi Squared test

Roberto Bertinetti
  • 555
  • 1
  • 4
  • 10
  • What are you desired output of the new column? You're only saying 'based' on, is that the sum of the other columns or what? – Deusdeorum Mar 11 '16 at 09:44
  • First u need to gropby using Template and then do mathematical part on coloumns and other coloumn ..u can use map/vectorize for this – Dark Matter Mar 11 '16 at 10:04
  • @HugoHonorem the output of the new column is a chi squared test, I added an example of how I deal with this in excel, maybe it makes it more clear! – Roberto Bertinetti Mar 11 '16 at 10:39
  • @DarkMatter I tried this solution with groupby or pivoting, but i am stuck in the second part since I do not only need the sum or the 22 rows. Maybe the excel solution I pasted might help you. Thanks again! – Roberto Bertinetti Mar 11 '16 at 10:42

2 Answers2

3

This might be overly general but I would use some sort of function map if different things should be done depending on the template name:

import pandas as pd
import numpy as np
import collections

n = 5
template_column = list(['are|off', 'are|off', 'comp', 'comp', 'comp|city'])
n = len(template_column)
df = pd.DataFrame(np.random.random((n, 3)), index=range(n), columns=['Clicks', 'Impressions', 'Conversions'])
df['template'] = template_column

# Use a defaultdict so that you can define a default value if a template is
# note defined
function_map = collections.defaultdict(lambda: lambda df: np.nan)

# Now define functions to compute what the new columns should do depending on
# the template.
function_map.update({
    'are|off': lambda df: df.sum().sum(),
    'comp': lambda df: df.mean().mean(),
    'something else': lambda df: df.mean().max()
})

# The lambda functions are just placeholders.  You could do whatever you want in these functions... for example:

def do_special_stuff(df):
    """Do something that uses rows and columns... 
    you could also do looping or whatever you want as long 
    as the result is a scalar, or a sequence with the same 
    number of columns as the original template DataFrame
    """
    crazy_stuff = np.prod(np.sum(df.values,axis=1)[:,None] + 2*df.values, axis=1)
    return crazy_stuff

function_map['comp'] = do_special_stuff

def wrap(f):
    """Wrap a function so that it returns an updated dataframe"""

    def wrapped(df):
        df = df.copy()
        new_column_data = f(df.drop('template', axis=1))
        df['new_column'] = new_column_data
        return df

    return wrapped

# wrap all the functions so that each template has a function defined that does
# the correct thing
series_function_map = {k: wrap(function_map[k]) for k in df['template'].unique()}

# throw everything back together
new_df = pd.concat([series_function_map[label](group)
                    for label, group in df.groupby('template')],
                   ignore_index=True)

# print your shiny new dataframe
print(new_df)

The result is then something like:

     Clicks  Impressions  Conversions   template  new_column
0  0.959765     0.111648     0.769329    are|off    4.030594
1  0.809917     0.696348     0.683587    are|off    4.030594
2  0.265642     0.656780     0.182373       comp    0.502015
3  0.753788     0.175305     0.978205       comp    0.502015
4  0.269434     0.966951     0.478056  comp|city         NaN

Hope it helps!

  • Hi Davis, from what I understand, this solution relies on aggregated data for each template, which is different from the way I need to calculate my output. The function I need to use should work as follows: Let's suppose I have Template A in 2 rows (A.1 and A.2), I would need to fill the new column both on the individual numbers inserted in each one of the 2 rows (A.1 and A.2), and also on the sum of them (A.1 + A.2). Please feel free to let me know if is not clear so I can improve my description :) – Roberto Bertinetti Mar 11 '16 at 17:07
  • @RobertoBertinetti, I added a `do_special_stuff(df)` function to demonstrate that you don't have to use a complete aggregate function like in the lambdas above it. This way you have pretty much total freedom. – Davis Kirkendall Mar 12 '16 at 02:52
  • thanks again for the explanation and clarification, will try this out and let you know! – Roberto Bertinetti Mar 15 '16 at 13:21
2

Ok so after groupby u need to apply this formula ..so you can do this in pandas also ...

import numpy as np
t = df.groupby("Template") # this is for groupby
def calculater(b5,b6,c5,c6):
    return b5/(b5+b6)*((c5+c6))
t['result'] = np.vectorize(calculater)(df["b5"],df["b6"],df["c5"],df["c6"])

here b5,b6 .. are column names of the cells shown in image

This should work for you or may need to do some minor changes in maths there

Dark Matter
  • 190
  • 2
  • 13