Calculating the correlation of each row with one in Group in Python

Question

I need some help to calculate 4 metrics for my table. What is the correct way to constantly refer to the same string within the department?

I have been trying to solve this problem for a very long time, but I can not come to a logical solution. I have tried Partition from SQLalchemy, but my skills wasn't enough. I tried the def function, but functions are my weak point.

Therefore, I ask you for help with a solution or advice. Thank you!!!

my real task is not about salaries)))) don't worry about it!

EXAMPLE:

import pandas as pd

df2=pd.DataFrame({
    'head_department': [ 'Anna','Anna', 'Anna','Anna','Anna',
                         'John', 'John','John',
                         'Denis', 'Denis',
                         'Sarah',
                         'Greg', 'Greg','Greg','Greg', 'Greg','Greg'],
    'serial_number': [1,2,3,4,5,1,2,3,1,2,1,1,2,3,4,5,6],
    'Worker_in_dpt': ['Anna','Anna-s_worker_2','Anna-s_worker_3','Anna-s_worker_4','Anna-s_worker_5',
                      'John','John-s worker_2','John-s_worker_3',
                      'Denis','Denis-s_worker_2',
                      'Sarah',
                      'Greg','Greg-s_worker_2', 'Greg-s_worker_3', 'Greg-s_worker_4','Greg-s_worker_5', 'Greg-s_worker_6'],
   'jan_salary': [1000, 600, 600, 500, 900, None, 600, 500, 1200, 800, 1400, None, 700, 600, 600, 450, 700],
   'feb_salary': [1100, 700, 700, 700, 800, None, 500, 400, 1800, 900, 1000, 1400, 900, 800, 500, 450, 700],
   'mar_salary': [1200, 800, 800, 900, 700, 1300, 600, 500, 1800, 600, 1100, 1600, 400, 700, 600, 250, 700],
   'apr_salary': [1600, 900, 900, 700, 700, 2300, 500, 400, 1800, 900, 1100, 1900, 200, 900, 500, 150, 700],
   'may_salary': [1100, 700, 700, 700, 800, 2300, 500, 400, 1800, 900, 1000, 1400, 900, 800, 500, 450, 700],
   'jun_salary': [1200, 800, 800, 900, 700, 1300, 500, 400, 1800, 900, 1000, 1400, 900, 800, 500, 450, 800],
   'jul_salary': [1000, 600, 600, 500, 900, 1300, 600, 500, 1200, 800, 1400, 1200, 700, 600, 600, 450, 700],
   'aug_salary': [1100, 700, 700, 700, 800, 2300, 600, 500, 1800, 600, 1100, 1600, 400, 700, 600, 250, None]
})


df2['serial_number'] = df2['serial_number'].astype('str') # else it computed as int, when i use numeric_only=True
df2[ 'mean_salary_per_period'] = df2.mean(numeric_only=True, axis=1)
df2

enter image description here

i'm trying calculate columns:

df2['corr_with_head_in_dpt'] =
df2['mean _when_1_in_NAN'] =
df2['mean _when_1_not_in_NAN'] =
df2['sum_all_dpt_in_jan'] =

WHERE:

df2['corr with head in dpt'] - Calculation of the correlation of the salary of each worker for the entire period within the department with the head of this department (1), the flag of the head of the department is always = 1!
df2['mean _when_1_in _NAN'] - the average salary of each worker for the period while the head of department is NAN. if the head of department has no NAN, then the average for the all period.
df2['mean _when_1_in_NAN'] - the average salary of each worker for the period starting from the first month, when the salary appeared at the head of the department. if the head of department had no NA then the average for the all period.
df2['sum all dpt in Jan'] - the sum of all workers department salaries in January, including the head of department's, even it's NAN too.

Thank you!!!

score 1 · Answer 1 · answered Jul 10 '23 at 17:28

The following will give you the desired output, using groupby.apply() with a function:

df = df2.loc[:, ~df2.columns.isin(
    ["Worker_in_dpt", "mean_salary_per_period"])] \
    .set_index(['head_department', 'serial_number']).stack(dropna=False)

def func(frame):
    # unstack for serial number as columns
    frame = frame.unstack(level=["serial_number"])
    # correlations and take first row
    corr = frame.corr().iloc[0]
    # find months that are nan for serial number 1 (first column)
    nan = frame.iloc[:, 0].isna()
    # mean of non-nans
    mean_not_nan = frame.loc[~nan, :].mean(axis=0)
    # if nan, mean for these, else same as non-nans
    if nan.sum() > 0:    
        mean_nan = frame.loc[nan, :].mean(axis=0)
    else:
        mean_nan = mean_not_nan.copy()
    # sum for january
    sum_jan = pd.Series(data=frame.loc[
        frame.index.get_level_values(-1).str.contains("jan")].sum(axis=1)[0],
        index=frame.columns)
    
    # concat all series and name columns
    return pd.concat([
        corr, mean_nan, mean_not_nan, sum_jan], axis=1) \
        .set_axis(
            ['corr with head in dpt',
             'mean _when_1_in _NAN',
             'mean _when_1_in_NAN',
             'sum all dpt in Jan'], axis=1)

# groupby head_department and apply func
additional_cols = df.groupby(level="head_department").apply(func)

# merge df2 with grouped outputs
out = pd.merge(df2, additional_cols,
               left_on=['head_department', 'serial_number'],
               right_index=True)

Calculating the correlation of each row with one in Group in Python

enter image description here

1 Answers1