1

Let's assume I have the following data (in reality I have millions of entries):

import pandas as pd
import numpy as np
import datetime as dt

df_data = pd.DataFrame([
    [dt.date(2023, 5, 8),  'Firm A', 'AS', 250.0, -1069.1],
    [dt.date(2023, 5, 8),  'Firm A', 'JM', 255.0, -1045.5],
    [dt.date(2023, 5, 8),  'Firm A', 'WC', 250.0, -1068.8],
    [dt.date(2023, 5, 11), 'Firm A', 'WC', 250.0, -1068.8],
    [dt.date(2023, 5, 8),  'Firm B', 'AS',  31.9,  -317.9],
    [dt.date(2023, 5, 8),  'Firm B', 'JM',  33.5,  -310.7],
    [dt.date(2023, 5, 8),  'Firm B', 'WC',  34.5,  -305.9],
    [dt.date(2023, 5, 11), 'Firm B', 'AS',  33.0,  -313.1],
    [dt.date(2023, 5, 11), 'Firm B', 'JM',  33.5,  -310.7],
    [dt.date(2023, 5, 11), 'Firm B', 'WC',  35.0,  -303.5],
    [dt.date(2023, 5, 10), 'Firm C', 'BC', 167.0,   301.0],
    [dt.date(2023, 5, 9),  'Firm D', 'BA', 791.9,  1025.0],
    [dt.date(2023, 5, 9),  'Firm D', 'CT', 783.8,  1000.0],
    [dt.date(2023, 5, 11), 'Firm D', 'BA', 783.8,  1000.0],
    [dt.date(2023, 5, 11), 'Firm D', 'CT', 767.9,   950.0]],
    columns=['Date', 'Name', 'Source', 'Value1', 'Value2'])

Now for each Name I want to find its first & last available Date and compute the mean for the columns Value1 & Value2 for each of the dates. And ultimately, I want to compute the change in both values between the first & last date.

The problem is that not all names have data on the same dates, and some names only have data on 1 date.

The following approach works:

def compute_entry(df: pd.DataFrame) -> dict:
    dt_min  = df.Date.min()
    dt_max  = df.Date.max()
    idx_min = df.Date == dt_min
    idx_max = df.Date == dt_max
    data = {
        'Min Date':        dt_min,
        'AvgValue1 (Min)': df[idx_min].Value1.mean(),
        'AvgValue2 (Min)': df[idx_min].Value2.mean(),
        '#Sources (Min)':  df[idx_min].Value2.count(),
        'Max Date':        dt_max,
        'AvgValue1 (Max)': df[idx_max].Value1.mean(),
        'AvgValue2 (Max)': df[idx_max].Value2.mean(),
        '#Sources (Max)':  df[idx_max].Value2.count(),
        'Value1 Change':   df[idx_max].Value1.mean() - df[idx_min].Value1.mean(),
        'Value2 Change':   df[idx_max].Value2.mean() - df[idx_min].Value2.mean()
    }
    return data

df_pivot = pd.DataFrame.from_dict({sn_id: compute_entry(df_sub)
                                   for sn_id, df_sub in df_data.groupby('Name')}, orient='index')

And gives the desired format: Pivot Table However, this approach is very slow for many entries.

So instead I tried using pd.pivot_table which is much faster:

pd.pivot_table(df_data,
               index=['Name', 'Date'],
               aggfunc={'Value1': np.mean, 'Value2': np.mean, 'Source': len})

But the output is not quite in the right format, and I find it difficult to convert the pivot table into the same desired format as above.

Is there a good way to use pandas built-in (vectorised) functions to achieve the desired format?

Phil-ZXX
  • 2,359
  • 2
  • 28
  • 40

2 Answers2

1

Use GroupBy.transform for get rows with minimal and maximal dates, then use named aggreagtion in GroupBy.agg and last create columns in DataFrame.assign:

g = df_data.groupby('Name')['Date']

df1 = df_data[df_data['Date'].eq(g.transform('min'))]
df2 = df_data[df_data['Date'].eq(g.transform('max'))]

df = (pd.concat([df1.groupby('Name').agg(**{'Min Date': ('Date','first'),
                                          'AvgValue1 (Min)': ('Value1','mean'),
                                          'AvgValue2 (Min)': ('Value2','mean'),
                                          '#Sources (Min)':('Value2','count')}),
                df2.groupby('Name').agg(**{'Max Date': ('Date','first'),
                                          'AvgValue1 (Max)': ('Value1','mean'),
                                          'AvgValue2 (Max)': ('Value2','mean'),
                                          '#Sources (Max)':('Value2','count')})], axis=1)
     .assign(**{'Value1 Change': lambda x: x['AvgValue1 (Max)'].sub(x['AvgValue1 (Min)']),
                'Value2 Change': lambda x: x['AvgValue2 (Max)'].sub(x['AvgValue2 (Min)'])}))

print (df)
          Min Date  AvgValue1 (Min)  AvgValue2 (Min)  #Sources (Min)  \
Name                                                                   
Firm A  2023-05-08       251.666667     -1061.133333               3   
Firm B  2023-05-08        33.300000      -311.500000               3   
Firm C  2023-05-10       167.000000       301.000000               1   
Firm D  2023-05-09       787.850000      1012.500000               2   

          Max Date  AvgValue1 (Max)  AvgValue2 (Max)  #Sources (Max)  \
Name                                                                   
Firm A  2023-05-11       250.000000          -1068.8               1   
Firm B  2023-05-11        33.833333           -309.1               3   
Firm C  2023-05-10       167.000000            301.0               1   
Firm D  2023-05-11       775.850000            975.0               2   

        Value1 Change  Value2 Change  
Name                                  
Firm A      -1.666667      -7.666667  
Firm B       0.533333       2.400000  
Firm C       0.000000       0.000000  
Firm D     -12.000000     -37.500000  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

You could do this by finding min and max dates for each firm, merging those with the grouped means for those days, and then merging both sets of data together. Then you can compute your value change columns from the merged data:

dates = df_data.groupby('Name', as_index=False)['Date']
g = df_data.groupby(['Name', 'Date'], as_index=False).agg(
    AvgValue1=('Value1', 'mean'),
    AvgValue2=('Value2', 'mean'),
    Sources=('Source',len))

min_data = dates.min().merge(g, on=['Name', 'Date'])
max_data = dates.max().merge(g, on=['Name', 'Date'])

out = min_data.merge(max_data, on=['Name'], suffixes=(' (Min)', ' (Max)'))
out['Value1 Change'] = out['AvgValue1 (Max)'] - out['AvgValue1 (Min)']
out['Value2 Change'] = out['AvgValue2 (Max)'] - out['AvgValue2 (Min)'] 

Output:

     Name  Date (Min)  AvgValue1 (Min)  AvgValue2 (Min)  Sources (Min)
0  Firm A  2023-05-08       251.666667     -1061.133333              3  \
1  Firm B  2023-05-08        33.300000      -311.500000              3
2  Firm C  2023-05-10       167.000000       301.000000              1
3  Firm D  2023-05-09       787.850000      1012.500000              2

   Date (Max)  AvgValue1 (Max)  AvgValue2 (Max)  Sources (Max)  Value1 Change
0  2023-05-11       250.000000          -1068.8              1      -1.666667  \
1  2023-05-11        33.833333           -309.1              3       0.533333
2  2023-05-10       167.000000            301.0              1       0.000000
3  2023-05-11       775.850000            975.0              2     -12.000000

   Value2 Change
0      -7.666667
1       2.400000
2       0.000000
3     -37.500000
Nick
  • 138,499
  • 22
  • 57
  • 95