Column Pair-wise aggregation and reorganization in Pandas

Question

I am importing a csv file into a pandas dataframe such as:

df = pd.DataFrame( {0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}} )

0	1	2	3	4	5	6	7	8	9
ID	Net Cost	Charge Description	Charge Amount	Charge Description	Charge Amount	Charge Description	Charge Amount	Charge Description	Charge Amount
1	30	Surcharge A	9.5	Discount X	-11.5	Discount Y	-3.25	Surcharge B	2.5
2	40	Discount X	-12.5
3	50	Discount X	-11.5
4	35	Discount X	-5.5	Surcharge B	3.5
5	45	Surcharge A	9.5	Discount X	-10.5	Surcharge B	4.5

The first row are the headers with column names Charge Description and Charge Amount forming pairs and appearing multiple times.

Desired output is a df with a unique column for each description, with the reorganized columns sorted alphabetically and NaNs showing as 0:

ID	Net Cost	Surcharge A	Surcharge B	Discount X	Discount Y
1	30	9.5	2.5	-11.5	-3.25
2	40	0	0	-12.5	0
3	50	0	0	-11.5	0
4	35	0	3.5	-5.5	0
5	45	9.5	4.5	-10.5	0

This post looks like a good starting point but then I need a column for each Charge Description and only a single row per ID.

Are the column headers REALLY supposed to be the first row of your dataframe? Or is that a "bug" in your code initialising the example dataframe? (If you read it from a csv file you should tell pandas to use the first row as headers...) — MatBailie, Feb 15 '23 at 19:36
@MatBailie I initially did that but then python will change the column names to Charge Description.1, Charge Description.2, Charge Description.3 etc. Just wanted to show that they're actually all named the same. — sebwallat, Feb 15 '23 at 19:44
Care to share the source file? Might be easier to reshape if read in as a Multiindex — sammywemmy, Feb 15 '23 at 19:53
https://drive.google.com/file/d/1gFw1pD9ivlvyovlg5wY2ZIn_Rvk2f_sc/view?usp=sharing @sammywemmy — sebwallat, Feb 15 '23 at 20:01
You state you want the columns sorted alphabetically, yet you have columns starting with `S` coming before columns starting with `D`? — MatBailie, Feb 15 '23 at 21:08

sammywemmy · Answer 1 · 2023-02-15T22:04:05.193

I used the file you shared, and edited the columns with the initial dataframe df shared (Pandas automatically adds suffixes to columns to make them unique) to keep the non uniqueness:

invoice = pd.read_csv('Downloads/Example Invoice.csv')
invoice.columns = ['ID', 'Net Cost', 'Charge Description', 'Charge Amount',
                   'Charge Description', 'Charge Amount', 
                   'Charge Description', 'Charge Amount', 
                   'Charge Description', 'Charge Amount']
print(invoice)
   ID  Net Cost Charge Description  Charge Amount  ... Charge Description  Charge Amount Charge Description  Charge Amount
0   1        30        Surcharge A            9.5  ...         Discount Y          -3.25        Surcharge B            2.5
1   2        40         Discount X          -12.5  ...                NaN            NaN                NaN            NaN
2   3        50         Discount X          -11.5  ...                NaN            NaN                NaN            NaN
3   4        35         Discount X           -5.5  ...                NaN            NaN                NaN            NaN
4   5        45        Surcharge A            9.5  ...        Surcharge B           4.50                NaN            NaN

First step is to transform to long form with pivot_longer from pyjanitor - in this case we take advantage of the fact that charge description is followed by charge amount - we can safely pair them and reshape into two columns. After that is done, we flip back to wide form - getting Surcharge and Discount values as headers. Thankfully, the index is unique, so a pivot works without extras. I used pivot_wider here, primarily for convenience - the same can be achieved with pivot, with just a few cleanup steps - under the hood pivot_wider uses pd.pivot.

# pip install pyjanitor
import pandas as pd
import janitor

index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
(invoice
.pivot_longer(
    index = index, 
    names_to = arr, 
    names_pattern = arr, 
    dropna=True)
.pivot_wider(
    index=index,
    names_from='Charge Description', 
    values_from='Charge Amount')
.fillna(0)
)

   ID  Net Cost  Discount X  Discount Y  Surcharge A  Surcharge B
0   1        30       -11.5       -3.25          9.5          2.5
1   2        40       -12.5        0.00          0.0          0.0
2   3        50       -11.5        0.00          0.0          0.0
3   4        35        -5.5        0.00          0.0          3.5
4   5        45       -10.5        0.00          9.5          4.5

Another option - since the data is fairly consistent with the ordering, you can dump down into numpy, reshape into a two column array, keep track of the ID and Net Cost columns (ensure they are correctly paired), and then pivot to get your final data:

index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
invoice = invoice.set_index(index)
out = invoice.to_numpy().reshape(-1, 2)
out = pd.DataFrame(out, columns = arr)
# reshape above is in order `C` - default
# so we can safely repeat the index
# with a value of 4
# which is what you get -> 
# invoice.columns.size // 2
# to correctly pair the index with the new dataframe
out.index = invoice.index.repeat(invoice.columns.size//2)
# get rid of nulls, and flip to wide form
(out
.dropna(how='all')
.set_index('Charge Description', append=True)
.squeeze()
.unstack('Charge Description', fill_value=0)
.rename_axis(columns = None)
.reset_index()
)

   ID  Net Cost Discount X Discount Y Surcharge A Surcharge B
0   1        30      -11.5      -3.25         9.5         2.5
1   2        40      -12.5          0           0           0
2   3        50      -11.5          0           0           0
3   4        35       -5.5          0           0         3.5
4   5        45      -10.5          0         9.5         4.5

You can convert the data dtypes for Discount to numeric

score 0 · Answer 2 · answered Feb 15 '23 at 20:02

You can flatten your dataframe first with melt then reshape with pivot_table after cleaning it up:

# 1st pass
out = (pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0].tolist())
         .melt(['ID', 'Net Cost'], ignore_index=False))

m = out['variable'] == 'Charge Description'

# 2nd pass
out = (pd.concat([out[m].reset_index(drop=True).add_prefix('_'),
                  out[~m].reset_index(drop=True)], axis=1)
         .query("_value != ''")
         .pivot_table(index=['ID', 'Net Cost'], columns='_value',
                      values='value', aggfunc='first')
         .rename_axis(columns=None).reset_index().fillna(0))

Output:

>>> out
  ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0  1       30      -11.5      -3.25         9.5         2.5
1  2       40      -12.5          0           0           0
2  3       50      -11.5          0           0           0
3  4       35       -5.5          0           0         3.5
4  5       45      -10.5          0         9.5         4.5

score 0 · Answer 3 · answered Feb 15 '23 at 20:03

You can use pivot_table after concatenating pair-wise:

import pandas as pd

df = pd.DataFrame.from_dict(
    {0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}})

# setting first row as header
df.columns = df.iloc[0, :]
df.drop(index=0, inplace=True)

df = pd.concat([df.iloc[:, [0,1,i,i+1]] for i in range(2, len(df.columns), 2)]).replace('', 0)

print(df[df['Charge Description']!=0]
    .pivot_table(columns='Charge Description', values='Charge Amount', index=['ID', 'Net Cost'])
    .fillna(0))

Output:

Charge Description  Discount X  Discount Y  Surcharge A  Surcharge B
ID Net Cost                                                         
1  30                    -11.5       -3.25          9.5          2.5
2  40                    -12.5        0.00          0.0          0.0
3  50                    -11.5        0.00          0.0          0.0
4  35                     -5.5        0.00          0.0          3.5
5  45                    -10.5        0.00          9.5          4.5

score 0 · Answer 4 · answered Feb 15 '23 at 20:32

I would use melt to stack the identically named columns, then pivot to create the outcome you want.

# Ensure the first line is now the column names, and then delete the first line.
df.columns = df.iloc[0]
df = df[1:]

# Create two melted df's, and join them on index.
df1 = df.melt(['ID', 'Net Cost'], ['Charge Description']).sort_values(by='ID').reset_index(drop=True)
df2 = df.melt(['ID', 'Net Cost'], ['Charge Amount']).sort_values(by='ID').reset_index(drop=True)
df1['Charge Amount'] = df2['value']

# Clean up a little, rename the added 'value' column from df1. 
df1 = df1.drop(columns=[0]).rename(columns={'value': 'Charge Description'})
df1 = df1.dropna()
# Pivot the data.
df1 = df1.pivot(index=['ID', 'Net Cost'], columns='Charge Description', values='Charge Amount')

Result of df1:

Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost                                                     
1  30                   -11.5      -3.25         9.5         2.5
2  40                   -12.5        NaN         NaN         NaN
3  50                   -11.5        NaN         NaN         NaN
4  35                    -5.5        NaN         NaN         3.5
5  45                   -10.5        NaN         9.5         4.5`

I believe that `dropna()` won't work here as the values are `''` rather than `NaN`. — MatBailie, Feb 15 '23 at 20:56

MatBailie · Answer 5 · 2023-02-15T21:20:54.750

My first thought was to read the data out in to a list of dictionaries representing each Row (making both the keys and values from the data values), then form a new dataframe from that.

For your example, that would make...

[
  {
    'ID': '1',
    'Net Cost': '30',
    'Discount X': '-11.5',
    'Discount Y': '-3.25',
    'Surcharge A': '9.5',
    'Surcharge B': '2.5',
  },
  {
    'ID': '2',
    'Net Cost': '40',
    'Discount X': '-12.5',
  },
  {
    'ID': '3',
    'Net Cost': '50',
    'Discount X': '-11.5',
  },
  {
    'ID': '4',
    'Net Cost': '35',
    'Discount X': '-5.5',
    'Surcharge B': '3.5',
  },
  {
    'ID': '5',
    'Net Cost': '45',
    'Discount X': '-10.5',
    'Surcharge A': '9.5',
    'Surcharge B': '4.5',
  },
]

For the SMALL sample dataset, using comprehensions appears to be quite quick for that...

import pandas as pd
from itertools import chain

rows = [
  {
    name: value
      for name, value in chain(
        [
          ("ID", row[0]),
          ("Net Cost", row[1]),
        ],
        zip(row[2::2], row[3::2]) # pairs of columns: (2,3), (4,5), etc
      )
      if name
  }
  for ix, row in df.iloc[1:].iterrows() # Skips the row with the column headers
]
df2 = pd.DataFrame(rows).fillna(0)

Demo (including timings of this and three other answers):

https://trinket.io/python3/555f860855

EDIT:

To sort the column names, add the following...

df2 = df2[['ID', 'Net Cost', *sorted(df2.columns[2:])]]

Column Pair-wise aggregation and reorganization in Pandas

5 Answers5