0

Let's say I have a table of sales transactions, some of which are individual SKUs and some bundled SKUs.

Date, Product, Qty
1 Jan 2017, A, 10
2 Jan 2017, Bundle X, 5
3 Jan 2017, B, 10
4 Jan 2017, Bundle Y, 5

And a separate table containing the Bundle-component combination:

ParentSKU, ComponentSKU, Quantity
Bundle X, A, 3
Bundle X, B, 5
Bundle X, C, 10
Bundle Y, P, 5
Bundle Y, Q, 7
Bundle Y, R, 12
Bundle Y, S, 3

How do I define a function to apply across the sales transaction table (or use a for-loop) so that the end product will break down rows with Bundle SKUs into multiple rows with the SKUs? The end result should look like:

Date, Product, Qty
1 Jan 2017, A, 10
2 Jan 2017, A, 15
2 Jan 2017, B, 25
2 Jan 2017, C, 50
3 Jan 2017, B, 10
4 Jan 2017, P, 25
4 Jan 2017, Q, 35
4 Jan 2017, R, 60
4 Jan 2017, S, 15

Thank you!

jpp
  • 159,742
  • 34
  • 281
  • 339
johnconnor92
  • 199
  • 1
  • 13

2 Answers2

2

This is one way using numpy and itertools.

Setup

import pandas as pd, numpy as np
from itertools import chain

# SETUP

df1 = pd.DataFrame({'Date': ['Jan 2017', 'Jan 2017', 'Jan 2017', 'Jan 2017'],
                    'Product': ['A', 'Bundle X', 'B', 'Bundle Y'],
                    'Qty': [10 , 5, 10, 5]})

df2 = pd.DataFrame({'ParentSKU': ['Bundle X', 'Bundle X' ,'Bundle X', 'Bundle Y',
                                  'Bundle Y', 'Bundle Y', 'Bundle Y'],
                    'ComponentSKU': ['A', 'B', 'C', 'P', 'Q', 'R', 'S'],
                    'Quantity': [3, 5, 10, 5, 7, 12, 3]})

Solution

# Perform groupby on bundles
bundles = df2.groupby('ParentSKU')['ComponentSKU'].apply(list)
bundles_q = df2.groupby('ParentSKU')['Quantity'].apply(list)

# Map bundles to df1
df1['Product_Decomposed'] = df1['Product'].map(bundles).fillna(df1['Product'].apply(list))
df1['Quantity_Decomposed'] = df1.apply(lambda x: [x['Qty']*i for i in bundles_q.get(x['Product'], [1])], axis=1)

# Get lengths of each bundle
lens = list(map(len, df1['Product_Decomposed']))

# Create dataframe by repeating and chaining data
res = pd.DataFrame({'Date': np.repeat(df1['Date'], lens),
                    'Product': list(chain.from_iterable(df1['Product_Decomposed'])),
                    'Qty': list(chain.from_iterable(df1['Quantity_Decomposed']))})

Result

print(res)

       Date Product  Qty
0  Jan 2017       A   10
1  Jan 2017       A   15
1  Jan 2017       B   25
1  Jan 2017       C   50
2  Jan 2017       B   10
3  Jan 2017       P   25
3  Jan 2017       Q   35
3  Jan 2017       R   60
3  Jan 2017       S   15
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Thank you! But is there a way for me to use only numpy and pandas? – johnconnor92 May 09 '18 at 11:23
  • If you are using `numpy` / `pandas`, you should *absolutely* be open to `itertools`. The first 2 are 3rd party libraries, while `itertools` is part of the standard library (shipped with every installation). Is there a specific issue you have using `itertools`? – jpp May 09 '18 at 11:24
  • Not really, I just haven't learned itertools yet. Thank you! I'll check it out. – johnconnor92 May 09 '18 at 11:26
  • I am getting this error with your calculation of df1['Quantity_decomposed']. "could not broadcast input array from shape (3) into shape (11)". I think I am having the problem mentioned in this page but I'm not sure how to solve it? https://stackoverflow.com/questions/43977463/valueerror-could-not-broadcast-input-array-from-shape-224-224-3-into-shape-2?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa – johnconnor92 May 31 '18 at 05:17
  • I think shape (11) refers to the shape of my actual df1 dataframe (it contains 11 columns). but I am not sure where the "shape (3)" is coming from? – johnconnor92 May 31 '18 at 06:13
  • I have attached the files I am working with here: https://www.dropbox.com/s/7z0ahksoargbd1u/Sample%20Variant%20Report.csv?dl=0 and https://www.dropbox.com/s/zisyxjqs63hwmwa/Sample%20Bundle%20Master.csv?dl=0 – johnconnor92 May 31 '18 at 08:34
0

One way would be to use merge (using jpp's handy setup for df1 and df2):

# Split df1 into the ones we need to unbundle
by_bundling = dict(list(df1.groupby(df1.Product.str.startswith("Bundle"))))

# Select the ones we want to unbundle, and make the index a column
unbundled = by_bundling[True].reset_index()

# Merge this with our second table
unbundled = unbundled.merge(df2, left_on="Product", right_on="ParentSKU")

# Multiply the quantities
unbundled["Qty"] *= unbundled["Quantity"]

# Reduce to the columns of interest and rename
unbundled = unbundled.set_index("index")[["Date", "ComponentSKU", "Qty"]]
unbundled = unbundled.rename(columns={"ComponentSKU": "Product"})

# Recombine and sort
final = pd.concat([by_bundling[False], unbundled]).sort_index()

which gives me

In [57]: final
Out[57]: 
       Date Product  Qty
0  Jan 2017       A   10
1  Jan 2017       A   15
1  Jan 2017       B   25
1  Jan 2017       C   50
2  Jan 2017       B   10
3  Jan 2017       P   25
3  Jan 2017       Q   35
3  Jan 2017       R   60
3  Jan 2017       S   15

Really the only interesting thing here is the merge:

In [59]: unbundled.merge(df2, left_on="Product", right_on="ParentSKU")
Out[59]: 
   index      Date   Product  Qty ComponentSKU ParentSKU  Quantity
0      1  Jan 2017  Bundle X    5            A  Bundle X         3
1      1  Jan 2017  Bundle X    5            B  Bundle X         5
2      1  Jan 2017  Bundle X    5            C  Bundle X        10
3      3  Jan 2017  Bundle Y    5            P  Bundle Y         5
4      3  Jan 2017  Bundle Y    5            Q  Bundle Y         7
5      3  Jan 2017  Bundle Y    5            R  Bundle Y        12
6      3  Jan 2017  Bundle Y    5            S  Bundle Y         3

The rest is just rearrangement and arithmetic.

Don't look down on manual ways of doing things, though -- they're sometimes the simplest, and clean code you can follow is much better than "clever" code you can't.

DSM
  • 342,061
  • 65
  • 592
  • 494