pivot long form categorical data by group and dummy code categorical variables

Question

For the following dataframe, I am trying to pivot the categorical variable ('purchase_item') into wide format and dummy code them as 1/0 - based on whether or not a customer purchased it in each of the 4 quarters within 2016.

I would like to generate a pivotted dataframe as follows:

To get the desired result shown above, I have tried basically in various ways to combine groupby/pivot_table functions with a call to get_dummies() function in pandas. Example: data.groupby(["cust_id", "purchase_qtr"])["purchase_item"].reset_index().get_dummies()

However, none of my attempts have worked thus far.

Can somebody please help me generate the desired result?

Cameron Riddell · Accepted Answer · 2021-11-04T23:26:14.143

One way of doing this is to get the crosstabulation, and then force all values > 1 to become 1, while keeping all 0's as they are:

TL;DR

out = (
    pd.crosstab([df["cust_id"], df["purchase_qtr"]], df["purchase_item"])
    .gt(0)
    .astype(int)
    .reset_index()
)

Breaking it all down:

Create Data

df = pd.DataFrame({
    "group1": np.repeat(["a", "b", "c"], 4),
    "group2": [1, 2, 3] * 4,
    "item": np.random.choice(["ab", "cd", "ef", "gh", "zx"], size=12)
})

print(df)
   group1  group2 item
0       a       1   cd
1       a       2   ef
2       a       3   gh
3       a       1   ef
4       b       2   zx
5       b       3   ab
6       b       1   ab
7       b       2   gh
8       c       3   gh
9       c       1   cd
10      c       2   ef
11      c       3   gh

Cross Tabulation

This returns a frequency table indicating how often each of the categories are observed together:

crosstab = pd.crosstab([df["group1"], df["group2"]], df["item"])

print(crosstab)
item           ab  cd  ef  gh  zx
group1 group2
a      1        0   1   1   0   0
       2        0   0   1   0   0
       3        0   0   0   1   0
b      1        1   0   0   0   0
       2        0   0   0   1   1
       3        1   0   0   0   0
c      1        0   1   0   0   0
       2        0   0   1   0   0
       3        0   0   0   2   0

Coerce Counts to Dummy Codes

Since we want to dummy code, and not count the co-occurance of categories, we can use a quick trick to force all values greater than 0 gt(0) to become 1 astype(int)

item           ab  cd  ef  gh  zx
group1 group2
a      1        0   1   1   0   0
       2        0   0   1   0   0
       3        0   0   0   1   0
b      1        1   0   0   0   0
       2        0   0   0   1   1
       3        1   0   0   0   0
c      1        0   1   0   0   0
       2        0   0   1   0   0
       3        0   0   0   1   0

Although this works for the example here, for a very large dataset, it seems to result in fewer number of cust_ids than would be expected from the long format dataframe. Any idea how to prevent this? I get the same issue from the other alternatives listed below also — veg2020, Jan 24 '22 at 21:36

Hamza · Answer 2 · 2021-11-05T01:03:19.030

You can do in one line via several tricks.

1) Unique Index count in junction with casting as bool

Works in any case even when you do not have any other column besides index, columns and values. This code implies to count unique indices of each index-column intersection and returning 1 incase its more than 0 else 0.

df.reset_index().pivot_table(index=['cust_id','purchase_qtr'], 
                             columns='purchase_item',
                             values='index',
                             aggfunc='nunique', fill_value=0)\
                .astype(bool).astype(int)

2) Checking if any other column is not null

If you have other columns besides index, columns and values AND want to use them for intuition. Like purchase_date in your case. It is more intutive because you can "read" it like: Check per customer per quarter if the purchase date of the item is not null and parse them as integer.

df.pivot_table(index=['cust_id','purchase_qtr'],
               columns='purchase_item',values='purchase_date',
               aggfunc=lambda x: all(pd.notna(x)), fill_value=0)\
  .astype(int)

3) Seeing `len` of elements falling in each index-column intersection

This seeslen of elements falling in each each index-column intersection and returning 1 incase its more than 0 else 0. Same intuitive approach:

df.pivot_table(index=['cust_id','purchase_qtr'], 
               columns='purchase_item',
               values='purchase_date',
               aggfunc=len, fill_value=0)\
  .astype(bool).astype(int)

All return desired dataframe:

Note that you should be only using crosstab when you don't have a dataframe already as it calls pivot_table internally.

Although this works for the example here, for a very large dataset, it seems to result in fewer number of cust_ids than would be expected from the long format dataframe. Any idea how to prevent this? I get the same issue from the other alternatives listed above also. — veg2020, Jan 24 '22 at 21:36