Pandas: Wide to first, second, third, identified categories

Question

I am wondering if anyone knows a quick way in pandas to pivot a dataframe to achieve the desired transformation below. It is sort of a wide-to-long pivot, but not quite.

Input Dataframe structure (needs to be able to support N number of categories, not just 3 as case below)

+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| id   | catA_present | catA_pos | catA_neg | catA_ntrl | catB_present | catB_pos | catB_neg | catB_ntrl | catC_present | catC_pos | catC_neg | catC_ntrl |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0001 | 1            | 1        | 0        | 0         | 0            | 0        | 0        | 0         | 1            | 0        | 1        | 0         |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0002 | 0            | 0        | 0        | 0         | 1            | 1        | 0        | 0         | 1            | 1        | 0        | 0         |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0003 | 1            | 0        | 0        | 1         | 1            | 0        | 0        | 1         | 0            | 0        | 0        | 0         |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0004 | 1            | 1        | 0        | 0         | 1            | 1        | 0        | 0         | 1            | 0        | 0        | 1         |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0005 | 0            | 0        | 0        | 0         | 0            | 0        | 0        | 0         | 1            | 0        | 1        | 0         |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+

Output Transformed Dataframe structure: (needs to support N number of categories, not just 3 as example shows)

+------+------+-------+------+-------+------+-------+
| id   | cat1 | sent1 | cat2 | sent2 | cat3 | sent3 |
+------+------+-------+------+-------+------+-------+
| 0001 | catA | pos   | catC | neg   | NULL | NULL  |
+------+------+-------+------+-------+------+-------+
| 0002 | catB | pos   | catC | pos   | NULL | NULL  |
+------+------+-------+------+-------+------+-------+
| 0003 | catA | ntrl  | catB | ntrl  | NULL | NULL  |
+------+------+-------+------+-------+------+-------+
| 0004 | catA | pos   | catB | pos   | catC | ntrl  |
+------+------+-------+------+-------+------+-------+
| 0005 | catC | neg   | NULL | NULL  | NULL | NULL  |
+------+------+-------+------+-------+------+-------+

score 1 · Answer 1 · answered Dec 01 '20 at 15:13

I don't think it's a pivot at all.. However, anything is possible so here we go:

import io
import itertools
import pandas

# Your data
data = io.StringIO(
"""
id   | catA_present | catA_pos | catA_neg | catA_ntrl | catB_present | catB_pos | catB_neg | catB_ntrl | catC_present | catC_pos | catC_neg | catC_ntrl
0001 | 1            | 1        | 0        | 0         | 0            | 0        | 0        | 0         | 1            | 0        | 1        | 0
0002 | 0            | 0        | 0        | 0         | 1            | 1        | 0        | 0         | 1            | 1        | 0        | 0
0003 | 1            | 0        | 0        | 1         | 1            | 0        | 0        | 1         | 0            | 0        | 0        | 0
0004 | 1            | 1        | 0        | 0         | 1            | 1        | 0        | 0         | 1            | 0        | 0        | 1
0005 | 0            | 0        | 0        | 0         | 0            | 0        | 0        | 0         | 1            | 0        | 1        | 0
"""
)
df = pandas.read_table(data, sep="\s*\|\s*")


def get_sentiment(row: pandas.Series) -> pandas.Series:
    if row["cat_pos"] == 1:
        return "pos"
    elif row["cat_neg"] == 1:
        return "neg"
    elif row["cat_ntrl"] == 1:
        return "ntrl"
    else:
        return None


# Initialize a dict that will hold an entry for every index in the dataframe, with a list of categories and sentiments
categories_per_index = {index: [] for index in df.index}

# Extract a list of unique names of all possible categories
categories = set([column[3] for column in df.columns if column.startswith("cat")])

# Loop over the unique categories
for key in categories:

    # Select only the columns for a particular category, and where that category is present
    group = df.loc[df[f"cat{key}_present"] == 1, [f"cat{key}_present", f"cat{key}_pos", f"cat{key}_neg", f"cat{key}_ntrl"]]

    # Change the column names for generic processing
    group.columns = ["cat_present", "cat_pos", "cat_neg", "cat_ntrl"]

    # Figure out the sentiment for every line
    group["sentiment"] = group.apply(get_sentiment, axis=1)

    # Loop the rows in the group and add the sentiment for this category to the indices
    for index, row in group.iterrows():

        # Add the name of the category and the sentiment to the index
        categories_per_index[index].append(f"cat{key}")
        categories_per_index[index].append(row["sentiment"])


# Reconstruct the dataframe from the dictionary
df = pandas.DataFrame.from_dict(categories_per_index, orient="index", columns=list(itertools.chain.from_iterable([ [f"cat{i}", f"sent{i}"] for i in range(len(categories)) ])))

Output:

print(df)
   cat0 sent0  cat1 sent1  cat2 sent2
0  catA   pos  catC   neg  None  None
1  catB   pos  catC   pos  None  None
2  catB  ntrl  catA  ntrl  None  None
3  catB   pos  catA   pos  catC  ntrl
4  catC   neg  None  None  None  None

Pandas: Wide to first, second, third, identified categories

1 Answers1