2

I want to transform the structure of a pandas dataframe by unpivoting the column name into rows if the row has an "x" value.

Here is a small extraction of the excel file. There is a combination of categorical variables/dummy variables such as Line of business and four dummy variables indicating which data categories are used in a given process - such as Customer.

| Process name | Line of business | Customer | Potential customer | Employee | Vendor |
|--------------|------------------|----------|--------------------|----------|--------|
| Ad campaign  | Marketing        | x        | x                  |          | x      |
| Payroll      | HR               |          |                    | x        | x      |
df = pd.DataFrame({'Process name': {0: 'Ad campaign', 1: 'Payroll'},
                   'Line of business': {0: "Marketing", 1: "HR"},
                   'Customer': {0: "x", 1: ""},
                   'Potential customer': {0: "x", 1: ""},
                   'Employee': {0: "", 1: "x"},
                   'Vendor': {0: "x", 1: "x"}})

The desired output would look like this. This is only a small extraction of the data, so I would like a solution that does not need to reference the column names, and rather by indexing. I also need to do the unpivot operation on a few other groups of columns.

| Process name | Line of business | Data category      |
|--------------|------------------|--------------------|
| Ad campaign  | Marketing        | Customer           |
| Ad campaign  | Marketing        | Potential customer |
| Ad campaign  | Marketing        | Vendor             |
| Payroll      | HR               | Employee           |
| Payroll      | HR               | Vendor             |

I have tried using pd.melt and pd.pivot, but I cant seem to figure it out.

Solution: Thanks for pointing me in the right direction @jezrael

Here is how I implemented it:

##pivot selected columns
df = df.melt(id_vars=['Process name', 'Line of business'],
       value_vars=('Customer', 'Potential customer', 'Employee', 'Vendor'),
       var_name='Data category')
##remove rows that does not contain x
df = df[df['value'].str.contains("x")]
##sort alphabetically
df = df.sort_values("Process name")
torkestativ
  • 352
  • 2
  • 15

0 Answers0