0

I have a survey dataset, a part of which I need to expand as columns having textual responses to rating questions asked. The dataset is large, what is the best way of doing this?

import pandas as pd
  pd.DataFrame({'S.No': {0: 63.0,
  1: nan,
  2: nan,
  3: nan,
  4: 204.0,
  5: nan,
  6: nan,
  7: nan,
  8: 238.0,
  9: nan,
  10: nan,
  11: nan,
  12: 292.0,
  13: nan,
  14: nan,
  15: nan,
  16: 332.0,
  17: nan,
  18: nan,
  19: nan},
 'ID': {0: 251,
  1: 251,
  2: 251,
  3: 251,
  4: 252,
  5: 252,
  6: 252,
  7: 252,
  8: 253,
  9: 253,
  10: 253,
  11: 253,
  12: 254,
  13: 254,
  14: 254,
  15: 254,
  16: 255,
  17: 255,
  18: 255,
  19: 255},
 'Name': {0: 'Bob',
  1: 'Bob',
  2: 'Bob',
  3: 'Bob',
  4: 'Foo',
  5: 'Foo',
  6: 'Foo',
  7: 'Foo',
  8: 'Mike',
  9: 'Mike',
  10: 'Mike',
  11: 'Mike',
  12: 'Mary',
  13: 'Mary',
  14: 'Mary',
  15: 'Mary',
  16: 'Bar',
  17: 'Bar',
  18: 'Bar',
  19: 'Bar'},
 'User Function': {0: 'Sales',
  1: 'Sales',
  2: 'Sales',
  3: 'Sales',
  4: 'Mktg',
  5: 'Mktg',
  6: 'Mktg',
  7: 'Mktg',
  8: 'Finance',
  9: 'Finance',
  10: 'Finance',
  11: 'Finance',
  12: 'Sales',
  13: 'Sales',
  14: 'Sales',
  15: 'Sales',
  16: 'Mktg',
  17: 'Mktg',
  18: 'Mktg',
  19: 'Mktg'},
 'Business Unit': {0: 'BU1',
  1: 'BU1',
  2: 'BU1',
  3: 'BU1',
  4: 'BU2',
  5: 'BU2',
  6: 'BU2',
  7: 'BU2',
  8: 'BU3',
  9: 'BU3',
  10: 'BU3',
  11: 'BU3',
  12: 'BU1',
  13: 'BU1',
  14: 'BU1',
  15: 'BU1',
  16: 'BU2',
  17: 'BU2',
  18: 'BU2',
  19: 'BU2'},
 'Gender': {0: 'Male',
  1: 'Male',
  2: 'Male',
  3: 'Male',
  4: 'Male',
  5: 'Male',
  6: 'Male',
  7: 'Male',
  8: 'Male',
  9: 'Male',
  10: 'Male',
  11: 'Male',
  12: 'Female',
  13: 'Female',
  14: 'Female',
  15: 'Female',
  16: 'Male',
  17: 'Male',
  18: 'Male',
  19: 'Male'},
 'Primary Exit Reason': {0: 'Policy',
  1: 'Policy',
  2: 'Policy',
  3: 'Policy',
  4: 'Team',
  5: 'Team',
  6: 'Team',
  7: 'Team',
  8: 'Navigation',
  9: 'Navigation',
  10: 'Navigation',
  11: 'Navigation',
  12: 'Others',
  13: 'Others',
  14: 'Others',
  15: 'Others',
  16: 'Policy',
  17: 'Policy',
  18: 'Policy',
  19: 'Policy'},
 'Primary Question': {0: 'A',
  1: 'B',
  2: 'C',
  3: 'D',
  4: 'E',
  5: 'F',
  6: 'G',
  7: 'H',
  8: 'I',
  9: 'J',
  10: 'K',
  11: 'L',
  12: 'M',
  13: 'N',
  14: 'O',
  15: 'P',
  16: 'A',
  17: 'B',
  18: 'C',
  19: 'D'},
 'Primary Response': {0: nan,
  1: nan,
  2: nan,
  3: nan,
  4: 'Agree',
  5: 'Agree',
  6: 'No',
  7: nan,
  8: 'Agree',
  9: 'Agree',
  10: 'No',
  11: nan,
  12: nan,
  13: nan,
  14: nan,
  15: nan,
  16: nan,
  17: nan,
  18: nan,
  19: nan},
 'Secondary Exit Reason': {0: 'Policy',
  1: 'Policy',
  2: 'Policy',
  3: 'Policy',
  4: 'Others',
  5: 'Others',
  6: 'Others',
  7: 'Others',
  8: 'Transport',
  9: 'Transport',
  10: 'Transport',
  11: 'Transport',
  12: 'Policy',
  13: 'Policy',
  14: 'Policy',
  15: 'Policy',
  16: 'Policy',
  17: 'Policy',
  18: 'Policy',
  19: 'Policy'},
 'Secondary Question': {0: 'A',
  1: 'B',
  2: 'C',
  3: 'D',
  4: 'M',
  5: 'N',
  6: 'O',
  7: 'P',
  8: 'Q',
  9: 'R',
  10: 'S',
  11: 'T',
  12: 'A',
  13: 'B',
  14: 'C',
  15: 'D',
  16: 'A',
  17: 'B',
  18: 'C',
  19: 'D'},
 'Secondary Response': {0: 'Agree',
  1: 'Agree',
  2: 'Yes',
  3: nan,
  4: nan,
  5: nan,
  6: nan,
  7: nan,
  8: nan,
  9: nan,
  10: nan,
  11: nan,
  12: 'Agree',
  13: 'Agree',
  14: 'No',
  15: nan,
  16: 'Highly Agree',
  17: 'Agree',
  18: "I'm unaware",
  19: nan},
 'Feedback Question': {0: 'A',
  1: 'B',
  2: 'C',
  3: 'D',
  4: 'A',
  5: 'B',
  6: 'C',
  7: 'D',
  8: 'A',
  9: 'B',
  10: 'C',
  11: 'D',
  12: 'A',
  13: 'B',
  14: 'C',
  15: 'D',
  16: 'A',
  17: 'B',
  18: 'C',
  19: 'D'},
 'Feedback Reason': {0: '4',
  1: '8',
  2: nan,
  3: nan,
  4: '4',
  5: '7',
  6: 'NO ALL GOOD',
  7: 'NO\n',
  8: '4',
  9: '6',
  10: 'No',
  11: 'No',
  12: '5',
  13: '6',
  14: 'No',
  15: 'No',
  16: '5',
  17: '10',
  18: 'YES GOOD XP',
  19: 'Yes'}})

Above is a reproducible dataset. It is required to expand Primary Question, Secondary Question and Feedback Question into columns, having Primary Response, Secondary Response and Feedback Response as the values to those columns. Rest all variables don't change so remain in a single row such as 'S.No', 'ID', 'Name', 'User Function', 'Business Unit', 'Gender', 'Issues' . Each ID consists of 4 rows

I tried using groupby() ID and then unstack() -ing the required columns but that did not work. Also tried melt and pivot.

df.melt(id_vars=['S.No','ID','Name','User Function','Business Unit','Gender','Primary Exit Reason'], 
              var_name='var')

I'm not sure how to break the columns each into separate columns

Output expected: enter image description here

Ayan
  • 145
  • 2
  • 16
  • I was talking a look at this, however, could I request you show an example of what you want the final output to look like? – dimButTries May 27 '23 at 20:38
  • @dimButTries have added to the question – Ayan May 27 '23 at 21:05
  • One way of doing this could be: 1. drop S.No 2. create a subset of the dataset basis the columns which dont change and those which do. 3. Using pivot_table on 3 different data sets for Primary Question, Secondary Question, Feedback Question and then 4. Merge all together – Ayan May 28 '23 at 07:04

1 Answers1

0

Here is the output as your expected, please note I only transformed from wide to long Feedback Reasons, Secondary Responses and Primary Response, as they contained information I could easily identify.

The steps are reproducible for you to add the additional columns.

df = pd.DataFrame(data)

# Step 2: Group by Name
grouped_df = df.groupby('Name')

# Step 3: Transform Feedback Reason from long to wide format
feedback_reason_df = grouped_df['Feedback Reason'].apply(lambda x: pd.Series(x.values)).unstack().add_prefix('Feedback Reason ')

# Combine the original grouped DataFrame with the feedback_reason_df
result_df = grouped_df.first().join(feedback_reason_df)

# Remove the duplicate column
result_df = result_df.drop(columns='Feedback Reason')

# Step 4: Transform Secondary Response from long to wide format
secondary_response_df = grouped_df['Secondary Response'].apply(lambda x: pd.Series(x.values)).unstack().add_prefix('Secondary Response ')

# Add the secondary_response_df to the result_df
result_df = result_df.join(secondary_response_df)

# Step 5: Transform Primary Response from long to wide format
primary_response_df = grouped_df['Primary Response'].apply(lambda x: pd.Series(x.values)).unstack().add_prefix('Primary Response ')

# Add the primary_response_df to the result_df
result_df = result_df.join(primary_response_df)

result_df = result_df.fillna('')

result_df

Additional Here is a function to make the code more readable and reproducible.

# Function to transform a column from long to wide format and merge with result_df
def transform_column(column_name: str) -> pd.DataFrame:
    """
    Transform the specified column from long to wide format and merge with result_df.
    
    Args:
        column_name: The name of the column to transform.
    
    Returns:
        The result_df DataFrame with the specified column transformed and merged.
    """
    column_df = grouped_df[column_name].apply(lambda x: pd.Series(x.values)).unstack().add_prefix(column_name + ' ')
    return result_df.join(column_df)

You can use the function by result_df = transform_column('Feedback Reason')

dimButTries
  • 661
  • 7
  • 15