Data Preparation Advanced Pandas Melt Multiple Column Sets. Merge some Columns to make them one

Question

I have a dataset like:

Mileage	Operator 1	16 Seater	29-seater	Operator 2	16-seater.1	29-seater.1
80	BusRus	£199	£299	Wheels	£190	£290

I want it in the following format to build a price prediction model with Operator and Seats as Features

Mileage	Operator	Seats	Price
80	BusRus	16 Seater	£199
80	BusRus	29 Seater	£299
80	Wheels	16 Seater	£190
80	Wheels	29 Seater	£290

Removed columns for simplicity

Data Set Has Nulls as sometimes you only get quotes for a particular seater vehicle or less Operators

I've tried the following code aswell:

import pandas as pd

# create the original dataframe
df = pd.DataFrame({
    'Mileage': [80],
    'Operator 1': ['BusRus'],
    '16-Seater': ['£199'],
    '29-Seater': ['£299'],
    'Operator 2': ['Wheels'],
    '16-Seater.1': ['£190'],
    '29-Seater.1': ['£290']
})

# melt the dataframe from wide to long format
df_melt = pd.melt(df, id_vars=['Mileage', 'Operator 1', 'Operator 2'],
                  var_name='Seats', value_name='Price')

# split the 'Seats' column into two columns - one for the number of seats and one for the type of vehicle
df_melt[['NumSeats', 'VehicleType']] = df_melt['Seats'].str.split('-', expand=True)

# determine the correct operator for each row based on which 'Seats' column the price belongs to
df_melt['Operator'] = df_melt.apply(lambda x: x['Operator 1'] if '1' in x['Seats'] else x['Operator 2'], axis=1)

# drop the original 'Seats' column and rename the remaining columns
df_melt = df_melt.drop('Seats', axis=1).rename(columns={'Operator 1': 'Dummy1', 'Operator 2': 'Dummy2'})

# drop the 'Dummy1' and 'Dummy2' columns and reorder the columns
df_melt = df_melt.drop(['Dummy1', 'Dummy2'], axis=1)[['Mileage', 'Operator', 'NumSeats', 'Price']]

# output the final dataframe
print(df_melt)

This gets me close to what I need but the Operators are incorrect

index	Mileage	Operator	NumSeats	Price
0	80	BusRus	16	£199
1	80	Wheels	29	£299
2	80	BusRus	16	£190
3	80	BusRus	29	£290

Arronjeet Kooner · Answer 1 · 2023-04-11T16:37:21.163

    import pandas as pd

# create the original dataframe
df = pd.DataFrame({
    'Mileage': [80],
    'Operator 1': ['BusRus'],
    '16.1-Seater': ['£199'],
    '29.1-Seater': ['£299'],
    'Operator 2': ['Wheels'],
    '16.2-Seater.1': ['£190'],
    '29.2-Seater.1': ['£290'],
    'Operator 3': ['Taxi'],
    '16.3-Seater.2': ['£200'],
    '29.3-Seater.2': ['£300']
})

# melt the dataframe from wide to long format
df_melt = pd.melt(df, id_vars=['Mileage', 'Operator 1', 'Operator 2', 'Operator 3'],
                  var_name='Seats', value_name='Price')

# split the 'Seats' column into two columns - one for the number of seats and one for the type of vehicle
df_melt[['NumSeats', 'VehicleType']] = df_melt['Seats'].str.split('-', expand=True)

# create a dictionary to map seat numbers to operator names
operator_map = {
    '16.1': 'Operator 1',
    '29.1': 'Operator 1',
    '16.2': 'Operator 2',
    '29.2': 'Operator 2',
    '16.3': 'Operator 3',
    '29.3': 'Operator 3'
}

# look up the correct operator based on the seat number and add it as a column
df_melt['Operator'] = df_melt.apply(lambda x: x[operator_map[x['NumSeats']]], axis=1)

# drop the original 'Seats' column and rename the remaining columns
df_melt = df_melt.drop('Seats', axis=1).rename(columns={'Operator 1': 'Dummy1', 'Operator 2': 'Dummy2', 'Operator 3': 'Dummy3'})

# drop the 'Dummy1', 'Dummy2', and 'Dummy3' columns and reorder the columns
df_melt = df_melt.drop(['Dummy1', 'Dummy2', 'Dummy3'], axis=1)[['Mileage', 'Operator', 'NumSeats', 'Price']]

# output the final dataframe
print(df_melt)

Not Ideal But I've found a work around by changing the Value of Seats for Each Operator. Seats.Operator_ID.

E.g.

# create a dictionary to map seat numbers to operator names
operator_map = {
    '16.1': 'Operator 1',
    '29.1': 'Operator 1',
    '16.2': 'Operator 2',
    '29.2': 'Operator 2',
    '16.3': 'Operator 3',
    '29.3': 'Operator 3'
}

Not The best Solution but will do. Hopefully it can help people in a similar problem. Although I'd be interested in better solutions.

I can strip the Decimals of the seats after

score 1 · Accepted Answer · answered Apr 12 '23 at 03:07

One option is to reorder the columns before flipping with a combination of stack and melt:

temp = df.set_index('Mileage')
regex = r"(?P<num1>\d+)\.(?P<num2>\d)(?P<text>-Seater).*"
repl = lambda g: f"{g['num2']}_{g['num1']}{g['text']}"
temp.columns = temp.columns.str.replace(regex, repl, regex=True)
temp.columns = ["_".join(ent.split()[::-1]) 
               if ent.startswith('Operator') 
               else ent 
               for ent in temp]
temp.columns = temp.columns.str.split("_", expand=True)
(temp
.stack(level=0)
.melt(id_vars='Operator', value_name = 'Price', var_name = 'Seats')
)

  Operator      Seats Price
0   BusRus  16-Seater  £199
1   Wheels  16-Seater  £190
2     Taxi  16-Seater  £200
3   BusRus  29-Seater  £299
4   Wheels  29-Seater  £290
5     Taxi  29-Seater  £300

Great Solution How would one add the Filler Columns e.g. Mileage and any other Columns such as Fuel and Hours. As there will be multiple rows in the data set — Arronjeet Kooner, Apr 12 '23 at 10:27
Expand your initial question with the filler columns. Alternatively you can create a new question — sammywemmy, Apr 12 '23 at 11:03

Data Preparation Advanced Pandas Melt Multiple Column Sets. Merge some Columns to make them one

2 Answers2