1

I have a dataset like:

Mileage Operator 1 16 Seater 29-seater Operator 2 16-seater.1 29-seater.1
80 BusRus £199 £299 Wheels £190 £290

I want it in the following format to build a price prediction model with Operator and Seats as Features

Mileage Operator Seats Price
80 BusRus 16 Seater £199
80 BusRus 29 Seater £299
80 Wheels 16 Seater £190
80 Wheels 29 Seater £290

Removed columns for simplicity

Data Set Has Nulls as sometimes you only get quotes for a particular seater vehicle or less Operators

I've tried the following code aswell:

import pandas as pd

# create the original dataframe
df = pd.DataFrame({
    'Mileage': [80],
    'Operator 1': ['BusRus'],
    '16-Seater': ['£199'],
    '29-Seater': ['£299'],
    'Operator 2': ['Wheels'],
    '16-Seater.1': ['£190'],
    '29-Seater.1': ['£290']
})

# melt the dataframe from wide to long format
df_melt = pd.melt(df, id_vars=['Mileage', 'Operator 1', 'Operator 2'],
                  var_name='Seats', value_name='Price')

# split the 'Seats' column into two columns - one for the number of seats and one for the type of vehicle
df_melt[['NumSeats', 'VehicleType']] = df_melt['Seats'].str.split('-', expand=True)

# determine the correct operator for each row based on which 'Seats' column the price belongs to
df_melt['Operator'] = df_melt.apply(lambda x: x['Operator 1'] if '1' in x['Seats'] else x['Operator 2'], axis=1)

# drop the original 'Seats' column and rename the remaining columns
df_melt = df_melt.drop('Seats', axis=1).rename(columns={'Operator 1': 'Dummy1', 'Operator 2': 'Dummy2'})

# drop the 'Dummy1' and 'Dummy2' columns and reorder the columns
df_melt = df_melt.drop(['Dummy1', 'Dummy2'], axis=1)[['Mileage', 'Operator', 'NumSeats', 'Price']]

# output the final dataframe
print(df_melt)

This gets me close to what I need but the Operators are incorrect

index Mileage Operator NumSeats Price
0 80 BusRus 16 £199
1 80 Wheels 29 £299
2 80 BusRus 16 £190
3 80 BusRus 29 £290
desertnaut
  • 57,590
  • 26
  • 140
  • 166

2 Answers2

1
    import pandas as pd

# create the original dataframe
df = pd.DataFrame({
    'Mileage': [80],
    'Operator 1': ['BusRus'],
    '16.1-Seater': ['£199'],
    '29.1-Seater': ['£299'],
    'Operator 2': ['Wheels'],
    '16.2-Seater.1': ['£190'],
    '29.2-Seater.1': ['£290'],
    'Operator 3': ['Taxi'],
    '16.3-Seater.2': ['£200'],
    '29.3-Seater.2': ['£300']
})

# melt the dataframe from wide to long format
df_melt = pd.melt(df, id_vars=['Mileage', 'Operator 1', 'Operator 2', 'Operator 3'],
                  var_name='Seats', value_name='Price')

# split the 'Seats' column into two columns - one for the number of seats and one for the type of vehicle
df_melt[['NumSeats', 'VehicleType']] = df_melt['Seats'].str.split('-', expand=True)

# create a dictionary to map seat numbers to operator names
operator_map = {
    '16.1': 'Operator 1',
    '29.1': 'Operator 1',
    '16.2': 'Operator 2',
    '29.2': 'Operator 2',
    '16.3': 'Operator 3',
    '29.3': 'Operator 3'
}

# look up the correct operator based on the seat number and add it as a column
df_melt['Operator'] = df_melt.apply(lambda x: x[operator_map[x['NumSeats']]], axis=1)

# drop the original 'Seats' column and rename the remaining columns
df_melt = df_melt.drop('Seats', axis=1).rename(columns={'Operator 1': 'Dummy1', 'Operator 2': 'Dummy2', 'Operator 3': 'Dummy3'})

# drop the 'Dummy1', 'Dummy2', and 'Dummy3' columns and reorder the columns
df_melt = df_melt.drop(['Dummy1', 'Dummy2', 'Dummy3'], axis=1)[['Mileage', 'Operator', 'NumSeats', 'Price']]

# output the final dataframe
print(df_melt)

Not Ideal But I've found a work around by changing the Value of Seats for Each Operator. Seats.Operator_ID.

E.g.

# create a dictionary to map seat numbers to operator names
operator_map = {
    '16.1': 'Operator 1',
    '29.1': 'Operator 1',
    '16.2': 'Operator 2',
    '29.2': 'Operator 2',
    '16.3': 'Operator 3',
    '29.3': 'Operator 3'
}

Not The best Solution but will do. Hopefully it can help people in a similar problem. Although I'd be interested in better solutions.

I can strip the Decimals of the seats after

1

One option is to reorder the columns before flipping with a combination of stack and melt:

temp = df.set_index('Mileage')
regex = r"(?P<num1>\d+)\.(?P<num2>\d)(?P<text>-Seater).*"
repl = lambda g: f"{g['num2']}_{g['num1']}{g['text']}"
temp.columns = temp.columns.str.replace(regex, repl, regex=True)
temp.columns = ["_".join(ent.split()[::-1]) 
               if ent.startswith('Operator') 
               else ent 
               for ent in temp]
temp.columns = temp.columns.str.split("_", expand=True)
(temp
.stack(level=0)
.melt(id_vars='Operator', value_name = 'Price', var_name = 'Seats')
)

  Operator      Seats Price
0   BusRus  16-Seater  £199
1   Wheels  16-Seater  £190
2     Taxi  16-Seater  £200
3   BusRus  29-Seater  £299
4   Wheels  29-Seater  £290
5     Taxi  29-Seater  £300

sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • Great Solution How would one add the Filler Columns e.g. Mileage and any other Columns such as Fuel and Hours. As there will be multiple rows in the data set – Arronjeet Kooner Apr 12 '23 at 10:27
  • Expand your initial question with the filler columns. Alternatively you can create a new question – sammywemmy Apr 12 '23 at 11:03