4

I've started learning python and machine learning very recently. I have been doing a basic Decision Tree Regressor example involving house prices. So I have trained the algorithm and found the best number of branches but how do I use this on new data?

I have the below columns and my target value is 'SalePrice'

['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

Obviously for the original data I already have the SalePrice so I can compare the values. How would I go about finding the price if I only have the columns above?

Full code below

import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor


# Path of the file to read
iowa_file_path = 'train.csv'

home_data = pd.read_csv(iowa_file_path)
#Simplify data to remove useless info
SimpleTable=home_data[['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd','SalePrice']]
# Create target object and call it y # input target value
y = home_data.SalePrice 
# Create X input columns names to be analysed
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0, test_size=0.8, train_size=0.2)


# Specify Model
iowa_model = DecisionTreeRegressor(random_state=0)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)

val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE: {:,.0f}".format(val_mae))


def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

# to find best number of leaves
candidate_max_leaf_nodes = [10, 20, 50, 100, 200, 400] # start with big numbers are work your way down
for max_leaf_nodes in candidate_max_leaf_nodes:
    my_mae=get_mae(max_leaf_nodes,train_X,val_X,train_y,val_y)
    print("MAX leaf nodes: %d \t\t Mean Absolute Error:%d" %(max_leaf_nodes,my_mae))




scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}

best_tree_size = min(scores, key=scores.get)
print(best_tree_size)


#run on all data and put back into data fram 
final_model=DecisionTreeRegressor(max_leaf_nodes=best_tree_size,random_state=0)
final_model.fit(X,y)
final_model.predict(X)

final_predictions = final_model.predict(X)
finaltableinput = {'Predicted_Price':final_predictions}
finaltable = pd.DataFrame(finaltableinput)
SimpleTable.head()

jointable = SimpleTable.join(finaltable)

#export data with predicted values to csv
jointable.to_csv('newdata4.csv')




Thanks in Advance

ARH94
  • 43
  • 6

2 Answers2

3

If you want to know the price (Y) given the independent variables (X) with an already trained model, you need to use the predict() method. This means that based on the model your algorithm developed with the training, it will use the variables to predict the SalePrice. I see you've already used .predict() in your code.

You should start by defining the variable, for example:

X_new = df_new[['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']] #Let's say this is a pandas dataframe
new_sale_price = final_model.predict(X_new) #This will return an array
df_new['SalePrice'] = new_sale_price #The length will be of equal length so you should have no trouble.

You can do this is one line as well:

df_new['SalePrice'] = final_model.predict(X_new) 

Of course, since you don't know the real SalePrice for those values of X you can't do a performance check. This is what happens in the real world whenever you want to make predictions or forecasting of prices based on a group of variables, you need to train your model to achieve it's peak performance, and then do the prediction with it! Feel free to leave any question in the comments if you have doubts.

Celius Stingher
  • 17,835
  • 6
  • 23
  • 53
  • Glad to be of help! Feel free to accept my answer, that will mark the question as answered and help me with the reputation too :) – Celius Stingher Jan 27 '20 at 22:27
0

The Decision Tree algorithm is a supervised learning model, which means that in order to train it you must supply the model with data of the features as well as of the target ('Sale Price' in your case).

If you want to apply machine learning in a case where you don't have the target data, you have to use an unsupervised model.

A very basic introduction to these different kinds of learning models can be found here.

João Paludo
  • 43
  • 1
  • 6
  • Thanks for the help but I meant using the analysis of the training data to predict new data. This has now been answered. – ARH94 Jan 27 '20 at 20:18