How do I predict future results with scikitlearn, pandas in Python using RandomForestRegressor method?

Question

Hello I came across this tutorial on how to use python with some libraries to predict future NCAAB games using a sportsreference library. I will post the code as well as the article. This seems to work well, but I think it is only testing based on games in the past. How would I use it to predict future games of specific teams? For example, what will be the score between Team A and Team B on This Date?

The problem I see is that some of the data used can only be known after the game is finished. This known data is what is being used in the program to predict the score.

First Experiment: I tried filling in only the data that I knew on a game before it happened and filling in the remaining data with zero's using fillna(0). Here is what the the csv would look like:

date_team,away_assist_percentage,away_assists,away_block_percentage,away_blocks,away_defensive_rating,away_defensive_rebound_percentage,away_defensive_rebounds,away_effective_field_goal_percentage,away_field_goal_attempts,away_field_goal_percentage,away_field_goals,away_free_throw_attempt_rate,away_free_throw_attempts,away_free_throw_percentage,away_free_throws,away_losses,away_minutes_played,away_offensive_rating,away_offensive_rebound_percentage,away_offensive_rebounds,away_personal_fouls,away_points,away_steal_percentage,away_steals,away_three_point_attempt_rate,away_three_point_field_goal_attempts,away_three_point_field_goal_percentage,away_three_point_field_goals,away_total_rebound_percentage,away_total_rebounds,away_true_shooting_percentage,away_turnover_percentage,away_turnovers,away_two_point_field_goal_attempts,away_two_point_field_goal_percentage,away_two_point_field_goals,away_win_percentage,away_wins,home_assist_percentage,home_assists,home_block_percentage,home_blocks,home_defensive_rating,home_defensive_rebound_percentage,home_defensive_rebounds,home_effective_field_goal_percentage,home_field_goal_attempts,home_field_goal_percentage,home_field_goals,home_free_throw_attempt_rate,home_free_throw_attempts,home_free_throw_percentage,home_free_throws,home_losses,home_minutes_played,home_offensive_rating,home_offensive_rebound_percentage,home_offensive_rebounds,home_personal_fouls,home_points,home_steal_percentage,home_steals,home_three_point_attempt_rate,home_three_point_field_goal_attempts,home_three_point_field_goal_percentage,home_three_point_field_goals,home_total_rebound_percentage,home_total_rebounds,home_true_shooting_percentage,home_turnover_percentage,home_turnovers,home_two_point_field_goal_attempts,home_two_point_field_goal_percentage,home_two_point_field_goals,home_win_percentage,home_wins,pace 0,0,0,0,0,0,0,0,0,59,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.7,7,0,0,0,0,0,0,0,0,0,0,42,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,.1,1,0 The final line of code is changed to: print(model.predict(final_trim).astype(int), y_test)

"final_trim" being the new csv that is being predicted.

The results were not accurate at all. What am I missing?

Here is the original code:

import pandas as pd
from sportsreference.ncaab.teams import Teams
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

FIELDS_TO_DROP = ['away_points', 'home_points', 'date', 'location',
                  'losing_abbr', 'losing_name', 'winner', 'winning_abbr',
                  'winning_name', 'home_ranking', 'away_ranking']

dataset = pd.DataFrame()
teams = Teams()
for team in teams:
    dataset = pd.concat([dataset, team.schedule.dataframe_extended])
X = dataset.drop(FIELDS_TO_DROP, 1).dropna().drop_duplicates()
y = dataset[['home_points', 'away_points']].values
X_train, X_test, y_train, y_test = train_test_split(X, y)
parameters = {'bootstrap': False,
              'min_samples_leaf': 3,
              'n_estimators': 50,
              'min_samples_split': 10,
              'max_features': 'sqrt',
              'max_depth': 6}
model = RandomForestRegressor(**parameters)
model.fit(X_train, y_train)
print(model.predict(X_test).astype(int), y_test)

And here is the post I got it from: https://towardsdatascience.com/predict-college-basketball-scores-in-30-lines-of-python-148f6bd71894

Thank you!

Welcome to stack overflow! Your question is a bit confusing, because `train_test_split` only has to do with model training and evaluation, and nothing to do with predicting on unseen data. There are numerous resources to tell you how to pass in new data to a trained model, please [edit] your question to show what you've tried based on your own research, and what went wrong with your attempt — G. Anderson, Dec 12 '19 at 22:50
@G.Anderson I have updated the question to include an experiment that I have since tried. Thanks — Ryan Record, Dec 20 '19 at 14:53

score 1 · Answer 1 · answered Dec 12 '19 at 22:54

Think of it this way, if you want to test the goodness of fit of your model, then you must know in advance the result so you can measure the distance between your (model) output and the real outcome and perform the necessary tuning to improve your model's overall performance.

Once you have trained your model, if you want to predict future values, then (without much knowledge of what you are working) you should feed your model the same features it was trained with, but with the data you will be making your prediction on. Here is a very basic example using two variables to predict the score of two teams (A and B):

import pandas as pd 
data = {'Temperature':[10,20,30,25],'Humidity':[40,50,80,65],'Score_A':[1,2,3,2],'Score_B':[6,3,1,2]}
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
df = pd.DataFrame(data)
print(df)
X = df[['Temperature','Humidity']]
Y = df[['Score_A','Score_B']]
X_train, X_test, y_train, y_test = train_test_split(X, Y,random_state=42)
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

Here I've trained my model, so if I want to make a future prediction, I would need to pass the same features I've used in training (Temperature and humidity) but with the values I want to make my prediction on. Let's say our friend the meteorologist says that the temperature and humidity for thier next match will be 35 and 70 respectively. So I need to use .predict() with those values:

model.predict(print(model.predict([[35,70]]))

Which returns an output of:

[[2.6 1.4]]

If you wish to make it fancier:

prediction = model.predict([[35,70]])
print("Team A will score: ",prediction[0][0])
print("Team B will score: ",prediction[0][1])

Returning:

Team A will score:  2.6
Team B will score:  1.4

How do I predict future results with scikitlearn, pandas in Python using RandomForestRegressor method?

1 Answers1