Hello I came across this tutorial on how to use python with some libraries to predict future NCAAB games using a sportsreference library. I will post the code as well as the article. This seems to work well, but I think it is only testing based on games in the past. How would I use it to predict future games of specific teams? For example, what will be the score between Team A and Team B on This Date?
The problem I see is that some of the data used can only be known after the game is finished. This known data is what is being used in the program to predict the score.
First Experiment: I tried filling in only the data that I knew on a game before it happened and filling in the remaining data with zero's using fillna(0). Here is what the the csv would look like:
date_team,away_assist_percentage,away_assists,away_block_percentage,away_blocks,away_defensive_rating,away_defensive_rebound_percentage,away_defensive_rebounds,away_effective_field_goal_percentage,away_field_goal_attempts,away_field_goal_percentage,away_field_goals,away_free_throw_attempt_rate,away_free_throw_attempts,away_free_throw_percentage,away_free_throws,away_losses,away_minutes_played,away_offensive_rating,away_offensive_rebound_percentage,away_offensive_rebounds,away_personal_fouls,away_points,away_steal_percentage,away_steals,away_three_point_attempt_rate,away_three_point_field_goal_attempts,away_three_point_field_goal_percentage,away_three_point_field_goals,away_total_rebound_percentage,away_total_rebounds,away_true_shooting_percentage,away_turnover_percentage,away_turnovers,away_two_point_field_goal_attempts,away_two_point_field_goal_percentage,away_two_point_field_goals,away_win_percentage,away_wins,home_assist_percentage,home_assists,home_block_percentage,home_blocks,home_defensive_rating,home_defensive_rebound_percentage,home_defensive_rebounds,home_effective_field_goal_percentage,home_field_goal_attempts,home_field_goal_percentage,home_field_goals,home_free_throw_attempt_rate,home_free_throw_attempts,home_free_throw_percentage,home_free_throws,home_losses,home_minutes_played,home_offensive_rating,home_offensive_rebound_percentage,home_offensive_rebounds,home_personal_fouls,home_points,home_steal_percentage,home_steals,home_three_point_attempt_rate,home_three_point_field_goal_attempts,home_three_point_field_goal_percentage,home_three_point_field_goals,home_total_rebound_percentage,home_total_rebounds,home_true_shooting_percentage,home_turnover_percentage,home_turnovers,home_two_point_field_goal_attempts,home_two_point_field_goal_percentage,home_two_point_field_goals,home_win_percentage,home_wins,pace 0,0,0,0,0,0,0,0,0,59,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.7,7,0,0,0,0,0,0,0,0,0,0,42,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,.1,1,0 The final line of code is changed to: print(model.predict(final_trim).astype(int), y_test)
"final_trim" being the new csv that is being predicted.
The results were not accurate at all. What am I missing?
Here is the original code:
import pandas as pd
from sportsreference.ncaab.teams import Teams
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
FIELDS_TO_DROP = ['away_points', 'home_points', 'date', 'location',
'losing_abbr', 'losing_name', 'winner', 'winning_abbr',
'winning_name', 'home_ranking', 'away_ranking']
dataset = pd.DataFrame()
teams = Teams()
for team in teams:
dataset = pd.concat([dataset, team.schedule.dataframe_extended])
X = dataset.drop(FIELDS_TO_DROP, 1).dropna().drop_duplicates()
y = dataset[['home_points', 'away_points']].values
X_train, X_test, y_train, y_test = train_test_split(X, y)
parameters = {'bootstrap': False,
'min_samples_leaf': 3,
'n_estimators': 50,
'min_samples_split': 10,
'max_features': 'sqrt',
'max_depth': 6}
model = RandomForestRegressor(**parameters)
model.fit(X_train, y_train)
print(model.predict(X_test).astype(int), y_test)
And here is the post I got it from: https://towardsdatascience.com/predict-college-basketball-scores-in-30-lines-of-python-148f6bd71894
Thank you!