0

I have a large horse racing database that I am trying to create a model for. Currently I am training the model based on the whole database - however, in horse racing, I need to train the model in the context of a race - using historic race results.

How is it possible to train a model, with the data grouped by race? i.e. I would need it to predict the performance of a horse in a race, compared to the other horses in that race - not every other horse in the database.

re0603
  • 387
  • 1
  • 4
  • 19
  • What is exactly your doubt? The problem is how to build a model for that or how to prepare the data? – rmesteves Jul 01 '20 at 06:11
  • The problem is how to build the model. The data is already in one dataset, but I am confused as to how to train the model in the context of a race. i.e. it needs to predict the probability of a horse winning a particular race, based on the different attributes of that race, and the horse's performances in previous races with similar attributes. Currently it seems I am only able to create a very general model not related to any specific race. – re0603 Jul 01 '20 at 09:06
  • Which parameters would you like to consider in the model? – rmesteves Jul 01 '20 at 10:00
  • I'd like to consider each horse's run time in previous races at the same venue, distance. Each horse has an id and lots of other attributes, but want to keep it simple for the first model. – re0603 Jul 01 '20 at 10:11
  • Isn't it enough for you selecting all your necessary parameters and using a linear regression? By what I understood your problem is more in how to filter the data than in creating the model – rmesteves Jul 01 '20 at 10:28
  • I have tried doing this, however it predicts times for horses much worse or much better than they have ever run previously. So I don't believe it is considering the horse's actual capabilities from it's previous results. – re0603 Jul 01 '20 at 10:31
  • The same for logistic regression? – rmesteves Jul 01 '20 at 11:41
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/217014/discussion-between-re0603-and-rmesteves). – re0603 Jul 01 '20 at 12:25

1 Answers1

1

The CREATE MODEL function supports standard SQL, so you can do anything you like in that SQL statement (e.g. filtering by certain horses or races etc. in your case). The SQL statement you provide trains the model using the data retrieved by the query's SELECT statement.

https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create#query_statement

For example (this is nonsensical model, but it shows how you can use any SQL you like in the CREATE MODEL function):

#standardSQL
CREATE MODEL
  `another_test.sample_model` OPTIONS(model_type='logistic_reg') AS
SELECT
  SUM(views) AS label,
  year,
  month,
  day,
  wikimedia_project,
  LANGUAGE
FROM
  `bigquery-samples.wikipedia_benchmark.Wiki1M`
WHERE
  title LIKE '%melbourne%'
GROUP BY
  2,
  3,
  4,
  5,
  6
Graham Polley
  • 14,393
  • 4
  • 44
  • 80
  • Thanks Graham, appreciate your help. I just need to work out how to train the model based on the horses past performances etc, which may be a bit trickier than most examples I've found online... will give it a shot though! – re0603 Jun 30 '20 at 12:56
  • You can also prep your data first in BigQuery using SQL and then feed that into the model if you want. – Graham Polley Jun 30 '20 at 23:17
  • Hey @re0603, have you finally found a technique/model of doing this. I am having a similar problem (also similar to this [other post](https://stackoverflow.com/questions/59978022/how-do-i-classify-instances-based-on-other-instances-within-a-group)). Thx – Jones Apr 29 '21 at 14:08