11

I'm a huge football(soccer) fan and interested in Machine Learning too. As a project for my ML course I'm trying to build a model that would predict the chance of winning for the home team, given the names of the home and away team.(I query my dataset and accordingly create datapoints based on previous matches between those 2 teams)

I have data for several seasons for all teams however I have the following issues that I would like some advice with.. The EPL(English Premier League) has 20teams which play each other at home and away (380 total games in a season). Thus, each season, any 2 teams play each other only twice.

I have data for the past 10+ years, resulting in 2*10=20 datapoints for the two teams. However I do not want to go past 3 years since I believe teams change quite considerably over time (ManCity, Liverpool) and this would only introduce more error into the system.

So this results in just around 6-8 data points for each pair of team. However, I do have several features(upto 20+) for each data point like Full-time goals, half time goals, passes, shots, yellows, reds, etc. for both teams so I can include features like recent form, recent home form, recent away form etc.

However the idea of just having only 6-8 datapoints to train with seems incorrect to me. Any thoughts on how I could counter this problem?(if this is a problem in the first place i.e.)

Thanks!

EDIT: FWIW, here's a link to my report which I compiled at the completion of my project. https://www.dropbox.com/s/ec4a66ytfkbsncz/report.pdf . It's not 'great' stuff but I think some of the observations I managed to elicit were pretty cool (like how my prediction worked very well for the Bundesliga because Bayern win the league all the time).

keithxm23
  • 1,280
  • 1
  • 21
  • 41

3 Answers3

5

That's an interesting problem which I don't think has an unique solution. However, there are a couple of little things that I could try if I were in your position.

I share your concerning about 6-8 points per class being too little data to build a reliable model. So I would try to model the problem a bit differently. In order to have more data for each class, instead of having 20 classes I would have only two (home/away) and I would add two features, one for the team being home and other one for the away team. In that setup, you can still predict which team would win given if it is playing as home or away, and your problem has more data to produce a result.

Another idea would be to take data from other European leagues. Since now teams are a feature and not a class, it shouldn't add too much noise to your model and you could benefit from the additional data (assuming that those features are valid in another leagues)

Pedrom
  • 3,823
  • 23
  • 26
  • Hey thanks for the input Pedrom.. Yeah, I thought about modelling my data this way.. This would allow me to have around 380 datapoints each season and this I could have thousands of datapoints to work with. However the problem that this would solve is just "The chance of the home team winning a game".. It would return the same value for, say, a 3rd Division team playing a Champions league team or for any team playing any other team at all.. – keithxm23 Mar 20 '13 at 18:38
  • 1
    @keithxm23 Hey, good to hear back from you... "The chance of the home team winning a game", not necessarily. Given that your features include the Home team and the Away team (and if you include the each division as additional features even better) then the output would read "The chance of home team winning a game *given* that home team is A and away team is B". Does it make sense? – Pedrom Mar 20 '13 at 18:44
  • Oh! So you mean, for the home team, predict the chance of winning given it's recent home form AND then for the away team, predict the chance of winning given its recent away form.. and then compare these 2 variables and make a prediction. That's a very good idea.. So I was thinking after this, how to also include the knowledge, particularly about, previous matches between the two teams(A and B) as I'm sure that would have immense value too.. So an idea I thought of was, after calculating at a point in time the 'home form' and 'away form' for both teams , also calculate how each team fared.. – keithxm23 Mar 20 '13 at 23:11
  • ..how each team fared against the other(i.e. A-vs-B & B-vs-A) in those matches that contributed to 'home form' and 'away form'.. Does that make sense to you? If it does, do you think this is a good idea or can you think of some better way of adding the knowledge about teams(A,B)-specific matches. – keithxm23 Mar 20 '13 at 23:12
  • I think you are going through the right track :) it definitely makes sense to me. I might have some other ideas but not necessarily better, it's a matter of try and see how it goes. – Pedrom Mar 20 '13 at 23:39
2

I have some similar system - a good base for source data is football-data.co.uk. I have used last N seasons for each league and built a model (believe me, more than 3 years is a must!). Depends on your criterial function - if criterion is best-fit or maximum profit you may build your own predicting model.

One very good thing to know is that each league is different, also bookmaker gives different home win odds on favorite in Belgium than in 5th English League, where you can find really value odds for instance.

Out of that you can compile interesting model, such as betting tips to beat bookmakers on specific matches, using your pattern and to have value bets. Or you can try to chase as much winning tips as you can, but possibly earns less (draws earn a lot of money even though less amount of draws is winning).

Hopefully I gave you some ideas, for more feel free to ask.

kovomaster
  • 36
  • 2
1

Don't know if this is still helpful, but features like Full-time goals, half time goals, passes, shots, yellows, reds, etc. are features that you don't have for the new match that you want to classify.

I would treat this as a classification problem (you want to classify the match in one of 3 categories: 1, X, or 2) and add more features that you can also apply to the new match. i.e: the number of missing players (due to injury/red cards), the number of wins/draws/losses each team has had in a row immediately BEFORE the match, which is the home team (already mentioned), goals scored in the last few matches home and away etc...

Having 6-8 matches is the real problem. This dataset is very small and there would be a lot of over-fitting, but if you use features like the ones I mentioned, I think you could also use older data.

tomas
  • 963
  • 6
  • 19