Soccer is the most popular sport in the world, with an estimated 240 million registered players worldwide, and there is undoubtedly a lot of data present (Terrell, 2022).
Sports forecasting has been dominated by traditional statistical methods, but machine learning methods have emerged. Recent models used for sports forecasting are neural networks (Horvat & Job, 2020), such as Hubáček et. al. who use a convolutional neural network on an NBA dataset (Hubáček et al., 2019), although other researchers use regression models (e.g. linear, logistic), or support vector machines (Wilkens, 2021).
We will be using the European Soccer Database by Hugo Mathien from Kaggle. We are interested in the following tables:
The sports betting industry is valued at 81 billion US dollars in the year 2022, and it is projected to double that by 2030. A critical piece of this industry is the spreads and odds that are bet on. For example, if the sportsbook which controls the betting odds expects Team A to beat team B by 3.5 points, if you bet on team A and they win by 4+ points then you win, else you lose, even if they win by 2 points. These spreads are essential to making money on betting, so we will use machine learning to create our own soccer game spreads, and identify which teams are overvalued and undervalued.
Researchers have used many different machine learning algorithms to predict sports outcomes. Logistic regression models not only can be used for win/loss classification, but also give confidence in the model’s decision. A random forest approach achieved an impressive 83% accuracy, and other random forest or decision tree models also outperformed regression based models (Wilkens, 2021). Lastly, artificial neural networks have shown strong results and require less preprocessing on inputs than other models (Hubáček et al., 2019).
Our core model applies the Estimator Score Method to rank teams and predict matches. Supplemental metrics like success factors from scoring parameters and regression techniques will refine predictions. We target 60-70% accuracy for rankings and winners, with incremental gains over time. But with inherent unpredictability in sports, interpretability and utility outweigh perfect accuracy. Out-of-sample testing on 2017 data will show model portability to new seasons, not just overfitting history. In summary, we optimize for insight and usability over precision alone when forecasting match outcomes and rankings via Estimator Scores augmented by additional metrics.
See link above
Name | Contribution |
---|---|
Daniel | Introduction |
Elijah | Problem Definition |
Sabina | Timeline, slides for video |
Xander | Potential Results and Discussion |
Matthew | Methods |
Horvat, T., & Job, J. (2020). The use of machine learning in sport outcome prediction: A review. WIREs Data Mining and Knowledge Discovery, 10(5). https://doi.org/10.1002/widm.1380
Hubáček, O., Šourek, G., & Železný, F. (2019). Exploiting sports-betting market using machine learning. International Journal of Forecasting, 35(2), 783–796. https://doi.org/10.1016/j.ijforecast.2019.01.001
Terrell, E. (2022). Research Guides: Sports Industry: A Research Guide: Soccer. Guides.loc.gov. https://guides.loc.gov/sports-industry/soccer
Wilkens, S. (2021). Sports prediction and betting models in the machine learning age: The case of tennis. Journal of Sports Analytics, 7(2), 1–19. https://doi.org/10.3233/jsa-200463