ml-project

Proposal Report

Introduction

Soccer is the most popular sport in the world, with an estimated 240 million registered players worldwide, and there is undoubtedly a lot of data present (Terrell, 2022).

Sports forecasting has been dominated by traditional statistical methods, but machine learning methods have emerged. Recent models used for sports forecasting are neural networks (Horvat & Job, 2020), such as Hubáček et. al. who use a convolutional neural network on an NBA dataset (Hubáček et al., 2019), although other researchers use regression models (e.g. linear, logistic), or support vector machines (Wilkens, 2021).

We will be using the European Soccer Database by Hugo Mathien from Kaggle. We are interested in the following tables:

Match: This table contains features from over 25,000 matches played in Europe from 2008 to 2016. This dataset contains match statistics (e.g. number of cards, lineups, etc.) and betting odds from 10 providers. Betting odds will be important in our models when validating results.
Player_attributes: This table contains features such as player’s ratings in different attributes from the FIFA video game series, as well as height, weight, preferred foot, and age.
Team_attributes: This table contains features that rate teams in different aspects of the game, such as passing, dribbling, etc.

Problem Definition

The sports betting industry is valued at 81 billion US dollars in the year 2022, and it is projected to double that by 2030. A critical piece of this industry is the spreads and odds that are bet on. For example, if the sportsbook which controls the betting odds expects Team A to beat team B by 3.5 points, if you bet on team A and they win by 4+ points then you win, else you lose, even if they win by 2 points. These spreads are essential to making money on betting, so we will use machine learning to create our own soccer game spreads, and identify which teams are overvalued and undervalued.

Methods

Researchers have used many different machine learning algorithms to predict sports outcomes. Logistic regression models not only can be used for win/loss classification, but also give confidence in the model’s decision. A random forest approach achieved an impressive 83% accuracy, and other random forest or decision tree models also outperformed regression based models (Wilkens, 2021). Lastly, artificial neural networks have shown strong results and require less preprocessing on inputs than other models (Hubáček et al., 2019).

Potential Results and Discussion

Our core model applies the Estimator Score Method to rank teams and predict matches. Supplemental metrics like success factors from scoring parameters and regression techniques will refine predictions. We target 60-70% accuracy for rankings and winners, with incremental gains over time. But with inherent unpredictability in sports, interpretability and utility outweigh perfect accuracy. Out-of-sample testing on 2017 data will show model portability to new seasons, not just overfitting history. In summary, we optimize for insight and usability over precision alone when forecasting match outcomes and rankings via Estimator Scores augmented by additional metrics.

Timeline

See link above

Contributors

Name	Contribution
Daniel	Introduction
Elijah	Problem Definition
Sabina	Timeline, slides for video
Xander	Potential Results and Discussion
Matthew	Methods

References

Horvat, T., & Job, J. (2020). The use of machine learning in sport outcome prediction: A review. WIREs Data Mining and Knowledge Discovery, 10(5). https://doi.org/10.1002/widm.1380

Hubáček, O., Šourek, G., & Železný, F. (2019). Exploiting sports-betting market using machine learning. International Journal of Forecasting, 35(2), 783–796. https://doi.org/10.1016/j.ijforecast.2019.01.001

Terrell, E. (2022). Research Guides: Sports Industry: A Research Guide: Soccer. Guides.loc.gov. https://guides.loc.gov/sports-industry/soccer

Wilkens, S. (2021). Sports prediction and betting models in the machine learning age: The case of tennis. Journal of Sports Analytics, 7(2), 1–19. https://doi.org/10.3233/jsa-200463