MLS Predictions | Project Results

Headline Results

Regression Outcome Accuracy

55%

from predicted goals -> win/draw/loss

Classification Ensemble Accuracy

42%

direct multiclass prediction

Regression Recall (Wins / Losses / Draws)

71% / 76% / 4%

draw prediction remained hardest

Classification Recall (Wins / Losses / Draws)

63% / 29% / 20%

class imbalance impacts draws + losses

Pipeline

Scrape: team match + shooting/passing/creation/defense stats from FBref.
Clean: normalize teams, parse scores, add time + venue features.
Engineer: 10-match rolling stats and opponent-aware features.
Train: time-aware CV with Random Forest and XGBoost.
Evaluate: confusion matrices, classification reports, feature importances.

Why The Results Are Hard

Your data already shows structural difficulty: draws are only 24.8% overall (1,564 of 6,312 rows), while home-perspective outcomes are skewed (3,096 wins vs 1,652 losses). In this setup, direct classification tends to overpredict wins and underperform on draws/losses.

MLS has also shifted quickly during this window (2018-2024): team count in your data rises from 23 to 29, roster mechanisms are unique and evolving (DP + GAM/TAM + U22), and competition formats have changed. Those shifts create non-stationarity that makes historical patterns less stable for ML models.

Interactive Team Explorer

Pick a team and season to inspect per-match performance and compare it to that season's league average.

Team Season

Visual Results

Regression model selection chart — Regression model selection

Regression feature importance team 1 — Feature importance: team 1

Regression feature importance team 2 — Feature importance: team 2

Random forest and XGBoost report — RF + XGB classification report

Random forest feature importance — Feature importance: Random Forest

XGBoost feature importance — Feature importance: XGBoost