Skip to main content

Methodology · NHL Goal Predictor

Eight ways to guess who scores tonight

Predicting goal scorers is a question almost every model gets partly right and none gets fully right. This site runs eight of them in parallel — a sportsbook baseline, a hand-tuned linear formula, a Monte Carlo simulation, a tree-based xG model, a lineup-aware variant, two neural networks, and a stacked meta-ensemble that ties them together. Here is what each one actually sees, what it computes, and where it tends to fail.

The problem, stated honestly

On any given NHL game day there are roughly 12 to 24 forwards and defensemen with a non-trivial chance of scoring a goal. The marginal pick — the difference between the player ranked 25th and the player ranked 60th in tonight's slate — usually comes down to a half-percent of probability. That is well below the noise floor of any stats-only model. The interesting question is not "who will score" in absolute terms; it is "whose probability is higher than the market thinks it is, and by how much."

That framing matters because it shapes every modeling choice on this site. I am not trying to outpredict a sportsbook on the obvious favorites — Connor McDavid is going to be near the top of every model on every list every night. I am trying to find the depth-line player whose recent shooting profile, opponent matchup, and lineup role disagree productively with the consensus.

To do that I run eight different models, each one biased toward a different signal:

  1. Market Odds v1market — bookmaker-implied probabilities
  2. Weighted Linear v1baseline — hand-tuned linear formula
  3. Monte Carlo v2simulation — 10,000-sim Poisson assignment
  4. xG XGBoost v3tree — per-shot xG from real play-by-play
  5. Lineup TOI v1tree — xG v3 scaled by recent ice time
  6. Neural MLP v1re-ranker — sklearn MLP over MC features
  7. Neural Embed v2deep — PyTorch MLP with player embeddings
  8. Meta Ensemble v1stacked — LightGBM + isotonic calibration

Inputs and ground truth

Every model on the site shares the same upstream data scaffolding. Every morning the pipeline:

Ground truth comes from the NHL boxscore endpoint after games finish: data/results/{date}.json records who actually scored. That file is the spine of everything backtested on this site — every claim about model accuracy, every calibration plot, every "yesterday's top 10 vs reality" panel on the dashboard reads from it.

Advertisement

1. Market Odds v1 — the wisdom of the books

The Market Odds model is the simplest one on the site by code size, and structurally the hardest one to beat. Sportsbooks publish "anytime goal scorer" props for nearly every skater in every game. Each price implies a probability — convert American odds to implied probability with the standard formula, average across bookmakers (the script pulls from the US and US2 regions, which covers most major books), and you have a per-player number.

implied_prob = 100 / (odds + 100) if odds > 0 implied_prob = abs(odds) / (abs(odds) + 100) if odds < 0

Two things make this model strong. First, sportsbook lines embed information no public stats model can see: confirmed lineups (who is actually playing top-line tonight), starting goalies (which changes every team's scoring environment), late injury news, and sharp money that has already moved the line. Second, the lines are implicitly trained on every previous bet that was ever settled, by people whose job is to be right about exactly this question.

Two things make it weak. First, anytime-scorer prices include a vig — bookmakers' margin — so the raw averaged probabilities are inflated by a few percent across the board. The script considered several devigging schemes (per-game normalization, power-method, flat 5% shrinkage) and ended up using the raw averaged probability because the ranking is what matters more than the absolute level. Second, the market is sometimes slow on rookies and in-season call-ups whose shooting talent is genuinely mispriced.

Where this model fails

Bookmaker odds aren't published for every game (early-week mid-week games have thin markets). When there are no odds, this model produces no prediction for that game and the meta-ensemble has to fall back on the other features.

2. Weighted Linear v1 — the dumb baseline that refuses to die

Every machine-learning project needs a stupidly simple baseline whose only job is to embarrass the fancy models when they don't beat it. This is mine.

The formula is hand-tuned and computed per player, then min-max normalized so the top score in any given night caps at 0.55 and the floor sits at 0.02:

raw = 0.40 × season_goals_per_game + 0.08 × season_shots_per_game + 0.05 × last_5_games_goals + 0.15 × season_shooting_pct prob = clip(raw / max(raw_today) × 0.55, 0.02, 0.60)

That is it. No opponent adjustment, no home-ice factor, no goalie quality, no power-play split. The four coefficients were chosen by inspection — they roughly reflect that goals-per-game is the strongest single signal, shooting percentage matters quite a bit, and shot volume and recent form contribute smaller corrections.

This model has no business being competitive, and yet it consistently lands in the middle of the pack on hit rate. That tells me something useful: most of the predictive signal in this problem is just "who scores a lot." Models that win on top of this baseline are the ones that find the small but real edge from matchup, deployment, and form. Models that lose to it are usually overfitting.

3. Monte Carlo v2 — simulating the game

The Monte Carlo model treats each game as a stochastic process and simulates it 10,000 times. Two pieces fit together.

Team scoring rate. First it computes how many goals each team is expected to score in this matchup:

team_xG = league_avg × (team_GF_per_game / league_avg) × (opp_GA_per_game / league_avg) × home_factor home_factor = 1.026 if home else 0.974

League average sits at 3.07 goals per team per game (from the 2024–25 season). Home teams score about 2.6% more than the league average and away teams 2.6% less, which is a smaller home-ice advantage than older studies suggested.

Per-player goal share. Then for each player on each team it computes a weight that determines how likely that player is to take any given goal:

weight = (season_GPG ^ 1.8) × form_boost × (1 + min(shots_per_game × 0.02, 0.15)) × (1 + (pp_goals / season_goals) × 0.10)

Two design choices matter here. The exponent of 1.8 is the most important parameter on the entire site. NHL goal scoring is heavily skewed — a 40-goal scorer is not "10× as likely" as a 4-goal scorer, but the linear baseline implicitly treats them that way. Raising goals-per-game to a power greater than 1 pulls the distribution back toward what you actually observe in nightly box scores: stars score much more disproportionately than their season averages alone suggest.

The form boost is an asymmetric multiplier: hot players (last-five GPG > season GPG × 1.5) get a small bonus capped at +30%; cold players get pulled toward season average by 0.6 + 0.4 × form_ratio. Players with no season goals but recent goals get a flat 1.3× boost so call-ups aren't permanently zeroed.

The simulation. For each of the 10,000 sims:

  1. Draw the team's total goals from Poisson(team_xG).
  2. For each goal, randomly pick a scorer using the normalized weight vector as a categorical distribution.
  3. Mark which players got at least one goal in this sim.

A player's predicted probability is just the fraction of sims in which they scored at least once. The strength of this model is that it captures squad-level uncertainty: a high-weight star on a low-xG team gets penalized correctly, and a low-weight depth player on a high-xG night gets the lift they should get.

4. xG XGBoost v3 — a per-shot model

The xG model answers a different question: given the geometry and context of a specific shot, what is the probability it goes in?

It is an XGBoost classifier trained on the play-by-play shot table accumulated nightly into data/xg_training/shots.csv. The hyperparameters are unremarkable — 200 trees, max depth 5, learning rate 0.1, subsample 0.8 — chosen so the model has enough capacity for shot-context interactions without wandering into overfitting territory on the sub-100k-row training set.

The interesting part is the feature set:

The model outputs an xG per shot. To turn that into a per-game scorer probability, the prediction script:

  1. Estimates the player's expected shot count tonight from their season shots-per-game, scaled by an opponent factor (opp_GA / league_avg), a home/away factor, and a recent-form factor.
  2. Pulls up to the player's 60 most recent real shots from data/player_shots/{pid}.json (the aggregator caps storage at 60 newest-first), scores each through the XGBoost model, and takes the mean to get a per-shot xG estimate.
  3. Computes total xG = mean per-shot xG × expected shots.
  4. Converts to a goal probability assuming Poisson scoring: P(≥1 goal) = 1 − e^(−total_xG).

For players with no recent shots in the database — rookies, just-promoted call-ups — the model falls back to a position prior (forwards have a different baseline shot profile than defensemen). If even that is missing, it generates 20 synthetic shots from the player's estimated profile, scores those, and tags the prediction's xg_source field so anyone reading the JSON can see the prediction came from a fallback rather than real data.

5. Lineup TOI v1 — the same xG, told who is actually playing

One of the longest-standing weaknesses of every public goal-scorer model is that it ranks people on their full-minutes career history even when the team has scratched them, demoted them to the fourth line, or just called up a different forward. Pure "season averages" are blind to tonight's deployment.

Lineup TOI v1 is the same script as xG XGBoost v3 (same training, same features, same Poisson conversion) with one extra step. After computing expected shots, it multiplies by a TOI factor:

toi_factor = clip(recent_TOI / season_TOI, 0.05, 1.5)

A healthy scratch with effectively zero recent TOI gets driven to a 0.05 multiplier — effectively dropped. A bottom-six guy bumped to a top-line role for the night gets boosted up to 1.5×. The clamp matters: outside that range you are mostly looking at noise (TOI estimates are noisy at the day level), so the factor refuses to multiply or divide by extreme values.

This is the model that catches the situations where xG XGBoost v3 quietly embarrasses itself by ranking a guy who was scratched yesterday at 11% to score tonight.

Advertisement

6. Neural MLP v1 — a re-ranker over Monte Carlo

The first neural model on the site is intentionally modest. It is a scikit-learn MLPClassifier with a StandardScaler in front of it, trained on the same player-level rows the Monte Carlo model already produces, with the actual game outcome as the binary target.

Eleven features, in this order (the scaler is positional, so order matters):

At prediction time, the script reads Monte Carlo v2's latest.json, builds the feature matrix in the trained order, scales it, and rescores every player. The output preserves the original Monte Carlo probability as mc_probability so the dashboard can show the delta between the simulation and the neural re-rank.

This model is useful precisely because it shares the Monte Carlo player pool. You learn whether the neural network is consistently moving probabilities up or down for certain feature combinations, and that disagreement is itself a signal — captured later by the meta-ensemble's model_disagreement feature.

7. Neural Embed v2 — letting the network learn each player

Neural Embed v2 is the most ambitious model on the site, and structurally the most different from everything else. Instead of operating on aggregated player stats, it works at the level of individual shots, with a learnable per-player embedding so the network can capture player-specific shooting tendencies that are invisible to the xG model's hand-crafted features.

The architecture, in PyTorch:

Training: Adam optimizer at 3e-4, batch size 256, max 50 epochs with early stopping at patience 5, 10% validation holdout. The same shot CSV that trains xG XGBoost v3 trains this. Dropout is set to 0 at inference time.

At prediction time, for each player on tonight's slate the script pulls their cached recent shots, runs each one through the network with their embedding ID, averages the per-shot probabilities, and converts to a per-game probability via the same 1 − e^(−total_xG) Poisson assumption used by xG v3. Players not seen during training get embedding index 0 plus a positional prior on the shot side.

The point of the embedding is to let the model represent things like "this defenseman tends to shoot from way out but his shots go in more often than the geometry alone predicts" — a pure shot-level xG model cannot represent that, because it does not know whose shot it is looking at.

8. Meta Ensemble v1 — stacking and calibration

The Meta Ensemble is what closes the loop. It is a LightGBM classifier with isotonic calibration on top, trained on historical predictions from the base models alongside the actual outcomes.

For each player on each historical night, the training row contains:

The target is binary: did this player score at least one goal in the actual game? LightGBM is trained with a time-series split — earlier data trains the trees, the most recent 20% of the training window is held back for calibration.

Calibration matters more than people think. Even a model with great ranking can have miscalibrated probabilities — e.g. its "20% picks" might actually score 14% of the time. That gap looks small but compounds badly in any decision built on top of the probability. The ensemble fits an IsotonicRegression on the held-out 20% mapping raw predictions to observed frequencies, then saves both the LightGBM model and the calibrator so inference is just cal(model.predict(X)).

A note on the "disagreement" feature

One of the more interesting things the meta-model has learned is that high disagreement across the stats-based models is itself information. When all three stats models agree closely, the consensus pick is usually right. When they disagree sharply, the market is often a better tiebreaker than any of them individually — and the ensemble has implicitly learned to weight the market more heavily in those cases.

Backtesting and validation

None of the above is worth anything without continuous backtesting. The fetch_results.py script runs after every game day and writes that night's actual goal scorers to data/results/{date}.json. The validate_predictions.py script then joins predictions to outcomes and computes per-model metrics: top-N hit rate, Brier score, log-loss, calibration curves, and the most useful diagnostic — "in the last 30 days, how often did the meta-ensemble's top pick actually score?"

The dashboard surfaces a daily health report at data/health_report.json that flags any model whose recent calibration has drifted past acceptable bounds. There is a hard top-probability ceiling (currently 0.95) that catches cases where a model produces an obviously wrong >95% probability — a known failure mode in stacked models when one base predictor returns extreme values.

What this site is not doing

A few honest limits:

What the site is doing is publishing the probabilities, the model disagreement, and the historical accuracy openly so anyone can reproduce, criticize, or improve on them. The full source is on GitHub; tonight's predictions are on the dashboard.

Last updated 2026-05-03. Models, features, and calibration thresholds drift; this page describes the current production pipeline, not its historical state.