Skip to main content

About the project

Eight models in a Quebec apartment

This project began as a private spreadsheet for the Tim Hortons Hockey Challenge and ended up as a daily-publishing prediction stack with eight models, a stacked meta-ensemble, and a calibration validator that tells me when my own models are lying. This page is the story of how it got there, what runs every morning, and what I have not yet built.

Why this exists

The Tim Hortons Hockey Challenge gives players a list of three groups of five NHL skaters per night and asks you to pick one player from each group whose total goals + assists for the night are higher than the players you didn't pick. It is mostly a casual game. It is also, structurally, a forced-choice ranking task — exactly the shape of problem statistics actually gives you traction on.

The first version of this project lived in a Google Sheet. I was averaging a few features by hand, scoring the player pool, and submitting picks. That worked badly enough — and was tedious enough — that I rewrote it as a Python script. The script worked well enough that I added another model. Then the second model disagreed with the first model often enough that I added a third to break ties. Then I noticed the first model was actually better than the third on certain matchups, and the only way to learn which matchups was to keep both models and measure. Roughly that pattern, repeated, is how I ended up with eight.

The whole point of running this many models is that no single approach to this problem dominates. A market-implied probability is great information but is silent on a third of the night's games where the books haven't posted. A pure xG model is sharp on shot quality but blind to whether a guy was a healthy scratch yesterday. A Monte Carlo simulation captures squad-level uncertainty but is mediocre at picking between two depth-line wingers. Each model fills in where another is weak.

The eight models, one paragraph each

Market Odds v1 averages anytime-goal-scorer odds across major US sportsbooks and converts them to probabilities. Strongest single model on most nights; missing entirely on others.

Weighted Linear v1 is a hand-tuned formula on season averages — the dumb baseline whose job is to embarrass the fancy models when they don't beat it. It often does.

Monte Carlo v2 simulates each game 10,000 times: Poisson-draw the team's goals, then assign each goal to a player weighted by season scoring rate raised to the 1.8 power. The exponent is the trick — it captures how much more disproportionately stars score than their averages alone suggest.

xG XGBoost v3 is a per-shot expected-goals model trained on play-by-play data and converted to a per-game probability via the player's expected shot count. The conversion is a Poisson assumption: P(≥1 goal) = 1 − e^(−total_xG).

Lineup TOI v1 is the same model as xG XGBoost v3, with one extra step that scales expected shots by the ratio of the player's recent ice time to their season average. It catches the cases where everyone else is still ranking a player who got benched yesterday.

Neural MLP v1 is a small scikit-learn neural network that re-ranks the Monte Carlo output using player-level aggregate features. It learns where Monte Carlo is consistently too generous or too stingy.

Neural Embed v2 is a PyTorch MLP with learnable 32-dimensional player embeddings on top of shot-level features. It's the only model on the site that can represent "this defenseman's shots go in more often than the geometry alone predicts."

Meta Ensemble v1 stacks four base models (linear, MC, xG, market) plus disagreement, goalie, and matchup features into a LightGBM classifier with isotonic calibration on top. This is the model the site treats as canonical.

The technical writeup of every model — features, formulas, hyperparameters, failure modes — lives in methodology.

Advertisement

What the daily pipeline actually does

Every morning, a chain of GitHub Actions cron jobs runs, scheduled around when the day's matchups stabilize and lineup signals start to firm up. Times below are UTC; the schedule is timed to morning Eastern.

06:00
Results pipeline writes yesterday's actual goal scorers to data/results/ from the NHL boxscore endpoint.
14:00
Monte Carlo v2 runs first; it builds the canonical player pool that downstream models reuse.
14:05
xG XGBoost v3 + Lineup TOI v1 run from the shared shot pipeline.
14:30
Market Odds v1 fetches anytime-goal-scorer prices from The Odds API.
15:00
Weighted Linear v1 runs (legacy schedule, kept independent so it's never gated by the others).
15:20
Meta Ensemble v1 reads all base predictions and produces the calibrated final ranking.
16:30
Health Check validates every model's outputs and writes data/health_report.json.
monthly
Neural v2 trainer retrains the player-embedding model on accumulated shot data.

Each step writes its output as JSON into data/predictions/{model_name}/ and commits to the repo. The dashboard you land on at the predictor page is plain static HTML that fetches those JSON files at load time. There is no backend; there is no database; there is a pipeline of small Python scripts and a folder of JSON files.

What running this for a season has taught me

A few things that were not obvious to me before:

The market is the strongest single feature, but it isn't a ceiling. The Meta Ensemble outperforms the market on its own roughly often enough to justify the rest of the stack — most of the lift comes from cases where the market hasn't priced a lineup change or a goalie matchup that the other models see clearly.

Calibration drifts. Models trained months ago against a different scoring environment quietly become overconfident — they keep predicting "20% to score" with the same calibration as before, but the league's overall scoring rate has shifted, or the opposing-goalie quality distribution has changed. The validator now flags this. The most recent calibration tweak, raising the hard top-probability ceiling from 0.85 to 0.95, was driven by exactly this kind of drift.

API throttling will eat you alive. The early version of this pipeline fetched every player's stats from scratch in every model. With six models running, that meant six separate hits on the same NHL endpoint per player per day. The fix was a per-day shared roster cache — predictable, boring, and the single biggest reliability improvement of the project.

Real shots beat synthetic shots by a lot. The xG model originally generated 20 fake shots per player from a position-based profile and scored those. Replacing that with the player's real recent shots from play-by-play data was the biggest predictive lift the site has had — and the simplest technical change. The fanciest model on this site is not the neural network with embeddings; it's the one that finally got to look at real data.

What I haven't built yet

Three deferred upgrades, in roughly the order I expect to tackle them:

None of these are urgent. The pipeline as it stands runs reliably, the meta-ensemble is calibrated, and the daily output is good enough that I find it interesting most nights. The deferred list exists so that when I do come back to expand the project, I am not picking from feel.

Where to read more

Last updated 2026-05-03. Model schedule and feature lists drift as the project evolves; this page reflects the current pipeline, not its historical state.