NFL Outcome Predictor — self-grading track record

How to read this page

This page shows how our prediction model did across the entire 2025 NFL season. Before each game, every model named a favorite and a win chance (e.g. “Eagles 65%”), using only information known beforehand. We then grade those guesses against what actually happened — and against the Vegas odds, the gold standard.

The lines / rows you'll see:

Our model (QBElo) — this is our prediction. A power ranking like chess rankings, adjusted when a backup quarterback starts.
Vegas — the betting odds as a clean win %. The benchmark we measure ourselves against (nobody really beats it).
Elo and ML — two simpler/alternative versions we built for comparison.

The grades:

Accuracy — how often it picked the right winner. Higher is better.
Brier / Log loss — how good the percentages were (saying “90%” and being wrong is penalized hard). Lower is better.
Calibration (the chart below) — when a model says “70%”, do those teams really win about 70% of the time? On the diagonal line = honest.

Bottom line: the homemade models get close to Vegas but don’t beat it — the honest, expected result.

How it works

1. The model learns in a loop

Every team has a power rating → Predict each game's win % → Games are played → Grade the guess → Nudge ratings ↻

Beat a strong team and your rating jumps; lose to a weak one and it drops. Repeat every week and the ratings sort the league into a power order — no human opinions needed.

2. Turning two ratings into a win chance

The core trick: take the gap between two teams' ratings and bend it into a probability. Drag the slider.

How much stronger is the better team? 100 rating pts (a home team gets ~48 of this for free)

→ Stronger team's win chance: —

3. A real game from 2025

…

4. Watch a team's power rating over the season

It rises after wins (especially big ones over good teams) and falls after losses. 1500 = average; higher = stronger.

Team:

Season totals

loading…

Model	Brier ↓	Log loss ↓	Accuracy ↑	Games

Brier and log loss grade the percentages (lower = better); accuracy is just how often the winner was picked. Market = the Vegas odds (the ceiling); baseline = always guess the home team.

Calibration curve

When a model says “70%,” do those teams really win about 70% of the time? Points on the diagonal are honest; below the line is overconfident, above is underconfident.

Track record through the season

Each line is a model's running grade on its percentages (Brier — lower is better). Vegas (red) stays the best; the homemade models bunch up just above it.

Weekly breakdown

Click a week to see every game's prediction.

How it was built

The point of this project isn't a secret formula — it's the process: build progressively smarter models and measure each one honestly, never peeking at the future, always benchmarked against Vegas. See the full pipeline & data flow →

The climb — each model, scored on games it had never seen (2019–2024)

Win-pick accuracy (50% = a coin flip). Each step adds sophistication and gets closer to Vegas — none beats it.

The pipeline

Free NFL data (nflverse) → Build features (Elo, QB value, rolling EPA) → Train + back-test (walk-forward, no leakage) → Score vs Vegas → Live tracker (this page)

Build decisions that matter

No peeking at the future. Every prediction uses only games played before it — the #1 way projects like this accidentally cheat, and the easiest to get wrong.
Tried machine learning, reported the truth. Gradient-boosted and logistic models on EPA / rest features didn't beat the simple QB-aware rating. Shown honestly, not buried.
Benchmarked against the closing Vegas line — the sharpest, hardest number to beat.
Built from scratch. The rating system, the quarterback-value formula, and the calibration step are all hand-coded — no off-the-shelf prediction library.

Full source code, the commit history (the build step by step), and the methodology write-up live in the project repo.