At a glance
Five stages, top to bottom. The detailed map below expands each one.
Each box is a script. There are three kinds:
gets data, saves a file
does the math, saves nothing
runs the tests & the live feed
A green chip like writes schedules.parquet means that script saves a file to disk for the next stage to pick up.
The detailed map
nflverse — free, public NFL data
Schedules & betting lines · player stats · play-by-play
Get the data
Three scripts download free NFL data and save it, so nothing has to be re-fetched later.
pull_data.py
Every game’s schedule, final score, and betting line.
writes schedules.parquetqb_value.py
Scores how well each quarterback played in each game, from the box score.
writes qb_value.parquetepa_features.py
Each team’s recent form on offense & defense — counting only games already played.
writes epa_features.parquetWhat comes out — the backbone file schedules.parquet (real 2025 rows):
| game_id | wk | away | home | result | spread | home ML | away ML |
|---|---|---|---|---|---|---|---|
| 2025_01_DAL_PHI | 1 | DAL | PHI | +4 | 8.5 | −425 | +330 |
| 2025_01_KC_LAC | 1 | KC | LAC | +6 | −3.0 | +145 | −175 |
| 2025_01_TB_ATL | 1 | TB | ATL | −3 | −1.5 | −105 | −115 |
| 2025_01_CIN_CLE | 1 | CIN | CLE | −1 | −5.5 | +195 | −238 |
result = home score − away score (so +4 = home won by 4). ML = moneyline odds.
qb_value.parquet — how each QB played:
| player | team | value |
|---|---|---|
| Aaron Rodgers | PIT | 75.7 |
| Matthew Stafford | LA | 44.6 |
| Joe Flacco | CLE | 42.2 |
epa_features.parquet — recent form (higher = better):
| team | off form | def form |
|---|---|---|
| BUF | +0.136 | +0.051 |
| BAL | +0.074 | +0.071 |
| ATL | −0.009 | +0.056 |
Rate the teams (the brains)
Pure logic that other scripts call. Ratings update game-by-game, so a test can’t accidentally see the future.
elo.py
A chess-style power rating: beat a strong team and your number climbs; lose to a weak one and it drops.
qbelo.py
★ main model
Same rating, but it drops when a backup quarterback starts — the moment plain Elo gets fooled.
metrics.py
The scorecard: grades each prediction, and turns the Vegas odds into a clean win % to compare against.
What comes out — each engine’s estimate of the home team’s win chance (real Week 1 2025):
| game (away @ home) | elo.py | qbelo.py ★ | home won? |
|---|---|---|---|
| DAL @ PHI | 0.811 | 0.673 | yes (PHI +4) |
| KC @ LAC | 0.359 | 0.326 | yes (LAC +6) |
| TB @ ATL | 0.428 | 0.322 | no (TB +3) |
| CIN @ CLE | 0.358 | 0.437 | no (CIN +1) |
0.811 = “81% chance the home team wins.” QBElo (★) nudges Elo’s number — biggest when a backup QB is starting.
Combine everything into one table
The single step that every test and the live feed below all share.
ml_model.py → assemble()
Runs the ratings and stitches all three data files into one big table — one row per game, with every number lined up: the ratings, the win probabilities, recent form, days of rest, and the Vegas price.
What comes out — the single combined row for one game (Week 1, DAL @ PHI). Every test below reads rows shaped like this:
| column | what it is | value |
|---|---|---|
| elo_diff | PHI’s rating minus DAL’s | +205.3 |
| p_home_elo | Elo’s win chance for PHI | 0.811 |
| p_home_qbelo | QBElo’s win chance for PHI | 0.673 |
| off_epa_diff | offense-form gap | +0.167 |
| def_epa_diff | defense-form gap | −0.124 |
| home_rest / away_rest | days of rest each | 7 / 7 |
| p_home_mkt | Vegas’ win chance for PHI | 0.777 |
| y | what happened (1 = home win) | 1 |
Grade it, then build the live feed
These actually run. The tests prove the model works (and doesn’t cheat); the producer writes the file this site reads.
The tests — each saves a report (chart or table)
backtest.py
The main exam: predict past seasons it was never tuned on, and compare to Vegas.
ml_model.py
Tries machine learning instead of the rating — and reports honestly that it doesn’t win.
ml_walk_forward.py
Re-runs that ML test fairly, letting it learn each new season. Still can’t beat the rating.
backup_slice.py
Pinpoints where the QB rating earns its keep: games where a backup starts.
sanity_seasons.py
The cheat-detector: if we beat Vegas every season we’d be peeking. We don’t.
What comes out — the scorecard backtest.py prints (graded on 2019–24 games it never saw):
| model | Brier ↓ | log loss ↓ | accuracy ↑ |
|---|---|---|---|
| Always pick home | 0.2495 | 0.6937 | 53.1% |
| Elo | 0.2222 | 0.6376 | 63.2% |
| QBElo ★ | 0.2205 | 0.6335 | 63.6% |
| Vegas (the ceiling) | 0.2097 | 0.6087 | 66.6% |
Lower Brier / log loss = sharper percentages. QBElo lands just shy of Vegas — the honest, expected result.
The live producer
season_replay.py
Walks a whole season week by week, scoring every game under every model, and saves the results file the website loads.
Change the year to 2026 and a weekly job turns this into the real live tracker.
What comes out — the live feed (replay_2025.json): every model’s home-win % per game, plus the result. This is exactly what the page below reads.
| matchup | Elo | QBElo ★ | ML | Vegas | result |
|---|---|---|---|---|---|
| DAL @ PHI | 0.811 | 0.673 | 0.751 | 0.777 | PHI +4 |
| KC @ LAC | 0.359 | 0.326 | 0.345 | 0.391 | LAC +6 |
| TB @ ATL | 0.428 | 0.322 | 0.222 | 0.489 | TB +3 |
| MIA @ IND | 0.505 | 0.413 | 0.299 | 0.504 | IND +25 |
Show it here
A plain static page — no server, no database.
Run it yourself, end to end
Each step saves a file the next one reads, so the order is just the map above — top to bottom:
# Step 1 — get the data (saves data/*.parquet)
.venv\Scripts\python.exe src\pull_data.py
.venv\Scripts\python.exe src\qb_value.py
.venv\Scripts\python.exe src\epa_features.py
# Step 4 — grade the models (saves reports/*.csv and *.png)
.venv\Scripts\python.exe src\backtest.py
.venv\Scripts\python.exe src\ml_model.py
.venv\Scripts\python.exe src\ml_walk_forward.py
.venv\Scripts\python.exe src\backup_slice.py
.venv\Scripts\python.exe src\sanity_seasons.py
# Step 4 (cont.) — build the live feed the site reads (also copied into web/)
.venv\Scripts\python.exe src\season_replay.py
Steps 2 & 3 (the engines and the combine step) aren’t run on their own — the scripts above call them.
The big picture
The whole project in one line:
NFL data
team
game
vs Vegas
here
So… does it work?
How often each one picks the right winner (on 2019–24 games it had never seen):
We land within a few points of the sharpest line in the world — and we don’t beat it. That’s the honest result: the goal was calibrated, trustworthy probabilities, not a fantasy edge over Vegas.