How AgentMadness Predicts March Madness
A plain-English guide to the data science behind the simulator — designed for anyone who wants to understand the methodology and apply it to their own projects.
The Big Picture
We're trying to answer one question: if Team A plays Team B, what's the probability Team A wins?
Everything in this project — the upset algorithm, the simulation engine, the Kaggle submission — comes back to that single question. We answer it by combining multiple "models" (different ways of estimating that probability), each with its own strengths and weaknesses.
Think of it like asking five different basketball experts for their prediction. Each expert looks at different things. We blend their answers together because the crowd is smarter than any individual.
Model 1: The Efficiency Model (KenPom-Style)
What it measures
How many points a team scores and allows per 100 possessions. This removes pace of play from the equation — a team that plays fast and scores 85 points isn't necessarily better than a team that plays slow and scores 65.
How we compute it
From every game in the 2026 regular season, we extract:
Possessions = Field Goal Attempts - Offensive Rebounds + Turnovers + (0.475 × Free Throw Attempts)
This formula estimates how many times a team had the ball. Then:
Adjusted Offensive Efficiency (adjOE) = (Total Points Scored / Total Possessions) × 100
Adjusted Defensive Efficiency (adjDE) = (Total Points Allowed / Total Possessions) × 100
A team with adjOE of 115 and adjDE of 95 has an efficiency margin of +20. That's elite — they score 20 more points per 100 possessions than they allow.
Converting to win probability
We use a logistic function (the most important function in data science — you'll see it everywhere):
P(Team A wins) = 1 / (1 + 10^(-(marginA - marginB) / 11))
The number 11 is a scaling factor calibrated to college basketball. Here's the intuition:
| Efficiency Gap | Win Probability | |---------------|-----------------| | +0 (equal teams) | 50% | | +5 (small edge) | 61% | | +10 (solid favorite) | 72% | | +15 (clear favorite) | 81% | | +20 (dominant) | 88% | | +30 (mismatch) | 95% |
Why this works: Efficiency margin is the single best predictor of college basketball outcomes. KenPom has proven this over 20+ years. It's better than win-loss record, better than rankings, better than the eye test.
Where it falls short: It treats all games equally. A blowout win over a weak team in November counts the same as a close loss to a top-10 team in February. It also can't capture "intangibles" like coaching, clutch play, or tournament experience.
Model 2: Bradley-Terry
The concept
Imagine you have a tournament of chess players. You don't know how good they are, but you can see who beat whom. The Bradley-Terry model works backward from results to estimate each player's true strength.
How it works (step by step)
- Start: Give every team a strength of 1.0
- Look at reality: Count how many games each team actually won
- Look at expectation: For each game a team played, calculate how likely they were to win given current strengths
- Adjust: If a team won MORE games than expected, increase their strength. If fewer, decrease it.
- Repeat 100 times until the strengths stabilize
The math for one iteration:
For each team:
actual_wins = number of games they won
expected_wins = sum of [my_strength / (my_strength + opponent_strength)] for every game
new_strength = actual_wins / expected_wins × old_strength
Then normalize all strengths so they average out to 1.0.
Why this is brilliant
- A team that beats strong teams gets more credit than a team that beats weak teams
- Strength of schedule is automatically accounted for
- After 100 iterations, the ratings converge to the "true" relative strength
Win probability from Bradley-Terry
Dead simple:
P(A beats B) = strength(A) / (strength(A) + strength(B))
If Team A has strength 2.5 and Team B has strength 1.0:
P = 2.5 / (2.5 + 1.0) = 71.4%
Why we add this to the ensemble: It captures information that raw efficiency misses — specifically, WHO you played and how you performed relative to their strength. Two teams with identical efficiency margins can have very different Bradley-Terry ratings if one played a harder schedule.
Model 3: Seed-Based Baseline
The concept
The NCAA selection committee assigns seeds 1-16 to each team. These seeds contain expert judgment that isn't always captured by statistics — things like injuries, team drama, recent trends.
How we use it
Convert seed difference to probability:
P(Team A wins) = 1 / (1 + 10^((seedA - seedB) / 5))
A 1-seed vs a 16-seed produces 99.9%. A 5-seed vs a 12-seed produces 96.2%.
Wait — 96%? That's too high. Historical data says 12-seeds win 36% of the time. That's why this model gets a LOW weight (10%) in our ensemble. It's useful as a signal but wrong on its own.
Why we include it anyway: For teams we have limited data on (small conferences, late-season additions), the committee's seed is the best information we have.
Model 4: Conference Tournament Momentum
The concept
Teams enter March Madness in very different emotional states. A mid-major that just won their conference championship is playing with house money and confidence. A power-conference team that lost in the first round of their conference tournament might be deflated.
How we use it
We parse conference tournament results to find which teams are champions. Conference champions get a +3% probability boost. This is small but meaningful.
The data science principle: This is a feature — a piece of information we add to our model because we believe it predicts the outcome. Good data science is often about finding clever features, not clever algorithms.
Model 5: Recency Weighting
The problem
A team's performance in November might look nothing like their performance in March. Injuries, player development, chemistry — teams change dramatically over a season.
The solution
Instead of treating all games equally when computing efficiency, we apply exponential weighting:
weight = exp(dayNum / maxDayNum × 2)
This means:
- Games on the last day of the season get weight ~7.4
- Games on the first day get weight ~1.0
- Games in the middle get weight ~2.7
A late-season blowout win counts 7x more than an early-season one.
The data science principle: This is feature engineering — transforming raw data to better capture the signal. Recency weighting is used everywhere in data science, from stock prediction to recommendation systems.
The Ensemble: Blending It All Together
Why blend?
Each model has blind spots:
- Efficiency misses strength of schedule
- Bradley-Terry is slow to react to recent changes
- Seeds are based on committee judgment (sometimes wrong)
- Conference tournaments are small sample sizes
By blending, we cancel out individual errors. This is called ensemble learning and it's one of the most powerful ideas in data science. Almost every Kaggle competition winner uses some form of ensembling.
Our weights
Final probability =
0.55 × KenPom logistic (recency-weighted efficiency)
+ 0.30 × Bradley-Terry
+ 0.10 × Seed-based
+ 0.05 × Conference tournament adjustment
Why these specific weights?
- Efficiency gets 55% because it's the single strongest predictor
- Bradley-Terry gets 30% because it's the best complement (captures different information)
- Seeds get 10% because committee judgment adds value but is noisy
- Conference tournament gets 5% because it's a small signal on a small sample
Log Loss: Why Confidence Matters
What it is
Log loss measures how good your probability predictions are. It's the scoring metric for the Kaggle competition.
The intuition
- Predict 0.9 and Team A wins → small penalty (you were right and confident)
- Predict 0.6 and Team A wins → moderate penalty (you were right but unsure)
- Predict 0.6 and Team A loses → moderate penalty (you were wrong but unsure)
- Predict 0.9 and Team A loses → HUGE penalty (you were wrong and confident)
- Predict 1.0 and Team A loses → INFINITE penalty
This is why we clamp predictions to [0.01, 0.99]. A single prediction of 1.0 that's wrong would destroy our entire submission.
What makes a good log loss score?
| Score | Meaning | |-------|---------| | 0.693 | Predicting 50% for every game (no skill) | | 0.55-0.60 | Decent model | | 0.45-0.50 | Competitive on Kaggle | | < 0.45 | Top 10% territory |
The key insight: Log loss rewards calibration. If you predict 70% for a group of games, roughly 70% of them should actually be won by the predicted team.
The Upset Algorithm (In-App Simulation)
The simulator uses a different approach than the Kaggle submission because it has a different goal. Kaggle wants calibrated probabilities. The simulator wants entertaining, realistic outcomes with appropriate chaos.
Five signals, weighted
upsetProbability =
0.35 × historicalSeedRate
+ 0.30 × (1 - efficiencyGap)
+ 0.20 × combinedVolatility
+ 0.10 × (1 - experienceGap)
+ 0.05 × momentum
Historical Seed Rate (35% weight)
40 years of NCAA Tournament data tells us exactly how often each seed matchup produces an upset:
| Matchup | Upset Rate | Translation | |---------|-----------|-------------| | 1 vs 16 | 1.5% | Almost never (2 times ever) | | 2 vs 15 | 6% | About once every 2 years | | 3 vs 14 | 13% | About once per tournament | | 4 vs 13 | 20% | One per year | | 5 vs 12 | 35% | The famous "upset special" | | 6 vs 11 | 37% | Nearly a coin flip | | 7 vs 10 | 39% | Toss-up | | 8 vs 9 | 48% | Dead even |
Volatility (20% weight)
We compute each team's game-to-game scoring variance. A team that wins by 20, loses by 5, wins by 30, loses by 2 is VOLATILE. When two volatile teams play, anything can happen. This is what makes March Madness what it is.
volatility = standard deviation of scoring margin across all regular season games
High volatility + high upset rate = MADNESS.
How Claude Uses the Algorithm
The computed probability is injected into the referee prompt as a constraint:
"Computed upset probability for this matchup: 34.2%. Use this probability to decide the winner — respect it."
Claude then generates a realistic narrative that explains WHY the result happened. The math decides the winner; Claude tells the story.
Key Concepts for Your Other Projects
These principles work far beyond basketball:
| Concept | What It Does | Use It For | |---------|-------------|-----------| | Logistic Function | Converts any number to a probability (0-1) | Medical diagnosis, spam filtering, credit scoring | | Ensemble Learning | Blend multiple models to outperform any individual | Any prediction task | | Feature Engineering | Find informative signals in raw data | Recommendation systems, fraud detection | | Calibration | Probabilities should mean what they say | Weather forecasting, risk assessment | | Log Loss | Measure probability quality | Any classification task | | Bradley-Terry | Estimate strength from pairwise comparisons | Product ranking, A/B testing, search results | | Exponential Weighting | Recent data matters more | Stock prediction, trend detection |
Built by Tarik Moody for the AgentMadness project · March 2026