Every NBA season produces, somewhere around game ten, a piece arguing that a certain player is having a “career year” based on numbers that, if you stop to do the arithmetic, span fewer than 200 field-goal attempts. The piece gets shared. The argument enters the discourse. By March, the player is back at his career averages, the piece is unupdated, and the writer has moved on to a different ten-game sample about a different player. The cycle repeats. The vocabulary survives. The discipline does not.
This is the single most reliable failure mode in sports analytics coverage. Small samples produce dramatic-looking numbers. Dramatic-looking numbers produce headlines. Headlines survive the regression that quietly arrives six weeks later. The result is an entire genre of writing built on data that, by the standards of the field’s own methodology, should not have been argued from yet.
The piece below is the working version of the discipline. What “small sample” actually means, where it gets dangerous, how to tell when a stretch is large enough to mean something, and the short checklist we run before quoting a number from a hot stretch.
Quick read: small samples in 60 seconds
- What “small sample” means: Any stretch of performance too short to separate skill from variance. Threshold depends on the metric and the sport.
- Why it goes wrong: Random clustering inside small windows produces results that look like trends but are mostly noise.
- The five worst cases: 5-game NBA shooting, 5-match soccer xG, 4-game NFL turnover differential, 50-attempt rookie evaluations, 200-minute lineup splits.
- When the sample becomes meaningful: See the stabilization table below — varies from ~150 plays to ~2,000 possessions depending on the metric.
- How to use small samples honestly: As a hypothesis, not a conclusion. Name the sample size out loud and write the regression risk into the piece.
What “small sample” actually means
The phrase gets thrown around as if it were a feeling. It is not. Every sports performance metric has a sample size at which the number starts behaving like a reliable estimator of underlying skill. Below that threshold, the metric is mostly capturing noise — opponent quality, lineup fluctuations, hot streaks, luck on bounces. Above it, the metric stabilizes around the player or team’s true rate.
The boundary is empirical, not theoretical. It is established by looking at how much a metric correlates with itself across two halves of a season, or across a full season versus the next one. A metric that correlates well at 200 attempts has stabilized at 200. A metric that requires 2,000 attempts to correlate at the same level has stabilized at 2,000. The smaller stabilization point, the safer the metric is to quote in short windows.
Most public sources publish these numbers if you look. Basketball Reference notes stabilization points in its glossary. FBref does the same for soccer metrics. The information is available. It is just not part of the standard coverage workflow.
Where small samples mislead most reliably
The table below is the version we keep open when reviewing any piece that leans on a hot-stretch claim. Each row maps a common metric to the sample where it actually starts meaning something, and to the typical small-sample misuse that surfaces in coverage.
| Metric | Sport | Stabilizes around | Common small-sample misuse |
|---|---|---|---|
| Three-point percentage | NBA / WNBA | ~750 attempts (full career-style) | Declaring a shooter “improved” after 50 makes |
| True shooting percentage | NBA / WNBA | ~300 attempts | “Career year” claims at game 8 |
| Five-man lineup net rating | NBA | ~2,000 possessions | “This lineup is dominant” from 220 minutes |
| On/off splits | NBA / WNBA | ~2,000 possessions per condition | “He makes his team better” from 8 games |
| xG per match (team) | Soccer | ~20 matches | “They’re playing the best football in the league” from 4 matches |
| Goals vs xG (finishing) | Soccer | ~30+ matches for strikers | “Hot finisher” pieces in October |
| EPA per play (offense) | NFL | ~150 plays | 4-game “offensive identity” claims |
| Turnover differential | NFL | Most of a season | “They’re forcing turnovers now” from 3 games |
| BABIP | MLB | ~600 balls in play | “He’s seeing the ball better” from one hot week |
| Pressure rate (defense) | NFL | Most of a season | “Improved pass rush” from a single elite-opponent game |
The stabilization numbers come from public research that, in most cases, has been settled for at least a decade. They are not controversial. They are mostly absent from coverage because the coverage cycle moves faster than the math.
Five small-sample arguments that should be retired
The arguments below are not strawmen. They are common formats that produce a particular kind of bad analysis, and each is a candidate for retirement.
The “hot start” piece in early November. Eight to ten NBA games in. A player is shooting 47% from three. The piece argues that an offseason change — shot mechanics, hand placement, finally embracing analytics — has produced the breakout. The same player in February is at 36%. The piece never runs. The offseason change was never the explanation. The sample was. We unpacked the underlying math in our regression to the mean piece; small-sample hot starts are the canonical input.
The “this lineup is the answer” column from 200 NBA minutes. Five-man lineup net rating is enormously volatile across small samples. A unit that posts +18 over 200 minutes is, statistically, not particularly distinguishable from a unit at +4 over the same minutes. The chart looks dramatic because of the sample size, not because the unit is special. A useful version of the same column waits until the lineup has played 800-1,000 minutes together.
The five-match Premier League “title contender” argument. Soccer’s xG model stabilizes for teams across roughly twenty matches. A team that produces a great xG profile across the first five fixtures has produced encouraging signal. It has not produced a settled forecast. Writing the September piece declaring the title race over is the genre at its worst. The careful version writes the same observations and explicitly names how many matches it would take to be confident.
The “rookie is a steal” valuation after thirty pro at-bats. Baseball’s BABIP, walk rate, and strikeout rate all require hundreds of plate appearances to stabilize. A rookie hitting .380 across his first thirty at-bats has produced a small-sample story. The same rookie at 300 at-bats is the data point worth arguing from. Calling the franchise “set at the position” before 100 appearances is the standard early-summer mistake.
The “this team is forcing turnovers” piece after three NFL games. Turnover differential is one of the most unstable team-level metrics in football. Across three games, a defense forcing eight turnovers tells you almost nothing about the rest of the season. League-wide turnover rates regress hard. The piece written in week four about a team’s “turnover identity” almost always reads as embarrassing by week ten.
A decision framework: sample size by intent
The table below maps the kind of claim you want to make against the sample size you actually need. The framework is the version we run before publishing any piece that leans on recent performance.
| Type of claim | Sample threshold | What to do if you have less |
|---|---|---|
| “This player is having a career year” | ~30+ games (NBA), ~25+ matches (soccer) | Write “this stretch is unusually strong; the underlying numbers will tell us more by midseason” |
| “This team is a contender” | ~20+ games (NBA), ~15+ matches (soccer) | Write “the early indicators are encouraging; the season will price in opponent quality” |
| “This lineup works” | ~1,000+ possessions (NBA) | Write “the early read is positive; we’ll know more once they’ve played more high-leverage minutes” |
| “This coaching change is paying off” | ~15+ games of stable rotation | Write “early signs suggest the change is producing; watch the process metrics, not the wins” |
| “This rookie is a star” | ~50+ games (NBA), full season (NFL) | Write “the early flashes are real; the role and opponent quality will determine the trajectory” |
| “This shooter has improved” | ~500+ attempts | Write “the stretch is encouraging; the percentage needs more attempts to be argued from” |
| “This defense is elite” | ~Half a season | Write “the underlying defensive metrics are strong; the rest of the schedule will test it” |
The pattern is that the careful versions of these claims are not dramatically less interesting than the lazy versions. They are slightly more cautious. They age much better.
When small samples are actually useful
The framework above is not an argument against ever using a small sample. It is an argument against treating small samples as conclusions when they are hypotheses. There are several cases where a small sample is the appropriate unit of analysis.
The first is scouting at the player-action level. A pitcher’s release point, a shooter’s mechanics, a quarterback’s footwork — these are observable in fewer than ten plays. A scout watching twenty pitches has data the analytics community would not call “stabilized,” but the data is qualitative, not statistical. Confusing the two produces both kinds of error.
The second is scheme identification. A team’s pick-and-roll coverage, a defense’s base front, a soccer side’s pressing trigger — these tend to be visible across a small number of games. The pattern is repeated by design. The small sample reveals the system, not the variance around it.
The third is injury recovery and early-season scheme installation. A player coming off a torn ACL in November is meaningfully different in March; a system that installed a new scheme in September is meaningfully different by January. The small sample at the start of the recovery or scheme period describes a different version of the player or team than the later one. Treating the early data as predictive of the later state is the error. Treating it as descriptive of the recovery process is fine.
The vocabulary that surrounds these distinctions is the working glossary in our sports analytics field guide, which covers how concepts like stabilization, regression, and per-possession adjustment fit together.
Frequently asked questions
If small samples are unreliable, why do analytics writers still cite them?
Because the alternative is publishing nothing for the first month of every season, and that is not a viable business. The honest move is not to refuse small-sample claims but to name the sample inside the piece. “This is a 12-game sample” is the difference between analysis and assertion. Writers who omit the qualifier are signalling either inexperience or laziness.
How can I tell if a piece I’m reading is making a small-sample mistake?
Check whether the sample size appears anywhere in the article. If it does not, the piece is leaning on a number whose context the writer either does not know or did not want to share. Good analytical writing tends to surface the sample explicitly: “across 14 games,” “in 380 lineup minutes,” “from 9 starts.” The number is part of the argument. Hiding it is a tell.
Are some sports more vulnerable to small-sample errors than others?
Yes. Football, with its 16-17 game regular season, has the worst sample-size problem of any major American sport — almost every claim is built on what would be a small sample in basketball or baseball. Soccer, with its 38-match top league season, sits in the middle. NBA and MLB, with 82 and 162 games respectively, have the most room for samples to stabilize. WNBA, with around 40 games, sits closer to soccer in this respect.
Is “small sample” the same as “early in the season”?
Not exactly. Early in the season the sample is small by definition. But small samples also appear mid-season — a player just returned from injury, a coaching change that produced 8 games of new data, a midseason trade that gave a player a new role. Each of these creates a fresh small sample inside a larger season. The same caution applies. The trap is treating the new sample with the same confidence as a full season’s worth of stable data.
The takeaway, in one paragraph
Small samples produce dramatic numbers because variance has not yet been smoothed out by sample size. The disciplined response is not to ignore them but to name them, write the regression risk into the piece, and revisit the claim once the sample has grown enough to mean something. The good versions of analytical writing do this routinely. The lazy versions do not. For the broader vocabulary this concept sits inside, our field guide to sports analytics terms is the natural companion read, and the regression to the mean piece covers the mechanism that does most of the unsmoothing.



