F1 Score

IntermediateMachine Learning

Last updated June 14, 2026

What is F1 Score in simple terms?

In simple terms, the F1 score blends two separate scores — how trustworthy a model's "yes" answers are, and how many real cases it catches — into one fair number that stays low if either half is weak.

What is F1 Score?

The F1 score is a single number that combines a classification model's precision and recall into one balanced figure, designed so the model can only score well when both — its rate of false alarms and its rate of misses — are kept low together.

When you judge a flagging model, precision (how many of its flags were correct) and recall (how many real cases it caught) usually pull against each other, and reporting two numbers can be awkward when you just want to compare models or pick a winner. The F1 score solves that by squeezing both into a single figure between 0 and 1, where higher is better. Its defining feature is *how* it combines them: not a plain average, but one designed to punish imbalance. If a model scores 1.0 on precision but a dismal 0.1 on recall, a plain average would flatter it to 0.55 — but the F1 score lands near 0.18, much closer to the weak half. The only way to earn a high F1 is to be good at both at once.

This anti-cheat property is exactly why the F1 score exists. A plain average can be gamed: a model that flags almost nothing can post near-perfect precision and, blended naively, look respectable despite missing nearly everything. The F1 score won't let that pass, because it pulls hard toward whichever number is lower (it's the *harmonic mean* of the two, a kind of average that's dominated by the smaller value). The practical upshot is simple and useful: a high F1 score is a genuine promise that the model is neither raising too many false alarms nor missing too many real cases — it's competent on both fronts, not lopsided.

Two caveats keep it honest. First, the F1 score weighs precision and recall *equally*, which isn't always what you want — for cancer screening you might care far more about catching every case (recall) than about the odd false alarm, in which case a variant that tilts the balance is more appropriate. Second, because it collapses two numbers into one, F1 hides the very trade-off precision and recall were reporting; a model and a quite different one can share an F1 of 0.8 while failing in opposite ways. So the F1 score is excellent for ranking or summarizing models at a glance, especially when one category is rare and plain accuracy misleads — but it's a headline, not the full story, and serious evaluation still looks at precision and recall underneath it.

Real-world example of F1 Score

Imagine a hiring team using an automated screener to shortlist résumés for a "strong match." You can rate the screener two ways: of the résumés it shortlisted, how many were genuinely strong (precision), and of all the genuinely strong candidates in the pile, how many it shortlisted (recall). Suppose it's very precise — almost everyone it picks really is strong, 95% — but it's timid and only surfaces a third of the strong candidates, so recall is 33%. Quote precision alone and the screener sounds excellent; quote recall and it sounds useless. The F1 score refuses to take sides: it works out to about 49%, dragged down toward the weak recall, and that single honest number tells the team at a glance that the tool is badly lopsided and quietly discarding most of the good people. That's the warning a plain average would have buried.

Related terms

Frequently asked questions about F1 Score

What is the difference between the F1 score and accuracy?

Accuracy is the share of *all* predictions a model got right, counting every category. The F1 score focuses on one target category and balances how trustworthy its flags are (precision) against how many real cases it catches (recall). The crucial difference shows up when a category is rare: a model can score 99% accuracy by simply never flagging the rare case — while its F1 score, which demands both precision and recall, would expose that it's catching nothing. For imbalanced problems, F1 is the more honest measure; for balanced ones, accuracy is a fine, simpler summary. **2. Mechanism — How does the F1 score work?**

How does the F1 score work?

It combines precision and recall using their harmonic mean — a type of average that leans toward the smaller of the two numbers rather than treating them evenly like a normal average. Concretely, it multiplies precision and recall, divides by their sum, and doubles the result, giving a figure from 0 to 1. Because the harmonic mean is dragged down by whichever value is lower, a model only achieves a high F1 when both precision and recall are high together; being excellent at one while poor at the other yields a mediocre score. That built-in penalty for imbalance is the whole design. **3. Application — What is the F1 score used for?**

What is the F1 score used for?

It's the standard single-number summary for classification tasks where the categories are imbalanced and accuracy would mislead — fraud detection, disease screening, spam filtering, defect spotting, and many language tasks like extracting names from text. Teams use it to compare models on a level footing and to pick the best one when both false alarms and misses matter. When the two errors aren't equally costly, a weighted variant tips the balance toward precision or recall, but the plain F1 score remains the common default for a quick, fair comparison.