Google’s new “Forest vs Tree” study shows the usual 3–5 raters per item make safety and toxicity benchmarks brittle—and offers a roadmap for better N/K trade-offs.

Benchmarking AI with three annotators per item isn’t cutting it. Google Research just published “Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation,” showing that the standard practice of collecting 1–5 human ratings per example leaves benchmarks brittle and irreproducible—especially for subjective tasks like toxicity or safety.

The forest vs. tree metaphor

Forest (breadth): Rate lots of items with very few humans each. You get coverage but shallow insight.
Tree (depth): Fewer items, many raters per item. You capture disagreement and nuance but sacrifice scale.

Most ML teams choose the forest because budgets are limited. The paper argues that strategy quietly ignores natural human disagreement, making it easy for two teams to run “the same” benchmark and get conflicting conclusions.

How they tested it

Researchers built a simulator using real datasets: Jigsaw’s 100k-comment toxicity set, DICES chatbot safety scores, the cross-cultural D3code offensiveness corpus, and job-related tweet annotations. For each dataset they varied:

N — total number of items annotated (100 to 50,000)
K — number of raters per item (1 to 500)

They then asked: for a fixed annotation budget, which combination makes model comparisons reproducible (p < 0.05)? They stress-tested skewed datasets (e.g., 99% spam) and multi-label categories to see if the trade-off shifts.

Key findings

3–5 raters is often inadequate. That setup captures neither breadth nor the full distribution of opinions. More than 10 raters per item are frequently required for reliable nuance.
Metric dictates strategy. If you care only about majority vote accuracy, add more items. If you care about opinion variance (e.g., “mildly offensive” vs “toxic”), invest in more raters per item.
You don’t need infinite budget. With the right N/K balance, ~1,000 total annotations can yield reproducible results. The wrong balance wastes money without improving trust.

Why this matters for benchmark owners

Benchmarks guide product decisions, model launches, and marketing claims. If they’re irreproducible, you’re building on sand. The paper makes a case for treating disagreement as signal, not noise:

Report the distribution of ratings, not just majority vote.
Document rater demographics; cross-cultural disagreement is real.
Use more raters when measuring nuanced safety or fairness metrics.

Implementation checklist

A/B your budgets. Run small pilots varying N and K to see where your metrics stabilize.
Adopt the open-source simulator. Google released code to emulate different annotation strategies before you spend real money.
Store raw rater-level data. Don’t throw away disagreement—you might need it to debug future model regressions.
Weight raters intelligently. Calibrate each annotator or group to avoid biasing results toward a single demographic.

Enterprise questions

How many raters per item did a vendor use, and why?
Did they measure calibration across raters (some are consistently harsher)?
Can they reproduce their numbers when you swap in a new crowd or new geography?

Case study: safety evaluations

For DICES, a chatbot safety benchmark with 16 dimensions, increasing raters per item surfaced disagreements around “coercion” that single votes hid. Without deeper ratings, a model that slightly reduces the worst-case harm might look identical to baseline; with depth, improvements were statistically significant.

Why disagreement is a feature

Tasks like content moderation, chatbot safety, or job status inference aren’t deterministic. A Brazilian teenager and a U.S. hiring manager will rate the same tweet differently. Benchmarks that collapse those opinions into a single label hide cultural blind spots and create incentives to optimize for the majority demographic. Capturing the variance tells you when your model is polarizing, not just when it’s “wrong.”

Monitoring over time

Even after you pick an N/K strategy, revisit it. As models improve, the residual errors might cluster in edge cases that demand more depth. Track benchmark drift: if disagreement suddenly spikes for a category, you may need to rebalance raters or refresh items.

Bottom line

Reliable AI evaluation is less about bigger datasets and more about smarter annotation portfolios. Treat the N/K trade-off as a budget knob, not an afterthought.

Think of benchmark design as infrastructure. The earlier you budget for disagreement, the less time you’ll spend explaining why yesterday’s model “improvement” vanished when another team reran the test.

Source: “Building better AI benchmarks: How many raters are enough?” Google Research Blog, March 31, 2026.

Aswin Sarang

Aswin Sarang is a technology professional and entrepreneur working across robotics, artificial intelligence, and automation. He focuses on building practical systems that bridge engineering, strategy, and real-world deployment, with an emphasis on clarity, scalability, and long-term value. His work spans product development, system integration, and technology consulting, helping organizations navigate complex technical decisions and translate emerging technologies into usable solutions. Known for a first-principles approach, Aswin prioritizes fundamentals over hype and execution over speculation. Beyond technology, he maintains a strong interest in human performance, learning, and personal development, bringing a multidisciplinary perspective to both his professional and creative pursuits.

All Transmissions

TABLE _OF_CONTENTS

How Many Human Raters Do Your AI Benchmarks Really Need?