How accurate is GradeThread's AI grading?

We publish it. Every grade a human reviewer checks is compared to the AI's grade, and we report the agreement rate (share within half a point) and mean absolute error against expert reviewers on this page — updated continuously as more grades are reviewed.

How does GradeThread improve over time?

Reviewer corrections and post-sale buyer disputes feed an accuracy loop, and every new grading model version must clear a fixed eval gate — a maximum error and minimum agreement against a golden set of expert-graded garments — before it can grade live items. The model changelog on this page lists versions that passed.

What stops a grading model from getting worse?

An automated monitor re-checks the live grader on a schedule against the same golden set and against production reviews and disputes. If accuracy drifts below threshold, the team is alerted before quality slips further.

Do buyers have to trust a black box?

No. The rubric and weights are published, every grade carries a confidence score, low-confidence grades are routed to human review, and these platform-wide accuracy figures are public — so the standard is verifiable, not opaque.

Published, not promised

Grading accuracy & transparency report

A grade is only worth trusting if its accuracy is measured and shown. This page reports, platform-wide, how closely GradeThread's AI grades match expert human reviewers, how confident the model is, and how often buyers dispute a graded item — alongside the eval gate and model changelog that keep the standard improving over time.

Last updated 7/19/2026, 3:27:05 AM

How accurate the grades are

Whenever a human expert reviews a grade, we compare their score to the AI's. These figures cover every reviewed grade across the platform.

AI-vs-human agreement

Not enough data yet

Grades within half a point of the human reviewer

Mean error vs. reviewers

0.00

Average points of difference (lower is better)

Average model confidence

62.6%

Across recent grades

Routed to human review

13.6%

Low-confidence grades checked before finalizing

Intentional-design misread rate

0.0%

Design (e.g. distressing) mistaken for damage — lower is better

Buyer dispute rate

Not enough data yet

Of opted-in graded sales

Items graded

Expert reviews

Graded sales tracked

Accuracy by garment category

Some categories are harder to grade than others — distressed denim is easy to misread, a plain tee is not. We break the same AI-vs-human comparison down by category so the number isn't a single average that hides where grading is weakest. A category appears once it has enough reviewed grades to report truthfully.

Per-category accuracy appears here once individual garment categories have enough reviewed grades to report a meaningful number.

How good is “good”? The expert baseline

The fairest bar for an AI grader is how often two human experts agree with each other. We measure that in blind multi-rater reliability studies and publish the baseline here once a study is complete and reviewed.

Baseline pending — a published reliability study will appear here with expert-vs-expert agreement, mean error, and how the AI compares.

How the standard improves over time

Accuracy isn't a one-time claim — it's maintained by a closed loop.

1
Every grade is attributed to a model version
So accuracy can be measured per version, per factor, and per garment category — not as a single vague average.
2
Human reviewers correct and the loop learns
Reviewer corrections (including when design was mistaken for damage) and post-sale buyer disputes are fed back as signal on where grading drifts.
3
New models must clear a published eval gate
A candidate version cannot grade live items until it beats a maximum error of 1.0 and at least 70% agreement on a golden set of expert-graded garments.
4
An automated monitor watches for regressions
On a schedule, the live grader is re-checked against the golden set and against production reviews and disputes. If quality drifts below threshold, the team is alerted before it slips further.

Model changelog

Grading model versions that have cleared the eval gate, newest first. Each row is a version proven against the golden set before it went live.

Eval-gated model releases will appear here as new grading versions are promoted.

For the full rubric and weighting behind every grade, see the grading standard.

Transparency FAQ

How accurate is GradeThread's AI grading?: We publish it. Every grade a human reviewer checks is compared to the AI's grade, and we report the agreement rate (share within half a point) and mean absolute error against expert reviewers on this page — updated continuously as more grades are reviewed.
How does GradeThread improve over time?: Reviewer corrections and post-sale buyer disputes feed an accuracy loop, and every new grading model version must clear a fixed eval gate — a maximum error and minimum agreement against a golden set of expert-graded garments — before it can grade live items. The model changelog on this page lists versions that passed.
What stops a grading model from getting worse?: An automated monitor re-checks the live grader on a schedule against the same golden set and against production reviews and disputes. If accuracy drifts below threshold, the team is alerted before quality slips further.
Do buyers have to trust a black box?: No. The rubric and weights are published, every grade carries a confidence score, low-confidence grades are routed to human review, and these platform-wide accuracy figures are public — so the standard is verifiable, not opaque.

Ready to Grade Smarter?

Join resellers who trust GradeThread to standardize condition grading, build buyer confidence, and sell faster.