The evaluation framework
How AI summarizers are evaluated on this site
The sources, the rubric, the affiliate disclosure. Published in full.
Updated when the framework changes.
§ I.What this site is, and is not
This site is an editorial comparison reference for AI summarization tools. Verdicts and recommendations are grounded in vendor-published documentation, public benchmark literature, pricing verification, and side-by-side comparison of tool output on representative passages.
This site is not an in-house benchmark suite. The numbers presented in tool comparison tables are an editorial synthesis informed by the sources below, not the output of a calibrated proprietary benchmark. Readers who need calibrated, reproducible scoring should consult the academic benchmarks cited in section III, or apply the rubric in section IV to their own documents.
§ II.Sources used for verdicts
The inputs that inform every recommendation on this site.
Vendor documentation
Each tool's pricing page, feature list, supported file types, context-window limits, security and certification claims (SOC 2, HIPAA BAA availability), and any vendor-published evaluation results.
Public benchmarks
Where the literature provides relevant evaluations, references include ROUGE and BERTScore for summarization quality, HELM for general LLM capability, and peer-reviewed studies on hallucination and conditionality preservation in LLM summarization.
Pricing verification
Every price quoted on this site is checked against the vendor's public pricing page. Pages are updated when prices change.
Side-by-side excerpt comparison
Each category page presents the same source passage alongside summaries reproduced from each tool, so readers can judge fidelity, compression, and nuance preservation directly.
Reader-applicable framework
The rubric below is published so readers can apply it to their own documents on whichever tools they are considering. The honest test of any summarizer is how it handles the document in front of you, not a synthetic benchmark.
§ III.Public benchmarks worth knowing
For readers who want calibrated, reproducible numbers rather than editorial assessment, the following are the most relevant public benchmarks for summarization quality:
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures overlap between a generated summary and human-written reference summaries. Standard for extractive summarization research, less informative for abstractive.
- BERTScore measures semantic similarity between generated and reference summaries using contextual embeddings. Better than ROUGE at capturing paraphrasing and abstractive quality.
- HELM (Holistic Evaluation of Language Models, Stanford CRFM) is a broad multi-metric framework that includes summarization sub-tasks alongside reasoning, calibration, robustness, and bias.
- SUPERT and QAEval are reference-free evaluation metrics that probe whether a summary preserves the answers to questions a reader might ask of the source.
These benchmarks generally evaluate base model performance rather than the consumer-facing tools (Otter, NotebookLM, Scholarcy) that wrap them. That gap is part of why side-by-side comparison on representative document types still matters for buyers.
§ IV.A 10-point rubric for your own documents
Apply this rubric to any tool on the document in front of you. The honest answer is whether it works for your content, not whether it scored well on someone else's.
Key findings captured
Does the summary correctly identify the primary conclusions, results, or main argument of the source? Deduct points for missing major findings or mischaracterising conclusions.
Methods described accurately
For research papers, is the approach, dataset, or analytical method correctly described? For meeting notes, the decisions and their rationale. For legal documents, the operative clauses and their conditions.
Figures, tables, and data referenced
Are the key quantitative findings, charts, or data tables in the source reflected in the summary? Full points require at least 70% of significant data points mentioned.
Nuance and conditionality preserved
The single most important criterion for high-stakes content. Deduct points for inverting conditional statements (omitting 'unless', 'except', 'subject to'), collapsing caveats into certainty, or removing hedge language where the hedge is material.
Disciplinary terminology accuracy
Are specialist terms used correctly? For academic papers, no substitution of incorrect synonyms. For legal, correct clause-type labels. For medical, correct anatomical and pharmacological terminology.
Output format usefulness
Is the output format appropriate for the use case? Meeting summaries should have action items and owners. Research papers should distinguish findings from methods. Contracts should flag clauses by type.
How to apply: pick a representative document you actually need summarised. Read the source first. Then read the AI summary and score against each criterion. The only score that matters is the one for the document in front of you.
§ V.Affiliate disclosure
Some links on this site are affiliate links. This means that if a reader clicks through and purchases a subscription, this site may receive a commission from the vendor. Affiliate relationships exist with: QuillBot (via Impact Radius), Jasper (via Impact), Otter.ai (via Impact), Fireflies.ai (via Impact), Scribbr (via Impact and direct), and Paperpal (direct).
Verdicts are not influenced by commission rates. NotebookLM is recommended on every page where it is the honest winner despite paying $0 in affiliate revenue.
No payment is accepted for positive reviews, sponsored placements in results grids, or paid inclusion in any recommendation. The editorial process is independent.
§ VI.Conflicts of interest
This site is operated independently. There are no ownership interests in any of the tools reviewed, and no employment, contracted, or other commercial relationship with QuillBot, Otter, Fireflies, NotebookLM, Google, Adobe, Spellbook, Harvey, CoCounsel, Scholarcy, SciSummary, Paperpal, Scribbr, Blinkist, Shortform, Eightify, NoteGPT, Jasper, or any other tool reviewed here beyond the affiliate commission relationships disclosed above.
§ VII.Corrections policy
If you find an error in a price, feature description, or evaluation on this site, please get in touch. Confirmed errors are corrected within 48 hours. Negative reviews are not removed and verdicts are not altered in response to vendor requests, but factual errors are corrected promptly. Corrections are noted inline with a timestamp.