§ Methodology
How the AI Discovery Score is computed.
The score measures whether modern LLMs surface a specific book for its genre when readers ask for recommendations. It’s a measurement, not a marketing claim. This page explains exactly how we compute it, what we don’t measure, and the known limitations.
What actually drives LLM book retrieval
When a reader asks ChatGPT, Perplexity, Claude, or Gemini for “best [genre] books” or “books like X”, two layers determine which titles surface:
Layer 1 — training-data citation graph.Wikipedia, Wikidata, Goodreads (especially shelves and lists), major review outlets (NYT / Guardian / LRB / Kirkus), Reddit (which most LLMs trained on via Common Crawl and Reddit’s licensed corpus), Substack and Medium niche roundups. Once a model is trained, this layer is locked in until the next retrain.
Layer 2 — retrieval-augmented (live web) signals.Listicle rankings (“best [genre] books 2026”), bookstore staff picks, podcast transcripts, genre-vertical sites (Lesbrary, Crimereads, Romance.io, ALLi, SLJ). When LLMs do live web search, this is what they read.
The six axes we score
The diagnostic extracts a structured signal table from its searches and computes the score deterministically — same inputs, same number, every time. Haiku doesn’t pick the score; the rubric below does.
Retrievability — 0 to 10
Does the Amazon page surface for direct title-and-author search? Almost every published book passes this; a zero here is a giveaway that something’s badly wrong (broken metadata, taken-down listing).
Structured-data depth — 0 to 20
Goodreads page (scaled by ratings count), Wikipedia article on the book, Wikidata entry. Goodreads is the single biggest input to LLM citation graphs for fiction and trade non-fiction. Wikipedia + Wikidata are the semantic backbone that lets LLMs traverse author → book → genre.
Listicle / peer-set presence — 0 to 25 (heaviest weight)
Inclusion in “best [genre] books” round-ups, peer-recommendation chains, curated lists. This is the axis that most directly answers would an LLM recommend this book?. A book with zero hits here is, by rubric definition, not in the recommendation chain — and the score caps accordingly.
Institutional authority — 0 to 20 (genre-specific)
Genre-specific endorsements that disproportionately shape LLM citation. Examples by genre:
- Health / self-help / clinical: NHS Reading Well, ABCT Self-Help Seal, ADAA, NICE guidance
- Literary fiction: Booker / Women’s Prize / Pulitzer / Costa long & shortlists, NYT/Guardian/LRB review, Kirkus starred
- Crime & thriller: CWA Dagger awards, Crimereads roundups, BookRiot crime lists
- Children’s & YA: ALA awards, Carnegie / Greenaway, SLJ starred reviews
- Romance: Romance Writers awards, Smart Bitches features, Goodreads romance shelves with 100+ shelf hits
- Academic / non-fiction: Google Scholar 50+ citations, peer-reviewed press, syllabus inclusions
- Indie / self-published: ALLi top picks, Reedsy showcase, IndieReader awards, Storygraph community feature
Author graph — 0 to 15
Author Wikipedia article, bylines in credible publications, podcast appearance trail, expert credentials in the genre. A strong author footprint multiplies a book’s citation likelihood — when LLMs see a book by a cited author, they’re more willing to recommend it.
Cross-source citation — 0 to 10
Reddit discussions, Substack pieces, niche enthusiast forums. Reddit weighs disproportionately because its corpus was licensed by major model providers and recurs in retrieval queries.
How signals are verified
The signal table feeding the score has two sources, in this order of authority:
- Direct API verification for boolean structural facts that have authoritative sources. We hit the Wikipedia
api.phpendpoint, the Wikidatawbsearchentitiesendpoint, and the Goodreads search page in parallel. The book either has a Wikipedia article or it doesn’t — we don’t guess. These verifications run on every score and override anything the model reported. - Live LLM web searchfor the harder-to-verify signals: which genre listicles include the book, which institutional bodies cite it, which podcasts the author has appeared on, which Reddit threads discuss it. The model (Claude Haiku 4.5) runs three live searches and reports its findings as a structured table; the runtime computes the score from that table using fixed weights — the model doesn’t pick the number.
Direct-source verification eliminates the “model didn’t happen to surface that signal in three searches” failure mode for the boolean facts. For the harder signals, multi-source corroboration is on the roadmap.
The empirical recommendation test
Every score now includes the result of an empirical test: we make a separate LLM call with one constrained web search asking the actual question a reader would ask — “best [genre] books”— and check whether the user’s book appears in the top results.
The test returns a binary: recommended or not recommended, plus the top books that did surface in the search. Both are shown verbatim on the public score page. Anyone can re-run the same query in ChatGPT, Perplexity, or Gemini and verify the answer in under 60 seconds.
When the test returns not recommended, the runtime forces the listicle-presence axis to zero regardless of any inclusions Haiku separately reported — the test is more authoritative because it asks the recommendation question directly. This is what stops a book from being scored as “well-indexed” when an LLM-with-retrieval doesn’t actually surface it for its genre.
Hard caps
Three caps override the raw axis sum. They exist because the rubric refuses to label a book “well-indexed” when its primary genre-retrieval signal is missing — no matter how good the other axes look.
- Zero listicle inclusions → max 50. Cannot exceed the “partially visible” floor.
- External-signal floor not met (listicle + institutional + cross-source < 10) → max 35. A book with no genre-context signal at all is genuinely invisible to LLM retrieval, full stop.
- No Goodreads page → max 60. Goodreads absence almost always means the book is too new or obscure for citation graphs to have absorbed it.
Bands
- 80+ Well-indexed.The book appears in genre listicles and has structural depth — Goodreads, third-party citations, often institutional listings. ChatGPT, Perplexity, and Gemini surface it when readers ask category questions. The fixes that remain are about lifting from “cited” to “canonical”.
- 60–79 Partially visible. The book is findable by direct title-and-author search, but absent from the genre listicles and peer-recommendation chains LLMs draw on for category-level queries. Two or three high-leverage moves usually close the gap.
- Under 60 Effectively invisible.The book exists on Amazon but doesn’t surface in LLM recommendations for its genre. Author footprint, structural signals, or both need work before the book’s discoverable.
What we deliberately do not measure
- Sales rank, reader sentiment, or critical merit. The score is about retrievability, not quality.
- Amazon-specific placement (KDP categories, A+ content, BSR). That’s a different optimisation surface and dedicated tools (Publisher Rocket) handle it well.
- Whether the book is “in” an LLM’s training data — model providers don’t disclose that, and it isn’t the bottleneck for retrieval anyway.
- Anything that requires touching the LLM training pipeline, prompt-injection, or other gaming. The diagnostic only operates on author-controlled public surfaces.
Anti-gaming notes
The fastest way to ruin a measurement metric is to make it gameable. We weight the signals deliberately:
- Authentic third-party mentions (trade press, indie blogs, podcast transcripts) carry the most weight because they’re hardest to fake.
- Goodreads list inclusion alone moves the needle modestly — it’s easy to vote books onto lists.
- Author-controlled signals (Wikidata setup, schema.org markup) count, but they cap at the structural-signal ceiling so a book can’t score well without any external presence.
- Every (book, author) pair in the competitive set is verified against Google Books before delivery — hallucinated peer titles get dropped.
Known limitations
- Per-book runs use a finite number of live searches. Edge cases — niche subgenres, books with unusual titles, recent releases — produce more variance than mainstream books.
- The score reflects what was retrievable on the day of the run. LLM behaviours change as models retrain and retrieval indexes refresh — this is why we auto-rerun every 90 days.
- Cross-LLM behaviour differs (ChatGPT vs Perplexity vs Gemini cite different sources). We weight toward sources that recur across engines.
- The diagnostic is genre-relative. A book scoring 90 in a niche subgenre is not directly comparable to a book scoring 90 in a crowded mass-market category.
Refresh cycle
Every public score is automatically re-run every 90 days. The author is emailed the change. The badge SVG updates to show the latest figure and date — no action required from authors who’ve embedded it. Refresh count is visible on the public score page.
Run it on your own book.
Free, ungated. Takes about 90 seconds. You’ll get the score, the verdict, the competitive set, the three fastest fixes — plus an embeddable badge if you want to publish the score.
Run my AI-Discovery Score →