Anvil's scoring rubric is grounded in the Princeton GEO-bench study and validated against empirical citation data across hundreds of thousands of domains. We measure infrastructure (the 4 weighted channels you see in your scorecard) and outcomes (Share of Model, the actual citation rate across ChatGPT, Perplexity, Claude, and Google AI Overviews). Every weight in this rubric maps to evidence we can cite. Last updated June 11, 2026.
Every Anvil scorecard reports a 0-100 score across four weighted channels. The category weights are intentional and based on what actually moves AI citations vs. what only correlates with traditional search rankings.
| Channel | Weight | What it measures |
|---|---|---|
| Technical SEO | 30% | Title, meta, canonical, Open Graph, sitemap, soft-404 handling, H1. |
| AI Search | 30% | Schema markup, AI crawler access, content restructuring (statistics, BLUF, attributed quotes, citations), heading hierarchy, plus live Share of Model testing across golden prompts. |
| Content Quality | 20% | Internal linking, homepage word count, image alt text, blog/news section presence. |
| Platform Health | 20% | CMS detection, security headers, HTTPS, JS library hygiene, page load, HTML size, server header exposure. |
Citation behavior across ChatGPT, Perplexity, Claude, and Google AI Overviews has been studied extensively. The strongest signals, the ones we weight most heavily inside the AI Search channel, are the ones with empirical evidence behind them, not the ones that sound good in a marketing deck.
Four findings from 2026 research now shape how we weight and sequence work:
Within the AI Search channel (30% of overall score), we weight the individual checks based on their evidence base. The full breakdown:
| Check | Points (of 100) | Why this weight |
|---|---|---|
| JSON-LD Schema | 30 | Highest-impact lever per current research. We score richness (basic schema vs. domain-specific @types) and depth (multiple types beats single). |
| AI Crawler Access | 20 | Robots.txt rules for GPTBot, ClaudeBot, PerplexityBot, Google-Extended. If blocked, AI engines literally cannot index your content. |
| Statistics Density | 10 | +30-41% AI visibility per Princeton GEO-bench and 2026 follow-up studies. Counts distinct numeric data points (percentages, dollar amounts, multipliers) on the homepage. |
| Authoritative Citations | 8 | Outbound links to .edu, .gov, journals, and recognized authorities. Drives the co-citation patterns LLMs use to weight credibility. |
| Heading Hierarchy | 7 | H1 > H2 > H3 structure. Useful as a structural signal but low impact in isolation. |
| Attributed Quotations | 5 | +28-32% AI visibility per Princeton GEO-bench and 2026 follow-up studies. Counts blockquotes and attributed text patterns (quote, name, role). |
| BLUF Architecture | 5 | "Bottom Line Up Front." AI engines extract from the first 40-100 words of a page, and self-contained passages of ~134-167 words that fully answer one query are ~4.2x more likely to be cited. Soft narrative openers ("Welcome to...", "For over X years...") get penalized. |
| llms.txt | 3 | Forward-looking proposed standard. We score it because it's a free win, but at minimal weight. See below. |
| AI Citation Testing (live) | 12 | Live ChatGPT, Perplexity, Claude, and Google AI Overview testing across 5-10 golden prompts, run as part of the 24-hour scorecard process. Deeper multi-engine tracking continues in paid engagements. |
The flip side of an evidence-based rubric: we know what doesn't work, and we don't weight it.
The proposed /llms.txt standard has gotten significant attention in agency content over the last 18 months, often positioned as the "key" to AI search visibility. The evidence does not support that framing.
We score llms.txt at 3 of 100 points within the AI Search channel: present so we can confirm whether you have it, weighted accurately so its absence doesn't dominate your score. We recommend deploying it because it's free and useful for AI agents, never because it's a visibility lever.
Traditional Domain Authority scores (the Moz/Ahrefs metric) correlate with AI citation frequency at r=0.18, meaningful but weak. Only 12% of links AI engines cite rank in the traditional Google top 10. AI search and traditional search are diverging quickly; what gets you to position 1 in Google does not necessarily get you cited in ChatGPT.
Keyword optimization for ranking does not translate to citation likelihood in generative engines. RAG models extract semantic meaning, not n-gram frequency. The fix is content restructuring (statistics, quotations, BLUF), not keyword stuffing.
The four-channel score measures infrastructure: how citable your content is. Share of Model measures the outcome: how often AI engines actually cite you when buyers ask category-relevant questions.
For every paid engagement we design 5-20 "golden prompts" that represent real buyer queries in your category and run them across ChatGPT, Perplexity, Claude, and Google AI Overviews, three times each, monthly. The formula:
A scorecard that goes from 40 to 75 means the infrastructure is now built for AI visibility. A SoM that goes from 8% to 24% means buyers are actually finding you when they ask. We track both. We optimize for the second.
The rubric is built so two scorers assessing the same site arrive at the same number within 2-3 points. Every criterion is binary, tiered, or countable. No subjective assessments. If we cannot measure it or verify it from the live site or a standard tool, it does not count toward the score.
We update the rubric when (a) a criterion proves unreliable across 5+ independent scorings, (b) a new AI search platform becomes commercially relevant, (c) data shows a criterion has no correlation with real visibility improvement, or (d) market standards shift materially. Every change is dated and documented in our internal rubric file. Old criteria are archived as "deprecated" rather than deleted, so historical scores remain interpretable.
This is also why scores can shift when we update the rubric, even on a site that hasn't changed. When we recalibrate, we re-run baselines and disclose the methodology change. We do not silently restate prior scores.