Methodology

How we score AI search visibility.

Anvil's scoring rubric is grounded in the Princeton GEO-bench study and validated against empirical citation data across hundreds of thousands of domains. We measure infrastructure (the 4 weighted channels you see in your scorecard) and outcomes (Share of Model, the actual citation rate across ChatGPT, Perplexity, Claude, and Google AI Overviews). Every weight in this rubric maps to evidence we can cite. Last updated June 11, 2026.

The four channels

Every Anvil scorecard reports a 0-100 score across four weighted channels. The category weights are intentional and based on what actually moves AI citations vs. what only correlates with traditional search rankings.

ChannelWeightWhat it measures
Technical SEO30%Title, meta, canonical, Open Graph, sitemap, soft-404 handling, H1.
AI Search30%Schema markup, AI crawler access, content restructuring (statistics, BLUF, attributed quotes, citations), heading hierarchy, plus live Share of Model testing across golden prompts.
Content Quality20%Internal linking, homepage word count, image alt text, blog/news section presence.
Platform Health20%CMS detection, security headers, HTTPS, JS library hygiene, page load, HTML size, server header exposure.

What actually drives AI citations

Citation behavior across ChatGPT, Perplexity, Claude, and Google AI Overviews has been studied extensively. The strongest signals, the ones we weight most heavily inside the AI Search channel, are the ones with empirical evidence behind them, not the ones that sound good in a marketing deck.

Princeton GEO-bench Study + 2026 Research
+30-41% AI visibility
From embedding statistics into content. RAG models actively prefer "high-entropy" factual data when assembling answers. 2026 studies confirm the range; exact lift varies by study and platform. Content combining recent statistics with Tier-1 source citations shows up to 89% higher selection probability in AI Overviews.
Princeton GEO-bench Study + 2026 Research
+28-32% AI visibility
From including attributed expert quotations. AI engines recognize quoted material as established expert opinion and cite it more readily than anonymous prose.
2026 Industry Research
+73% selection rate
For pages with comprehensive JSON-LD schema markup, in Google AI Overviews. Healthcare-specific schema (MedicalOrganization, Hospital, Physician) drives 82% higher CTR.
Citation Source Mix (2026 Audits)
Reddit ~40% · LinkedIn #2 · YouTube #1 by share
Reddit is now the most-cited domain across ALL major engines at roughly 40% citation frequency. LinkedIn rose to #2 overall and #1 for professional queries; its citation frequency doubled between November 2025 and February 2026. YouTube is the single most-cited domain by share. Wikipedia still accounts for 26-48% of ChatGPT's top citations. Citation overlap between major engines is only ~11%, so entity stacking across multiple authorities matters more than dominating any single one.
Brand Citation Rates (2026 Audits)
ChatGPT ~0.59% · Perplexity ~13%
ChatGPT cites brands in only ~0.59% of responses versus ~13% for Perplexity, a 46x gap. Share of Model lift shows up on Perplexity and Google surfaces first, ChatGPT last. We set expectations accordingly.

The 2026 levers

Four findings from 2026 research now shape how we weight and sequence work:

Inside the AI Search channel: what we score and why

Within the AI Search channel (30% of overall score), we weight the individual checks based on their evidence base. The full breakdown:

CheckPoints (of 100)Why this weight
JSON-LD Schema30Highest-impact lever per current research. We score richness (basic schema vs. domain-specific @types) and depth (multiple types beats single).
AI Crawler Access20Robots.txt rules for GPTBot, ClaudeBot, PerplexityBot, Google-Extended. If blocked, AI engines literally cannot index your content.
Statistics Density10+30-41% AI visibility per Princeton GEO-bench and 2026 follow-up studies. Counts distinct numeric data points (percentages, dollar amounts, multipliers) on the homepage.
Authoritative Citations8Outbound links to .edu, .gov, journals, and recognized authorities. Drives the co-citation patterns LLMs use to weight credibility.
Heading Hierarchy7H1 > H2 > H3 structure. Useful as a structural signal but low impact in isolation.
Attributed Quotations5+28-32% AI visibility per Princeton GEO-bench and 2026 follow-up studies. Counts blockquotes and attributed text patterns (quote, name, role).
BLUF Architecture5"Bottom Line Up Front." AI engines extract from the first 40-100 words of a page, and self-contained passages of ~134-167 words that fully answer one query are ~4.2x more likely to be cited. Soft narrative openers ("Welcome to...", "For over X years...") get penalized.
llms.txt3Forward-looking proposed standard. We score it because it's a free win, but at minimal weight. See below.
AI Citation Testing (live)12Live ChatGPT, Perplexity, Claude, and Google AI Overview testing across 5-10 golden prompts, run as part of the 24-hour scorecard process. Deeper multi-engine tracking continues in paid engagements.

What does not drive AI citations

The flip side of an evidence-based rubric: we know what doesn't work, and we don't weight it.

llms.txt is not a primary driver

The proposed /llms.txt standard has gotten significant attention in agency content over the last 18 months, often positioned as the "key" to AI search visibility. The evidence does not support that framing.

We score llms.txt at 3 of 100 points within the AI Search channel: present so we can confirm whether you have it, weighted accurately so its absence doesn't dominate your score. We recommend deploying it because it's free and useful for AI agents, never because it's a visibility lever.

"I don't see Llms.txt being used by any of the AI services. I think it's mostly a 'I want to feel like we did something about AI' kind of file." John Mueller, Google Search Advocate

Domain Authority is weakly correlated

Traditional Domain Authority scores (the Moz/Ahrefs metric) correlate with AI citation frequency at r=0.18, meaningful but weak. Only 12% of links AI engines cite rank in the traditional Google top 10. AI search and traditional search are diverging quickly; what gets you to position 1 in Google does not necessarily get you cited in ChatGPT.

Keyword density does not transfer to GEO

Keyword optimization for ranking does not translate to citation likelihood in generative engines. RAG models extract semantic meaning, not n-gram frequency. The fix is content restructuring (statistics, quotations, BLUF), not keyword stuffing.

Share of Model: the primary outcome metric

The four-channel score measures infrastructure: how citable your content is. Share of Model measures the outcome: how often AI engines actually cite you when buyers ask category-relevant questions.

For every paid engagement we design 5-20 "golden prompts" that represent real buyer queries in your category and run them across ChatGPT, Perplexity, Claude, and Google AI Overviews, three times each, monthly. The formula:

Share of Model
SoM = (Your Citations / Total Category Citations) × 100
35-40% indicates category leadership. Tracked monthly to measure actual visibility lift, not just infrastructure improvements.

A scorecard that goes from 40 to 75 means the infrastructure is now built for AI visibility. A SoM that goes from 8% to 24% means buyers are actually finding you when they ask. We track both. We optimize for the second.

Reproducibility

The rubric is built so two scorers assessing the same site arrive at the same number within 2-3 points. Every criterion is binary, tiered, or countable. No subjective assessments. If we cannot measure it or verify it from the live site or a standard tool, it does not count toward the score.

We update the rubric when (a) a criterion proves unreliable across 5+ independent scorings, (b) a new AI search platform becomes commercially relevant, (c) data shows a criterion has no correlation with real visibility improvement, or (d) market standards shift materially. Every change is dated and documented in our internal rubric file. Old criteria are archived as "deprecated" rather than deleted, so historical scores remain interpretable.

This is also why scores can shift when we update the rubric, even on a site that hasn't changed. When we recalibrate, we re-run baselines and disclose the methodology change. We do not silently restate prior scores.

Sources