Methodology

How we score AI search visibility.

Anvil's scoring rubric is grounded in the Princeton GEO-bench study and validated against empirical citation data across hundreds of thousands of domains. We measure infrastructure (the 4 weighted channels you see in your scorecard) and outcomes (Share of Model, the actual citation rate across ChatGPT, Perplexity, Claude, and Google AI Overviews). Every weight in this rubric maps to evidence we can cite. Last updated June 11, 2026.

The four channels

Every Anvil scorecard reports a 0-100 score across four weighted channels. The category weights are intentional and based on what actually moves AI citations vs. what only correlates with traditional search rankings.

Channel	Weight	What it measures
Technical SEO	30%	Title, meta, canonical, Open Graph, sitemap, soft-404 handling, H1.
AI Search	30%	Schema markup, AI crawler access, content restructuring for extractability, heading hierarchy, plus live Share of Model testing across golden prompts.
Content Quality	20%	Internal linking, homepage word count, image alt text, blog/news section presence.
Platform Health	20%	CMS detection, security headers, HTTPS, JS library hygiene, page load, HTML size, server header exposure.

What actually drives AI citations

Citation behavior across ChatGPT, Perplexity, Claude, and Google AI Overviews has been studied extensively. The strongest signals, the ones we weight most heavily inside the AI Search channel, are the ones with empirical evidence behind them, not the ones that sound good in a marketing deck.

Princeton GEO-bench Study + 2026 Research

Statistics and quotations win

Embedding concrete statistics and attributed expert quotations measurably increases AI visibility. RAG models prefer "high-entropy" factual data and recognize quoted material as established expert opinion, so they cite it more readily than anonymous prose. Comprehensive schema markup and recent, well-cited content compound the effect.

Citation Source Mix (2026 Audits)

Reddit ~40% · LinkedIn #2 · YouTube #1 by share

Reddit is now the most-cited domain across ALL major engines at roughly 40% citation frequency. LinkedIn rose to #2 overall and #1 for professional queries; its citation frequency doubled between November 2025 and February 2026. YouTube is the single most-cited domain by share. Wikipedia still accounts for 26-48% of ChatGPT's top citations. Citation overlap between major engines is only ~11%, so entity stacking across multiple authorities matters more than dominating any single one.

Brand Citation Rates (2026 Audits)

ChatGPT ~0.59% · Perplexity ~13%

ChatGPT cites brands in only ~0.59% of responses versus ~13% for Perplexity, a 46x gap. Share of Model lift shows up on Perplexity and Google surfaces first, ChatGPT last. We set expectations accordingly.

The 2026 levers

Four findings from 2026 research now shape how we weight and sequence work:

Content freshness. AI citations decay sharply once content is roughly 3 months old, and new content can enter citation pools within 3-5 business days. Publishing velocity and refresh cadence are ranking levers, not nice-to-haves.
Third-party listicle placement. Ranking and list-format pages capture a large share of AI citations. Getting onto credible "best X in Y" lists is a direct citation lever.
Query fan-out coverage. The correlation between Google AI Overviews and top-10 organic results collapsed from ~94% in 2025 to 17-38% by early 2026. Google now splits a query into sub-queries and cites pages that appear across them, so topic-cluster coverage beats optimizing one page for one head term. The May 2026 update added inline citations next to specific claims, hover previews, and an Expert Advice block. Cited pages earn ~35% more organic clicks; non-cited pages lose ~61% of CTR when an AI Overview triggers.
FAQ schema markup, downgraded. The markup itself has declining citation value on Google surfaces. FAQ content, meaning real buyer questions answered in extractable blocks, still works.

Inside the AI Search channel: what we weight

Within the AI Search channel (30% of overall score), we weight schema and AI-crawler access most heavily, followed by content structure and live Share-of-Model testing across golden prompts. The full weighted rubric, with the exact per-check point values, is delivered with your scorecard.

What does not drive AI citations

The flip side of an evidence-based rubric: we know what doesn't work, and we don't weight it.

llms.txt is not a primary driver

The proposed /llms.txt standard has gotten significant attention in agency content over the last 18 months, often positioned as the "key" to AI search visibility. The evidence does not support that framing.

Monitoring of more than 500 million AI bot visits found only 408 llms.txt fetches, roughly 1 per 1.2 million requests.
SE Ranking's prediction model across 300,000 domains improved in accuracy when llms.txt was removed as a variable. Real signals make models better; noise makes them worse.
Google confirmed it will not support llms.txt (Gary Illyes, July 2025). John Mueller compared it to the deprecated keywords meta tag.
8 of 9 sites saw no traffic change after implementing it. No major AI company has committed to reading it in production.

We score llms.txt at minimal weight within the AI Search channel: present so we can confirm whether you have it, weighted accurately so its absence doesn't dominate your score. We recommend deploying it because it's free and useful for AI agents, never because it's a visibility lever.

"I don't see Llms.txt being used by any of the AI services. I think it's mostly a 'I want to feel like we did something about AI' kind of file." John Mueller, Google Search Advocate

Domain Authority is weakly correlated

Traditional Domain Authority scores (the Moz/Ahrefs metric) correlate with AI citation frequency at r=0.18, meaningful but weak. Only 12% of links AI engines cite rank in the traditional Google top 10. AI search and traditional search are diverging quickly; what gets you to position 1 in Google does not necessarily get you cited in ChatGPT.

Keyword density does not transfer to GEO

Keyword optimization for ranking does not translate to citation likelihood in generative engines. RAG models extract semantic meaning, not n-gram frequency. The fix is restructuring content for extractability, not keyword stuffing.

Share of Model: the primary outcome metric

The four-channel score measures infrastructure: how citable your content is. Share of Model measures the outcome: how often AI engines actually cite you when buyers ask category-relevant questions.

For every paid engagement we design 5-20 "golden prompts" that represent real buyer queries in your category and run them across ChatGPT, Perplexity, Claude, and Google AI Overviews, three times each, monthly. The formula:

Share of Model

SoM = (Your Citations / Total Category Citations) × 100

35-40% indicates category leadership. Tracked monthly to measure actual visibility lift, not just infrastructure improvements.

A scorecard that goes from 40 to 75 means the infrastructure is now built for AI visibility. A SoM that goes from 8% to 24% means buyers are actually finding you when they ask. We track both. We optimize for the second.

Reproducibility

The rubric is built so two scorers assessing the same site arrive at the same number within 2-3 points. Every criterion is binary, tiered, or countable. No subjective assessments. If we cannot measure it or verify it from the live site or a standard tool, it does not count toward the score.

We update the rubric when (a) a criterion proves unreliable across 5+ independent scorings, (b) a new AI search platform becomes commercially relevant, (c) data shows a criterion has no correlation with real visibility improvement, or (d) market standards shift materially. Every change is dated and documented in our internal rubric file. Old criteria are archived as "deprecated" rather than deleted, so historical scores remain interpretable.

This is also why scores can shift when we update the rubric, even on a site that hasn't changed. When we recalibrate, we re-run baselines and disclose the methodology change. We do not silently restate prior scores.

Sources

Princeton GEO-bench Study: generative engine optimization benchmarks across major AI search engines.
Mueller, John (Google Search Advocate). Public statements on llms.txt across Bluesky and LinkedIn, 2025-2026. Illyes, Gary (Google). Public confirmation that Google will not support llms.txt, July 2025.
Industry citation analysis covering 500M+ AI bot visits (408 llms.txt fetches) and SE Ranking's prediction model across 300,000 domains.
Schema.org. Structured data vocabulary for AI consumption.
Anvil internal rubric (operations/scoring-rubric.md, v1.1, April 2026). Full point-by-point criteria.