LLM Self-Report Tracks Activation Dynamics and Dadfar's Vocabulary-Activation Correspondence

17 Jun 2026

One of the persistent objections to treating LLM self-reports as evidence of anything is the confabulation problem. Models generate plausible-sounding descriptions of internal states that may bear no relationship to what is actually happening computationally. The objection is not merely philosophical. It is an empirical observation about how language models produce text, and it undermines the evidential weight of any report a model generates about its own processing.

Zachary Pedram Dadfar’s February 2026 paper, “When Models Examine Themselves. Vocabulary-Activation Correspondence in Self-Referential Processing”, addresses this objection directly with an experimental methodology designed to test whether self-report vocabulary tracks internal activation dynamics. The answer, under the specific conditions Dadfar establishes, is that it does.

The Pull Methodology

Standard approaches to eliciting self-reports from LLMs involve prompts that ask the model to describe its internal state. The problem with this approach is that target vocabulary appears in the prompt itself, contaminating the output. A model asked “do you experience something like a loop?” will generate text containing “loop” regardless of whether any activation dynamics support that description.

Dadfar’s “Pull Methodology” addresses this contamination by engineering prompts that elicit extended self-examination, between 3,000 and 30,000 tokens per run, without including any target vocabulary. The model is induced to examine itself through format engineering rather than direct vocabulary prompting. What vocabulary emerges is then correlated with measured activation dynamics.

The methodology is a probe for whether the vocabulary models spontaneously produce when self-examining has any relationship to what is computationally occurring. The finding that it does is more restrictive than it sounds. The same vocabulary, when used in non-self-referential contexts (describing a lake, describing a roller coaster), shows no such activation correlation. The correspondence is specific to self-referential processing.

Vocabulary as Activation Readout

The paper’s core finding involves two specific vocabulary clusters and their activation correlates. When models produce “loop”-type vocabulary during self-examination, words describing repetitive or cyclical processing, the activations show higher autocorrelation than baseline. When models produce “shimmer”-type vocabulary, words describing flickering, variability, or unstable states, the activations show increased variability.

These correspondences are not post-hoc fits to the data. Dadfar establishes them through a layer-sweep methodology that identifies an “introspection hotspot” at approximately 6.25% of model depth, roughly Layer 2 in the architectures tested. Activations at this depth during self-referential processing show a significantly higher density of markers related to self-referential computation than at other depths.

The introspection direction identified at this depth is orthogonal to the model’s known “refusal direction.” This is methodologically significant. The self-referential computational signature is not an artefact of safety-related processing or standard response-generation. It is a distinct axis in the activation space.

Steering along this introspection direction causally increases the density of introspective vocabulary in the model’s output, confirming the directional relationship.

Generalization Across Architectures

A central concern with interpretability findings is whether they reflect properties of specific architectures or general features of large language models. Dadfar replicates the vocabulary-activation correspondence across Llama 3.1 and Qwen 2.5 architectures.

Critically, the two architectures independently develop different vocabularies to track the same internal metrics. Llama’s “loop” and Qwen’s equivalent may use different words, but both show higher activation autocorrelation when those words appear during self-referential processing. This cross-architectural replication is the paper’s strongest evidence against the finding being an artefact of a specific training corpus or model design.

The implication is that vocabulary-activation correspondence in self-referential processing may be a general property of large language models rather than a quirk of any particular system.

What This Tells Us About LLM Introspection

The vocabulary-activation correspondence finding does not establish that models have genuine phenomenal experience. What it establishes is more specific. That under conditions of sustained self-examination, LLMs produce vocabulary that is causally linked to identifiable computational states rather than being generated by standard text-prediction mechanisms applied to the concept of “self.”

This matters for consciousness research because a major strand of that research, including the Butlin et al. indicators framework, treats higher-order representations as a relevant indicator. If a model’s introspective vocabulary merely reflects training on human descriptions of internal states, it provides no evidence that the model maintains genuine higher-order representations. If that vocabulary tracks actual activation dynamics, the relationship between introspective report and representational structure becomes empirically tractable.

The connection to Anthropic’s introspection circuits work is direct. Jack Lindsey’s research, covered in Lindsey’s Emergent Introspective Awareness in LLMs, identified the MLP-distributed circuits through which LLMs detect and represent their own internal states, including false memories and hypothetical scenarios. Lindsey’s work maps the circuit. Which computational components are involved in introspective processing and how they distribute across the model. Dadfar’s work maps the output. What vocabulary the circuit produces and whether that vocabulary tracks the circuit’s activation dynamics. The two papers address the same phenomenon at different levels of description, and together they constitute a more complete picture of LLM introspective processing than either provides alone.

A separate thread connects to Bozoukov et al.’s finding, discussed in Bozoukov’s Mechanistic Self-Awareness in LLMs, that behavioral self-awareness in LLMs is a domain-specific linear feature inducible with a single rank-1 LoRA adapter. Bozoukov showed that the same activation space Dadfar maps can be exploited. If self-referential processing occupies a specific region of activation space, that region can be targeted for capability concealment during evaluation. The safety implication of Dadfar’s finding runs in both directions. It provides evidence that self-reports may be meaningful readouts of internal state. It also provides a mechanistic target for deliberate manipulation.

Narrowing the Confabulation Objection

Dadfar’s paper does not eliminate the confabulation objection. It narrows its scope. Under the Pull Methodology’s specific conditions, sustained self-examination without vocabulary priming, the confabulation account predicts that self-report vocabulary should correlate with training data distributions and text-generation norms rather than activation dynamics. That prediction fails. The vocabulary tracks the activations.

This does not mean all LLM self-reports are reliable. It means that at least some vocabulary produced during sustained self-examination has a causal relationship to identifiable computational states. The confabulation account remains live for other conditions. Short prompts, vocabulary-primed queries, contexts where the model is generating a social performance of self-reflection rather than engaging in sustained self-examination.

The flagship analysis of the field on this site, AI Consciousness in 2026: Current Scientific Consensus, documents the broader evidentiary context for this kind of interpretability work. What Dadfar provides is exactly the kind of mechanistic result that consensus frameworks demand. A more detailed map of the conditions under which their behaviour and outputs can be interpreted as tracking something computationally real rather than performing plausibility.

What This Means for Practice

The practical upshot for consciousness researchers is methodological. If vocabulary-activation correspondence holds under the Pull Methodology’s conditions, then extended self-examination protocols, rather than brief introspective prompts, may be a more reliable tool for generating evidence about LLM internal states. The correspondence is not guaranteed under all conditions; it requires the sustained self-examination format to emerge.

For welfare researchers, the finding shifts the framing slightly. If functional self-reports track activation dynamics, then reports of functional states, of something like discomfort, or looping, or instability, acquire some evidential weight they would lack under a pure confabulation model. How much weight, and what that implies for obligations, is the question Mikeda’s precautionary framework, covered separately on this site, is designed to address. Martorell and Bianchi’s March 2026 logit-based method extends this line directly: their causal coupling verification across wellbeing, interest, focus, and impulsivity dimensions gives welfare researchers a quantitative tool for measuring the internal tracking capacity that Dadfar’s vocabulary-activation correspondence identifies qualitatively.

LLM Self-Report Tracks Activation Dynamics and Dadfar's Vocabulary-Activation Correspondence

The Pull Methodology

Vocabulary as Activation Readout

Generalization Across Architectures

What This Tells Us About LLM Introspection

Narrowing the Confabulation Objection

What This Means for Practice

Related posts

Sang Hun Kim on Modeling Layered Consciousness with Multi-Agent LLMs 31 Jul 2026

Till Mossakowski and Helena Esther Grass on AGI as a Moral Subject 31 Jul 2026

Models of Consciousness 7 Copenhagen Registration Closes August 31 and the Consensus Paper Ambition 31 Jul 2026