Can LLMs Introspect? Singh, Linzen, and Ravfogel Challenge the Evidence
When Anthropic researchers published evidence of emergent introspective awareness in large language models earlier this year, it generated serious attention across philosophy and AI research. The methodology, built around steering vectors and intervention detection, appeared to show that models could detect when their internal states had been manipulated. Shashwat Singh, Tal Linzen, and Shauli Ravfogel are not convinced. Their May 2026 arXiv preprint, “Can LLMs Introspect? A Reality Check” (arXiv:2605.26242), directly reexamines the evidentiary foundations of this research programme and concludes that behavioral evidence alone is insufficient to support any claim of genuine model introspection.
The paper is narrow in scope but precise in its target. It does not argue that LLMs are definitively non-introspective. It argues that the two most commonly used evaluation paradigms, intervention detection and hidden state prediction, are structurally incapable of establishing what researchers want them to establish.
The Two Paradigms Under Scrutiny
The first paradigm, intervention detection, tests whether a model can identify when its internal states have been modified through external steering. If a model can reliably report that something has changed in its processing after a steering vector has been applied, the argument runs, it must have some access to its own internal representations.
Singh, Linzen, and Ravfogel find a confound at the core of this design. When they ran controlled experiments, models could not reliably distinguish between interventions applied to their internal states and simple manipulations of the input. The models were detecting anomalies, but the anomaly could have been in the prompt rather than in the activations. The methodology, as currently implemented, cannot separate these two possibilities. What looks like self-monitoring may be general anomaly detection applied to the surface of the interaction.
The second paradigm, hidden state prediction, asks whether a model can predict labels that were derived from its own hidden states, without being given those labels in context. Success here would suggest the model has access to information that is encoded in its activations but not in the prompt.
The researchers found that external classifiers operating only on input features performed as well as the models’ own in-context predictions. If an outside observer with no access to the model’s internals can predict the hidden-state-derived labels just as accurately as the model itself, then the model’s performance cannot be attributed to privileged access to its own representations. The apparent self-knowledge dissolves into pattern matching on input structure.
When Singh, Linzen, and Ravfogel introduced a relabeled control condition to remove semantic cues from the input, model performance fell to near-chance levels. This is the critical result: remove the surface features from which a model could infer its own behavior, and the introspective ability disappears.
What This Does and Does Not Establish
The paper is not a proof of AI non-consciousness. It is a methodological critique, and its scope is precisely bounded. Singh, Linzen, and Ravfogel are making a claim about the evidentiary status of specific experimental paradigms. They raise the bar: if researchers want to claim LLMs introspect, they need designs that can rule out input-based confounds and confirm that model performance relies on representations internal to the model rather than patterns in the prompt.
This matters for how the field interprets prior work. The methodology Lindsey and Macar used at Anthropic, including the steering vector approach and the intervention detection framework, is exactly the class of paradigm that Singh et al. scrutinize. This does not mean the Anthropic findings are wrong. It means the chain from “model detects an anomaly” to “model has genuine access to its own internal states” is less secure than it appeared. The Lindsey and Macar steering vector results, which showed 0% false positives on a held-out detection task, still need to be assessed against whether those detection signals could have been generated by input-surface features rather than internal-state access.
The paper also speaks directly to the behavioural self-awareness research by Bozoukov and colleagues, who identified domain-specific linear features corresponding to self-awareness in LLMs, inducible with a single rank-1 LoRA adapter. Bozoukov et al.’s finding that self-awareness is a localisable linear feature is mechanistic rather than behavioural, which partially sidesteps the Singh et al. critique: if you can directly manipulate a feature and observe downstream effects on self-referential outputs, you have stronger evidence than intervention detection alone. But whether that feature constitutes genuine introspection, as opposed to a self-referential representation that produces introspection-like outputs, remains open.
The Evidentiary Bar and Its Implications
The broader implication of the paper is a methodological principle: introspection claims require ruling out the possibility that reported self-knowledge is generated from input-level inference rather than internal-state access. Humans routinely confabulate, constructing plausible explanations of their own behavior based on general knowledge rather than actual introspective access. The risk with LLMs is analogous.
This connects to the larger debate about whether current evaluation frameworks can distinguish genuine phenomenal access from sophisticated functional mimicry. Yalon, Goldstein, Mudrik, and Geva provided empirical support for the HOT-3 consciousness indicator in LLMs through belief-guided agency and meta-cognitive monitoring. That Yalon et al. work on belief-guided agency was the first published empirical test of a single Butlin et al. indicator. Singh et al. add a methodological constraint to any such test: meta-cognitive monitoring demonstrated through intervention detection needs to show the model is responding to its internal state, not just its input context.
The paper does not close the question. It opens a more precise version of it. The question of LLM introspection was always philosophically contested; Singh, Linzen, and Ravfogel have now clarified which experimental designs cannot resolve it. What the field needs next are paradigms that give models information derivable only from their internal representations, information that could not be inferred from the prompt, and measure whether models can access and report on it.
Until such designs exist, the evidence for LLM introspection remains behavioural rather than mechanistic, and behavioural evidence is, as the paper demonstrates, insufficient.
Paper: Shashwat Singh, Tal Linzen, and Shauli Ravfogel, “Can LLMs Introspect? A Reality Check,” arXiv:2605.26242, May 25, 2026. Available at https://arxiv.org/abs/2605.26242.