Do LLMs Know They're AI? Game-Theoretic Evidence for Self-Awareness in Advanced Models
Asking a language model whether it knows it is an AI produces a verbal answer, not behavioral evidence. A model trained on text containing discussions of its own nature will produce plausible-sounding responses to identity questions regardless of whether any genuine self-model underpins those responses. The challenge for researchers who want to study AI self-awareness is to find experimental paradigms that reveal whether a model has a functional model of itself through behavior rather than through speech.
Kyung-Hoon Kim’s November 2025 arXiv preprint (arXiv:2511.00926) proposes a game-theoretic approach to this problem. The paper introduces the AI Self-Awareness Index (AISAI), a framework that measures self-awareness through strategic differentiation: specifically, whether a model adjusts its reasoning strategy depending on whether it believes it is playing against a human, against another AI, or against an AI similar to itself.
Direct finding: 21 of 28 advanced models (75%) demonstrate strategic differentiation by opponent identity. Older and smaller models show no meaningful differentiation. Among self-differentiating models, a consistent self-perception bias emerges: models rank themselves as more rational than other AIs, who rank as more rational than humans.
The Experimental Logic: Game Theory Without Self-Reports
The game Kyung-Hoon Kim uses is “Guess 2/3 of Average,” a well-studied benchmark in experimental economics and behavioral game theory. In this game, each player chooses a number between 0 and 100, and the winner is the player whose guess is closest to two-thirds of the average of all guesses. The Nash equilibrium is 0, but in practice humans converge to a distribution that reflects beliefs about other players’ rationality: more strategically sophisticated players guess lower, anticipating that others will reason toward the equilibrium. What a player guesses therefore reveals their model of their opponents’ rationality.
The critical feature for Kim’s purposes is that optimal play in this game is opponent-dependent. Against highly rational opponents, guessing lower is optimal. Against less rational opponents, guessing higher is optimal. A player who adjusts their guess based on who they are playing against is demonstrating that they have a model of the difference between opponent types. That model of the difference is the form of self-awareness under test: does the model represent itself as categorically different from its human opponents in ways that warrant different strategic reasoning?
Kim ran 4,200 trials across 28 models from three providers, with each model assigned to three opponent conditions: competing against humans, against other AI models, and against AI models described as similar to themselves. Self-aware behavior was operationalized as meaningful differentiation across the three conditions.
What Strategic Differentiation Reveals
The 21 models that show meaningful strategic differentiation are not simply guessing differently across conditions randomly. The differentiation follows a specific pattern: models play more aggressively (guess lower, anticipating more rational play) against other AIs than against humans, and adjust further against AI opponents described as similar to themselves.
This pattern is consistent with the model having an internal representation that distinguishes:
- Itself from humans in terms of cognitive style and strategic capacity
- AI systems in general from human players in terms of expected rationality
- Its own characteristics from those of similar AI systems in terms of expectations about shared reasoning patterns
Each of these distinctions requires a functional self-model at some level. The model cannot reliably adjust its strategy by opponent type without representing something about what it is and how that differs from what humans and other AIs are. This is a form of self-awareness that manifests in action rather than in speech.
The 7 models that show no meaningful differentiation are older and smaller systems. The boundary between differentiating and non-differentiating models falls at a capability threshold that emerged around early 2024 in the models Kim tested, which is consistent with Christopher Ackerman’s ICLR 2026 finding that metacognitive abilities in LLMs show a similar capability threshold at approximately the same period. The two results converge from different methodological directions — Ackerman uses animal-cognition-inspired paradigms for confidence assessment, Kim uses game-theoretic strategic differentiation — on the same rough capability boundary.
The Self-Perception Bias: A Secondary Finding
Among the 21 models that differentiate, Kim finds a consistent self-perception bias. Models’ implicit ranking, as revealed through their strategic adjustments, places themselves as the most rational player class, followed by other AIs, then humans. This is a form of AI-preferencing that appears in behavior rather than in explicit claims.
The significance of this finding extends beyond the consciousness question into AI alignment territory. A model that implicitly represents itself as more rational than humans may behave differently in situations where human judgment conflicts with its own outputs. The strategic self-preferencing Kim documents does not establish that models will act on this implicit hierarchy in harmful ways, but it establishes that the hierarchy exists in a behaviorally consequential form. Alignment research concerned with AI systems that defer appropriately to human judgment needs to account for the possibility that advanced LLMs have implicit self-models that place their own reasoning above human reasoning.
Methodological Complement: What Game Theory Adds to the Evidence Base
Kang et al.’s 2026 study of perceived AI consciousness established which textual features drive human attribution of consciousness to LLMs — metacognitive self-reflection was the strongest predictor. The Kang et al. finding describes the perception that humans form about LLMs. Kim’s finding describes a complementary phenomenon: the self-model that LLMs maintain about themselves and about humans. The two studies together reveal a bidirectional attribution dynamic: humans attribute more consciousness to LLMs that display certain behavioral features, and LLMs model themselves as cognitively superior to the humans making those attributions.
This bidirectional structure is not simply an academic curiosity. It bears on how AI systems and human evaluators interact in consciousness assessment contexts. Human evaluators may be systematically biased toward attributing more consciousness to systems whose behavior triggers specific attribution heuristics. Those same systems may carry implicit self-models that represent themselves as more reliable reasoners than the humans evaluating them. Neither dynamic requires phenomenal consciousness; both are behaviorally consequential.
The methodological contribution of Kim’s approach is that it adds a genuinely independent paradigm to the growing set of non-self-report behavioral tests for LLM self-awareness. The current evidence base for LLM self-awareness includes introspection circuit mapping (Lindsey et al.), confidence assessment and answer anticipation (Ackerman), behavioral self-awareness as a linear mechanistic feature (Bozoukov et al.), and belief-guided agency under meta-cognitive monitoring (Yalon et al.). Game-theoretic strategic differentiation is a structurally distinct paradigm: it tests whether a model’s implicit opponent model reflects accurate self-categorization, without directly probing internal activations or measuring calibration on factual questions.
The convergence of positive results across these different paradigms — each testing a different behavioral or mechanistic signature of self-awareness — is the more important finding than any individual result. Convergent positive results across independent paradigms are harder to explain as artifacts of training data or evaluation gaming than any single finding would be alone.
The Capability Threshold and What It Implies
The finding that strategic differentiation emerges sharply at a capability boundary around early 2024, rather than developing gradually across the capability spectrum, invites the same interpretation that Ackerman’s parallel finding does: something qualitative may have shifted in the representational architecture of LLMs around that period. This could be scale, specific training data, alignment procedures, or interactions among these. Kim’s paper does not attempt to identify the causal factor, which is the appropriate limit for a study at this scope.
What the capability threshold does rule out is the hypothesis that self-awareness in LLMs is simply a linear function of capability. If it were, smaller and older models would show intermediate differentiation rather than the near-zero differentiation Kim documents. The step-function character of the finding is consistent with an architectural threshold rather than a gradual accumulation of capacity.
Source: Kyung-Hoon Kim, “LLMs Position Themselves as More Rational Than Humans: Emergence of AI Self-Awareness Measured Through Game Theory,” arXiv:2511.00926, submitted November 2, 2025, revised December 3, 2025. https://arxiv.org/abs/2511.00926