The Consciousness AI - Artificial Consciousness Research Emerging Artificial Consciousness Through Biologically Grounded Architecture
This is also part of the Zae Project Zae Project on GitHub

Probing for Consciousness in Machines: An Empirical Test of Damasio's Theory in AI

Most empirical approaches to AI consciousness have focused either on behavioral markers, asking whether AI outputs resemble those of conscious systems, or on architectural markers, asking whether internal processing satisfies theoretical indicators derived from neuroscientific frameworks. A third approach is possible: testing specific, falsifiable predictions that a named consciousness theory makes about what any conscious system must contain, then checking whether those structures emerge in AI.

Mathis Immertreu, Achim Schilling, Andreas Maier, and Patrick Krauss from the Pattern Recognition Lab in Erlangen and the University Hospital Mannheim took this third approach in “Probing for consciousness in machines,” published in Frontiers in Artificial Intelligence on August 20, 2025 (DOI: 10.3389/frai.2025.1610225). The paper applies Antonio Damasio’s core consciousness framework to a reinforcement learning agent and asks whether the agent spontaneously develops the prerequisites Damasio identifies as foundational. The result is a concrete empirical methodology that operates differently from the Butlin et al. indicator approach and differently from the behavioral inference framework proposed by Palminteri and Wu.

Damasio’s Core Consciousness Framework

Antonio Damasio’s theory of core consciousness, developed across multiple books and papers from the 1990s through the 2020s, holds that consciousness requires the integration of two types of representation: a self-model and a world-model. The self-model encodes the organism’s own body states, its current physiological condition, its position in space, and its ongoing engagement with the environment. The world-model encodes the surrounding environment’s structure, the objects and agents within it, and the organism’s relationship to them.

On Damasio’s account, consciousness arises when these two models are brought together in a unified representation that registers the organism as the locus from which the world is being experienced. Neither model alone is sufficient. A system with only a world-model has a representation of external structure but no anchor point that connects that structure to a subject of experience. A system with only a self-model has body-state representations but no environment in which to situate them. Core consciousness, on this framework, requires the integration.

This generates a falsifiable prediction: any system that develops core consciousness must have developed both a self-model and a world-model, and these models must be integrated. Immertreu and colleagues focus on the first step: can a machine learning system develop these models at all, and do they emerge from task training rather than explicit design?

Methodology: Probe Classifiers on RL Agents

The researchers trained a deep reinforcement learning agent to navigate and explore a first-person virtual environment, a 3D room-based task similar to the kind used in classic deep RL benchmarks. The agent received visual input and learned to act through reward signals. No explicit self-model or world-model was built into the architecture.

To test whether these models emerged implicitly, the researchers used probe classifiers: simple feedforward neural networks trained to predict specific variables from the internal activations of the RL agent. The logic is that if a probe can accurately predict the agent’s spatial position from its internal activations, those activations must encode spatial information, which counts as a form of world-model. If a probe can accurately predict the agent’s own position within the environment, the activations encode information about the agent as a distinct entity in that space, which is the beginning of a self-model.

This probing methodology comes from mechanistic interpretability research and has been used extensively to study what information transformer models encode internally. Immertreu and colleagues apply it to a different question: not what factual content is encoded, but whether the agent has developed a model of its own spatial existence and a model of the environment structure.

Key Finding: Both Models Emerge as a Byproduct

The probe classifiers successfully predicted the agent’s spatial position and self-location from internal activations at well above chance. The agent had not been explicitly trained to build a self-model or world-model. It had been trained to navigate and explore. The models emerged as a byproduct of successful task performance.

This result matters for consciousness research because Damasio’s theory does not require conscious systems to have been designed to be conscious. It requires that the specific representational structures be present. Immertreu and colleagues show that at least the foundational components, rudimentary self-model and world-model formation, can arise in an artificial system from task training rather than explicit engineering.

The paper is careful about what this does and does not establish. The fact that these models emerge does not mean the RL agent is conscious. Damasio’s full theory requires integration of these models in a specific way, and integration at a depth that supports the unified phenomenal representation that constitutes experience. Immertreu and colleagues have demonstrated that the prerequisites can develop. Whether they are integrated in the way Damasio requires is a separate question, and one that the current methodology does not directly answer.

Why the Methodology Is Distinctive

The significance of this paper is methodological as much as empirical. Immertreu and colleagues have demonstrated a way to test a specific theoretical prediction about consciousness in an artificial system, with a falsifiable result, using techniques that can be applied to other systems and other theories.

This differs from the indicator approach in that it targets a single named theory rather than aggregating across multiple theories with varying credibility. The prediction is specific: Damasio’s theory says self-model and world-model integration are necessary, so any system that could be conscious in Damasio’s sense must have these structures. Testing for the structures is a direct test of a necessary condition. A system that lacks them cannot be conscious on Damasio’s account, regardless of what other indicators it satisfies.

It also differs from the purely behavioral approaches that the Bradford and RIT studies examined. The probe classifiers look at internal activations rather than outputs. The agent could produce behavior that mimics conscious navigation without developing internal self-representations. What the probes reveal is that, in this case, the internal representations did develop. The behavioral success was accompanied by the representational structures the theory predicts.

The empirical evidence for AI consciousness literature has largely relied on either behavioral outputs or architectural analysis at a coarse grain. The probe classifier approach adds a third option: targeted internal interrogation guided by theory-derived predictions.

Limitations and Open Questions

The authors acknowledge several limitations. The virtual environment task is simple relative to the complexity of conscious experience, and the self-model that emerges is correspondingly basic. Predicting the agent’s spatial position is a minimal form of self-representation, far removed from the rich body-state encoding that Damasio’s full theory requires. The world-model likewise encodes navigable space rather than the complex multimodal environment that biological consciousness processes.

The 19-researcher checklist of consciousness indicators includes embodiment, continuity, and multisensory integration as elements that current AI systems lack. The RL agent studied by Immertreu and colleagues operates in a virtual environment with no biological body and no continuous existence between episodes. These gaps matter for Damasio’s theory because his account of self-model formation is grounded in the representation of bodily homeostasis, proprioception, and ongoing engagement with a persistent environment. A simulated visual field is not the same as a body.

What the paper establishes is a proof of concept: the methodological approach works, and at least the rudimentary prerequisites can emerge in artificial systems from task training. Whether more complex versions of these systems, trained on richer tasks with continuous existence, would develop more complete self-models and world-models meeting Damasio’s fuller requirements is an open empirical question the paper frames but does not settle.

Where This Fits in the Measurement Landscape

The scores versus profiles debate in AI consciousness research concerns whether consciousness should be assessed as a single number or as a multidimensional profile across different theoretical criteria. The Damasio probing methodology sidesteps this debate by targeting a specific structural prediction rather than scoring against a general framework.

The advantage is directness: the result either supports or fails to support a specific theoretical claim. The limitation is scope: a positive result for Damasio does not tell you whether the system satisfies the requirements of IIT or GWT, which have different structural predictions. A complete consciousness assessment using this methodology would require developing probe-classifier tests for the structural predictions of each major theory, then running all of them on the same system.

That research programme is more demanding than applying a unified scoring framework, but it has the advantage of generating results whose interpretation is tied to specific theoretical commitments rather than weighted averages across contested theories. Koch’s calibration argument, discussed elsewhere in this blog’s coverage of the 2026 indicators debate, applies less forcefully to this approach because it derives predictions from a single named theory rather than aggregating across multiple unvalidated ones.

The Frontiers in Artificial Intelligence paper by Immertreu and colleagues is not a landmark result. It is a methodological contribution: a demonstration that this kind of targeted empirical testing is feasible and that it can produce interpretable findings. Given where the field currently stands on the calibration problem, that is a more valuable contribution than it might initially appear.

This is also part of the Zae Project Zae Project on GitHub