Cuzzolin's Formal Definition of Machine Theory of Mind: What It Means and Why It Matters

20 Jun 2026

Theory of mind is among the most studied cognitive capacities in developmental psychology and comparative cognition. It is the ability to attribute mental states to others, to model what another agent believes, intends, desires, and perceives, and to use those attributions to predict and explain behavior. Since at least 2020, AI researchers have asked whether large language models exhibit theory of mind and have produced sharply divided answers, ranging from claims of genuine ToM performance in frontier LLMs to demonstrations that apparent ToM collapses under minor syntactic perturbations.

What has been missing from this debate is a formal definition of what Machine Theory of Mind actually is. The term has been used loosely enough to cover behavioral performance on false-belief tasks, structured prediction of agent goals from observed actions, and anything in between. Without a precise definition, the conflicting empirical results cannot be resolved because it is unclear whether they are studying the same thing.

Fabio Cuzzolin at Oxford Brookes University addresses this gap in a June 2026 arXiv preprint, “A Formal Definition and Meta-Model for a Machine Theory of Mind” (arXiv:2606.03471). The paper proposes the first rigorous formal definition of Machine ToM, advances a holistic meta-model for organizing the field, and identifies limitations in existing benchmarking that follow from the definitional ambiguity it is trying to correct.

The Formal Definition

Cuzzolin defines Machine Theory of Mind as the problem of dynamically learning to understand the thinking process of an agent or class of agents external to the machine. Several elements of this definition are deliberate and precise.

Dynamically: the understanding is not computed once from a fixed prior but updated continuously as the machine acquires new observations of the target agent. This excludes static profile-matching systems that apply fixed templates to agent behavior. It requires an ongoing generative model that revises its representations of the target agent’s mental states as evidence accumulates.

Learning: the understanding is acquired from data rather than encoded by hand. This positions Machine ToM firmly in the machine learning framework and distinguishes it from expert-system approaches to agent modeling. It also imposes a requirement that the resulting understanding generalize beyond the training distribution.

To understand: this word choice is deliberate. Cuzzolin uses “understand” rather than “predict” or “classify” because the target is a generative model of the thinking process, not a discriminative model that outputs labels. A system that correctly predicts an agent’s next action without modeling the underlying mental states that produce the action satisfies a prediction criterion but may fail the understanding criterion.

The thinking process: the object of understanding is the process, not its outputs. This means Machine ToM requires modeling the causal structure of the agent’s cognition, not merely correlating observations with outcomes.

External to the machine: the definition specifies that the target agent is distinct from the machine itself. This excludes self-model and introspection problems, which are related but separate. Machine ToM is other-directed. The self-modeling question, which is the subject of the Martorell and Bianchi emotive introspection framework and the Dadfar introspection direction work, requires a different theoretical treatment.

The Meta-Model

Beyond the definition, Cuzzolin proposes a holistic meta-model intended to organize the space of Machine ToM research. The meta-model has three components.

The first is a taxonomy of mental state types that Machine ToM systems must handle. These include epistemic states (beliefs, knowledge, ignorance), conative states (desires, goals, intentions), and phenomenal states (perceptions, experiences). Current benchmarks, Cuzzolin argues, focus heavily on epistemic states, particularly on false-belief tasks derived from the developmental psychology literature. Conative and phenomenal states are underrepresented despite being central to theory of mind in its full cognitive sense.

The second component is a specification of the computational architecture that a genuine Machine ToM system requires. Cuzzolin identifies recursive mentalizing as a necessary feature: the system must be able to model not only what an agent believes but what an agent believes about what another agent believes. Standard transformer architectures process tokens without explicit recursive depth tracking, which means their ToM performance may be bounded in ways that are not obvious from single-level false-belief task performance.

The third component is an account of the temporal dimension. Mental state attribution in real interactions is not a one-shot inference but an ongoing process in which each new observation updates the model. The meta-model specifies that Machine ToM must be evaluated in sequential interaction settings, not only on static problem presentation.

The Benchmarking Gap

The paper’s most practically actionable contribution is its analysis of existing benchmarks. Cuzzolin identifies three systematic limitations.

Current benchmarks are primarily derived from developmental psychology paradigms designed for human children. These paradigms test a narrow slice of the full ToM capacity, focusing on first-order false-belief attribution and simple perspective-taking. A system that performs well on these tasks has demonstrated something, but that something may not generalize to the more complex social reasoning that constitutes adult human theory of mind or that would be required for an AI system to attribute mental states in real interactions with human agents.

The stimuli in most benchmarks are static text vignettes. They do not test the dynamic, ongoing revision of mental state models that Cuzzolin’s definition requires. A system that reads a story about Sally and Anne and correctly identifies where Sally will look for her marble has processed a static description of a mental state situation. It has not demonstrated the ability to update a model of Sally’s beliefs as new observations arrive.

Existing benchmarks do not systematically test recursive mentalizing at depth greater than one. The Sally-Anne paradigm tests first-order attribution: what Sally believes about the marble. Second-order tasks (what Anne believes Sally believes) are available but not standard. Higher-order tasks are rare. If the relevant cognitive capacity for high-level social interaction requires third or fourth-order mentalizing, current benchmarks cannot detect whether a system has it.

Relationship to the AI Consciousness Debate

Machine Theory of Mind is adjacent to but distinct from the question of machine consciousness. A system can have a sophisticated generative model of another agent’s mental states without having any mental states of its own. Cuzzolin’s definition is explicit on this: Machine ToM is about the machine’s capacity to understand external agents, not about the machine’s phenomenal experience.

This matters for the current scientific consensus on AI consciousness because several consciousness indicators in the Butlin et al. framework are theory-of-mind related. Higher-order thought theories, in particular, propose that consciousness involves meta-representation: representing one’s own mental states as mental states. A system that is genuinely conscious on Higher-Order Thought accounts would need to attribute mental states to itself, which requires some of the same cognitive machinery as attributing mental states to others.

The Cuzzolin formalization clarifies what the relevant machinery actually is: a dynamic learning process that models the causal structure of another agent’s thinking, capable of recursive depth and temporal updating. If higher-order consciousness requires this kind of self-directed architecture, then the benchmarking gaps Cuzzolin identifies in external-directed Machine ToM are likely to appear in analogous form when the architecture is turned inward.

The formal definition also provides a more precise target for the Keeling and Street work on mutual theory-of-mind modeling in LLM conversations. Keeling and Street argue that LLM characters emerge from a bidirectional process in which the model encodes and responds to the user’s representations of the character’s mental states. Whether this constitutes genuine mutual ToM modeling, in Cuzzolin’s sense, depends on whether the model is dynamically learning to understand the user’s thinking process or simply pattern-matching on conversational cues. The formal definition makes that distinction testable in a way that the informal account did not.

Paper: Fabio Cuzzolin, “A Formal Definition and Meta-Model for a Machine Theory of Mind,” arXiv:2606.03471, June 2026. Available at https://arxiv.org/abs/2606.03471.

Cuzzolin's Formal Definition of Machine Theory of Mind: What It Means and Why It Matters

The Formal Definition

The Meta-Model

The Benchmarking Gap

Relationship to the AI Consciousness Debate

Related posts

VanRullen 2026: Intelligence Predicts AI Existential Risk. Consciousness Does Not. 20 Jun 2026

Robert Wright's The God Test: An Evolutionary Case for Guiding AI Wisely 20 Jun 2026

Logit-Based Emotive Introspection in LLMs: Martorell and Bianchi's Causal Tracking Method 20 Jun 2026