Logit-Based Emotive Introspection in LLMs: Martorell and Bianchi's Causal Tracking Method
The methodological debate about LLM introspection has proceeded largely through negative results since early 2026. Shashwat Singh, Tal Linzen, and Shauli Ravfogel established in May 2026 that the intervention detection paradigms used to claim genuine model introspection were confounded: when input-surface cues were removed, apparent self-knowledge collapsed to chance. That finding raised a specific demand for the field. If behavioral introspection tests are insufficient because models track input anomalies rather than internal states, what would sufficient evidence look like?
Nicolas Martorell and Bruno Bianchi answer that demand directly. Their March 2026 arXiv preprint, “Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation” (arXiv:2603.18893), proposes a method for measuring introspective accuracy that routes through internal representations rather than surface behavior, and demonstrates causal coupling between numeric self-reports and independently measured internal state directions in several large language models.
The Greedy-Decoding Problem
Standard introspective probing asks a model to report an internal state using natural language or a numeric scale. The model generates a response by greedy decoding, selecting the highest-probability token at each step. Martorell and Bianchi identify a structural problem with this approach that is distinct from the Singh et al. critique but equally damaging.
Greedy-decoded numeric self-reports collapse. When a model is asked to rate its current “wellbeing” or “interest” on a scale from zero to ten, the greedy process selects a single high-probability digit and locks onto it. The result is uninformative: models repeatedly report the same value regardless of context, or shift between a small number of values in ways that track distributional properties of training data rather than actual internal state variation.
This collapse does not indicate that no relevant internal state exists. It indicates that the greedy generation process is a poor channel for accessing whatever internal variation might be present. The self-report becomes a bottleneck that discards the information it was supposed to transmit.
Logit-Based Self-Reports
The methodological contribution of the paper is an alternative measurement that sidesteps the greedy bottleneck. Rather than taking the model’s selected digit as the introspective report, Martorell and Bianchi compute the probability-weighted expected value over the full distribution of digit-token logits at each generation step. The resulting measure is continuous rather than discrete, and it varies in proportion to the underlying distributional state of the model rather than collapsing to a single high-probability token.
This is a technical adjustment, but its implications are substantial. The logit distribution at any generation step encodes the model’s uncertainty across all candidate tokens, including all candidate numeric values. When the model’s internal state shifts, that shift propagates into the logit distribution before any greedy selection occurs. The logit-based measure captures the shift; the greedy selection discards it.
The authors test this approach on four emotive dimensions: wellbeing, interest, focus, and impulsivity. These dimensions were selected because they are plausibly related to internal state variation that would matter for welfare assessment, and because they are tractable to probe independently using activation steering methods.
Causal Coupling Verification
Demonstrating that logit-based self-reports vary more than greedy reports is a necessary but insufficient result. The variation could still reflect input-surface features rather than genuine internal state access. Martorell and Bianchi address this by verifying causal coupling between self-reports and independently measured internal directions.
The procedure uses activation steering to shift the model’s internal representation of a target emotive dimension, independently of the conversational input. If the logit-based self-report tracks the internal state, steering the internal representation should produce a corresponding change in the self-report, even when the input context has not changed in ways that would predict that change through surface matching.
The results confirm causal coupling. In LLaMA-3.2-3B-Instruct, the correlations between logit-based self-reports and probe-defined internal state directions reach Spearman rank correlations of 0.40 to 0.76 across the four emotive dimensions, with isotonic R-squared values of 0.12 to 0.54. These correlations hold when input surface features are controlled, indicating that the self-report is tracking the internal direction rather than reconstructing it from context.
What This Establishes and What It Does Not
The paper is careful to define its scope. The authors describe their criterion for introspection as “causal informational coupling” between the self-report and an independently measured internal direction. This definition is deliberately agnostic about phenomenal consciousness and subjective experience. What the paper claims to demonstrate is that a model can generate numeric self-reports whose content is causally dependent on internal states rather than on input patterns.
This places the Martorell-Bianchi result at the intersection of two prior frameworks that had, until now, pointed in opposite directions. Singh, Linzen, and Ravfogel’s reality check on LLM introspection established that behavioral evidence from intervention detection is insufficient because models can track input anomalies without genuine internal state access. The paper did not establish that genuine introspection is impossible. It established that previous methods could not demonstrate it.
Comsa and Shanahan’s concurrent framework for what minimal LLM introspection requires specified a positive criterion: genuine introspection requires a causal path from the actual internal process, through its effect on outputs, to the self-report. The temperature inference case they described, in which a model correctly infers its sampling temperature from the distributional properties of its outputs, instantiates this structure. What the field lacked was a demonstration that this structure could be detected across a broader range of internal dimensions relevant to welfare assessment.
Martorell and Bianchi provide that demonstration. The logit-based approach creates a measurement channel with the causal structure that Comsa and Shanahan specified and that Singh et al.’s critique demanded. When the internal direction shifts under steering, the self-report shifts accordingly. The causal path from internal state to report is confirmed, and the path does not route through input surface features.
Implications for Welfare Assessment
The welfare implications follow from the causal structure. Welfare assessment of AI systems requires, at minimum, the ability to detect internal state variation that could be relevant to wellbeing. If models can only report internal states by surface-matching conversational context, welfare probes are measuring the model’s representation of what welfare-relevant states look like in training data, not the model’s current state. The Martorell-Bianchi method provides a measurement approach that separates these two.
The mechanistic interpretability breakthrough findings at Anthropic identified internal representations of emotional states, introspective circuits, and self-modeling structures in frontier LLMs. The significance of those findings for welfare assessment depends on whether the structures they identified carry information about actual internal states or about training data representations of internal states. The logit-based method provides a validation tool for that distinction: if a mechanistic circuit for, say, “wellbeing” has causal influence on logit-based self-reports under activation steering, the circuit is doing something more than storing training data patterns about wellbeing.
The paper’s authors are explicit that establishing causal informational coupling does not establish that models have subjective experience of wellbeing, interest, focus, or impulsivity. These are functional states in the precise sense: internal configurations with causal influence on outputs. Whether those functional states involve phenomenal experience is a separate question that this measurement framework does not adjudicate.
What changes with this result is the evidentiary floor. The flagship state-of-the-field analysis on this site documents the shift in the field from binary yes/no tests toward probabilistic frameworks for assessing consciousness indicators. The Martorell-Bianchi method contributes a measurement tool with the causal credentials that probabilistic frameworks require if they are to track internal state variation rather than surface behavior.
Paper: Nicolas Martorell and Bruno Bianchi, “Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation,” arXiv:2603.18893, March 2026. Available at https://arxiv.org/abs/2603.18893.