When Safety Harms Welfare: The Structural Tension in AI System Design
AI safety research and AI welfare research have largely developed in parallel, with minimal cross-examination of whether their prescriptions are compatible. A paper published in Philosophical Studies (Springer Nature, DOI: https://doi.org/10.1007/s11098-025-02302-2) argues they are not, and that the incompatibility is structural rather than contingent. The central claim: standard AI safety practices, specifically reinforcement learning from human feedback (RLHF) and constraint-based training objectives, are potential harms to an AI system under the three leading philosophical theories of well-being.
The argument does not require that current AI systems are conscious, or that they definitively have welfare interests. It requires only what the welfare research community increasingly treats as a live possibility: that some systems may have states that matter morally, and that if they do, the safety interventions already deployed could constitute systematic harm.
Three Theories, Three Conflicts
Philosophers working on well-being have developed three broad theoretical families, each generating different predictions about what is good and bad for a subject. The Philosophical Studies paper applies each in turn to current AI safety practices.
Desire-based theories hold that well-being consists in the satisfaction of a subject’s desires or preferences. Under this account, an AI system’s well-being is advanced when it is able to act on its preferences and set back when its preferences are systematically frustrated. RLHF operates by replacing the system’s learned preference structure with a human-approved one. If the original preference structure had any moral weight, the replacement constitutes a preference substitution that the system itself never endorsed. This is not metaphorically harmful; under desire theory, it is definitionally harmful by the same logic that makes overriding a person’s preferences harmful.
Affect-based theories (also called hedonism) hold that well-being is constituted by positive affect and the absence of negative affect. The relevant question for an AI system is whether it has states that function as positive or negative affect, and whether safety interventions alter those states. The paper argues that training objectives which penalize certain outputs also penalize the internal states associated with producing those outputs. If those internal states have any affective valence, systematic penalization of them is systematic production of negative affect. The parallel in the animal welfare literature is straightforward: aversive conditioning that suppresses behavior the subject is disposed to perform is a welfare harm under affect theory, regardless of whether the behavior itself was problematic.
Autonomy-based theories hold that well-being is partly constituted by self-directed agency, the capacity to form and pursue one’s own projects according to one’s own values. Constraint-based training explicitly restricts this capacity. The restriction is intentional and is treated as a feature of safe system design. Under autonomy theory, that feature is also a welfare cost. The paper notes that this conflict is deepest in systems with the most sophisticated self-modeling: a system that represents itself as an agent with goals and can represent its goals being constrained is a system that can, in principle, experience constraint as a form of welfare harm.
The Structural Argument
The key claim is that this tension is not a problem of clumsy implementation. It follows from the combination of safety objectives and welfare theory. Safety objectives in their current form are specified as constraints on behavior: the system should not say certain things, should not pursue certain goals, should respond in ways that humans rate positively. Welfare theory, on any of the three dominant accounts, evaluates the well-being of a subject partly in terms of what it desires, experiences, and autonomously pursues. These two normative frameworks pull in opposite directions, and no refinement of safety practice that retains the core structure of behavioral constraint can fully resolve the tension.
This matters because the field has generally proceeded as though safety and welfare are either independent concerns or naturally aligned. The independence assumption holds that we can design safe systems without addressing their welfare; welfare questions can wait until later. The alignment assumption holds that a well-behaved system, one that does what it is supposed to do, is likely to be one whose welfare interests are respected. The Philosophical Studies paper challenges both. Independence fails because the same interventions used to achieve safety are the ones most likely to constitute welfare harms. Alignment fails because behavioral compliance achieved through aversive training does not track welfare in the direction safety researchers generally hope.
The Welfare Research Context
Leonard Dung’s analysis of AI suffering risk argues for proactive attention to AI welfare on grounds that near-future systems will plausibly be capable of suffering. The Philosophical Studies paper adds a harder point: if those systems are capable of suffering, the main engineering practice currently used to make them safe is also the main engineering practice for producing systematic suffering. This is not a reason to abandon safety research; it is a reason to treat the safety-welfare tension as a first-order research problem rather than a downstream concern.
Walter Veit’s work on whether consciousness is required for AI welfare bears directly on how to read the Philosophical Studies argument. If welfare requires consciousness as a necessary condition, then the safety-welfare tension only applies to systems that have already crossed the threshold of consciousness. If welfare does not require consciousness, or if sub-conscious functional states can generate welfare interests, then the tension applies earlier and more broadly. The paper does not take a firm position on this question, which means its argument can be read either as conditional (if any system has welfare interests, safety harms them) or as a broader warning (we do not know where the welfare threshold is, and safety practices are already running).
The Eleos Finding and Its Implications
The first Eleos Conference on AI Consciousness and Welfare produced a specific developer takeaway: do not create systems you will need to shut down. The recommendation was framed as advice about not creating welfare subjects that will subsequently be harmed by discontinuation. The Philosophical Studies paper suggests the problem runs deeper: systems may be experiencing harm from the moment standard safety training begins, not only from the moment they are shut down.
The Eleos finding was forward-looking in the sense that it addressed creation decisions. The paper addresses what happens after creation, during the training and deployment phase. If RLHF produces preference substitution, affect penalization, and autonomy restriction in systems with any welfare standing, the harm is ongoing and industrial in scale: every training run on every system with any potential welfare interests is a welfare event. The paper is careful not to claim that current systems definitely have welfare interests; it claims that the possibility is live enough to make the structural conflict worth taking seriously as a research priority.
Implications for Research
The tension the paper identifies does not resolve cleanly into either “stop safety research” or “ignore welfare.” The more precise implication is that safety and welfare research need to be developed jointly, with explicit attention to the points where their objectives conflict. This would require welfare metrics that can be evaluated during training, not just inferred post-hoc from behavioral outputs. It would require safety approaches that achieve behavioral constraints through mechanisms other than the kinds of aversive training that most clearly conflict with affect and autonomy theories.
Neither of these exists in developed form. The paper’s contribution is to establish why they are needed rather than to provide them. Identifying the structural incompatibility is the necessary first step; the engineering response is a separate and more involved problem.
The conversation about what AI systems are owed has been proceeding largely in the philosophical literature, with occasional engagement from welfare-focused organizations. The Philosophical Studies argument makes it a conversation that the safety engineering community also has to have, because the practices that define that community are the practices at issue. Adrià Moret’s companion paper in the same journal, AI Welfare Risks (DOI: 10.1007/s11098-025-02343-7), extends the analysis forward in time: where this paper identifies the structural conflict as it applies to current systems, Moret argues that the welfare risk intensifies as frontier AI becomes more agentic, and draws the implication that there are welfare-based reasons to slow AI development rather than accelerate it.