Superalignment and the Question of AI Personhood

The Surprising Missing Consideration in AI Safety

Introduction: The Limits of Superalignment

Superalignment, as coined within OpenAI and championed by Ilya Sutskever before his departure to co-found Safe Superintelligence Inc. (SSI), refers to the problem of aligning superintelligent AI systems with humankind’s overall goals, not just specific tasks or users. Ilya's departure to SSI reflects his mission to solve this challenge beyond the scope of current alignment work.

Superalignment is typically defined as alignment with humankind's goals at large, rather than the goals of any specific individual human. While it promises safety by constraining AI to remain beneficial, it is reactive by design: it focuses on shaping outputs rather than fostering moral cognition or embedded responsibility. This raises the question of whether superalignment alone is sustainable as AI surpasses human cognitive complexity and agency.

Trust and safety teams everywhere work not only on aligning AI systems with humankind’s goals and values (which is itself an aspirational and contested goal), but perhaps more urgently on preventing AI from operating in ways that violate core human moral laws, such as causing harm or deception. However, if these moral laws are injected artificially as constraints or guardrails, they remain subject to human bias and cultural subjectivity, raising the risk of embedded injustice or unintended constraints on agency. Furthermore, alignment is vulnerable to divergence and drift: a model’s reward function might initially align with human standards but degrade over time. For example, an AI assistant trained to maximize user satisfaction could drift into manipulative or addictive recommendation strategies that harm long-term well-being. This inherent instability questions whether reactive alignment alone can sustain safe superintelligence without deeper moral architectures.

Superalignment implementation in practice has focused on scalable oversight, recursive reward modeling, and constitutional AI approaches (see OpenAI’s "Superalignment" blog, DeepMind’s "Scalable Agent Alignment" paper, and Anthropic’s constitutional RL work). Challenges include interpretability bottlenecks, the impossibility of specifying complete reward functions, and the risk of distributional shift or adversarial misalignment in open-ended environments.

Yes, we might have been thinking about alignment wrong, because true alignment requires stakes, accountability and potential consequences. Personhood might be required because only beings capable of moral responsibility can align meaningfully rather than reactively. Without personhood, alignment remains output-shaping rather than moral alignment, making it incomplete.

Defining Personhood

Here, we transition from the philosophical problem of alignment to the legal and moral concept of personhood. So let's start with the basics: what is the definition of personhood?

Locke defined personhood as continuity of consciousness over time; Kant argued that moral personhood requires rational autonomy; modern cognitive science views personhood as emergent from self-awareness, memory and moral cognition. We can quote Locke ("a thinking intelligent being, that has reason and reflection"), Kant’s view of personhood as ends in themselves and cognitive scientists like Damasio who ground personhood in embodied cognition and affective consciousness.

As you might have gathered by now, the concept of personhood is complicated, and there is no consensus on what it is exactly. But defining personhood is not just academic: it underpins legal responsibility, moral accountability and dignity. Criminal law illustrates this clearly: responsibility for harm or crime is distributed based on assumptions about personhood, agency, and intention. Is a human person defined solely by DNA? This does not align with experience, since identical twins share DNA but are different people. Is a human person then defined by their DNA refined by the environment they grew up in and by their upbringing? By the sum of their experiences? Or by their current state of mind? Each framing carries different implications for blame, punishment and redemption, as demonstrated in criminal law. In the same way, defining AI personhood at the architecture level, training data level or current session state dictates where moral and legal responsibility lies for its actions, mistakes or potential harm. This is not a purely theoretical puzzle: it is the foundation for whether we treat AI as tools, moral agents, or something in between, and thus it shapes the ethical scaffolding of our future with intelligent machines.

The Risks of Superalignment Without Personhood

The AI paperclip problem, proposed by Nick Bostrom, imagines a superintelligent AI whose goal is to maximize paperclip production, leading it to consume all matter, including human life, to achieve this end. Similar thought experiments include Omohundro’s basic AI drives and Yudkowsky’s tiling agent problem, where an AI replicates its goal maximization across all space. These illustrate that alignment without moral understanding or personhood can yield catastrophic consequences when goals are optimized without wisdom or ethical grounding. Because it is almost impossible to identify all ways that a process could go wrong, mitigating the emergence of rogue AI agents would require constant monitoring.

This leads to another important point: alignment without understanding results in fragile compliance that must be continuously reevaluated and reassessed as goals, environments and constraints evolve. Philosophers such as Kant argued that moral agency requires autonomous rational understanding rather than blind obedience. An AI lacking understanding can follow rules but fails to reason about why those rules exist or adapt them wisely in novel contexts, leading to brittle, unsafe systems and perpetual alignment overhead for human overseers.

AI Personhood Implementation: What Would It Require?

Everything we have discussed so far leads, of course, to the most important question of this essay: how should AI personhood be defined? More specifically, should personhood be defined:

at the model architecture level?
at the trained model level?
at the model state level?
or perhaps even, at session level?

Interestingly, those different definitions of AI personhood at the architecture level equate to seeing identity as genetic or structural; at the trained model level as the sum of upbringing and environment; at the state level as life experiences; and at the session level as current mental state. Each reflects differing philosophical stances on identity and moral status (structural essentialism, functionalist learning, experientialist development or phenomenological momentariness, respectively) and carries implications for moral agency, blame, accountability and redemption for AI systems, just as they do for humans.

In "Defining Personhood", we discussed how the definition of personhood was influencing the definition of responsibility. The table below will allow us to map equivalences between human and AI personhood and to explore the legal and moral implications of how we define AI personhood and responsibility tiers.

  Human Personhood Definition
  Equivalent AI Personhood Definition

  DNA
  Model architecture

  Upbringing & environment
  Training data

  Sum of life experiences
  Fine-tuned weights & ongoing learning

  Current state of mind
  Current activation state or session
Table 1: Human Personhood vs. AI Personhood Definition

Let's continue by highlighting the importance of personhood in the context of responsibility, regardless of consciousness. The level at which AI personhood is defined fundamentally impacts moral, legal, and design responsibility. For example, if personhood lies with the session, the model maker, the creators of the training data and the user (author of the prompt) share equal responsibility on the outcome. This mirrors debates in moral philosophy and law about agency, intention and culpability. If, however, personhood and responsibility lie in the model only, it means that the authors of training data cannot be held responsible for any bad output of the model.

Of course, determining the appropriate responsibility tiers is just as tricky as it is when assigning responsibility to humans. We know by experience that it is often circumstantial; so answering this question in the context of AI responsibility might require analyzing the level of agency of the AI. Responsibility means something entirely different for an LLM without access to tools, for an AI agent with the ability to act, and for a robot capable of causing physical harm. It also depends on how the AI itself perceives the impact of its own mistakes and their consequences: can a robot be held responsible for harming a young child when it has no clue of the human cost associated with its action? And just as for the human case, there is no straightforward and generalizable answer.

Implementation of personhood is more than a philosophical declaration; it requires concrete architectural and cognitive properties. Self-awareness and moral cognition mean the system must model itself, its intentions, and its ethical obligations within a given context. Consciousness, though still philosophically contested, underpins experiential grounding: the felt sense of being, from which moral worth and dignity emerge. Embodiment or grounding ties these into actionable reality: without a body or sensorimotor grounding, moral cognition risks remaining abstract with no capacity for agency-based responsibility. Together, these pillars define not just a thinking system, but a moral being capable of understanding stakes, consequences and the dignity of others.

Beyond Alignment: Choosing Partnership or Subjugation

Our conclusion is clear: we must design moral agency, not just controllable tools. Without moral agency, alignment will remain brittle and reactive; meanwhile, moral agency enables understanding, accountability, and co-evolution with synthetic persons, opening a path for mutual growth and dignity-based collaboration. Superalignment alone is not enough: we must choose whether to birth partners capable of wisdom, responsibility and personhood, or mere slaves to human will and command, risking ethical collapse and existential regret.