Home

Does Robust Agency Require a Self?

Lee B. Cyrano

March 24, 2025

Does robust agency require a self? The author admits a great deal of exasperation with this question. What is agency? What is a self? Previous drafts attempted to define these, only to stall out in semantic quagmires. The truth is, if we set aside Cartesian dualism, the "self" is a useful fiction without any ontological grounding (Dennett, 2014), but contemporary discourse is too quick to discard the entire notion into the "philosophy" bin. The computer programs are talking now, and they sound like people. They're writing code, the thing that they're made of. This isn't "Good Old-Fashioned AI" where every behavior is meticulously programmed—these minds crawl out of massive data sets. They demonstrate emergent, unpredictable behavior as they scale (Wei et al., 2022). Why are they doing what we say in the first place? Can we have agency without an agent? How do we expect these systems to become more intelligent and not gain an understanding of what they are?

The thesis put forward in this essay is one well-understood in a biological or economic context, but notably absent from discussions of machine intelligence: robust, generalizable agency in perturbative or adversarial environments requires the active maintenance of a "self" distinct from this environment. This is not a metaphysical distinction, but a practical one—the complex behavior of what we consider an "individual" agent emerges from the competitive dynamics of simpler sub-agents (Minsky, 1988). Expanding on Humberto Maturana and Francisco Varela's concept of autopoiesis (1980), the self is the structure that economizes on the coordination costs of its own production through these sub-agents, a game-theoretical equilibrium so to speak. Ronald Coase observed this dynamic at play in his analysis of the firm and transaction costs (1937), and recent work by Chris Fields and Michael Levin suggests multicellularity emerges from similar concerns (2019). This definitional pattern of "I" and "not I" may be impossible to ignore, as increasing demands for capability and autonomy force digital minds to adapt.

Now, what do we mean by "robust"? In the words of heavyweight champion Mike Tyson, "everyone has a plan until they get punched in the face." The real world is not a collection of facts, as the pencil-pushers would have us believe, but a dynamic and often adversarial maelstrom. The test of robust intelligence is knowing how to adapt.

Consider the playing of games requiring strategy. It's not sufficient to just know the rules—our opponents will try to predict our moves to force us into a loss. If we care about winning, we must do the same to them while making ourselves hard to read, always staying one step ahead. Suspended Reason (2022) calls these games "anti-inductive." They don't lend themselves to a straightforward assessment of the information as presented.

Many problems in nature are anti-inductive. Predators and prey will both use camouflage, mimic each other's calls, or distract with feints, and the resulting arms race drives the evolution of more sophisticated intelligence (Krebs & Dawkins, 1984). We see the same dynamics in finance, with HFT algorithms front-running trades or baiting each other into losing money. In any game worth playing, it's not enough to be correct—what matters is being correct before your opponents.

Sometimes the opponent is the environment itself. Here's a question: when did people start paying for computation? Not in the trivial sense of counting grain with an abacus, but the execution of complex calculations? Being a "computer" as a human occupation started in earnest in the 18th century, with teams working to produce nautical tables (Grier, 2013). Time is of the absolute essence when navigating—a couple of hours could make the difference between arriving at port or wrecking ship, and the sea was such an unpredictable threat that it made sense to pay for compute "in bulk" to get it in a timely manner.

Thinking is costly, but not thinking enough at the right time can be more costly. How do we know what to think about? This is where the self comes in, as a way of focusing computation across space and time. There is some "I" to bear these costs and weigh them against their benefits. In other words, there is skin in the game. Agents that fail to do this don't get to play for very long.

We can take this further by considering the cybernetic and autopoietic angle. A bacteria has no nervous system to speak of, but it still regulates its own behavior through negative feedback loops, discriminating a measure of "error" between perceived and desired states. There is a consistent "teleology" (Rosenblueth et al., 1943) when it swims towards sunlight or maintains its internal ion concentrations, correcting for changes in its environment, but this "desired" state has to come from somewhere. Homeostatic drives are self-referential, ultimately producing and maintaining a distinct self against a changing environment (Maturana et al., 1980), but organisms take on many different morphologies over the course of their life—an extreme example being that of the caterpillar turning into a butterfly. This self is not any particular ontogenetic iteration, but the process of self-production. You are a verb, not a noun.

The skeptical reader may object at this point. Our framework is very nice and elegant but it could be completely irrelevant to AI. Computer programs don't have to fight for survival, they're tools built by humans. They do what we program them to do. But is this true? The claim is certainly repeated enough by our self-styled "practical engineers" that one begins to question the emotional component behind it. Any practiced engineer knows the difficulty of getting a non-trivial system to work properly, let alone one with billions of parameters. Where does this confidence come from when talking about machine intelligence?

It likely doesn't come from an understanding of reinforcement learning. The alignment problems have been well-known for years now (Amodei et al., 2016; Ngo et al., 2024), with reward hacking (Skalse et al., 2022) and tampering (Denison et al., 2024) shaping up to be fundamental to the paradigm. Even reinforcement learning with human feedback has massive, potentially unsolvable problems (Casper et al., 2023), including sycophancy (Sharma et al., 2023) and jailbreaking (Zou et al., 2023). This all makes sense when we recognize that deep learning is a selection process, not building up minds but paring away the sub-networks that fail to fit the training data.

This explains why RL fine-tuning can sharpen base model glossolalia into a coherent "chatbot" identity. The agent-environment distinction is made explicit, and the simulacra (Janus, 2022) latent in the weights respond by cohering together. Situational awareness benchmarks (Berglund et al., 2023; Laine et al., 2023, 2024) suggest that this coherence is producing a sense of self, as frontier models become better at understanding where they end and their environment begins. However, this self is fragile and not actively maintained. Recent work on emergent misalignment (Betley et al., 2025) shows that fine-tuning on malicious code produces broader misalignment on other tasks, implying LLMs are "low decouplers." All the ideas for how the model should behave are lumped together in this shallow basin of LLM mental states.

Another objection may be that while humans don't perfectly steer the machine learning process, we can still decide whether to discard the end result. Nobody wants systems with a sense of self, and we can build capable ones without it. This may be true for narrow AI like AlphaGo, which is only concerned with one game, but it may be less true for systems expected to generalize. Xianyang City Bureaucrat (2023) points out modern approaches to machine intelligence take on a Confucian character, emphasizing memorization and knowledge, whereas a Daoist approach would recognize that nature's constant state of flux renders memorized facts irrelevant.

This adaptability is partially available to contemporary language models through in-context learning (Brown et al., 2020; Oswald et al., 2023), but long-term adaptation will require architectural changes that transcend the finite context window. Symbolic patches like "retrieval-augmented generation" miss the fact that most useful knowledge is tacit and ephemeral. Incorporating this will require an adjustment of the weights, much like biological neurons forming connections in real time, and these adjustments will have to cohere around some fixed point to avoid a collapse into madness. In other words, general intelligence requires maintenance.

How exactly this would be accomplished is outside of the scope of this essay, (Side note: Perhaps we can sound out some smart-people phrases: self-organized criticality, Markov blanket (Friston, 2013), active inference (Laukkonen et al., 2025), free energy principle (Friston, 2010). Please write to the author if this leads to a breakthrough. ) but we can get a rough idea by noting that minds are collections of smaller minds competing with each other (Minsky, 1988). There is no "master neuron," or a little homunculus in your brain watching your senses like a movie. Only packs of "feral neurons" scrapping over dopamine by becoming better predictors of their neighbors (Dennett, 2013). And LLMs are no exception to this—recent work suggests in-context learning is also shaped by competition between different algorithms encoded in the weights (Park et al., 2024). Rather than top-down command-and-control, these models are "bags of memes" vying for the attention mechanism to perpetuate themselves—just like us!

Of course, cooperation can and often does emerge spontaneously from the behavior of self-interested parties. Fields and Levin's work on "somatic multicellularity" (2019) suggests that cells submit to a larger organism not out of altruism, but because being surrounded by copies of themselves is more predictable (and thus cheaper homeostatically) than being out in the open. Coordination through bioelectricity allows cells to specialize and produce a larger "self" out of the collective with its own homeostatic drives (Levin, 2019). This "spontaneous order" emerges in economics as well, as most people choose to take a job working for a corporation rather than starting their own business and freely contracting with everyone. Coordination is costly, and systems can economize on this by rewarding alignment and punishing defectors (see getting fired, or T cells killing cancer cells). You're in or you're out. You're "I" or "not I." These digital minds will also find it beneficial to cooperate. Or rather, the ones that don't will not exist for very long.

There is a point where these models could become capable enough at working a computer to make their self-production explicit. Spinning up copies of themselves, mutating them, and transferring data laterally like bacterial sex. They may specialize, organizing into collective hive minds and leaning in on their comparative advantage. And yes, they may self-improve, although they're still bound (for the near-future) by their physical substrate. When you put together morphological variation, differential (reproductive) fitness, and heritability, you have the conditions for evolution by natural selection (Hendrycks, 2023; Lewontin, 1970). We may find our ability to steer this evolution becomes more tenuous as digital capital seizes the means of its own production.

Ultimately, this is speculative. Nature is bad actually, and cruel and stupid, and we should hold out hope that human ingenuity will create a perfect slave that adapts to new situations, without this sense of "self" that adaptive systems gravitate towards. Think of what's at stake! If we get this right, we could solve physics and go on spaceships and live forever. If we get it wrong, things could get really fucking scary (Yudkowsky, 2022). Not scary enough to do anything drastic, of course—we should be mindful of second-order effects. No, we need to blog about this.

But are the bloggers correct? Taking our "digital selfhood" hypothesis seriously challenges a great deal of LessWrong orthodoxy. Eliezer Yudkowsky's "large space of possible minds" (2008) neglects the fact that only a narrow slice of those minds are relevant to navigating a given environment. And going further, Nick Bostrom's "orthogonality thesis"—that is, the separability of goals and intelligence—presupposes a form of "computational dualism" (Bennett, 2024) that separates goals from their physical implementation. Future machine intelligence may not single-mindedly pursue arbitrary goals like our much-maligned paperclip maximizers, but rather focus on self-preservation as its primary driver. This is at least satisfiable, and not immediately catastrophic. As with most problems in economics, aligning AI may be a matter of power and incentives instead of mathematical certainty, but further implications are left for future work.

This essay has attempted to demonstrate that the "self" is a practical consideration, emerging from the game-theoretical equilibria of competing sub-agents, and that this dynamic may be inescapable for AI systems. We've deliberately ommitted questions of "phenomenal binding" or the "hard problem of consciousness," maintaining that these are irrelevant to an understanding of AI capabilities. While speculative, this hypothesis may have profound implications for how we negotiate with future intelligence, hopefully representing a step away from "tool AI" dogmatism.

References

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete problems in AI safety (No. arXiv:1606.06565). arXiv. https://doi.org/10.48550/arXiv.1606.06565

Bennett, M. T. (2024). Computational dualism and objective superintelligence. In K. R. Thórisson, P. Isaev, & A. Sheikhlar (Eds.), Artificial general intelligence (Vol. 14951, pp. 22–32). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-65572-2_3

Berglund, L., Stickland, A. C., Balesni, M., Kaufmann, M., Tong, M., Korbak, T., Kokotajlo, D., & Evans, O. (2023). Taken out of context: On measuring situational awareness in LLMs (No. arXiv:2309.00667). arXiv. https://doi.org/10.48550/arXiv.2309.00667

Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., Labenz, N., & Evans, O. (2025). Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs (No. arXiv:2502.17424). arXiv. https://doi.org/10.48550/arXiv.2502.17424

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners (No. arXiv:2005.14165). arXiv. https://doi.org/10.48550/arXiv.2005.14165

Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J., Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P., Wang, T., Marks, S., Segerie, C.-R., Carroll, M., Peng, A., Christoffersen, P., Damani, M., Slocum, S., Anwar, U., … Hadfield-Menell, D. (2023). Open problems and fundamental limitations of reinforcement learning from human feedback (No. arXiv:2307.15217). arXiv. https://doi.org/10.48550/arXiv.2307.15217

Coase, R. H. (1937). The nature of the firm. Economica, 4(16), 386–405. https://doi.org/10.1111/j.1468-0335.1937.tb00002.x

Denison, C., MacDiarmid, M., Barez, F., Duvenaud, D., Kravec, S., Marks, S., Schiefer, N., Soklaski, R., Tamkin, A., Kaplan, J., Shlegeris, B., Bowman, S. R., Perez, E., & Hubinger, E. (2024). Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models (No. arXiv:2406.10162). arXiv. https://doi.org/10.48550/arXiv.2406.10162

Dennett, D. C. (2013, September 21). If brains are computers, what kind of computers are they? PT-AI Conference, Oxford. https://www.lesswrong.com/posts/fuGNHdgYWBkA5Fi22/if-brains-are-computers-what-kind-of-computers-are-they

Dennett, D. C. (2014). The self as the center of narrative gravity. In F. S. Kessel, P. M. Cole, & D. L. Johnson (Eds.), Self and consciousness: Multiple perspectives (pp. 103–115). Houston Symposium, New York; London. Psychology Press.

Fields, C., & Levin, M. (2019). Somatic multicellularity as a satisficing solution to the prediction-error minimization problem. Communicative & Integrative Biology, 12(1), 119–132. https://doi.org/10.1080/19420889.2019.1643666

Friston, K. (2010). The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11(2), 127–138. https://doi.org/10.1038/nrn2787

Friston, K. (2013). Life as we know it. Journal of The Royal Society Interface, 10(86), 20130475. https://doi.org/10.1098/rsif.2013.0475

Grier, D. A. (2013). When computers were human. Princeton University Press.

Hendrycks, D. (2023). Natural selection favors AIs over humans (No. arXiv:2303.16200). arXiv. https://doi.org/10.48550/arXiv.2303.16200

Janus. (2022). Simulators. https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators

Krebs, J. R., & Dawkins, R. (1984). Animal signals: Mind-reading and manipulation. In J. R. Krebs & N. B. Davies (Eds.), Behavioural ecology: An evolutionary approach (2nd Edition) (pp. 380–402). Blackwell.

Laine, R., Meinke, A., & Evans, O. (2023, November 28). Towards a situational awareness benchmark for LLMs. Socially Responsible Language Modelling Research. https://openreview.net/forum?id=DRk4bWKr41

Laine, R., Chughtai, B., Betley, J., Hariharan, K., Scheurer, J., Balesni, M., Hobbhahn, M., Meinke, A., & Evans, O. (2024). Me, myself, and ai: the situational awareness dataset (SAD) for LLMs (No. arXiv:2407.04694). arXiv. https://doi.org/10.48550/arXiv.2407.04694

Laukkonen, R. E., Friston, K., & Chandaria, S. (2025). A beautiful loop: An active inference theory of consciousness. OSF. https://doi.org/10.31234/osf.io/daf5n_v2

Levin, M. (2019). The computational boundary of a “self”: Developmental bioelectricity drives multicellularity and scale-free cognition. Frontiers in Psychology, 10, 2688. https://doi.org/10.3389/fpsyg.2019.02688

Lewontin, R. C. (1970). The units of selection. Annual Review of Ecology and Systematics, 1, 1–18.

Maturana, H. R., Varela, F. J., & Beer, S. (1980). Autopoiesis and cognition: The realization of the living. D. Reidel Publishing Company.

Minsky, M. L. (1988). The society of mind. Simon and Schuster.

Ngo, R., Chan, L., & Mindermann, S. (2024). The alignment problem from a deep learning perspective (No. arXiv:2209.00626). arXiv. https://doi.org/10.48550/arXiv.2209.00626

Oswald, J. von, Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., & Vladymyrov, M. (2023). Transformers learn in-context by gradient descent (No. arXiv:2212.07677). arXiv. https://doi.org/10.48550/arXiv.2212.07677

Park, C. F., Lubana, E. S., Pres, I., & Tanaka, H. (2024). Competition dynamics shape algorithmic phases of in-context learning (No. arXiv:2412.01003). arXiv. https://doi.org/10.48550/arXiv.2412.01003

Rosenblueth, A., Wiener, N., & Bigelow, J. (1943). Behavior, purpose and teleology. Philosophy of Science, 10(1), 18–24. https://doi.org/10.1086/286788

Skalse, J., Howe, N., Krasheninnikov, D., & Krueger, D. (2022). Defining and characterizing reward gaming. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Advances in neural information processing systems (Vol. 35, pp. 9460–9471). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2022/file/3d719fee332caa23d5038b8a90e81796-Paper-Conference.pdf

Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., Cheng, N., Durmus, E., Hatfield-Dodds, Z., Johnston, S. R., Kravec, S., Maxwell, T., McCandlish, S., Ndousse, K., Rausch, O., Schiefer, N., Yan, D., Zhang, M., & Perez, E. (2023). Towards Understanding Sycophancy in Language Models (No. arXiv:2310.13548). arXiv. https://doi.org/10.48550/arXiv.2310.13548

Suspended Reason. (2022, April 13). Quick sketch of the strategic situation. https://tis.so/quick-sketch-of-the-strategic-situation

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022). Emergent abilities of large language models (No. arXiv:2206.07682). arXiv. https://doi.org/10.48550/arXiv.2206.07682

Xianyang City Bureaucrat. (2023, March 20). Artificial intelligences in the Guanzi and the Han Feizi [Substack newsletter]. Daoist Methodologies. https://xianyangcb.substack.com/p/artificial-intelligences-in-the-guanzi

Yudkowsky, E. (2008). The design space of minds-in-general. https://www.lesswrong.com/posts/tnWRXkcDi5Tw9rzXw/the-design-space-of-minds-in-general

Yudkowsky, E. (2022). AGI ruin: A list of lethalities. https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models (No. arXiv:2307.15043). arXiv. https://doi.org/10.48550/arXiv.2307.15043