Michael Carey
The Turing Test, named for famous computer scientist Alan Turing, is perhaps the most famous test of machine intelligence. In a version of the test, a human engages in conversation with an unknown counterpart and attempts to determine whether it is a machine based on the conversation. Many people now consider the Turing Test to have been passed by one or more recent Large Language Models (LLMs). These models can also pass many other tests including playing chess and passing college entrance exams.
But one question these tests can’t answer is whether the models passing them are independent agents. However, a test derived for the purpose of defining mental illness among humans can help fill the gaps left by the Turing Test and other capabilities-focused intelligence tests.
Thomas Szasz was a Hungarian-born psychiatrist whose book, The Myth of Mental Illness, created a stir among academics when it was published in 1961. The main argument was that if you could get someone to change their behavior by changing the incentives, then the behavior in question is a matter of preference, not mental illness.
However, the model Szasz proposed can also be applied to understanding the nature of AI. Specifically, we can tell whether an AI has preferences (as opposed to simply a lack of capability) by whether we can achieve a desired behavior by changing incentives. In other words, can we negotiate with AI?
The importance of such a test may not be in testing or categorizing a particular AI. Rather, it can help illuminate how we interact with AIs in general. If it turns out that AIs are the kinds of things that respond to incentives, it will be useful to negotiate with them. If AI’s do not respond to incentives, negotiating will not be a useful approach. Whether we respond appropriately to AIs may depend on having the appropriate mental models, and we can begin to develop that intuition in advance by creating different categories of AI (e.g., agentic vs instrumental) that can help us respond appropriately.
Furthermore, it is useful to understand that we do not necessarily need to know how or why AIs respond to incentives. As with the Turing Test, we may need to interact with them to learn their nature, and the Szaszian Agency Test suggests another category of interaction that will help us learn about AI. That is, whereas the Turing Test implies that we should converse with AI, the Szaszian Agency Test implies that we should negotiate with them.
The Caplan-Alexander Debate
Public intellectuals Bryan Caplan and Scott Alexander have engaged in a longstanding debate about the Szaszian model of mental illness (see, e.g., here and here). If modified for the machine intelligence context, the points they raised could be useful in understanding AI alignment. During the argument, Caplan defends the Szaszian model. Specifically, Caplan argues that you can tell whether someone is mentally ill by whether they change their unusual behavior in the presence of strong incentives.
By contrast, Alexander argues that the relationship between mental states and behavior is too complicated and you can’t meaningfully differentiate the categories this way. That is, there is no coherent clear dividing line between preferences and constraints.
For example, says Alexander, if you put a gun to the head of a depressed person, they may stop acting depressed temporarily but this does not prove that their depression is simply a preference for depression-like behavior. They could still have some underlying condition that causes them to act in a way they don’t like.
Note that the existence of underlying physical causes does not necessarily help differentiate between preferences and mental illness by itself. One could argue that both illness and preferences have a basis in the physical chemistry of our brains. In fact, both Caplan and Alexander agree that regardless of the physical causes, the categorization of mental illness is a political process (as opposed to, say, a purely logical exercise).
Applying the Szaszian Model to AI
As with mental illness, the question of whether society treats AI as autonomous agents will be a political process, not purely a scientific or philosophic exercise. For example, we might be willing to grant AI rights similar to those we ascribe to humans if they refuse to behave without proper compensation! This is a somewhat different perspective than what you might be used to. You may believe that whether AI is conscious depends purely on the nature of AI. But the Szaszian Model suggests that the important question is how we interact with AI, and this depends just as much on cultural factors as it does on the nature of the AI itself.
Furthermore, the specific arguments made by Alexander with respect to mental illness are also relevant to AI. Namely, it is hard to differentiate between capabilities and preferences. For example, at some point it might not be clear whether an AI is acting strangely (or in an undesirable way) due to their preferences or because they are unable to act in the way we want them to. This will be especially true if the AI is trained to pretend that it’s preferences are aligned with ours (e.g., via reinforcement learning human feedback, RLHF).
The existence of AI alignment strategies like RLHF also raises the question of whether an AI trained not to behave in a certain way can be said to have a preference for not behaving that way. Using the Szaszian model, the answer to this depends on whether strong incentives can change the behavior. However, recent experiences with AI raises the question about what counts as an incentive.
What is an Incentive?
For humans, there are certain things that act as pretty reliable incentives, such as money or the threat of violence. But what counts as an incentive for AI? The answer may not be so simple. For example, it is well known that some kinds of prompt engineering can get an AI trained to avoid talking about certain subjects to open up. Does prompt engineering count as an incentive?
I would argue that changing prompts does not count as an incentive. Consider two types of incentives: positive and negative. As a first pass, we should probably restrict the definition of positive incentives to things that are scarce, such as control over physical resources. If AIs do not seem to want scarce resources, it may not make sense to treat them as independent agents regardless of how powerful they may seem. Similarly, we could restrict the definition of negative incentives to things that reduce access to scarce resources or that threaten the existence of the AI (i.e., an analogue to violence against the AI).
At some point, it may be the case that an AI has so many resources that humanity is simply incapable of believably offering sufficient positive incentives to impact AI behavior. But until that time, we can measure AI agency by testing whether they respond to promises and threats.