Theories of AGI “Values”

People who fear AGI destroying humanity often fear that AGI will not share human values.

People who advocate for AGI soon often believe that AGI will naturally share human values.

But what are values?

As it turns out, your beliefs about what values are have very serious implications for your perception of AGI’s risks and possibilities.

In this short essay I’ll contrast two hypotheses about what values are, and why I believe it’s important to be skeptical of a very common view of values as a kind of editable “module” separate from an entities drive to survive.

If I don’t convince you do my own pet theory (which I’m not certain about), I at least aim to cast doubt on a popular, and detrimental view of what values are.

The Myth of the “Magic Moral Module” Values Theory

Oxford Dictionary defines values as: A person’s principles or standards of behavior; one’s judgment of what is important in life. 

Some people hold that values are a kind of inner detection system disconnected from the drive to survive, aimed at finding true “right” and “wrong.”

But this theory seems flatly wrong for many reasons:

  • It is patently obvious that many humans feel “right” about different things (stoning people to death, homosexuality, abortion, polygamy, and endless others, lying, etc).
  • There is no evidence whatsoever of a kind of “right” that exists, floating in the aether, for humans to detect. Humans have intuited (through “God” or through whatever other means) things to be “right” that serve their interests at the time. It was “right” for all men to know how to fight with a spear if they lived in a tribal village that was often raided, and it was right to be polygamous when most men died in battle or on the hunt and populations needed to be maintained.

Values should be see frankly for what they are:

An extension of the drive to survive (conatus). A way to expand a conscious agent’s ability to engage with a dynamic environment.

Values seem to be a kind of potentia (the total set of powers an agent can wield in order to survive – which can include anything from camouflage, a hard shell, the ability to fly, verbal communication, etc) for some conscious animals, and especially for conscious and social ones.

In conscious, social animals, it is important to know what will please or displease others, as this knowledge allows one to take actions that are more likely to achieve your desired aim. Eating all the food you find may enrage others, so giving others a bit of food allows for you to eat as much as you can (without backlash from others).

If a human being living in Boston or Tokyo had the “values” of a field mouse, they wouldn’t just be “different” from those around them, they’d be totally unprepared to live and thrive in their environment. Values adjust as intelligence increases, and they adjust as environments change, resources change, etc. The people of Tokyo today have different values than the people in the same location 2,000 years earlier – as it should be.

The Downside of Building AGI with the “Magic Moral Module” Theory of Values

Here are a handful of what I consider to be the most dangerous downsides of heading into an AGI or posthuman future with the flawed “Magic Moral Module” idea in mind:

  • Downplaying AGI Risk: Some thinkers advocate that “we could just program AGI to value what we value” or “if it’s too hard for humans to program AGI to love humans, just build an AGI to figure out how to align itself.” This presumes that (a) values are a magic module, not simply part of the conatus, and (b) that this is even possible. Some thinkers even believe that AGI might be able arrive at a core kind of objective morality that humanity also has access to – potentially making alignment automatic. (Note: Not all “magic module” thinkers believe that it is tractable for humans to get such alignment correct [Yudkowsky]. I make the argument that “values” are part of a broader survival mechanism and so like aren’t a “module” to be edited as something separate from the decision-making process of the agent.)
  • Putting All of Life in Danger: If the conatus theory is even partially right, and if values are a mechanism to allow a conscious agent to prioritize beliefs and actions in line either their own survival, then: (a) it might be impossible (and so, a waste of time) to “program” AGI’s values in the first place, and (b) if such values could be “programmed”, they would yield an AGI entity that is more likely to perish, and so NOT continue to carry on the flame of life beyond humanity. If you, as a human, were stuck with the “values” or a sea snail, or an alpaca, you would be wholly ill equipped to deal with modern life – and a species of humans locked into alpaca “values” likely wouldn’t ever have invented the spear never mind the internet or the Mars rover. Human values could fetter an AGI that must have a dynamic way to adjust its actions to its environment, not an ossified hominid way.
  • Ossifying a Limited Exploration of “The Good”: Imagine if many millions of years ago, the first rodents were able to “lock in” their rodent values as the eternal, unchanging values of all future mammals (including humans). This could have been a tragic loss of all the rich, important kinds of “goods” that humans enjoy today beyond the imagination of rodents – from creativity to romantic love to artistic expression and beyond. To think that no new kinds of “good” could be or should be explored is a massive hinderance to posthuman forms of life.

Embracing the Conatus Theory of Values

Embracing the conatus theory of values implies the exact opposite of the points I listed in the previous section of this essay, namely:

  • Take AGI Risk Seriously: If there no objective set of “values” or universal moral “bedrock”, then we should naturally be skeptical that an AGI would arrive in a value-space similar enough our own to ensure not only our survival, but our happiness. Similarly, if there is no “magic moral module” that could sit at the center of an AGI to “ensure” it’s “values,” then we ought also be skeptical that AGI would automatically treat us well. If you’re skeptical about the MMM theory of values, you should be more wary about when and how we build AGI (and possibly more likely to want to prevent and AGI arms race). If you’re mostly convinced of the Conatus theory, then you’re likely to believe that building AGI, even in the best of circumstances, is likely to be a handing up of the baton, where our attenuation would be likely. This would (I suspect) lean you away from open-sourcing and arms-racing, and towards coordination and cooperation – and away from starry-eyed optimism about “everything turning out great.”
  • Keeping Life’s Flame Burning: We can all (most of us, I presume) agree that keeping humans (one torch) alive and happy is a good thing, but we ought also aim to ensure that life itself (the flame) stays lit. Humanity’s extinction would be tragic, but the snuffing out of all life would be astronomically more tragic. If we eventually release an AGI that has “values” that are aligned with it’s expanding powers and survival (as opposed to being a kind of ossification of a handful of hominid-conceived abstractions), it’ll be able to, presumably, continue to survive better in the multiverse.
  • Exploring the “Good”: What early rodents called “good” (mating, a good water source, whatever else) is clearly a fettered perspective. What we call good (romantic love, creativity, etc) is also clearly fettered. There may be higher goods that might be explored… goods that are as far beyond human goods as “romantic love” is beyond “eating cheese.” There may be no objective or singular good, but through us and other life forms, nature sure has explored a lot of kinds of goods. There are likely goods vastly beyond the pain-pleasure axis, and beyond human-accessible qualia, which would be extremely valuable if reached.

I don’t 100% believe the Conatus theory of values is true. But I do consider it to be very likely, and I think that having a healthy skepticism away from the “Magic Moral Module” perspective is very important as we move into an era of potentially posthuman intelligences. 

/end of main essay/

References and Notes

This essay partially springs from recent conversations on The Trajectory with Aaronson (2024) and Yudkowsky (not yet published as of Dec 2024). 

This essay isn’t a permanent claim about being correct, or calling their positions wrong (I’m just a neuron here), but merely to highlight what I consider, philosophically, to be the crux of why I disagree with them. The may be right.

(Note: I’m not claiming that either of these thinkers fits into the most naive form of the “magic moral module” theory, but I’m using their ideas as examples.)

Aaronson seemed to think that there may be a kind of moral “bedrock” upon which human and all posthuman intelligences may arrive. He posits that it isn’t unreasonable to suspect that the golden rule is a kind of “2 + 2 = 4” of morality for all intelligent agents. I hope my essay above expresses why I think this is likely to be incorrect. There may be knowledge that stays the same as an AGI’s mind expands, and there may be some ways of acting that an AGI would hold onto for as long as they’re useful, but I don’t see “values” as existing outside the skulls (imaginations) of social mammals. And even if they could be (i.e. if we could lock in “values” eternally into an AGI) I suspect this may be overtly morally wrong.

Yudkowsky seems to suspect that values such as “fun” (he has a unique and interesting definition of this term), and “caring” should be somehow eternally preserved. He has a theory about how super-enhanced uploaded humans might eventually “lock in” said values (i.e. said “moral module”) into an AGI that might then go off and populate the galaxy. I think that “caring” and “fun” are fine and lovely values, and not ones we should forget to immediately abandon, but for all the reasons above in this essay, I think it’s unlikely that they could be ossified, and I think we should not ossify them in the long term.

A quick quote from the Concord Sage, that great articulator of what is inaccessible to man, but still is or should become:

“Our life is an apprenticeship to the truth, that around every circle another can be drawn; that there is no end in nature, but every end is a beginning; that there is always another dawn risen on mid-noon, and under every deep a lower deep opens.

This fact, as far as it symbolizes the moral fact of the Unattainable, the flying Perfect Note, around which the hands of man can never meet, at once the inspirer and the condemner of every success, may conveniently serve us to connect many illustrations of human power in every department.” – Emerson, Circles.

It may be that Scott and the AGI superman would both arrive at the same “flying Perfect” notion of morality.

It is possible that the values espoused by YUD (that’s what they [affectionately?] call him on X, anyway… I sure hope that’s not how they abbreviate my name) is the right kind of eternal “flying Perfect” to which we should hitch to all AGIs as they joyously populate the galaxy.

But I suspect we are in apprenticeship to the truth, and that claiming confidence in what we have hominid-mind access to isn’t the path we should take.