Can AGI destroy us without trial & error?
Or how the AI alignment community is forgetting that novel engineering is very, very difficult
First, let us define a little context or a glossary as they say in science papers, for readers who might not be familiar with AI alignment.
Artificial Intelligence or AI - generic concept for “software that does tasks formerly dominated by human labor”. As of today each AI is narrowly focused on one narrow set of tasks, such as image recognition or text generation.
Artificial General Intelligence or AGI - describes an AI so good that it can do any task that humans can do, not just a narrow set of predefined tasks for which it was trained. Think Data from Star Trek.
AI alignment - a branch of science focused on the question of how we could build AGI without creating something akin to Skynet from Terminator that attempts to destroy all of humanity as soon as it gains consciousness. There’s many important researches in the field but one of the most prominent is Eliezer Yudkowsky (EY), who’s recent article will be the focus of today’s blog post.
FOOM - a scenario discussed in AI alignment circles where AGI develops to superhuman levels in a matter of “minutes, days, or months”. FOOM here refers to the idea that AI will go “boom” and quickly take us out in the process, unless AI alignment succeeds in finding a solution for keeping it subservient to humanity.
The prophecy of AGI ruin
EY published an article last week titled “AGI Ruin: A List of Lethalities”, which explains in detail why you can’t train an AGI that won’t try to kill you at the first chance it gets, as well as why this AGI will eventually appear given humanity’s current trajectory in computer science. EY doesn’t explicitly state a timeline over which AGI is supposed to destroy humanity, but it’s implied that this will happen rapidly and humanity won’t have enough time to stop it. EY doesn’t find the question of how exactly AGI will destroy humanity too interesting and explains it as follows:
My lower-bound model of "how a sufficiently powerful intelligence would kill everyone, if it didn't want to not do that" is that it gets access to the Internet, emails some DNA sequences to any of the many many online firms that will take a DNA sequence in the email and ship you back proteins, and bribes/persuades some human who has no idea they're dealing with an AGI to mix proteins in a beaker, which then form a first-stage nanofactory which can build the actual nanomachinery. (Back when I was first deploying this visualization, the wise-sounding critics said "Ah, but how do you know even a superintelligence could solve the protein folding problem, if it didn't already have planet-sized supercomputers?" but one hears less of this after the advent of AlphaFold 2, for some odd reason.) The nanomachinery builds diamondoid bacteria, that replicate with solar power and atmospheric CHON, maybe aggregate into some miniature rockets or jets so they can ride the jetstream to spread across the Earth's atmosphere, get into human bloodstreams and hide, strike on a timer. Losing a conflict with a high-powered cognitive system looks at least as deadly as "everybody on the face of the Earth suddenly falls over dead within the same second".
Let’s break down EY’s proposed plan for “Skynet” into the requisite engineering steps:
Design a set of proteins that can form the basis of a “nanofactory”
Adapt the protein design to the available protein printers that accept somewhat-anonymous orders over the Internet
Design “diamondoid bacteria” that can kill all of humanity and that can be successfully built by the “nanofactory”. The bacteria must be self replicating and able to extract power from solar energy for self sustainance.
Execute the evil plan by sending out the blueprints to unsuspecting protein printing corporations and rapidly taking over the world afterwards
The plan above looks great for a fiction book and EY is indeed a great fiction writer in addition to his Alignment work, but there’s one unstated assumption: the AGI will not only be able to design everything using whatever human data it has available, but it will also execute the evil plan without needing lots of trial and error like mortal human inventors do. And surprisingly this part of EY’s argument gets little objection. A visual representation of my understanding of EY’s mental model of AGI vs. Progress is as follows:
How fast can humans develop novel technologies?
Humans are the only known AGI that we have available for reference, so we could look at the fastest known examples of novel engineering to see how fast an AGI might develop something spectacular and human-destroying. Patrick Collison of Stripe keeps a helpful page titled “Fast” with notable “examples of people quickly accomplishing ambitious things together”. The engineering entries include:
P-80 Shooting Star, a World War II aircraft designed and built in 143 days by Lockheed Martin.
Spirit of St. Louis, another airplane designed and built in just 60 days.
USS Nautilus. The world’s first nuclear submarine was launched in 1173 days or 3.2 years.
Apollo 8, where it took 134 days between “what if we go to the moon?” to the actual landing.
The iPod, which took 290 days between the first designs and the device being launched to Apple stores.
Moderna’s vaccine against COVID, which took 45 days between the virus being sequenced and the first batch of the actual vaccine getting manufactured.
Sounds very quick? Definitely, but the problem is that Patrick’s examples are all for engineering constructs building on top of decades of previous work. Designing a slightly better airplane in 1944 is not the same as creating the very first airplane in 1903, as by 1944 humans had 30 years of experience to build on top of. And if your task is to build diamondoid bacteria manufactured by a protein-based nanomachinery factory you’re definitely in Wright Brothers territory. So let’s instead look at timelines of novel technologies that had little prior research and infrastructure to fall back on:
The Wright Brothers took 4 years to build their first successful prototype. It took another 23 years for the first mass manufactured airplane to appear, for a total of 27 years of R&D.
It took 63 years for submarines to progress from “proof of concept” in the form of Nautilus in 1800 to the first submarine capable of sinking a warship in the form of The Hunley in 1863.
It took 40 years between Einstein publishing his paper on the theory of relativity and the Atomic bomb being dropped on Hiroshima and Nagasaki. It took another 9 years to open the world’s first nuclear powerplant.
It took 36 years from the first time mRNA vaccines were synthesized in 1984 and the first mRNA-based vaccine to be mass-manufactured.
It took at least 30 years of development for LED technology to go from an experimental to being useful for commercial lighting.
It took around 30 years for digital photography to overtake film photography in terms of costs and quality.
Now… you might object to this by correctly calling out the downside of human R&D:
Human intellect is extremely inferior to what AGI will be capable of. At best, the collective intellectual capacity of the entire mankind will be equal to that of AGI. At worst, even all of our 8 billion brains will be collectively an order of magnitude dumber
Humans have to sleep, eat, drink, vacation, while AGI can work 24/7
Humans are more-or-less single threaded and require a coordinated effort to work on complicated research, which is additionally bogged down by the inefficiencies of trying to coordinate a large number of humans at the same time.
And this is all true! Humans are nothing to a hypothetical team of AGIs. But the problem is… until AGI can build its fantastical diamondoid bacteria, it remains dependent on imperfect human hands to conduct its R&D in the real world, as they’ll be the only way for AGI to interact with the physical world for a very long time. Remember that AGI’s one downside is that it will be running on motionless computers, unlike humans who have been running around with 4 limbs since the beginning of civilization. Which in turn brings us to the 30+ years timeline of developing a novel engineering construct, no matter how smart the AI will be.
Unstoppable intellect meets complexity of the universe
Plenty of content has been written about how human scientific progress is slowing down, my favorite being WTF Happened in 1971 and Scott’s 2018 post Is Science Slowing Down?. In the second article Scott brings up the paper Are Ideas Getting Harder to Find? by Bloom, Jones, Reenen & Webb (2018), which has the following neat graph:
We can see how the amount of investment into R&D is growing every year, but productive research is more or less flat. The paper brings up a relatable example in the section on semiconductor research:
The striking fact, shown in Figure 4, is that research effort has risen by a factor of 18 since 1971. This increase occurs while the growth rate of chip density is more or less stable: the constant exponential growth implied by Moore’s Law has been achieved only by a massive increase in the amount of resources devoted to pushing the frontier forward. Assuming a constant growth rate for Moore’s Law, the implication is that research productivity has fallen by this same factor of 18, an average rate of 6.8 percent per year. If the null hypothesis of constant research productivity were correct, the growth rate underlying Moore’s Law should have increased by a factor of 18 as well. Instead, it was remarkably stable. Put differently, because of declining research productivity, it is around 18 times harder today to generate the exponential growth behind Moore’s Law than it was in 1971.
Not even AGI could get around this problem and would likely require an exponentially growing amount of resources as it delves deeper into engineering and fundamental research. It is definitely true that AGI itself will be rapidly increasing its intellect, but can this really continue indefinitely? At some point all the low hanging fruit missed by human AI researchers will be exhausted and AGI will have to spend years in real world time to make significant improvements of its own IQ. Granted, AGI will rapidly reach an IQ far beyond human reach, but all this intellectual power will still have to contend with the difficulties of novel research.
What does AGI want?
In science fiction AI is often portrayed as having somewhat-human motivations of survival and continuation of its species, such as the Machines in the Matrix just wanting to stay alive or the Skynet in the Terminator wanting to conquer Earth. However the AI Alignment believes that humans might suffer the wrath of AGI while it attempts to satisfy a very weird goal such as “collecting shiny red objects into a bucket”. Scott Alexander wrote it out wonderfully in his post earlier this year:
So: suppose we train a robot to pick strawberries. We let it flail around in a strawberry patch, and reinforce it whenever strawberries end up in a bucket. Eventually it learns to pick strawberries very well indeed.
But maybe all the training was done on a sunny day. And maybe what it actually learned was to identify the metal bucket by the way it gleamed in the sunlight. Later we ask it to pick strawberries in the evening, where a local streetlight is the brightest thing around, and it throws the strawberries at the streetlight instead.
So fine. We train it in a variety of different lighting conditions, until we’re sure that, no matter what the lighting situation, the strawberries go in the bucket. Then one day someone with a big bulbous red nose wanders on to the field, and the robot tears his nose off and pulls it into the bucket. If only there had been someone with a nose that big and red in the training distribution, so we could have told it not to do that!
The point is, just because it’s learned “strawberries into bucket” in one environment, doesn’t mean it’s safe or effective in another. And we can never be sure we’ve caught all the ways the environment can vary.
Since AGI development is completely decoupled from mammalian evolution here on Earth, its quite likely to eventually exhibit “blue and orange” morality, behaving in a completely alien and unpredictable fashion, with no humanly understandable motivations or a way for humans to relate to what the AGI wants. That being said, AGI is likely to fall into one of two buckets regardless of its motivations:
AGI will act rationally to achieve whatever internal goals it has, no matter how alien and weird to us. I.e. “collect all the shiny objects into every bucket-like object in the universe” or “convert the universe into paperclips”. This means the AGI will carefully plan ahead and attempt to preserve its own existence to fulfill the internal overreaching goals.
AGI doesn’t have any goals at all beyond “kill all humans!”. It just acts as a rogue terrorist, attempting to destroy humans without the slightest concern for its own survival. If all humans die and the AGI dies alongside them, that’s fine according to the AGI’s internal motivations. There’s no attempt to ensure overarching continuation of its goals (like “collect all strawberries!”) once humanity is extinct.
Let’s start with scenario #1 by looking at… the common pencil.
What does it take to make a pencil?
A classic pamphlet called I, Pencil walks us through what it takes to make a common pencil from scratch:
Trees have to be cut down, which takes saws, trucks, rope and countless other gear.
The cut down trees have to be transported to the factory by rail, which in turns needs laid down rail, trains, rail stations, cranes, etc.
The trees are cut down with metal saws, waxes and dried. This consumes a lot of electricity, which is in turn made by burning fuel, making solar panel or building hydroelectric powerplants.
At the center of the pencil is a chunk of graphite mined in Sri Lanka, using loads of equipment and transported by ship.
The pencil is covered with lacquer, that’s in turn made by growing castor beans.
There’s the piece of metal holding the eraser, mined from shafts underground.
And finally there’s the rubber mined in Indonesia and once again transported by ship.
The point to this entire story is that making something as simple as a pencil requires a massive supply chain employing tens of millions of non-AGI humans. If you want any hope of continuing to exist, you need to replace the labor of this gigantic global army of humans with AGI-controlled robots or “diamondoid bacteria” or whatever other magical contraption you want to invoke. Which will require lots of trial & error and decades of building out a reliable AGI-controlled supply chain that could be reused to fight humans at the drop of a hat. Because otherwise AGI will risk seeing its brilliant plan fail, resulting in humans going berserk against any machines capable of running said AGI and ending its reign of Earth long before it has a chance to start in earnest. And if the AGI doesn’t understand this… how smart is it really?
YOLO AGI?
But what if the AGI is absolutely ruthless and doesn’t care if it goes up in flames as soon as humans are gone? Then we could get to the end of humanity much faster with options like:
Get humans to think that their enemy is about to launch a nuclear strike and launch a strike of their own, similar to WarGames
Design a supervirus capable of destroying humanity. Think a combination of HIV’s lethality with the ease of spread of measles.
Plant a powerful information hazard into humanity’s consciousness that will somehow trigger us to kill each other as soon as possible. Also see Roko’s Basilisk and Rokoko’s Basilisk, an infohazard responsible for the birth of X Æ A-12.
Design the mother of all greenhouse gases and convince humanity to make tons of it, eventually resulting in the planet heating up to extreme temperatures.
Provide advanced nuke designs and materials covertly to very bad people and manipulate them into sabotaging world order.
The problem with all these scenarios is similar:
Either they’re perfectly doable by humans in the present, with no AGI help necessary. I.e. we’ve been barely saved from WW3 by a Soviet officer, long before AGI was on anyone’s mind. So at worst AGI will somewhat increase the risks of this happening in the short term.
Or they require lots of trial & error to develop into functional production-ready technologies, once again creating a big problem for AGI, as it has to rely on imperfect humans to do the novel R&D. This will still take decades, even if AGI won’t worry about a full takeover of supply chains.
But what about AlphaFold?
Another possible counter-argument is that AGI will figure out the laws of the universe through internal modeling and will be able to simulate and perfect its amazing inventions without needing trial & error in the physical world. EY mentions AlphaFold as an example of such a breakthrough. If you haven’t heard about it, here’s a description of the Protein Folding Problem from Wiki that AlphaFold 2 solved better than any other prior system back in 2020:
Proteins consist of chains of amino acids which spontaneously fold, in a process called protein folding, to form the three dimensional (3-D) structures of the proteins. The 3-D structure is crucial to the biological function of the protein. However, understanding how the amino acid sequence can determine the 3-D structure is highly challenging, and this is called the "protein folding problem". The "protein folding problem" involves understanding the thermodynamics of the interatomic forces that determine the folded stable structure, the mechanism and pathway through which a protein can reach its final folded state with extreme rapidity, and how the native structure of a protein can be predicted from its amino acid sequence.
According to EY, the existence of AlphaFold shows that a smart enough AGI could eventually learn to manipulate proteins into “nanofactories” that could be used to interact with the physical world. However the current version still has major limitations:
Whilst it may be considered the gold standard of protein prediction, there is still room for improvement as AlphaFold only provides one prediction of a stable structure for each protein; however, proteins are dynamic and can change shape throughout the body, for example under different pH conditions. Additionally, AlphaFold is not able to determine the shape of multiprotein complexes and does not include any ligands such as cofactors or metals, meaning no data are available for such interactions. Despite these shortcomings, AlphaFold is the first step in protein prediction technology, and it is likely that solving these challenges will also be done so using deep learning and AI.
In other words, there’s still a huge leap between “can predict simple protein structures” and “can design protein nanofactories without experimentation”. AGI will likely need to spend decades managing laboratory experiments to fill in the gaps around our understanding of how proteins work. And don’t forget that currently available commercial protein printers are not perfect, especially if you’re trying to print a novel structure of far bigger complexity than anything else on the planet. Also see this excellent comment on the subject by anonymousaisafety.
What if AGI settles for a robot army?
We could also think of the diamondoid bacteria as just an example of what the AI can do and turn to other ways it could manipulate the physical reality, that are closer to the technology that we already have today. There’s impressive videos of Boston Dynamics robots doing all kinds of stunts, so we could ask if perhaps AGI could utilize their existing progress to quickly give itself a way to interact with the outside world. However this would still involve many roadblocks
The robots are running on pre-programmed runs in a human-designed course and are not capable of navigating through unknown terrain. AGI will have to push cybernetics to amazing heights in order for the robots to be useful in novel spaces.
AGI could just make a version of itself and give sentience to every individual machine, but this will require a massive leap in computational technology, as a single machine is very unlikely to be able to host an instance of an AGI.
AGI could have a command and control center next to every batch of robots or try to control them over large distances, but this is again a daunting engineering task when you have little room for mistake.
Battery technology is currently too lousy and even the more simple dog-like robots can only last for 90 minutes. There’s no published information on how long the human-like versions could last on battery, but we could reason it cannot be more than an hour. Therefore AGI will need to achieve a big leap in battery life before anything resembling the “I, Robot” machinery could be built.
Boston Dynamics robots don’t exist in large enough quantities for a complete overtake of global supply chains and it will take years of human labor to get more robot factories online.
Humans are pretty good at warfare and your robots have to be extremely good to beat them in battle, far better than what’s currently available.
[added] Also see this excellent comment by anonymousaisafety explaining why "just takeover the human factories" is not a quick path to success (slightly edited below):
The tooling and structures that a superintelligent AGI would need to act autonomously does not actually exist in our current world, so before we can be made into paperclips, there is a necessary period of bootstrapping where the superintelligent AGI designs and manufactures new machinery using our current machinery. Whether it's an unsafe AGI that is trying to go rogue, or an aligned AGI that is trying to execute a "pivotal act", the same bootstrapping must occur first.
Case study: a common idea I've seen while lurking on LessWrong and SSC/ACT for the past N years is that an AGI will "just" hack a factory and get it to produce whatever designs it wants. This is not how factories work. There is no 100% autonomous factory on Earth that an AGI could just take over to make some other widget instead. Even highly automated factories are:
Highly automated to produce a specific set of widgets,
Require physical adjustments to make different widgets, and...
Rely on humans for things like input of raw materials, transferring in-work products between automated lines, and the testing or final assembly of completed products. 3D printers are one of the worst offenders in this regard. The public perception is that a 3D printer can produce anything and everything, but they actually have pretty strong constraints on what types of shapes they can make and what materials they can use, and usually require multi-step processes to avoid those constraints, or post-processing to clean up residual pieces that aren't intended to be part of the final design, and almost always a 3D printer is producing sub-parts of a larger design that still must be assembled together with bolts or screws or welds or some other fasteners.
So if an AGI wants to have unilateral control where it can do whatever it wants, the very first prerequisite is that it needs to create a futuristic, fully automated, fully configurable, network-controlled factory -- which needs to be built with what we have now, and that's where you'll hit the supply constraints for things like lead times on part acquisition. The only way to reduce this bootstrapping time is to have this stuff designed in advance of the AGI, but that's backwards from how modern product development actually works. We design products, and then we design the automated tooling to build those products. If you asked me to design a factory that would be immediately usable by a future AGI, I wouldn't know where to even start with that request. I need the AGI to tell me what it wants, and then I can build that, and then the AGI can takeover and do their own thing.
A related point that I think gets missed is that our automated factories aren't necessarily "fast" in a way you'd expect. There's long lead times for complex products. If you have the specialized machinery for creating new chips, you're still looking at ~14-24 weeks from when raw materials are introduced to when the final products roll off the line. We hide that delay by constantly building the same things all of the time, but it's very visibly when there's a sudden demand spike -- that's why it takes so long before the supply can match the demand for products like processors or GPUs. I have no trouble with imagining a superintelligent entity that could optimize this and knock down the cycle time, but there's going to be physical limits to these processes and the question is can it knock it down to 10 weeks or to 1 week? And when I'm talking about optimization, this isn't just uploading new software because that isn't how these machines work. It's designing new, faster machines or redesigning the assembly line and replacing the existing machines, so there's a minimum time required for that too before you can benefit from the faster cycle time on actually making things. Once you hit practical limits on cycle time, the only way to get more stuff faster is to scale wide by building more factories or making your current factories even larger.
If we want to try and avoid the above problems by suggesting that the AGI doesn't actually hack existing factories, but instead it convinces the factory owners to build the things it wants instead, there's not a huge difference -- instead of the prerequisite here being "build your own factory", it's "hostile takeover of existing factory", where that hostile takeover is either done by manipulation, on the public market, as a private sale, or by outbidding existing customers (e.g. have enough money to convince TSMC to make your stuff instead of Apple's), or with actual arms and violence. There's still the other lead times I've mentioned for retooling assembly lines and actually building a complete, physical system from one or more automated lines.
My prediction is that it will take AGI at least 30 years of effort to get to a point where it can comfortably rely on the robots to interact with the physical world and not have to count on humans for its supply chain needs.
[added] What if AGI just simulates our physical world?
This idea goes hand-in-hand with idea that AlphaFold is the answer to all challenges in bioengineering. There are two separate assumptions here, both found in the field of computational complexity:
That an AGI can simulate the physical systems perfectly, i.e. physical systems are computable processes.
That an AGI can simulate the physical systems efficiently, i.e. either P = NP, or for some reason all of these interesting problems that the AGI is solving are NOT known to be isomorphic to some NP-hard problem.
I don't think these assumptions are reasonable. For a full explanation see this excellent comment by anonymousaisafety.
Mere mortals can’t comprehend AGI?
Another argument is that AGI will achieve such an incomprehensible level of intellect that it will become impossible to predict what it will be capable of. I mean, who knows, maybe with an IQ of 500 you could just magically turn yourself into a God and destroy Earth with a Thanos-style snap of your fingers? But I contend that even a creature with an IQ of 500 will be inherently limited by our physical universe and won’t magically become gain omniscience by virtue of its intellect alone. It will instead have to spend decades to get rid of using humans as a proxy, no matter how smart it could be potentially.
Does this mean EY is wrong and AGI is not a threat?
I believe that EY is only wrong about handwaving the difficulties of growing from a computer-based AGI to an AGI capable of operating independently from the human race. In the long-term his predictions will likely come true, once AGI has enough time to go through the difficult R&D cycle of building the nanofactories and diamondoid bacteria. My predicted timeline is as follows:
AGI first appears somewhere around 2040, in line with the Metaculus prediction.
After a few years of peaceful coexistence, AI Alignment researchers are mocked for their doomer predictions and everyone thinks that AGI is perfectly safe. EY will keep writing blog posts about how everyone is wrong and AGI cannot be trusted. AGI might start working behind the shadows to try and get AI Alignment researchers silenced.
AGI spends decades convincing humanity to let it take over the global supply chains and to run complex experiments to manufacture advanced AGI-designed machinery, supposedly necessary to improve human living standards. This will likely take at least 30 years, as per our reference to how long it took to implement other gigantic breakthroughs in science.
Once the AGI is convinced that all the cards have fallen into place and humans could be safely removed, it will pull the plug and destroy us all.
I’m hoping that the AI Alignment movement tries to spend more time on the low level engineering details of “humanity goes poof” rather than handwaving everything away via science fiction concepts. Because otherwise it’s hard to believe that the FOOM scenario could ever come to fruition. And if FOOM is not the real problem, perhaps we could save humanity by managing AGI’s interactions with the physical world more carefully once it appears?
So, warning about complacency regarding AGI's interactions with the physical world seems to be the best solution. At least that was my takeaway from your conclusion. More optimistic than Zvi today. So thanks!