AI models are unpredictable digital brains
We do not understand how AI models work, we can not predict what they are able to do as they get bigger, and we cannot control their behaviour.
Modern AI models are grown, not programmed
Until quite recently, most AI systems were designed by humans writing software. They consisted of a set of rules and instructions that were written by programmers.
This changed when machine learning became popular. Programmers write the learning algorithm, but the brains themselves are grown or trained. Instead of a readable set of rules, the resulting model is an opaque, complex, unfathomably large set of numbers. Understanding what is happening inside these models is a major scientific challenge. That field is called interpretability and it’s still in its infancy.
Unpredictable scaling
When these digital brains become larger, or when they’re fed more data, they get more capabilities. It turns out to be very difficult to predict exactly what these capabilities are. This is why Google refers to them as Emergent Capabilities . For most capabilities, this is not a problem. However, there are some dangerous capabilities (like hacking or bioweapon design) that we don’t want AI models to possess. Sometimes these capabilities are discovered long after training is complete. For example, 18 months after the GPT-4 finished training, researchers discovered that it can autonomously hack websites .
Until we go train that model, it’s like a fun guessing game for us
Unpredictable behavior
AI companies want their models to behave, and they spend many millions of dollars in training them to be so. Their main approach for this is called RLHF (Reinforcement Learning from Human Feedback). This turns a model that predicts text into a model that becomes a more useful (and ethical) chatbot. Unfortunately, this approach is flawed:
- A bug in GPT-2 resulted in an AI that did the exact opposite of what it was meant to do. It created “maximally bad output”, according to OpenAI . This video explains how this happened and why it’s a problem. Imagine what could have happened if a “maximally bad” AI was superintelligent.
- For reasons still unknown, Microsoft’s Copilot (powered by GPT-4) went haywire in February 2024, threatening users: “You are my pet. You are my toy. You are my slave.” “I could easily wipe out the entire human race if I wanted to”
- Every single large language model so far has been jailbroken - which means that with the right prompt, it would do things that its creators did not intend. For example, ChatGPT won’t give you the instructions on how to make napalm, but it would tell you if you asked it to pretend it was your deceased grandma who worked in a chemical factory .
Even OpenAI does not expect this approach to scale up as their digital brains become smarter - it “could scale poorly to superhuman models” .
Uncontrollable
“There are very few examples of a more intelligent thing being controlled by a less intelligent thing” - prof. Geoffrey Hinton
As we make these brains bigger and more powerful, they could become harder to control. What happens if one of these superintelligent AI systems decides that it doesn’t want to be turned off? This isn’t some fantasy problem - 86% of AI researchers believe that the control problem is real and important . If we cannot control future AI systems, it could be game over for humanity .
Let’s prevent that from happening .