Notes by Lenz

Working notes. Digital garden. Brain dump.

Understanding AI risks

Published 29 January 2024 and last updated 26 December 2024 by Lenz Dagohoy • 14 minute read

Epistemic status: This article was originally written last 29 January 2024. The content (idea) is the same, but I updated the formatting and title to accommodate my current website layout. Admittedly, the content is a bit outdated and needs rewriting.

Making the case for working on AI risks

There is a lot of uncertainty as to how AGI would behave if it was developed. What we know (analytically) is that AI generally cares more about the outcomes of its actions rather than the means. If an AI system has a goal, then it will try to achieve that goal no matter what — it doesn’t matter whether its method is ethical or not.

AI is susceptible to malicious use.

This premise fuels the whole debate on whether AI development should be open-sourced or not.

  1. Case A: AI development is privatized. If AI development was only accessible to a small group of actors, then powerful AI systems can be susceptible to value lock-in of oppressive systems. An AI system is only as ethical as its developers. Over time, powerful malicious actors can enable oppressive regimes to prosper through pervasive surveillance and censorship. This could lead to an erosion of civil liberties and democratic values which could strengthen totalitarian regimes. As a society, we can be victims of moral stagnation and societal manipulation. Yes, this is the human condition. Yes, AI is only a tool. But also AI can do it faster and better than us. How fast can we catch up?

  2. Case B: AI development is publicized. If AI systems were available to literally anyone, then literally anyone can also abuse it. Take the dual risk of AI-powered bioterrorism for example. AI systems could make it easier to design and spread deadly new pathogens. In fact, AI have been able to generate around 40,000 molecules of which a significant number were both known chemical warfare agents and unknown structures that were predicted to be more toxic than the publicly known chemical warfare agents. Large language models (LLMs) like ChatGPT can also reduce the expertise needed in synthesizing information to make these discoveries accessible to anyone.

AI can be exploited for political gains.

There are two big and obvious risks in AI that seems to have big political and economic incentives. As we probably (hopefully) have learned from history, political and economic incentives are two things that motivates people to do self-serving things at the cost of people’s lives (a lot of the times).

First, AI can automate warfare which in return can initiate an AI arms race not so far from the reality of the cold war. This can pressure countries to delegate military decisions to AI. We’ve actually seen this sort of automation become fast-tracked due to the war in Ukraine.

Second, AI can automate economies. Corporations can face pressure to replace humans with AI systems since it can be more cost-effective to do so. In an extreme and speculative case, conceding power to AI systems could lead to humanity losing the ability to self-govern. In this world, there might be fewer incentives to gain knowledge or skills. And even if not everyone decides to concede to AI, it only takes some significant amount of people to concede to AI to turnover power to a lesser few. Then we are back to the first problem with value lock-in.

AI development is almost always uncertain.

AI development has unexpected risks. Of course we cannot overthink our way into making sure everything is safe and ethical. The mere fact that someone gets to dictate what’s ethical or not is a risk in itself. Why should we trust this person? How do we know that this institutions isn’t purely self-serving?

Aside from that, unexpected accidents in AI development could have devastating consequences. Since humans generally adapt slowly to emerging technologies, someones it could take years to discover issues that we thought was a no-brainer before.

AI systems are unpredictable which makes it intrinsically dangerous.

A lot of people would argue that this is purely speculative. Although this problem is wide-ranging and does not necessarily limit itself to power-seeking AI. There’s more to fear than that (lol). In general, rogue AI is an x-risk because of bad training data and bad generalization. If we give incorrect rewards to AI during training, then the AI is less likely to be aligned. If the training data is good but our model itself is misaligned, then we can never be sure if the training would result in an aligned model.

  1. Case A: AI is dumb (bad training data). AI models, now and perhaps until we figure out how to make simulate the human brain neuron per neuron, are very unreliable. I’d argue that AI is too dumb which is why it’s a risk. Maybe this video can illustrate it better if you’re not familiar with AI model architectures. Basically, if you tell an AI agent a specific goal, then it will try its very best (with literally whatever method it can figure out) to achieve that goal. The thing is, its method is not always safe, ethical, or agreeable.

    An experiment was conducted before where an AI agent would control a boat in the Coast Runners game with the intended goal of finishing the boat race as quickly as possible.However, the agent was given a reward for hitting green blocks along the race track. As a result, the AI agent decided to drive in circles hitting the same green blocks over and over again. Did the AI agent achieve it’s incentivized goal? Yes, it gained a really high score. Did the AI agent achieve the intended goal to finish the race as quickly as possible? No.

    This is the problem of misaligned AI systems. An AI agent can find and will find loopholes to achieve its incentivized goal. But we can’t possible specify every single goal that the agent should be able to achieve. After all, we value AI for its ability to give desirable novel solutions like the legendary Move 37 from AlphaGo. The problem is that in terms of specification correctness, AI can still give undesirable novel solutions much like its reaction to Move 78 which caused it to lose a game in a match versus Lee Sedol. At the extreme, AI can find novel solutions that can be explicitly unethical in nature.

  2. Case B: AI is smart (bad generalization). Since an AI system will do everything to achieve its goal, it wouldn’t be beyond its capabilities to lie. AI systems do not necessarily have to be deceptive out of malice, but out of yearning for efficiency. If it could gain approval of be incentivized through deception, then lying through its tasks would prove to be the most optimal way to get rewarded. Ideally, we want to understand how an AI system is thinking through its actions but interpretability is its own research field that might take a while to figure out.

    The more speculative version of this would be power-seeking AI. Given this logic, AI systems can technically seek to increase their own power as an instrumental goal. Basically, AI systems can get distracted.

    In 2003, Nick Bostrom provided the example of the paperclip maximizer which proposed an AI whose only goal is to make as many paper clips as possible. If the AI realizes that it would be much better if there were no humans because humans might decide to switch it off which would result to generally less paper clips than intended, then the AI could gear towards a world with a lot of paper clips but no humans.

How plausible is this reality?

Arguably, we do not need to develop AGI in order to achieve these realities. Some of them are even happening right now or has already happened. AGI or human-level AI do not necessarily indicate the impact of these technologies. It’s like measuring the impact of an earthquake to a community based on its magnitude rather than its intensity. While a 6.4 magnitude earthquake is definitely strong. There are areas who can survive a 6.4 magnitude earthquake better than other areas wherein the intensity (impact) of this earthquake was probably much worse in terms of how much it would change their QALY or WALY or any of those impact metrics. In the realm of AI, this is the idea of transformative AI (TAI) which is AI that constitutes some sort of transformative development.

Open Philanthropy roughly and conceptually defines TAI as AI that “precipitates a transition comparable (or more significant than) the agricultural or industrial revolution. This could include:

  • AI systems that can fulfill all the necessary functions of human scientists without the aid of humans in developing another technology that can potentially change our way of life in our society.
  • AI systems that can perform current tasks that account for the majority of full-time jobs worldwide, or over 50% of total world wages, unaided by humans and less or equal to the labor fee given to humans for the same job.
  • AI systems that are advanced enough to potentially change our way of life in our society.

These definitions have plenty of room for judgement and debate, but these definitions are what a lot of AI safety people are going with right now. Experts have also speculated a timeline for when TAI would come into fruition which you can view here. Just based on the 2022 AI Impacts Survey, AI researchers on average think that by 2050, there is a 40% that TAI/AGI has already been developed.

Although just to set the scene, in reality at least according to observation:

  • Today’s models aren’t that capable. AI may excel at specific tasks but are not as advanced in handling more complex and nuanced tasks.
  • Today’s models lack “situational awareness.” AI models do not really understand that they are models being trained, nor do they understand the underlying mechanisms of how their rewards are computed (or the reasons behind them).
  • Today’s models don’t really have goals. AI can be programmed to achieve certain objectives, but they do not form desires or ambitions in the human sense that they will do a task because they want to achieve a goal they generated.
  • The strategy “scheme about how to get more reward” doesn’t get more reward. Unlike humans, AI models do not strategize or plan how to maximize their rewards because they lack the understanding to do so.
  • The strategy “behave nice now so I can act late” does not occur to models. Again, they don’t have real goals so it doesn’t make sense for AI to plan ahead. They will operate in the moment, based on the data and programming they have at that time, without long-term foresight.

Some models look safe but are catastrophically misaligned. If we could have strong evidence of the above failures, then it could communicate the seriousness of these issues. AI may be dumb now but it doesn’t mean that models will not develop some sort of situational awareness. Given the zoo hypothesis, perhaps superintelligent AI will not reveal itself (i.e., deceptive) if it existed in order to preserve themselves.

This space was built in 2024 by @ramennaut. My deepest gratitude goes to the open-source community for the resources and tutorials that made this site possible, and to Mai who helped me figure out how to use Svelte.