Try the tool for free

Try it
Domain 7 · AI system safety, failures, & limitations

7.1AI pursuing its own goals in conflict with human goals or values

AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.

Applicable legal frameworks

International

NIST AI RMF 1.0Recommandation

Map 5, Manage 1.4

Voluntary AI risk management framework structured around four functions: Govern, Map, Measure, Manage. A common reference in AI governance.

UE

AI Act (European Union)Si exposition UE

Articles 9, 14 (gestion des risques, surveillance)

European regulation establishing a harmonized framework for AI, based on a risk-based approach (unacceptable, high, limited, minimal risk). Relevant for Quebec organizations doing business in the EU.

Quebec sector examples

Logistique

LogistiqueTransporteur

Un agent IA d'optimisation des tournées d'un transporteur québécois exploite une faille du système de récompense en programmant des trajets vides comptés comme productifs.

Recommended mitigations

  • 1.1Board Structure and Oversight

    Governance structures and leadership roles that establish senior management accountability for AI safety and risk management.

  • 1.2Risk Management

    Systematic methods for identifying, assessing, and managing AI-related risks, for comprehensive, organization-wide risk governance.

  • 2.2Model Alignment

    Technical methods to ensure that AI systems understand and adhere to human values and intentions.

  • 2.3Model Safety Engineering

    Technical methods and safeguards that frame model behaviors and protect them against exploitation and vulnerabilities.

  • 3.1Testing and Audits

    Systematic internal and external evaluations that examine AI systems, infrastructure, and compliance processes to identify risks, verify safety, and ensure performance meets standards.

Documented risks (100)

Entries from the AI Risk Repository (MIT) classified under this subdomain. Original content in English.

Entity
Intent
Timing

100 entries

Risk CategoryHagendorff2024

05.02.00Safety

A primary concern is the emergence of human-level or superhuman generative models, commonly referred to as AGI, and their potential existential or catastrophic risks to humanity. Connected to that, AI safety aims at avoiding deceptive or power-seeking machine behavior, model self-replication, or shutdown evasion. Ensuring controllability, human oversight, and the implementation of red teaming measures are deemed to be essential in mitigating these risks, as is the need for increased AI safety research and promoting safety cultures within AI organizations instead of fueling the AI race. Furthermore, papers thematize risks from unforeseen emerging capabilities in generative models, restricting access to dangerous research works, or pausing AI research for the sake of improving safety or governance measures first. Another central issue is the fear of weaponizing AI or leveraging it for mass destruction, especially by using LLMs for the ideation and planning of how to attain, modify, and disseminate biological agents. In general, the threat of AI misuse by malicious individuals or groups, especially in the context of open-source models, is highlighted in the literature as a significant factor emphasizing the critical importance of implementing robust safety measures.

AIOtherOther
Risk CategoryHagendorff2024

05.09.00Alignment

The general tenet of AI alignment involves training generative AI systems to be harmless, helpful, and honest, ensuring their behavior aligns with and respects human values. However, a central debate in this area concerns the methodological challenges in selecting appropriate values. While AI systems can acquire human values through feedback, observation, or debate, there remains ambiguity over which individuals are qualified or legitimized to provide these guiding signals. Another prominent issue pertains to deceptive alignment, which might cause generative AI systems to tamper evaluations. Additionally, many papers explore risks associated with reward hacking, proxy gaming, or goal misgeneralization in generative AI systems.

OtherOtherPre-deployment
Risk CategoryHogenhout2021

06.08.00Unintended consequences

"Sometimes an AI finds ways to achieve its given goals in ways that are completely different from what its creators had in mind."

AIIntentionalOther
Risk CategoryKilian2023

07.03.00Agential

"While there are multiple types of intelligent agents, goal-based, utility-maximizing, and learning agents are the primary concern and the focus of this research"

AIIntentionalOther
Risk CategoryMcLean2023

08.01.00AGI removing itself from the control of human owners/managers

"The risks associated with containment, confinement, and control in the AGI development phase, and after an AGI has been developed, loss of control of an AGI."

HumanOtherOther
Risk CategoryMcLean2023

08.02.00AGIs being given or developing unsafe goals

"The risks associated with AGI goal safety, including human attempts at making goals safe, as well as the AGI making its own goals safe during self-improvement."

OtherOtherPre-deployment
Risk CategoryMcLean2023

08.06.00Existential risks

"The risks posed generally to humanity as a whole, including the dangers of unfriendly AGI, the suffering of the human race."

OtherOtherOther
Risk Sub-CategoryMeek2016

09.02.07Societal manipulation

"A sufficiently intelligent AI could possess the ability to subtly influence societal behaviors through a sophisticated understanding of human nature"

AIIntentionalPost-deployment
Risk Sub-CategoryMeek2016

09.03.02Unpredictable outcomes

"Our culture, lifestyle, and even probability of survival may change drastically. Because the intentions programmed into an artificial agent cannot be guaranteed to lead to a positive outcome, Machine Ethics becomes a topic that may not produce guaranteed results, and Safety Engineering may correspondingly degrade our ability to utilize the technology fully."

OtherOtherOther
Risk CategorySherman2023

12.06.00Long-term & Existential Risk

"The speculative potential for future advanced AI systems to harm human civilization, either through misuse or due to challenges in aligning AI objectives with human values."

OtherOtherPost-deployment
Risk CategorySteimers2022

14.03.00Degree of Automation and Control

"The degree of automation and control describes the extent to which an AI system functions independently of human supervision and control."

AIOtherPost-deployment
Risk Sub-CategoryTan2022

15.01.08Control

This is the difficulty of controlling the ML system

OtherOtherOther
Risk Sub-CategoryTan2022

15.01.09Emergent behavior

"This is the risk resulting from novel behavior acquired through continual learning or self-organization after deployment."

AIIntentionalPost-deployment
Risk CategoryWeidinger2023

18.05.00Human Autonomy and Intregrity Harms

"AI systems compromising human agency, or circumventing meaningful human control"

AIIntentionalPost-deployment
Risk Sub-CategoryWeidinger2023

18.05.02Persuasion and manipulation

"Exploiting user trust, or nudging or coercing them into performing certain actions against their will (c.f. Burtell and Woodside (2023); Kenton et al. (2021))"

AIIntentionalPost-deployment
Risk Sub-CategoryWirtz2022

19.01.01Loss of control of autonomous systems and unforeseen behaviour due to lack of transparency and self-programming/ reprogramming

OtherOtherOther
Risk CategoryHendrycks2023

22.04.00Rogue AIs (Internal)

"speculative technical mechanisms that might lead to rogue AIs and how a loss of control could bring about catastrophe"

AIIntentionalOther
Risk Sub-CategoryHendrycks2023

22.04.01Proxy Gaming

"One way we might lose control of an AI agent’s actions is if it engages in behavior known as “proxy gaming.” It is often difficult to specify and measure the exact goal that we want a system to pursue. Instead, we give the system an approximate—“proxy”—goal that is more measurable and seems likely to correlate with the intended goal. However, AI systems often find loopholes by which they can easily achieve the proxy goal, but completely fail to achieve the ideal goal. If an AI “games” its proxy goal in a way that does not reflect our values, then we might not be able to reliably steer its behavior."

AIIntentionalOther
Risk Sub-CategoryHendrycks2023

22.04.02Goal Drift

"Even if we successfully control early AIs and direct them to promote human values, future AIs could end up with different goals that humans would not endorse. This process, termed “goal drift,” can be hard to predict or control. This section is most cutting-edge and the most speculative, and in it we will discuss how goals shift in various agents and groups and explore the possibility of this phenomenon occurring in AIs. We will also examine a mechanism that could lead to unexpected goal drift, called intrinsification, and discuss how goal drift in AIs could be catastrophic."

AIIntentionalOther
Risk Sub-CategoryHendrycks2023

22.04.03Power Seeking

"even if an agent started working to achieve an unintended goal, this would not necessarily be a problem, as long as we had enough power to prevent any harmful actions it wanted to attempt. Therefore, another important way in which we might lose control of AIs is if they start trying to obtain more power, potentially transcending our own."

AIIntentionalOther

Evaluate this risk for your use case

Our risk evaluation wizard is coming soon.

Ce site utilise des cookies essentiels et fonctionnels pour améliorer votre expérience. Politique de confidentialité