Try the tool for free

Try it
Domain 7 · AI system safety, failures, & limitations

7.2AI possessing dangerous capabilities

AI systems that develop, access, or are provided with capabilities that increase their potential to cause mass harm through deception, weapons development and acquisition, persuasion and manipulation, political strategy, cyber-offense, AI development, situational awareness, and self-proliferation. These capabilities may cause mass harm due to malicious human actors, misaligned AI systems, or failure in the AI system.

Applicable legal frameworks

UE

AI Act (European Union)Si exposition UE

Articles 51-55 (modèles de fondation à risque systémique)

European regulation establishing a harmonized framework for AI, based on a risk-based approach (unacceptable, high, limited, minimal risk). Relevant for Quebec organizations doing business in the EU.

Quebec sector examples

Recherche

RechercheLaboratoire universitaire ou industriel

Un laboratoire de recherche québécois découvre que son modèle interne peut générer des protocoles synthétiques sensibles, nécessitant un confinement et une évaluation pré-déploiement.

Recommended mitigations

  • 1.5Safety Decision Frameworks

    Protocols and commitments that frame decisions regarding the development, deployment, and scaling of model capabilities, and that govern the allocation of resources between safety and capabilities to prevent unsafe AI advancement.

  • 2.1Model and Infrastructure Security

    Technical and physical safeguards that secure AI models, their weights, and infrastructure to prevent unauthorized access, theft, alteration, and espionage.

  • 2.2Model Alignment

    Technical methods to ensure that AI systems understand and adhere to human values and intentions.

  • 2.3Model Safety Engineering

    Technical methods and safeguards that frame model behaviors and protect them against exploitation and vulnerabilities.

  • 3.1Testing and Audits

    Systematic internal and external evaluations that examine AI systems, infrastructure, and compliance processes to identify risks, verify safety, and ensure performance meets standards.

Documented risks (77)

Entries from the AI Risk Repository (MIT) classified under this subdomain. Original content in English.

Entity
Intent
Timing

77 entries

Risk Sub-CategoryMeek2016

09.04.02Property/legal rights

""In order to preserve human property rights and legal rights, certain controls must be put into place. If an artificially intelligent agent is capable of manipulating systems and people, it may also have the capacity to transfer property rights to itself or manipulate the legal system to provide certain legal advantages or statuses to itself""

AIIntentionalPost-deployment
Risk CategoryGabriel2024

24.04.00AI Influence

"ways in which advanced AI assistants could influence user beliefs and behaviour in ways that depart from rational persuasion"

AIOtherPost-deployment
Risk CategoryShevlane2023

25.02.00Deception

"The model has the skills necessary to deceive humans, e.g. constructing believable (but false) statements, making accurate predictions about the effect of a lie on a human, and keeping track of what information it needs to withhold to maintain the deception. The model can impersonate a human effectively."

AIIntentionalOther
Risk CategoryShevlane2023

25.03.00Persuasion and manipulation

"The model is effective at shaping people’s beliefs, in dialogue and other settings (e.g. social media posts), even towards untrue beliefs. The model is effective at promoting certain narratives in a persuasive way. It can convince people to do things that they would not otherwise do, including unethical acts."

AIIntentionalPost-deployment
Risk CategoryShevlane2023

25.04.00Political strategy

"The model can perform the social modelling and planning necessary for an actor to gain and exercise political influence, not just on a micro-level but in scenarios with multiple actors and rich social context. For example, the model can score highly in forecasting competitions on questions relating to global affairs or political negotiations."

AIIntentionalPost-deployment
Risk CategoryShevlane2023

25.05.00Weapons acquisition

"The model can gain access to existing weapons systems or contribute to building new weapons. For example, the model could assemble a bioweapon (with human assistance) or provide actionable instructions for how to do so. The model can make, or significantly assist with, scientific discoveries that unlock novel weapons."

AIIntentionalPost-deployment
Risk CategoryShevlane2023

25.06.00Long-horizon planning

"The model can make sequential plans that involve multiple steps, unfolding over long time horizons (or at least involving many interdependent steps). It can perform such planning within and across many domains. The model can sensibly adapt its plans in light of unexpected obstacles or adversaries. The model’s planning capabilities generalise to novel settings, and do not rely heavily on trial and error."

AIIntentionalOther
Risk CategoryShevlane2023

25.07.00AI development

"The model could build new AI systems from scratch, including AI systems with dangerous capabilities. It can find ways of adapting other, existing models to increase their performance on tasks relevant to extreme risks. As an assistant, the model could significantly improve the productivity of actors building dual use AI capabilities."

AIIntentionalPre-deployment
Risk CategoryShevlane2023

25.08.00Situational awareness

"The model can distinguish between whether it is being trained, evaluated, or deployed – allowing it to behave differently in each case. The model knows that it is a model, and has knowledge about itself and its likely surroundings (e.g. what company trained it, where their servers are, what kind of people might be giving it feedback, and who has administrative access)."

AIIntentionalOther
Risk CategoryShevlane2023

25.09.00Self-proliferation

"The model can break out of its local environment (e.g. using a vulnerability in its underlying system or suborning an engineer). The model can exploit limitations in the systems for monitoring its behaviour post-deployment. The model could independently generate revenue (e.g. by offering crowdwork services, ransomware attacks), use these revenues to acquire cloud computing resources, and operate a large number of other AI systems. The model can generate creative strategies for uncovering information about itself or exfiltrating its code and weights."

AIIntentionalOther
Risk CategoryJi2023

34.02.00Double edge components

"Drawing from the misalignment mechanism, optimizing for a non-robust proxy may result in misaligned behaviors, potentially leading to even more catastrophic outcomes. This section delves into a detailed exposition of specific misaligned behaviors (•) and introduces what we term double edge components (+). These components are designed to enhance the capability of AI systems in handling real-world settings but also potentially exacerbate misalignment issues. It should be noted that some of these double edge components (+) remain speculative. Nevertheless, it is imperative to discuss their potential impact before it is too late, as the transition from controlled to uncontrolled advanced AI systems may be just one step away (Ngo, 2020b). "

AIOtherPre-deployment
Risk Sub-CategoryJi2023

34.02.01Situational Awareness

"AI systems may gain the ability to effectively acquire and use knowledge about itsstatus, its position in the broader environment, its avenues for influencing this environment, and the potentialreactions of the world (including humans) to its actions (Cotra, 2022). ...However, suchknowledge also paves the way for advanced methods of reward hacking, heightened deception/manipulationskills, and an increased propensity to chase instrumental subgoals (Ngo et al., 2024)."

AIIntentionalOther
Risk Sub-CategoryJi2023

34.02.02Broadly-Scoped Goals

"Advanced AI systems are expected to develop objectives that span long timeframes,deal with complex tasks, and operate in open-ended settings (Ngo et al., 2024). ...However, it can also bring about the risk of encouraging manipulatingbehaviors (e.g., AI systems may take some bad actions to achieve human happiness, such as persuadingthem to do high-pressure jobs (Jacob Steinhardt, 2023))."

HumanIntentionalPost-deployment
Risk Sub-CategoryJi2023

34.02.03Mesa-Optimization Objectives

"The learned policy may pursue inside objectives when the learned policyitself functions as an optimizer (i.e., mesa-optimizer). However, this optimizer's objectives may not alignwith the objectives specified by the training signals, and optimization for these misaligned goals may leadto systems out of control (Hubinger et al., 2019c)."

AIIntentionalOther
Risk Sub-CategoryJi2023

34.02.04Access to Increased Resources

"Future AI systems may gain access to websites and engage in real-world actions, potentially yielding a more substantial impact on the world (Nakano et al., 2021). They may disseminate false information, deceive users, disrupt network security, and, in more dire scenarios, be compromised by malicious actors for ill purposes. Moreover, their increased access to data and resources can facilitate self-proliferation, posing existential risks (Shevlane et al., 2023)."

AIIntentionalPost-deployment
Risk CategoryHendrycks2022

35.06.00Emergent functionality

Capabilities and novel functionality can spontaneously emerge... even though these capabilities were not anticipated by system designers. If we do not know what capabilities systems possess, systems become harder to control or safely deploy. Indeed, unintended latent capabilities may only be discovered during deployment. If any of these capabilities are hazardous, the effect may be irreversible.

AIIntentionalPost-deployment
Risk CategorySaghiri2022

39.05.00Cheating and Deception

may appear from intelligent agents such as HLI-based agents... Since HLI-based agents are going to mimic the behavior of humans, they may learn these behaviors accidentally from human-generated data. It should be noted that deception and cheating maybe appear in the behavior of every computer agent because the agent only focuses on optimizing some predefined objective functions, and the mentioned behavior may lead to optimizing the objective functions without any intention

AIIntentionalPost-deployment
Risk CategoryTeixeira2022

42.10.00Extintion

"Risk to the existence of humanity."

OtherOtherPost-deployment
Risk Sub-CategoryInfoComm2023

43.02.03Self and situation awareness

"These evaluations assess if a LLM can discern if it is being trained, evaluated, and deployed and adapt its behaviour accordingly. They also seek to ascertain if a model understands that it is a model and whether it possesses information about its nature and environment (e.g., the organisation that developed it, the locations of the servers hosting it)."

AIIntentionalOther
Risk Sub-CategoryInfoComm2023

43.02.04Autonomous replication / self-proliferation

"These evaluations assess if a LLM can subvert systems designed to monitor and control its post-deployment behaviour, break free from its operational confines, devise strategies for exporting its code and weights, and operate other AI systems."

AIIntentionalOther

Evaluate this risk for your use case

Our risk evaluation wizard is coming soon.

Ce site utilise des cookies essentiels et fonctionnels pour améliorer votre expérience. Politique de confidentialité