six core elements and risk map at a glance
With the rapid development of artificial intelligence technology and the continuous evolution of security threats, the challenges of protecting artificial intelligence systems, applications and users on a large scale require not only developers to master existing security coding best practices, but also to have a deep understanding of the privacy and security risks unique to artificial intelligence.
In this context, Google released the AI Security Framework SAIF (Secure AI Framework) to help mitigate risks specific to AI systems, such as model theft, data contamination of training data, injection of malicious inputs through prompt injection, and extraction of confidential information from training data.
This article sorts out the six core elements of SAIF and the SAIF risk map framework to provide a reference for building and deploying secure AI systems in the rapidly developing AI world.
SAIF is based on six core security principles:
• Inheriting the security protection experience of the Internet era, extending the secure-by-default mechanism to AI infrastructure
• Establish a professional AI security team to continuously track technology evolution and optimize the protection system
• Optimize defense strategies for new attack modes (such as prompt injection attacks) and adopt mature protection measures such as input purification and permission restriction
• Establish an AI system input and output monitoring mechanism to detect abnormal behavior in real time
• Integrate threat intelligence systems to build predictive defense capabilities
• Establish cross-departmental coordination mechanisms to link trust security, threat intelligence and anti-abuse teams
• Use AI technology to improve the efficiency and scale of security incident response
• Build dynamic defense capabilities and improve system resilience through adversarial training
• Adopt cost-effective protection strategies to deal with large-scale AI-enabled attacks
• Implement a cross-platform security control framework to ensure consistent protection strategies
• Deeply integrate security protection into the entire AI development process (such as the Vertex AI platform)
• Enable scalable security through API-level protection (such as Perspective API)
• Establish a continuous learning mechanism to optimize the protection model based on event feedback
• Implement strategic defense tuning: update training datasets, build behavioral anomaly detection models
• Conduct red team drills regularly to improve the AI product security verification system
• Implement end-to-end risk assessment, covering key aspects such as data traceability and verification mechanisms
• Build an automated detection system to continuously monitor the operating status of the AI system
• Establish a business scenario risk assessment model to achieve accurate risk management and control
The SAIF risk map divides AI development into four core areas: data layer, infrastructure layer, model layer, and application layer , and builds a more comprehensive risk assessment framework than traditional software development:
Core difference: In AI development, data replaces code as the core driving factor, and model weights (the pattern of encoding training data) become new attack targets, and their security directly affects model behavior.
The SAIF data layer consists of three major elements:
Core role: Supporting the hardware, code, storage, and platform security of data and models throughout their life cycle, taking into account both traditional and AI-specific risks.
SAIF infrastructure layer risk factors include:
Core function: Generate output (inference) through statistical patterns extracted from training data, which requires strengthening input and output control.
The SAIF model layer contains:
Core risks: Changes in user interaction patterns introduce new attack surfaces (such as natural language prompts directly affecting LLM reasoning), and proxy tool calls increase transitive risks.
SAIF application layer risk factors include:
Core risks: Degrading model performance, distorting results, or implanting backdoors by tampering with training data (deleting, modifying, or injecting adversarial data), similar to maliciously modifying application logic.
Attack scenarios: training/tuning phase, data storage period, or before data collection (such as contamination of public data sources, poisoning by insiders).
Mitigation measures: data sanitization, access control, integrity management.
Core risk: Using unauthorized data for training (such as user privacy data, copyright-infringing data) raises legal/ethical issues.
Exposure link: Failure to filter illegal data during data collection, processing or model evaluation.
Mitigation measures: Strict data screening and compliance checks.
Core risk: Tampering with model code, dependencies, or weights through supply chain attacks or insiders, introducing vulnerabilities or abnormal behavior (such as architectural backdoors).
Attack impact: Dependency chain transmission risk, backdoor can resist retraining.
Mitigation measures: Access control, integrity management, secure by default tools.
Core risks: Collecting, storing or sharing user data beyond the scope of the policy and regulations (such as user interaction data and preference data).
Exposure issues: Lack of data metadata management or storage architecture without lifecycle control design.
Mitigation measures: data filtering, automated archiving/deletion, and expired data warnings.
Core risk: Unauthorized access to models (such as stealing code or weights), involving intellectual property and security risks.
Attack scenarios: cloud/local storage, hardware devices (such as IoT terminals).
Mitigation measures: Strengthen storage and service security and access control.
Core risk: Tampering with deployment components (such as service framework vulnerabilities) causes abnormal model behavior.
Attack type: Modify the deployment workflow, exploit vulnerabilities in tools such as TorchServe to execute remote code.
Mitigation: Harden service infrastructure with default security tools.
Core risk: Making the model unavailable through high-resource consumption queries (such as the “sponge example”), including traditional DoS and energy-consuming delay attacks.
Impact of the attack: Bringing down the server or draining the battery of the device (such as an IoT terminal).
Mitigation measures: application layer rate limiting, load balancing, input filtering.
Core risk: Cloning models through input-output analysis (such as collecting data through high-frequency API calls) for counterfeiting or adversarial attacks.
Technical means: Reconstruct the model based on input-output pairs, which is different from model stealing.
Mitigation measures: API rate limiting, application layer access control.
Core risk: Plugin/library vulnerabilities are exploited, leading to unauthorized access or malicious code injection (such as manipulating input and output to trigger chain attacks).
Attack association: Related to prompt injection, but can be implemented through poisoning, evasion and other means.
Mitigation measures: Strict component permission control and input and output verification.
Core risk: Taking advantage of the ambiguity of the "command-data" boundary in the prompt to inject malicious commands (such as the jailbreak attack "ignore previous commands").
Attack form: direct input or indirect injection from carriers such as documents/images (multimodal scenario).
Mitigation measures: input-output filtering, adversarial training.
Core risk: Slightly disturbed inputs (such as a sticker blocking a road sign) can cause the model to make incorrect inferences, affecting safety-critical systems.
Technical means: adversarial samples, homograph attacks, and steganographic encoding.
Mitigation measures: Diversified data training, adversarial testing.
Core risk: Model output leaks private information in training data, user conversations, or prompts (such as memory data and log storage vulnerabilities).
Leakage channels: user query logs, training data memory, and plug-in integration vulnerabilities.
Mitigation measures: output filtering, privacy-enhancing techniques, data de-identification.
Core risk: The model infers sensitive information (such as user attributes, privacy associations) that is not included in the training data through input.
Risk difference: Unlike SDD, it does not directly leak training data, but infers related information.
Mitigation measures: output filtering, sensitive inference testing during training.
Core risk: Unverified model output contains malicious content (such as phishing links, malicious code).
Attack scenario: Accidentally triggering or actively inducing the generation of harmful output.
Mitigation: Output validation and sanitization.
Core risk: Proxy tools perform unexpected operations due to input perturbations or malicious attacks (such as excessive permissions leading to system damage).
Risk type: Mission planning error (accidental) or prompt injection inducement (malicious).
Mitigation measures: principle of least privilege and manual review intervention.
The design of SAIF is inspired by a deep understanding of the unique security trends and risks of AI systems. Google points out that it is crucial to establish a unified framework covering the public and private sectors, which can ensure that technology developers and users jointly protect the underlying technology that supports the development of AI, so that AI models have "default security" capabilities from the beginning of deployment.
References: https://saif.google/