AI Red Teaming: Applying Software TEVV for AI Evaluations
As the National Coordinator for critical infrastructure security and resilience, CISA is responsible for facilitating a Secure by Design approach to AI-based software across the digital ecosystem and helping protect critical infrastructure from malicious uses of AI. To effectively mitigate against critical failures, physical attacks, and cyberattacks, AI software developers must prioritize conducting rigorous safety and security testing to understand how an AI system can fail or be exploited.
AI red teaming is a foundational component of the safety and security evaluations process.
This blogpost demonstrates that AI red teaming must fit into the existing framework for AI Testing, Evaluation, Validation and Verification (TEVV). Additionally, the post explains how and why AI TEVV must fit into software TEVV, ensuring AI systems are fit for purpose. While there are differences in the specific software tools used, AI TEVV—despite common misconceptions—must be treated under software TEVV from a strategic and operational perspective.
This assertion is grounded in the fact that TEVV has been used for more than four decades to improve the safety and security of software.1 Experts working on AI evaluations should avoid reinventing the wheel and build upon lessons the software security community has learned through developing and improving guidance and requirements.
Framing AI Red Teaming in the Context of TEVV
AI red teaming is the third-party safety and security evaluation of AI systems; AI red teaming is a subset of AI Testing, Evaluation, Verification and Validation (TEVV).
AI TEVV, a broader risk-based approach for the external testing of AI systems, has been developed and operationalized by our interagency partners at the National Institute of Standards (NIST) through programs like Assessing Risks and Impacts of AI (ARIA) and the NIST GenAI Challenge.
Because AI systems are a type of software system, approaches for AI TEVV must fundamentally be a sub-component of the more established software TEVV.2 The TEVV framework is commonly used to test software reliability and help ensure that software is fit for purpose. TEVV can broadly be deconstructed into three components: software system test and evaluation process, software verification, and software validation.
Software TEVV Can be Used for AI Evaluations
A common misconception surrounding AI evaluation methods is that the established software TEVV framework is not, or cannot be, adapted to account for the evaluation of AI systems.
However, while there are tactical implementation and technical details that differ between AI TEVV and software TEVV, the two processes—from a strategic and operational vantage point—are quite similar. There are three truths about all software systems that illuminate this assertion:
1. Software systems have always had safety risks
One pervasive example that fuels the misconception that AI and software TEVV are dissimilar is the narrative that AI evaluations are unique because of the need to mitigate risk posed by potential security vulnerabilities and safety violations.3 While true, many software developers have long had to consider both the security and safety dimensions within traditional software systems.
For example, the Food and Drug Administration (FDA) approves medical devices for use within the United States. Since the 1980s, some medical devices have had a software component, a trend that is increasingly common today. In 1986, software flaws in a cancer radiotherapy device, the Therac-25, led to several deaths.4 The software flaw was a race condition, a type of error where operations are not executed in the correct order, resulting in unexpected outcomes. They occur often unpredictably and arise due to complex interactions between components and data. Race conditions are often hard to reproduce and identifying which single lines of code require modification to fix the flaw can be challenging.
While the FDA’s device approval process has since been updated, the Therac-25 incident demonstrates how traditional software can have fatal consequences for human safety. Medical devices are not uniquely susceptible to software safety risks; many other critical infrastructure sectors also employ safety critical software. Examples include aerospace, water/wastewater, and transportation, among many others. AI systems, as one type of software system, should similarly be evaluated for safety concerns, cybersecurity vulnerabilities, and in particular cybersecurity issues that could be exploited to cause safety issues.
2. Software systems require validity and reliability testing
Another common misconception is that AI systems must distinctively be tested for validity and reliability, preventing the deployment of AI systems which are inaccurate, unreliable, or poorly generalized to data.
However, mitigating against validity and reliability concerns while also ensuring the robustness of software against novel situations and inputs is common to both software and AI. For example, modern road vehicle braking systems often heavily rely on software to work effectively. Automated braking software interprets data from sensors and assists when a driver may not react a hazard in time.
This safety-critical software must demonstrate robustness to a variety of events and conditions, like unexpected pedestrians, slick roads, or the driver following too closely behind another car. Designers of all safety-critical software systems, whether they include an AI-element or not, must consider a range of factors including the dynamics of the system, the probability of certain external events, and the desired degree of “safety margin” given the impact of the system losing control. Additionally, rigorous testing is applied to ensure that the real system reflects the intended design assumptions.
While AI systems can be more complex than this simple example, many of the concepts and techniques from evaluating and modeling software robustness against unexpected inputs are akin to those used in traditional software evaluations.
3. Software systems are fundamentally probabilistic
Finally, many point to how AI systems, constructed with probabilities and commonly created with intentional variance to avoid producing repetitive results, often need multiple trials to discover improper behavior. This concern surrounding variability also extends to how outputs from AI systems may differ entirely even with small changes to configuration details or training data.
However, traditional software systems are also inherently unpredictable and can exhibit wildly different behavior based on small changes in input if appropriate safeguards are not implemented. Broadly, one cannot prove any non-trivial properties of any computer program in general ahead of time (i.e., Rice’s Theorem).
However, more operationally relevant are security vulnerabilities where a change to one or a few bytes in the input from the network can lead to total control of the machine by a threat actor; this happened with the popular web server NGINX in 2021 (CVE-2021-23017). Some classes of vulnerabilities, like race conditions, are not deterministic in any software system. Software engineers may also intentionally create controlled randomness in computer processes; this is a core function of cryptography.
Characteristics that may seem more prominent or concerning in AI systems (safety concerns, testing for validity and reliability, their probabilistic nature) have always been present in traditional software systems. As such, the well-established software TEVV methodology is a perfectly valid approach from which to conduct AI evaluations. Yes, there are differences with AI that require some adaptation, but none so large as to warrant a drastically different approach.
CISA's Role in AI TEVV
As the AI evaluations field continues to mature, there is an array of diverse stakeholders working to advance the science and practice of AI red teaming. This includes developing novel methodologies, creating tools that are interoperable across models or platforms, and improving capabilities to conduct AI red teaming at scale.
Serving both as National Coordinator and an operational lead for federal cybersecurity, CISA focuses on contributing to AI red teaming efforts that primarily support security evaluations for Federal and non-Federal entities; this work is organized into three broad workstreams.
First, CISA remains steadfast in ensuring that our work on AI pre-deployment testing supplements efforts in industry, academia, and government. CISA is a founding member of the recently announced Testing Risks of AI for National Security (TRAINS) Taskforce, which will include testing of advanced AI models across national security and public safety domains. Led by the NIST AI Safety Institute, CISA will contribute expertise both by helping build new AI evaluation methods and benchmarks that integrate with security testing processes, as well as providing subject matter expertise on cybersecurity testing. For much of this work, CISA will rely upon Vulnerability Management within CISA’s Cybersecurity Division, which offers CISA security evaluation services such as Cyber Hygiene and Risk and Vulnerability Assessments.
Second, CISA continues to provide technical assistance and risk management support to Federal and non-Federal partners, specifically supporting AI security technical post-deployment testing. This includes varied forms of testing, such as penetration testing, vulnerability scanning, and configuration testing. CISA also often works independently to detect and identify security vulnerabilities impacting critical infrastructure systems and devices. CISA has already begun to receive requests from partners to conduct penetration and technical security testing on Large Language Models (LLMs) and expects demand for these services to grow as partners increasingly adopt AI tools.
Third, CISA collaborates with NIST on the development of standards for AI security testing. CISA provides operational cybersecurity expertise to help make standards practicable. Additionally, CISA builds on NIST standards to provide high-quality services and advice to our partners. CISA security evaluation services, such as red teaming, include AI systems in the scope of those security assessment services. CISA also provides operational guidance for securing software systems, such as the cross-sector cybersecurity performance goals, and priority security practices for AI systems, such as the Secure by Design pledge.
New, but the Same
By treating AI TEVV as a subset of traditional software TEVV, the AI evaluations community benefits from using and building upon decades of proven and tested approaches towards assuring software is fit for purpose. Additionally, by streamlining processes, enterprises can avoid standing up parallel testing processes to accomplish similar ends, saving time and resources.
Most notably, with the knowledge that software and AI TEVV must be treated similarly to software TEVV from a strategic and operational perspective, the digital ecosystem can instead channel effort at the tactical level, developing novel tools, applications, and benchmarks to robustly execute AI TEVV.
1 The US Department of Defense publications in the Rainbow Series in the 1980s represent major development in the history of cybersecurity guidance and requirements. See NSA/NCSC Rainbow Series.
2 Some organizations approach software TEVV as a part of a broader product quality management regime, as defined by the processes and practices necessary for conformance with the ISO 9000 series of publications. For example, NIST SP 800-160r1 applies ISO 9000 definitions for validation and verification directly to the topic of engineering trustworthy, secure information systems. However, NIST SP 800-160r1 refers to a different ISO standard (29119-2:2021 – Software and systems engineering – Software testing) to define testing. There are several other documents referenced by SP 800-160r1 for evaluation criteria and processes. While SP 800-160r1 does not title itself a TEVV process manual, it does in fact define and inter-relate the processes for software testing, evaluation, validation, and verification.
3 NIST defines “security” as resistance to intentional, unauthorized act(s) designed to cause harm or damage to a system; “safety” is a property of a system such that it does not, under defined conditions, lead to a state in which human life, health, property, or the environment is endangered. See: The Language of Trustworthy AI: An In-Depth Glossary of Terms.
4 Leveson, Nancy G., and Clark S. Turner. "An investigation of the Therac-25 accidents." Computer 26.7 (1993): 18-41.c