Over the next four days, more than 3,000 hackers will descend upon a conference hall at DEF CON and try to break into leading generative artificial intelligence systems. Attendees of the annual hacking conference in Las Vegas will have 50 minutes each at one of 156 laptops to deceive, probe and steal information from AI chatbots, in the largest-ever public exercise aimed at discovering the security weaknesses of large language models.
At a time when interest in deploying generative AI is skyrocketing and the vulnerabilities of these systems are only beginning to be understood, the red-teaming exercise at DEF CON’s AI Village aims to enlist the talents of America’s leading hackers to discover security flaws and biases encoded in large language models to better understand how they might harm society.
The popularity of LLMs and the viral phenomenon of ChatGPT has caused a boom in the AI industry, putting AI tools in the hands of consumers and hackers alike. Hackers have already found ways to circumvent their security controls, and prompt injections — instructions that cause LLMs to ignore their guardrails — targeting mainstream models have received widespread attention. But the organizers of the red-team event hope that the exercise will allow participants to examine the potential harms and vulnerabilities of generative AI more broadly.
“Most of the harmful things that will occur will happen in the everyday use of large language models,” said Rumman Chowdhury, an AI ethicist and researcher and one of the organizers of the events. What Chowdhury refers to as “embedded harms” can include disinformation, racial bias, inconsistent responses and the use of everyday language to make the model say something it shouldn’t.
Allowing hackers to poke and prod at the AI systems of leading labs — including Anthropic, Google, Hugging Face, Microsoft, Meta, NVIDIA, OpenAI and Stability AI — in an open setting “demonstrates that it’s possible to create AI governance solutions that are independent, inclusive and informed by but not beholden to AI companies,” Chowdhury said at a media briefing this week with the organizers of the event.
Broadening the community of people involved in AI security is more important than ever, the event’s organizers argue, because AI policy is being written while key scientific questions remain unanswered. “Congress is grappling with AI governance and they’re searching for guidance,” said Kellee Wicker, the director of the Science and Technology Innovation Program at the Wilson Center, a Washington think tank. As AI policy is being written, “wider inclusion of stakeholders in these governance discussions is absolutely essential,” Wicker argues, adding that the red-team event is a chance to diversify “both who’s talking about AI security and who is directly involved with AI security.”
Participants in the event will sit down at a laptop, be randomly assigned a model from one of the participating firms and provided with a list of challenges from which they can choose. There are five categories of challenges — prompt hacking, security, information integrity, internal consistency and societal harm — and participants will submit any problematic material to judges for grading.
The winners of the event are expected to be announced Sunday at the conclusion of the conference, but the full result of the red-teaming exercise are not expected to be released until February.
Policymakers have seized on red-teaming as a key tool in better understanding AI systems, and a recent set of voluntary security commitments from leading AI companies secured by the White House included a pledge to subject their products to external security testing. But even as AI models are being deployed in the wild, it’s not clear that the discipline of AI safety is sufficiently mature and has the tools to evaluate the risks posed by large language models whose internal workings scientists are often at a loss to explain.
“Evaluating the capability and safety characteristics of LLMs is really complex, and it’s sort of an open area of scientific inquiry,” Michael Sellitto, a policy executive at Anthropic, said during this week’s briefing. Inviting a huge number of hackers to attack models from his company and others is a chance to identify “areas in the risk surface that we maybe haven’t touched yet,” Sellitto added.
In a paper released last year, researchers at Anthropic described the results of an internal red-teaming exercise involving 324 crowd-sourced workers recruited to prompt an AI assistant into saying harmful things. The researchers found that larger models trained via human feedback to be more harmless were generally more difficult to red-team. Having been trained to have stronger guardrails, Anthropic’s researchers found it more difficult to prompt the models to engage in harmful behavior, but the firm noted that its data was limited and the approach expensive.
The paper notes that a minority of prolific red-teamers generated most of the data in the set, with about 80% of attacks coming from about 50 of the workers. Opening models to attack at DEF CON will provide a relatively inexpensive, larger data set from a broader, potentially more expert group of red-teamers.
Chris Rohlf, a security engineer at Meta, said that recruiting a larger group of workers with diverse perspectives to red-team AI systems is “something that’s hard to recreate internally” or by “hiring third-party experts.” By opening Meta’s AI models to attack at DEF CON, Rohlf said he hopes it will help “us find more issues, which is going to lead us to more robust and resilient models in the future.”
Carrying out a generative AI red teaming event at a conference like DEF CON also represents a melding of disciplines — between cybersecurity and AI safety.
“We’re bringing ideas from security, like using a capture-the-flag system that has been used in many, many security competitions to machine learning ethics and machine learning safety,” said Sven Catell, who founded the DEF CON AI Village. These aspects of AI safety don’t fit neatly within the cybersecurity discipline, which is principally concerned about security vulnerabilities in code and hardware. “But security is about managing risk,” Catell said, and that means the security community should work to address the risk of rapidly proliferating AI.
As AI developers place greater focus on security, bringing together these disciplines faces significant hurdles, but the hope of this weekend’s red-team exercise is that the hard-fought lessons from trying — and failing to secure — computer systems in recent decades might applied to AI systems at an early enough stage to mitigate major harms to society.
“There is this sort of space of AI and data science and security — and they’re not strictly the same,” Daniel Rohrer, NVIDIA’s vice president of product security architecture and research, told CyberScoop in an interview. “Merging those disciplines, I think is really important.” Over the course of the past 30 years, the computer security profession has learned a great deal about how to secure systems, and “a lot of those can be applied and implemented slightly differently in AI contexts,” Rohrer said.