LAS VEGAS—On a raised floor in a ballroom at the Paris Hotel, seven competitors stood silently. These combatants had fought since 9:00am, and nearly $4 million in prize money loomed over all the proceedings. Now some 10 hours later, their final rounds were being accompanied by all the play-by-play and color commentary you’d expect from an episode of American Ninja Warrior. Yet, no one in the competition showed signs of nerves.
To observers, this all likely came across as odd—especially because the competitors weren’t hackers, they were identical racks of high-performance computing and network gear. The finale of the Defense Advanced Research Projects Agency’s Cyber Grand Challenge, a DEFCON game of “Capture the Flag,” is all about the “Cyber Reasoning Systems”(CRSs). And these collections of artificial intelligence software armed with code and network analysis tools were ready to do battle.
Inside the temporary data center arena, referees unleashed a succession of “challenge” software packages. The CRSs would vie to find vulnerabilities in the code, use those vulnerabilities to score points against competitors, and deploy patches to fix the vulnerabilities. Throughout the whole thing, each system had to also keep the services defined by the challenge packages up and running as much as possible. And aside from the team of judges running the game from a command center nestled amongst all the compute hardware, the whole competition was untouched by human hands.
Greetings, Professor Falken
The Cyber Grand Challenge (CGC) was based on the formula behind the successful Grand Challenges that DARPA has funded in areas such as driverless vehicles and robotics. The intent is to accelerate development of artificial intelligence as a tool to fundamentally change how organizations do information security. Yes, in the wrong hands such systems could be applied to industrial-scale discovery and weaponization of zero-days, giving intelligence and military cyber-operators a way to quickly exploit known systems to gain access or bring them down. But alternatively, systems that can scan for vulnerabilities in software and fix them automatically could, in the eyes of DARPA Director Arati Prabhakar, create a future free from the threat of zero-day software exploits. In such a dream world, “we can get on with the business of enjoying the fruits of this phenomenal information revolution we’re living through today,” he said.
For now, intelligent systems—call them artificial intelligence, expert systems, or cognitive computing—have already managed to beat humans at increasingly difficult reasoning tasks with a lot of training. Google’s AlphaGo beat the world’s reigning Go master at his own game. An AI called ALPHA has beaten US Air Force pilots in simulated air combat. And, of course, there was that Jeopardy match with IBM’s Watson.
But those sorts of games have nothing on the cutthroat nature of Capture the Flag—at least as the game is operated by the Legitimate Business Syndicate, the group that oversees the long-running DEFCON CTF tournament.
This was the culmination of an effort that began in 2013, when DARPA’s Cyber Grand Challenge program director, Mike Walker, began laying the groundwork for the competition. Walker was a computer security researcher and penetration tester who had competed widely in CTF tournaments around the world. He earned this project after working on a “red team” that performed security tests on a DARPA prototype communications system.
After a 2012 briefing he gave the leadership of DARPA’s Information Innovation Office (I2O) on vulnerability detection and patching, the I20 leadership had one thought. “Can we do this in an automated fashion?” I20 Deputy Director Brian Pierce told Ars. “When it comes down to cyber operations, everything operates on machine time—the question was could we think about having the machine assist the human in order to address these challenges.”
The same question was clearly on the minds of Defense Department leaders, particularly at the US Cyber Command with its demand for some way to “fight the network.” In 2009, Air Force Gen. Kevin Chilton, then commander of the Strategic Command, said, “We need to operate at machine-to-machine speeds… we need to operate as near to real time as we can in this domain, be able to push software upgrades automatically, and have our computers scanned remotely.”
Walker saw an opportunity to push forward what was possible by combining the CTF tournament model with DARPA’s “Grand Challenge” experience. He drew heavily from then-deputy I20 director Norm Whitaker’s experience running the DARPA self-driving vehicle grand challenges from 2004 to 2007.
“We learned a lot from that,” said Pierce. But even with a template to follow, “a lot of things had to be built from scratch.” Those things included the creation of a virtual arena in which the competitors could be fairly judged against each other—one that was a vastly simplified version of the real world of cybersecurity so competitors focused on the fundamentals.
That was the same model DARPA followed in its initial self-driven vehicle challenges, as Walker pointed out at DEFCON this month. The winning vehicle of the 2005 DARPA Grand Challenge, a modified Volkswagen Touareg SUV named Stanley, “was not a self-driving car by today’s standards,” said Walker. “It was filled with computing and sensor and communications gear. It couldn’t drive on our streets, it couldn’t handle traffic… it couldn’t do a lot of things. All the same, Stanley earned a place in the Smithsonian by redefining what was possible, and today vehicles derived from Stanley are driving our streets.”
Similarly, the Cyber Grand Challenge devised by Walker and DARPA didn’t look much like today’s world of computer security. The systems would “work only on very simple research operating systems,” Walker said. “They work on 32-bit native code, and they spent a huge amount of computing power to think about the security problems of small example services. the complex bugs they found are impressive, but they’re not as complex as their real-world analogues, and a huge amount of engineering remains to be done before something like this guards the networks we use.”
The “research operating system” built by DARPA for the CGC is called DECREE (the DARPA Experimental Cyber Research Evaluation Environment). It was purpose-built to support playing Capture the Flag with an automated scoring system that changes some of the mechanics of the game as it is usually played by humans.
There are many variations on CTF. But in this competition, the “flag” to be captured is called a Proof of Vulnerability (POV)—an exploit that successfully proves the flaw on opponents’ servers. Teams are given “challenge sets,” or pieces of software with a one or sometimes multiple vulnerabilities planted in them, to run on the server they are defending. The teams race to discover the flaw through analysis of the code, and they can then score points either by patching their own version of the software and submitting that patch to the referee for verification or by using the discovered exploit to hack into opponents’ systems and obtain a POV.
The problem with patching something is that once a team patches, its code can be used to patch by everyone else. Patching generally means bringing down the “challenge set” code briefly to apply the patch. It’s a risk for competitors: if the patch fails and the software doesn’t work properly, that counts against your score.
Based on 32-bit Linux for the Intel architecture, DECREE only supports programs compiled to run in the Cyber Grand Challenge Executable Format (CGCEF)—a format that supports a much smaller number of possible system calls than used in software on general-purpose operating systems. CGCEF also comes with tools that allow for quick remote validation that software components are up and running, debugging and binary analysis tools, and an interface to throw POVs at the challenge code either as XML-based descriptions or as C code.
So just as with the human version of CTF, each of the CRSs was also tasked with defending a “server”—in this case, running instances of DECREE. The AI of each bot controlled the strategy used to analyze the code and the creation of potential POVs and patches. AIs made decisions based on the strategy they were trained with plus the state of the game, adapting to either submit patches (and share them with everyone else as a result), create a network-based defense to prevent exploits from landing, or go on the attack and prove vulnerabilities on other systems.
While teams could score points based on patches submitted, successfully deflected attacks, and POVs scored against the other teams’ systems, those points were multiplied by the percentage of time their copies of the challenge sets were available. Patching would mean losing availability, as the challenge sets would have to go down to be patched. That meant possibly giving up the chance to exploit the bug against others to score more points. Setting up an intrusion detection system to block attacks could also affect availability—especially if it was set wrong and it blocked legitimate traffic coming in. All of this makes for a complex set of game strategies in CTF, which really tests the flexibility of the AI controlling the bots.
One of the major differences between human CTF and the DARPA version of the game was its pace. In a typical human CTF tournament, only about 10 challenge sets would be posted over the course of a two-day tournament. In the qualifying round of the Cyber Grand Challenge, held last August at DEFCON, there were 131 challenge sets with a total of 590 vulnerabilities. In total, 28 systems made the first cut over the course of 24 hours—so they could handle a challenge about once every 10 minutes. In the finale of the Cyber Grand Challenge, there would be 100—but they would be posted to systems at the rate of one challenge every five minutes.
Another difference in this event was how the “offensive” part of the CTF worked. Rather than launching their POVs directly at competitors’ servers to try to score, the POVs were submitted to and launched by the referee system. That way, the success or failure of any POV could be instantly assessed by the scoring system, just as any patch submitted by competitors could be independently evaluated. All of the action recorded by the referee system could then be played back through a set of visualization tools to show the results of each round.
On top of all this was the hardware itself. CTF tournaments generally look like rooms full of people huddled at tables around laptops. The CGC version of the game required the construction of a portable data center, assembled in Las Vegas: an “air-gapped” network of 15 supercomputing systems, each with 128 2.5 GHz Intel Xeon processors (totaling more than 1,000 processor cores) and 16 terabytes of RAM. Physically disconnected from any outside network, the only way data left the “arena” network was via a robotic arm that passed Blu-ray discs burned with scoring data from one tray to another.
Seven of the identical supercomputing racks ran the AI-powered “bots.” Seven more were dedicated to running the match itself—handling the deployment of the challenge sets, verifying POVs and patches, performing forensic analysis, and tracking the score. The last system acted as a sparring partner for the competitors in warm-ups. The whole raised-floor rig was cooled by water piped in from three industrial chillers sitting on the Paris Convention Center’s loading dock; it drew 21 kilowatts of power over cables snaked in from outside.