Challenges in Data-Driven Security


DEFCON 22 was a great learning experience for me. My goal was to soak up as much information security knowledge as possible to complement my existing data science experience. I grew more and more excited as each new talk taught me more and more security domain knowledge. But as Alex Pinto began his talk, this excitement turned to terror.

I knew exactly where he was going with this. And I also knew that any of those marketing blurbs about behavioral analysis, mathematical models, and anomalous activity could have easily been from Endgame. I had visions of being named, pointed out, and subsequently laughed out of the room. None of that happened of course. Between Alex’s talk and a quick Google search I determined that none of those blurbs were from my company. But that wasn’t really the point. They could have been.

That’s because we at Endgame are facing the same challenges that Alex describes in that talk. We are building products that use machine learning and statistical models to help solve security problems. Anyone doing that is entering a field littered with past failures. To try and avoid the same fate, we’ve made sure to educate ourselves about what’s worked and what hasn’t in the past.

Alex’s talk at DEFCON was part of that education. He talked about the curse of dimensionality, adversaries gaming any statistical solution, and algorithms detecting operational rather than security concerns. This paper by Robin Sommer and Vern Paxson is another great resource that enumerates the problems that past attempts have run up against. It talks about general challenges facing unsupervised anomaly detection, the high cost of false-positive and false-negative misclassifications, the extreme diversity of network traffic data, and the lack of open and complete data sets to train on. Another paper critiques the frequent use of an old DARPA dataset for testing intrusion detection systems, and by doing that reveals a lot of the challenges facing machine learning researchers looking for data to train on.

Despite all that pessimism, there have been successes using data science techniques to solve security problems. For years here at Endgame, we’ve successfully clustered content found on the web, provided data exploration tools for vulnerability researchers, and used large scale computing resources to analyze malware. We’ve been able to do this by engaging our customers in a conversation about the opportunities—and the limitations—presented by data science for security. The customers tell us what problems they have, and we tell them what data science techniques can and cannot do for them. This very rarely results in an algorithm that will immediately identify attackers or point out the exact anomalies you’d like it to. But it does help us create tools that enable analysts to do their jobs better.

There is a trove of other success stories included in this blog post by Jason Trost. One of these papers describes Polonium, a graph algorithm that classifies files as malware or not based on the reputations of the systems they are found on. This system avoids many of the pitfalls mentioned above. Trustworthy-labeled malware data from Symantec allows the system to bootstrap its training. The large-scale reputation based algorithm makes gaming the system difficult beyond file obfuscation.

The existence of success stories like these proves that data-driven approaches can help solve information security problems. When developing those solutions, it’s important to understand the challenges that have tested past approaches and always be cognizant of how your approach will avoid them.

We’ll use this blog over the next few months to share some of the successes and failures we here at Endgame have had in this area. Our next post will focus on our application of unsupervised clustering for visualizing large, high dimensional data sets. Stay tuned!