Prove it!: A 2018 Wave in Information Security Machine Learning

Over the last several years, various waves of machine learning (ML) adoption have disrupted information security products and services.  Based on my limited retention of electromagnetic wave propagation theory from undergraduate studies, the idiomatic use of waves seems wholly appropriate.  The market reaction to ML has been not unlike a transverse wave that displaces particles in orthogonal directions even while delivering energy to its target.  Indeed, ML in infosec has delivered on its promise in many areas, but some combination of excitement, mystique, and hype has unnecessarily confounded the fundamental strengths of what ML can achieve, muddying the impact.  A few headlines from 2017 provide evidence of the market’s growing veiled skepticism, caveated celebration, and muted enthusiasm:

A.I. faces hype, skepticism at RSA cybersecurity show

Artificial Intelligence Will Revolutionize Cybersecurity: But Security Leaders Must View All Vendor Claims With Skepticism

Do you need AI? Maybe. But you definitely need technology that works

It is within this landscape that I make a not-so-audacious prediction for infosec machine learning in 2018: this year will mark the swell of the “Prove It!” wave for Machine Learning, which I hope will adjust the equilibrium of clarity, trust, and honesty for the better.  I’ll outline some of the forcing functions along with the boundary conditions and challenging scattering surfaces that are channeling this wave in 2018.  In other words, I’ll define the drivers behind “Prove It!” and the obstacles and policies that will ultimately shape it.


Forcing Functions

There is a growing demand for transparency in creating machine learning models, including explicit means to challenge model predictions, access and understanding of the data driving the models, and buzzword-free clarity in marketing solutions that leverage them.  This triad is driving a “Prove It!” wave in 2018 that is also shaped by societal and political forces. First, self-correction within the ML research community has instigated a move away from ad hoc model-building to a seek-to-understand approach that may include (gasp!) proofs and guarantees. Next, government regulations such as the EU’s General Data Protection Regulation aim to protect users from the consequences of unchallenged “black box” decisions. Finally, infosec customers are exhausted from blind reliance or incomplete information regarding ML capabilities and just want protection and usability, no matter how it is built. I’ll address each of these in more detail below.


Provable and Reproducible ML Research

Ali Rahimi gave a clarion if not somewhat controversial call for renewed emphasis in reproducible research at the NIPS machine learning conference last year.  He compared some trends, particularly in deep learning, to the medieval practice of Alchemy. “Alchemy ‘worked’,” Ali admitted.  “Alchemists invented metallurgy, ways to dye textiles, our modern glass-making processes, and medications.  Then again, Alchemists also believed they could cure diseases with leeches, and turn base metals into gold.”  

Adapting Ali’s metaphor to security, we believe machine learning is a powerful tool for detecting malicious tools and behavior. And for the record, we also believe in the appropriate use of deep learning. However, we don’t believe that machine learning will soon displace effective rules or, especially, hardworking infosec professionals.  Furthermore, when machine learning is powering, say, a photo-sharing app, then an Alchemist-like “let’s see if this works?” approach is acceptable.  But, this is infosec. We’re protecting customer networks, their endpoints, and their data.  Our customers deserve the reassurance that technology--machine learning or otherwise--is built on a bedrock of thorough knowledge, verifiability, and rigor.

To be clear, much of machine learning (including deep learning) is already built on that solid bedrock. For example, at Endgame, we carefully evaluated machine learning models for malware detection based on detection rate, memory, and CPU footprint.  We even attack our own machine learning models to understand weaknesses and ensure that models are robust to worst-case adaptation of an adversary.

My call for 2018 is to continue to address what is still particularly needed in ML infosec research: more cross-pollination between academia and industry, more open community engagement from security vendors, and more open datasets for reproducible research.  By doing this, we’ll continue to move ML in infosec from the dark arts of Alchemy to rigorous Science.


Transparency in Algorithmic Decision-Making

The European Union’s forthcoming General Data Protection Regulation (GDPR) introduces comprehensive rules about the collection, storage, and processing of personal information.  In addition to implementing corporate accountability for responses to breaches, a hearty majority of the law grants citizens specific rights related to personal data, including full transparency about how data is used, access to data at any time, to object to certain uses of the data, and the “right to be forgotten”.

In Article 22, the GDPR also addresses algorithmic decision-making to formulate “safeguards for the rights and freedoms of the data subject” for data processing that “produces legal effects” or “similarly significantly affects” the citizen.  Among the phrases in the GDPR is a “right to an explanation”.  The legal bounds and scope of applicability of “right to an explanation” may be debated, but I believe this is the whitecap of a broader swell: the call for more transparency in “black box” ML.  It is welcome and contributes to the mounting “Prove It!” pressure in 2018.  If asked, can you explain how your ML model arrived at its decision?  It will no longer be acceptable to blame algorithms for unintended consequences.

Importantly, ML models are not all created equally, and each may require a different technique to describe decision-making.  For example, a nearest-neighbor classifier naturally justifies its decision using case-based reasoning: your file was predicted “malicious” because it is similar to *this* known malicious file.  Decision trees provide a human-interpretable algorithm for justifying decisions, although, perhaps, awkwardly verbose.  Ensembles like random forests and gradient-boosted decision trees blur this simplicity because decisions are based on a committee of such trees, which are more awkward to concisely summarize.  Instead, one often reverts to listing features most commonly queried to derive the result, delivering information similar to feature importance in a linear model.  The greatest burden for clarity is still likely with deep learning models which today rely on sensitivity analysis, saliency maps, relevance propagation, or visualizing attention mechanisms, which unfortunately may amount to mere blobography for all but the data scientist shepherding the model.  Still, other methods like LIME are model-agnostic ways to implement explainable AI or interpretable ML.

The point is that although some models may reach impressive predictive performance, it may not be clear what information in the data directly determine the decisions.  Ironically, machine learning is such that even with full access to the source code and data, it may still be very difficult to determine *why* a model made a particular decision.  In some cases, there may be a trade-off: would you rather have a predictor be right 99% of the time, but not know why, or be right 97% of the time, but with a satisfactory explanation?  For some applications, such as medical diagnosis, “black box” decision-making may be considered irresponsible.  The use of explainable features and human-interpretable ML models are a foundation for providing guarantees about a diagnosis. Beyond that, interpretability enables verification.  Understanding the model enables model improvement.  And, in some cases, it may empower an infosec customer with important context about an incident.

One purpose of the GDPR is to protect users from algorithmic impact of their data, with “right to an explanation” as one safeguard.  Some would argue that “explainable AI” is holding models and algorithms to a higher standard than humans.  As Marvin Minsky, one of the fathers of AI noted, “No computer has ever been designed that is ever aware of what it’s doing.  But most of the time, we aren’t either.”  Still, at the very least, public policy and regulations are pushing us gently away from “black box” to “glass box” algorithmic decision-making.  My call for 2018 is to ride this early swell in infosec.  If the physicist’s mantra is Feynman’s “What I cannot create, I do not understand,” then the infosec data scientist should adopt, “What cannot be understood, should be deployed with care.”


Show me the money!

The final “Prove It!” pressure is rooted in industry fatigue.  In 2018, “because it uses ML” will hardly be an acceptable answer to a question about whether one product provides better protection than another. At the end of the day, whether a customer’s infrastructure is protected far outweighs how it is protected.  

The frenzy of ML is clearly not limited to information security, as it has come to symbolize the leading edge of technological innovation.  Indeed, the projected outpacing in AI research by China has been called a “Sputnik moment” for the United States, with some fear of a widening research gap.  Information security is sure to benefit from any unified energy behind an AI “space race” in the long run.  However, the stark reality remains that customers are being breached today and are hungry for the best solution for the problem at hand, regardless of whether it is imminently headed to space.

Fortunately, there are technique-agnostic methods to compare solutions. We have previously argued that AV can be compared apples-to-apples to ML by comparing both false positive and true positive rates, for example, whereas “accuracy” is wholly inadequate and may hide all manner of sins. Customers are increasingly demanding answers to these and other “right questions”.  Second, where data are non-existent or resources impractical for evaluating products in-house, customers can turn to agnostic third-party testing. In the endpoint security space, vendors are beginning to offer holistic breach tests rather than AV-only tests, which help customers value a broader protection landscape.

My call for 2018 is for companies to finally move beyond selling “because it uses ML” to address what really matters: “because it has been shown to protect customers (in third-party tests, even!)”.


A self-fulfilling prediction?

I am advocating a “Prove It!” trend in infosec ML as much as I am predicting it.  And I’m certainly not alone: data scientists I talk to throughout the infosec community welcome the rigor, the transparency, and the honesty. For data scientists, this means bringing more attention to “process” into what has been a maniacal drive for “outcome”. Let’s make our research reproducible.  Let’s do our best to understand our models and provide explanations to users when appropriate, legal pressures or not.  As that culture changes, consumers can, conversely, invest confidently in successful outcomes, rather than shiny processes. As a now-mature staple in information security, let’s let ML prove itself, and allow customers to demand results, no matter how it’s built.