New Open Source Repositories for Data Scientists in Infosec

Open Gym AI

Over the past few years, we have published numerous posts on the benefits and challenges of machine learning in infosec in an effort to help fellow practitioners and customers separate hype from the reality. We also believe contributing to the larger open source community is an essential component of this outreach.  In conjunction with Black Hat, DefCon and BSidesLV, we have released two GitHub repositories, each a playground for data scientists in information security.


gym-malware: An OpenAI Gym for Malware Manipulation

First, last week our research team released gym-malware, an open source OpenAI gym for manipulating Windows PE binaries to evade next-gen AV models.  The “gym” allows data scientists in information security to simulate realistic black-box evasion attacks against their own machine learning model by training a reinforcement learning agent to compete against it.  In contrast to other approaches for attacking machine learning models, this approach is agnostic to the model architecture under attack and only requires API access to the model. The reinforcement learning agent can probe the model to retrieve a malicious or benign label for any query.  By learning through tens of thousands of competitive rounds, the reinforcement agent can begin to produce with modest success functional malware that evades the model under attack.

Data scientists may use and modify this framework to answer questions such as:

  1. How sensitive is my model to evasion attacks for ransomware (or other category)?
  2. What mutations tend to evade my model the most?
  3. How can I create a killer reinforcement learning agent to bypass my model?

The repository contains a toy machine learning malware model and some preliminary agents (but bring your own malware!) that data scientists can use as a starting point to improve and optimize.


You Are Special, But Your Model Probably Isn’t

On a lighter note, at BSidesLV I presented “Your model isn’t that special: zero to malware model in not much code, and where the real work lies”.  A GitHub repo accompanies this talk, and contains a series of Jupyter notebooks that demonstrate building deep neural networks for Windows PE malware classification. The playground includes code (bring your own data!) for creating:

  1. A multilayer perceptron using hand crafted features (feature extraction code included);
  2. An end-to-end convolutional deep learning network for malware detection;
  3. A slightly silly re-work of ResNet for malware that I’ve named MalwaResNet for even deeper end-to-end convolutional deep learning for malware detection.

The talk and the notebooks aim to demonstrate that one cannot always simply port sophisticated deep learning models from computer vision domains and expect them to work immediately for malware classification.  Architectures developed to identify cats in images may not be optimally designed for finding malicious content in raw bytes. Deep learning does require work, and training them can be a challenge. These notebooks point practitioners in the right direction, but also highlight some of the shortcomings through the toy demonstration. For example, in the notebooks intended for consumption on a modest computer, models are trained on far too little data, for too few epochs, with non-optimized optimization parameters.  In fact, the simple multilayer perceptron with hand-crafted features and careful attention to the data (bring your own!) can actually produce a decent Windows PE malware machine learning model.  Each of the deep learning model architectures in the repository can be instructive to those who are interested in getting started with feature-based and end-to-end deep learning models in infosec.


A Deeper Look at Machine Learning in Infosec

Machine learning has become an important tool in security for detecting and preventing unknown threats, in large part because of its ability to generalize.  However, all machine learning models have blind spots that present an attack surface for motivated and sophisticated adversaries.  These open source packages help demystify machine learning for malware, and allow others in security to understand, attack, and harden their own machine learning models.  Especially in security, a rising tide lifts all boats. At Endgame, we continuously work to improve our models for malware and other threat detection and prevention, and share our insights and lessons learned to support others in the community.