Examining Malware with Python
Before I came to Endgame, I had participated in a couple of data science competitions hosted by Kaggle. I didn’t treat them as competitions so much as learning opportunities. Like most things in the data science community, these competitions felt very new. But now that I work for a security company, I’ve learned about the long history of CTF competitions meant to test and add to a security researcher’s skills. When the Microsoft Malware Challenge came along, I thought this would be a great opportunity to learn about new ways of applying machine learning to better understand malware. Also, as I’ve talked about before, the lack of open and labeled datasets is a huge obstacle to developing machine learning models to solve security problems. Here was an opportunity to work with an already prepared large labeled dataset of malware samples.
I gave a talk at the SciPy conference this year that describes how I used the scientific computing tools available in Python to participate in the competition. You can check out my slides or watch the video from that talk here. I tried to drive home two main points in this talk: first, that Python tools for text classification can be easily adopted for malware classification, and second, that details of your disassembler and analysis passes are very important for generalizing any results. I’ll summarize those points here, but take a look at the video and slides for more details and code snippets.
My final solution to the classification challenge was mainly based on counting combinations of bytes and instructions called ngrams. This method is based on counting the frequency that a byte or an instruction occurs in a malware sample. When n is greater than one, I count the frequency of combinations of two, three, or more combinations of bytes or instructions. Because the number of possible combinations climbs very quickly, a hashing vectorizer must be used to keep the size of the feature space manageable.
Figure 1: Example byte 2grams from the binglide documentation
Figure 2: Byte 2grams from a malware sample included in the competition
At first, I was only using byte ngrams and I was very surprised that feeding these simple features to a model could provide such good classifications. In order to explore this, I used binglide to better understand what the bytes inside an executable look like. Figure 1 and Figure 2 show the results of this exploration. Figure 1 shows example output from binglide’s documentation and Figure 2 shows the output when I ran the tool on a sample from the competition. In all the images, the entropy of a binary is displayed on the strip to the left and a histogram of the 2gram frequency is shown on the right. For that frequency histogram, each axis contains 256 possible values for a byte and a pixel turns blue as that combination of bytes occurs more frequently.
You can see that the first 2gram pattern in Figure 2 generally looks like the first 2gram pattern in Figure 1. The .text section is usually used for executable code so this match to example x86 code is reassuring. The second 2gram pattern in Figure 2 is very distinctive and doesn’t really match any of the examples from the binglide documentation. Machine learning algorithms are well suited to picking out unique patterns like this if they are reused throughout a class. Finding this gave me more confidence that the classification potential of the byte ngram features was real and not due to any mistake on my part.
I also used instruction ngrams in a similar way. In this case, instructions refer to the first part of the assembly code after it’s been disassembled from the machine code. I wrote some Python code to extract the instructions from the IDA disassembly files that were provided by those running the competition. Again, feature hashing was necessary to restrain the size of the feature space. To me, it’s very easy to see why instruction ngrams could provide good classifications. Developing software is hard, and malware authors are going to want to reuse code in order to not waste effort. That repeated code should produce similar patterns in the instruction ngram space across families of malware.
Using machine learning algorithms to classify text is a mature field with existing software tools. Building word and character ngrams from text is a very similar problem to building the byte and instruction ngrams that I was interested in. In the slides from my SciPy talk I show some snippets of code where I adapted the existing text classification tools in the scikit-learn library to the task of malware classification. Those tools were a couple of different vectorizers, pipelines for cross validating multiple steps together, and a variety of models to try out.
All throughout this process, I was aware that the disassembly provided in the competition would not be available in a large, distributed malware processing engine. IDA Pro is the leading program for reverse engineering and disassembling binaries. It is also restrictively licensed and intended to be run interactively. I’m more interested in extracting features from disassembly automatically in batch and providing some insight to the files generated by a statistical model. I spent a lot of time during and after the competition searching for open source tools that could automatically generate the disassembly provided by the competition.
I found Capstone to be a very easy to use open source disassembler. I used it to generate instruction ngrams and tested the classification performance of models based on those ngrams to the same models based on IDA instructions. They both performed well in that there were very few misclassifications. The competition was judged on a multi-class logarithmic loss metric, though, and this metric was always better when using the IDA instructions.
After talking to some security experts at Endgame, I’ve learned that this could be due to the analysis passes that IDA does before disassembling. Capstone will just execute one sweep over the binary and disassemble anything it finds as it goes. IDA will more intelligently decode the binary looking for entry points, where functions and subroutines begin and end, and what sections actually contain code, data, or imports. I was able to relate this to my machine learning experience in that I viewed IDA’s disassembly as a more intelligent feature engineering pipeline. The result is that I’m still working on finding or building the best performing distributable disassembler.
This Kaggle competition was a great example of how data science can be applied to solve specific security problems. Data science has been described as a combination of skills in software, math, and statistics, along with domain expertise. While I didn’t have the domain expertise when I first joined Endgame, working closely with our security experts has expanded my breadth of knowledge while giving me a new opportunity to explore how data science techniques can be used to solve security challenges.