Lessons from a Bake Off: A Data Intelligence Conference Readout

Data Science Bakeoff

Capital One recently hosted the excellent Data Intelligence conference in northern Virginia. As a data scientist working in infosec, it was great to meet so many new people and old friends who were all interested in applying machine learning to diverse fields. I presented an overview of our early research into malware classification titled “Which Model Came Hot and Fresh Out the Kitchen in our Malware Classifier Bake Off?”. We had previously documented this research in our technical blog, which detailed in depth our path toward choosing a machine learning model to use for our endpoint malware classifier, and providing tips for others as they evaluate machine learning models. The results of the bake off gave us a great foundation for building what eventually became MalwareScoreTM. In this talk, I added additional context about data science in security in general, the benefits and drawbacks of running a model bake off, and more information about our conclusions.

In our bake off, the Endgame data science team evaluated many machine learning models to see how well they could power a malware detection capability on customer endpoints. We not only focused on classification performance, but also model size and query time execution to fit within tight memory and CPU constraints. Our data science team members brought their expertise in many models to this bake off, including nearest neighbors, decision tree based models, and even a deep learning model.

In addition, this project allowed our data science team to share knowledge and insights. Having used a Support Vector Machine (SVM) with a Radial Basis Function kernel many years ago to find neutrinos from other galaxies during my graduate research, I thought I knew everything there was to know about them. But during this project, I learned that it’s best to train SVMs differently when your feature count is as high as it was here (>2000 features). This is just one example of how our data science team gained additional knowledge throughout the bake off process, and as a result expanded our own skillsets.

After my talk, most questions focused on the many things we could have done to continue to improve the performance of the models across the board. I had to laugh at some point where my answer was going to once again be “no, we didn’t try that” and explain that at some point the purpose of the bake off had been accomplished. Once we’ve learned from each other and seen some early performance results, it was important for the team to decide on a model and iterate on delivering an actual product. In our case, it was clear that gradient boosted decision trees offered the best combination of detection, size, and performance.  A lot of the audience’s suggestions (identifying problem executables and improving on them, searching for better features, Bayesian hyperparameter optimization) are techniques we have since used to improve and optimize MalwareScoreTM performance, something we do on a nearly continual basis.  

If you’re considering using machine learning to solve a problem at your company or organization, a bake off is a great way to determine which direction you should take or challenge existing assumptions about the best path.  As you design a bake off, make sure to clearly define what you hope to learn from it. Machine learning can be applied to many different domains, and so an unexpected model could be appropriate for your problem area. At the same time, focusing too many resources on a bake off could distract your team from all the other task required before shipping a data product. By defining what questions you need answered from a bake off, you can reduce the chances of it becoming too large of a project.

In the process of recalling our earlier work in order to build this talk, I was reminded how this bake off really served as the ignition for our efforts towards building MalwareScoreTM. We may remix the effort in the coming months to include what our team has learned about training deep learning architectures and will share results if and when they’re available.