Going “Deep” with Artemis 3.0


Over two years ago we announced Artemis, Endgame’s natural language interface to facilitate and expedite detection and response. During that time, we’ve learned how security workers employ the technology and identified some areas for improvement. When Artemis was first released, it exclusively supported the querying of events on the Windows operating system. Those were simpler times, when file names had extensions and process ids were usually multiples of 4. In 2018, we opened up the Artemis interface to events generated by Endgame Linux and OSX sensors. It was a seamless transition for our users, but in the Endgame spirit of continuous improvement, we started seeing ways to make Artemis better.

For example, if a user were to query “Search process data for apache on linux endpoints” the language model within Artemis would struggle to understand what apache was supposed to be. To a security worker, it is obvious the user meant for apache to be a process, but to a machine there isn’t much there to make a confident decision. The string apache could be a username or a process name because prior to the inclusion of Linux/OSX, Artemis was trained to associate extensions to file/process names. Without more specification (ie “user apache”) the best Artemis can do is lean toward guessing process name.

We found that training on the old architecture explained here could produce better recognition of extensionless files. Unfortunately any gains were offset by the expense of model size, training time and showstopper misses, making the new model infeasible to deploy.

We needed a way to perform an apples to apples comparison of potential solutions. This meant building a platform to train, evaluate, and test against a common set of data. We developed a BotInspector, a pipeline tool for creating NLU models. BotInspector allows us to quickly train new models using different features or architectures and provide summary statistics. Moreover, it provides a useful comparison view that highlights performance differences between model versions.

Model Comparison

The Winner Is...
Ultimately a deep learning approach won out. Specifically, a Bidirectional-Long Short Term Memory Conditional Random Field (thankfully referred to as BiLSTM-CRF for short) signficantly outperformed our original, standard CRF model. Not only was performance better, the model itself saw a 50x reduction in size (see table below) which enables us to push regular updates to the language model via our cloud services to its home on the Endgame Platform. The main reason for the difference in overall performance was due to the features being passed to the CRF.

In our original model the lack of a trained embedding layer forced a ballooning in model size, because the variety and size of the features in the model grew with the variety of vocabulary in the training data. This meant that whatever handcrafted features we tried, i.e. the last 3 letters of the word, the model would basically save those features to help out the later classification. It would essentially perform a lookup of a given feature in order to get a number which would then be placed into the word vector and fed to the CRF. This limited us to smaller handcrafted features so as not to balloon our model. We also needed to augment our handcrafted features with parts of speech tagging, ie adding that a certain word was a verb or plural noun to our feature vector. This added cost since the parts of speech tagging was in fact just another large model. All of which added to bloat in our deployments.

Our new BiLSTM-CRF model still has a CRF as the final step in the model, but ends up working since we train our own embedding layer which means our model is saved based on learned similarities of words, called an embedding layer, instead of a straight vectorization of the features themselves. This model acts as a function which can turn a tokenized sentence into a per word array of tag probabilities. In our case these tags are Inside-Outside-Beginning or IOB tags. The CRF takes these per word tag probabilities and uses the Viterbi algorithm to produce the most likely path of tags “Search for calc.exe” -> “O O B-ENT-FILE” which Artemis can then send to the EQL powered search functionality in the platform.


*Weighted avg. of the F1-Score generated by scikit-learn ClassificationReport

Moving Forward
Security software typically just sucks to use. At Endgame, we are driven to learn from the consumer-side of technology and implement trends and tools that make our customers’ lives easier. Our vision is to make Artemis your virtual security assistant, always ready with the information you need at your fingertips. Join us on this journey by trying our newest release, version 3.8!