Understanding Crawl Data at Scale (Part 2)


Effective analysis of cyber security data requires understanding the composition of networks and the ability to profile the hosts within them according to the large variety of features they possess. Cyber-infrastructure profiling can generate many useful insights. These can include: identification of general groups of similar hosts, identification of unusual host behavior, vulnerability prediction, and development of a chronicle of technology adoption by hosts. But cyber-infrastructure profiling also presents many challenges because of the volume, variety, and velocity of data. There are roughly one billion Internet hosts in existence today. Hosts may vary from each other so much that we need hundreds of features to describe them. The speed of technology changes can also be astonishing. We need a technique to address these rapid changes and enormous features sets that will save analysts and security operators time and provide them with useful information faster. In this post and the next, I will demonstrate some techniques in clustering and visualization that we have been using for cyber security analytics.

To deal with these challenges, the data scientists at Endgame leverage the power of clustering. Clustering is one of the most important analytic methodologies used to boil down a big data set into groups of smaller sets in a meaningful way. Analysts can then gain further insights using the smaller data sets.

I will continue the use case given in Understanding Crawl Data at Scale (Part 1): the crawled data of hosts. At Endgame, we crawl a large, global set of websites and extract summary statistics from each. These statistics include technical information like the average number of javascript links or image files per page. We aggregate all statistics by domain and then index these into our local Elasticsearch cluster for browsing through the results. The crawled data is structured into hundreds of features including both categorical features and numerical features. For the purpose of illustration, I will only use 82 numerical features in this post. The total number of data points is 6668.

First, I’ll cover how we use visualization to reduce the number of features. In a later post, I’ll talk about clustering and the visualization of clustering results.

Before we actually start clustering, we first should try to reduce the dimensionality of the data. The very basic EDA (Exploratory Data Analysis) method of numerical features is to plot them on a scatter matrix graph, as shown in Figure 1. It is an 82 by 82 plot matrix. Each cell in the matrix, except the ones on the diagonal line, is a two-variable scatter plot, and the plots on the diagonal are the histograms of each variable. Given the large number of features, we can hardly see anything from this busy graph. An analyst could spend hours trying to decipher this and derive useful insights:

Figure 1. Scatter Matrix of 82 Features

Of course, we can try to break up the 82 variables into smaller sets and develop a scattered matrix for each set. However, there is a better visualization technique available for handling the high dimensional data called a Self-Organizing Map (SOM).

The basic idea of a SOM is to place similar data points closely on a (usually) two dimensional map by training the weight vector of each cell on the map with the given data set. A SOM can also be applied to generate a heat map for each of the variables, like in Figure 2. In that case, a one-variable data set is used for creating each subplot in the component plane.

Figure 2. SOM Component Plane of 82 Features

By color-coding the magnitude of a variable, as shown in Figure 2, we can vividly identify those variables whose plots are covered by mostly blue. These variables have low entropy values, which, in information theory, implies that the amount of information is low. We can safely remove those variables and only keep the ones whose heat maps are more colorful. The component plane can also be used to identify similar or linearly correlated variables, such as the image at cell (2,5) and the one at cell (2,6). These cells represent the internal HTML pages count and HTML files count variables, respectively.

Based on Figure 2, 29 variables stood out as potential high information variables. This is a data-driven heuristic for distilling the data, without needing to know anything about information gains, entropy, or standard deviation.

However, 29 variables may still be too many, as we can see that some of them are pretty similar. It would be great to sort the 29 variables based on their similarities, and that can be done with a SOM. Figure 3 is an ordered SOM component plane of the 29 variables, in which similar features are placed close to each other. Again, the benefit of creating this sorted component plane is that any analyst, without the requirement of strong statistical training, can safely look at the graph and hand pick similar features out of each feature group.

Figure 3. Ordered SOM Component Plane

So far, I demonstrated how to use visualization, specifically a SOM, to help reduce the dimensionality of the data set. Please note that dimensionality reduction is another very rich research topic (besides clustering) in data science. Here I only mentioned an extremely small tip of the iceberg, using a SOM component plane to visually select a subset of features. One more important point about the SOM is that it not only helps reduce the number of features, but also brings down the number of data points for analysis by generating a set of codebook data points that summarize the original larger data set according to some criteria.

In Part 3 of this series on Understanding Crawl Data at Scale, I’ll show how we use codebook data to visualize clustering results.