Understanding Crawl Data at Scale (Part 3)


In Understanding Crawl Data at Scale (Part 2), I demonstrated using SOM to visualize a high-dimensional dataset and use the technique to help reduce the dimensionality. As you may remember, this technique is a time-saver for analysts who are dealing with large data sets consisting of hundreds of features. In this section, I will briefly show the process of clustering and the visualization of its results using a few classical clustering methods. As we know, it is difficult for humans to visually digest any information with more than three dimensions, so I would like to start with illustrating the clustering process using a 2-D data set. The two sort-of-arbitrarily-chosen features come from the data set used before, namely the minimum number of image files and the total number of HTML files.

The 2-D data set can be easily drawn as a scatter plot, as shown in Figure 1. Most of the data are located in a small region at the lower-left corner, while sparser points are stretched across far away in both dimensions. Intuitively we may come up with 3 clusters in our mind, something like the shaded areas below, by just looking at the scatter plot. How true is that? We can use a SOM to get a better idea.

Figure 1. Scatter Plot of the 2-D Data Set

A SOM technique places similar data points close to each other, or even together in the same cell, on a given map. The cells with data populated are part of a codebook data set, which is viewed as a representation of the original data set but with a much smaller number of data points. After the placement is done, the extent of dissimilarity (or distance) between the cells can be computed, and the results can be plotted as a unified distance matrix plot (or U-Mat plot).

Figure 2 shows the same U-Mat plot, but with two possible splits of clusters on the SOM. Darker regions indicate lower distance values and bright red color usually indicates a separation of clusters. Legitimate guesses of the number of clusters might be three or four on the given SOM U-Mat plot, and we are confident that it won’t be more than that.

Figure 2. U-Mat Plot with Possible Separation of Clusters

Now that we have a good idea of how many clusters we would like to try, we can use the K-Means method to group the data points into 3 or 4 clusters (K = 3 or 4 respectively). It is also always a good practice to normalize the data in each dimension before clustering takes place. Here I normalized the data into the range of [0, 1] in both dimensions so that they are comparable.

Figure 3 and Figure 4 show the color-coded clustering results with 3 and 4 clusters. With 3 clusters, 90.5% of data points are assigned to cluster 2, 9% to cluster 1, and 0.5% to cluster 3. Apparently the data points in cluster 3 are outliers in this data set.

Figure 3. Three-Cluster Split Using K-Means

The four-cluster split is a bit different. Cluster 1 now takes 17.5% of the data points, cluster 2, 3.4%, cluster 3, 78.6%, cluster 4, 0.4%.

The contours in both Figure 3 and Figure 4 indicate the areas where data points may have the same level of membership likelihood. In the area where data points are dense, the contours change much more rapidly than those in sparse areas because the clustering is sensitive to the distance of data points to the cluster centers.

Figure 4. Four-Cluster Split Using K-Means

Thanks to the low dimensionality of the hypothetical data set, the split in each case is clear-cut. We can visualize the two different labeling systems using the codebook data placed on a SOM map, as in Figure 5.

The U-Mat plot on the left side of Figure 5 is the same as those in Figure 2, only drawn on a map of slightly different sizes. On the right side of Figure 5, the codebook data are plotted with 3-cluster or 4-cluster labeling. Both of them seem to make sense, and the choice of which to use is really up to the analyst.

Figure 5. Labeled Codebook Data with K-Means(3) and K-Means(4), 2-D Dataset

The world of high dimensionality would be much more blurry. After showing that we can get some satisfactory clustering result with 2-D data, let’s move up to the high-dimensional space, with the same data set but containing 29 variables.

The SOM U-Mat of the 29-feature data set is shown on the left side of Figure 6. The separation of clusters is much less obvious than that in 2-D space. Although the 29 features include the 2 features we used in the 2-D data set, there are many other features adding noises on the once clear split of the data. The consequence is somewhat mixed-up labels as shown in the labeled codebook plots on the right side of Figure 6.

Figure 6. Labeled Codebook Data with K-Means(3) and K-Means(4), 29-D Dataset

We also may want to use the same technique to visually compare the results from different clustering methods. Even with more rigorous measurements of clustering evaluation available, visualization remains a very powerful way for analysts (who may not necessarily be statisticians or data scientists) to gauge the performance of a variety of clustering methods. That being said, a rigorous validation of clustering is always encouraged whenever it is possible, but we won’t be going into the details of this today.

Figure 7 shows the results from two other clustering methods, KMedoid with K=4 and Fuzzy C-Means with C = 4.

Figure 7. Labeled Codebook Data with K-Medoid(4) and Fuzzy C-Means(4), 29-D Dataset

Lastly, I’d like to close this post with hierarchical clustering using the codebook data. When dealing with a very large amount of data, directly clustering might not be a feasible solution. In that case, vector quantization (VQ) will be a handy tool to reduce the data set. SOM’s are one kind of such VQ method. By training the weight vector of each cell in a map, some or all of the cells will resemble a portion of the original data set. The weight vectors associated with those cells are the codebook data. Clustering on the codebook data becomes much less computationally expensive because of the dramatic reduction in data set size.

Figure 8 shows the K-Means clustering (K=10) on the codebook data. K=10 is a sort of arbitrarily chosen large number. With the K-Means clustering result, we can do agglomerative hierarchical clustering.

Figure 8. Codebook Data Grouped into 10 Clusters Created with K-Means

Figure 9 shows the dendrograms of two hierarchical clustering results. The difference lies in the choice of how the distance between two clusters is computed. The x-axis of the dendrogram is the clusters being agglomerated, and the y-axis is the distance measure between two merged clusters. By choosing a threshold of cluster distance, one can cut off the linkage and identify a number of separate clusters. More clusters are generated as the threshold decreases.

Figure 9. Dendrograms of Hierarchical Clustering with Single Linkage (top) and Complete Linkage (bottom)

In summary, this post only highlights some of the ways for visualizing high-dimensional data and the clustering results. It certainly cannot cover everything related to multivariate clustering and visualization. We didn’t even mention projection methods, such as PCA (Principal Component Analysis), MDS (Multi-Dimensional Scaling), and Sammon Mapping. However, I hope that this post provides some interesting ideas for data enthusiasts on clustering and visualization, two of the techniques that are extremely useful in data science. Although the example data is taken from a cybersecurity context, the same practice can be successfully applied to other industries, such as credit risk, customer segmentation, biology, finance, and more.