Hunting for Honeypot Attackers: A Data Scientist’s Adventure


The U.S. Office of Personnel Management (known as OPM) won the “Most Epic Fail” award at the 2015 Black Hat Conference for the worst known data breach in U.S. government history, with more than 22 million employee profiles compromised. Joining OPM as contenders for this award were other victims of high-profile cyber attacks, including Poland's Plus Bank and the website The truth is, hardly a day goes by without news of cyber intrusions. As an example, according to, just in recent months PNI Digital Media and many retailers such as Wal-Mart and Rite-Aid had their photo services compromised, UCLA Health’s network was breached, and information of 4.5 million people may have been exposed. Criminals and nation-state actors break into systems for many reasons with catastrophic and often irremediable consequences for the victims.

Traditionally, security experts are the main force for investigating cyber threats and breaches. Their expertise in computers and network communication provides them with an advantage in identifying suspicious activities. However, with more data being collected, not only in quantity but also in variety, data scientists are beginning to play a more significant role in the adventure of hunting malicious attackers. At Endgame, the data scientist team works closely with the security and malware experts to monitor, track and identify cyber threats, and applies a wide range of data science tools to provide our customers with intelligence and insights. In this post, I’ll explain how we analyze attack data collected from a honeypot network, which provides insight into the locations of attackers behind those activities. The analysis captures those organized attacks from a vast amount of seemingly separated attempts.

This post is divided into three sections. The first section describes the context of the analysis and provides an overview of the hacking activities. The second section focuses on investigating the files that the attackers implanted into the breached systems. Finally, the third section demonstrates how I identified similar attacks through uncovering behavioral characteristics. All of this demonstrates one way that data science can be applied to the security domain. (My previous post explained another application of data science to security.)


Cyber attackers are constantly looking for targets on the Internet. Much like a lion pursuing its prey, an attacker usually conducts a sequence of actions, known as the cyber kill chain, including identifying the footprints of a victim system, scanning the open ports of the system, and probing the holes trying to find an entrance into the system. Professional attackers might be doing this all day long until they find a weak system.

All of this would be bad news for any weak system the attacker finds – unless that weak system is a honeypot. A honeypot is a trap set up on the Internet with minimum security settings so an attacker may easily break into it, without knowing his/her activities are being monitored and tracked. Though honeypots have been used widely by researchers to study the methods of attackers, they can also be very useful to defenders. Compared to sophisticated anomaly detection techniques, honeypots provide intrusion alerts with low false positive rates because no legitimate user should be accessing them. Honeypots set up by a company might also be used to confuse attackers and slow down the attacks against their networks. New techniques are on the way to make setting up and managing honeypots easier and more efficient, and may play an increasingly prominent role in future cyber defense.

A network of honeypots is called a honeynet. The particular honeynet for which I have data logged activities showing that an attacker enumerated pairs of common user names and passwords to enter the system, downloaded malicious files from his/her own hosting servers, changed the privilege over the files and then executed them. During the period from March 2015 through the end of June 2015, there were more than 21,000 attacker IP addresses being detected, and about 36 million SSH attempts being logged. Attackers have tried 34,000 unique user names and almost 1 million unique passwords to break into those honeypots. That’s a lot of effort by the attackers to break into the system. Over time, the honeynet has identified about 500 malicious domains and more than 1000 unique malware samples.

The IP addresses that were owned by the attackers and used to host malware are geographically dispersed. Figure 1 shows that the recorded attacks mostly came from China, the U.S., the Middle East and Europe. While geographic origination doesn’t tell us everything, it still gives us a general idea of potential attacker locations. 

Figure 1. Attacks came from all around the world, color coded on counts of attack. The darker the color, the greater the number of attacks originating from that country.

The frequency of attacks varies daily, as shown in Figure 2, but the trend shows that more attacks were observed during workdays than weekends, and peaks often appear on Wednesday or Thursday. This seems to support the suspicion that humans (other than bots) were behind the scenes, and professionals instead of amateur hobbyists conducted the attacks. 

Figure 2. Daily Attack Counts.

Now that we understand where and when those attacks were orchestrated, we want to understand if any of the attacks were organized. In other words, were they carried out by same person or same group of people over and over again?

Attackers change IP addresses from attack to attack, so looking at the IP addresses alone won’t provide us with much information. To find the answer to the question above, we need to use the knowledge about the files left by the attackers. 

File Similarity

Malware to an attacker is like a hammer and level to a carpenter. We expect that an attacker would use his/her set of malware repeatedly in different attacks, even though the files might have appeared in different names or variants. Therefore, the similarity across the downloaded malware files may provide informative links to associated attacks.

One extreme case is a group of 17 different IPs (shown in Figure 3) used on a variety of days containing exactly the same files and folders organized in exactly the same structure. That finding immediately portrayed a lazy hacker who used the same folder time and time again. However, we would imagine that most attackers might be more diligent. For example, file structures in the hosting server may be different, folders could be rearranged, and the content of a malicious binary file may be tweaked. Therefore, a more robust method is needed to calculate the level of similarity across the files, and then use that information to associate similar attacks.

Figure 3. 17 IPs have exactly the same file structure.

How can we quantitatively and algorithmically do this?

The first step is to find similar files to each of the files in which we are interested. The collected files include different types, such as images, HTML pages, text files, compressed tar balls, and binary files, but we are probably only interested in binary files and tar balls, which are riskier. This reduces the number of files to work on, but the same approach can be applied to all file types.

File similarity computation has been researched extensively in the past two decades but still remains a rich field for new methods. Some mature algorithms to compute file similarities include block-based hashing, Context-Triggered Piecewise (CTP) hashing (also known as fuzzy hashing), and Bloom filter hashing. Endgame uses more advanced file similarity techniques based on file structural and behavioral attributes. However, for this investigation I used fuzzy hashing to compute file similarities for simplicity and since open source code is widely available.

I took each of the unique files based on its fuzzy hashing string and computed the similarity to all the other files. The result is a large symmetric similarity matrix for all files, which we can visualize to check if there are any apparent structures in the similarity data. The way I visualize the matrix is to connect two similar files with a line, and here I would choose an arbitrary threshold of 80, which means that if two files are more than 80% similar, they will be connected. The visualization of the file similarity matrix is shown in Figure 4.

Figure 4. Graph of files based on similarity.

It is visually clear that the files are indeed partitioned into a number of groups. Let’s zoom into one group and see the details in Figure 5. The five files, represented by their fuzzy hash strings, are connected to each other, having mutual similarity of over 90%. If we look at them very carefully, they only differ in one or two letters in the strings, even they have totally different file names and MD5 hashes. VirusTotal recognizes four out of the five malware, and the scan reports indicate that these malware are Linux Trojan. 

Figure 5. One group of similar files.

Identifying Similar Attacks

Now that we have identified the groups of similar files, it’s time to identify the attacks that used similar malware. If I treat each attack as a document, and the malware used in an attack as words, I can construct a document-term matrix to encapsulate all the attack information. To incorporate the malware similarity information into the matrix, I tweaked the matrix a bit. For malware that were not used in a specific attack, but that still share a certain amount of similarity with the malware being used, the malware will assume the value of the similarity level for that attack. For example, if malware M1 was not used in attack A1, but M1 is most similar to malware M2 which was used in attack A1, and the similarity level is 90%, then the element at cell (A1, M1) will be 0.9, while (A1, M2) be 1.0.

For readers who are familiar with NLP (Natural Language Processing) and text mining, the matrix I’ve described above is similar to a document-term matrix, except the values are not computed from TF-IDFs (Term Frequency-Inverse Document Frequency). More on applications of NLP on malware analysis can be found in a post published by my fellow Endgamer Bobby Filar. The essence of such a matrix is to reflect the relationship between data records and features. In this case, data records are attacks and features are malware, while for NLP they are documents and words. The resulting matrix is an attack-malware matrix, which has more than 400 columns representing malware hashes. To get a quick idea of how the attacks (the rows) are dispersed in such a high dimensional space, I plotted the data using the T-SNE (t-Distributed Stochastic Neighbor Embedding) technique and colored the points according to the results from K-means (K=10) clustering. I chose K=10 arbitrarily to illustrate the spatial segmentation of the attacks. The T-SNE graph is shown in Figure 6, and each color represents a cluster labeled by the K-means clustering. T-SNE tries to preserve the topology when projecting data points from a high dimensional space to a much lower dimensional space, and it is widely used for visualizing the clusters within a data set.

Figure 6 shows that K-Means did a decent job of spatially grouping close data points into clusters, but it fell short of providing a quantitative measurement of similarity between any two data points. It is also quite difficult to choose the optimum value for K, the number of clusters. To overcome the challenges that K-Means faces, I will use Latent Semantics Indexing (LSI) to compute the similarity level for the attack pairs, and build a graph to connect similar attacks, and eventually apply social network analytics to determine the clusters of similar attacks.

Figure 6. T-SNE projection of Attack-Malware matrix to 2-D space.

LSI is the application of a particular mathematical technique, called Single Value Decomposition or SVD, to a document-term matrix. SVD projects the original n-dimensional space (with n words in columns) onto a k-dimensional space, where k is much smaller than n. The projection then transforms a document’s vector in n-dimensional space into a vector in the reduced k-dimensional space under the requirement that the Euclidean distance between the original matrix and the resulting matrix after transformation is minimized.

SVD decomposes the attack-malware matrix into three matrices, one of which defines the new dimensions in the order of significance. We call the new dimensions principal components. The components are ordered by the amount of explained variance in the original data. Let’s call this matrix attack-component matrix. With the risk of losing some information, we can plot the attack data points on the 2-d space using the first and the second components just to illustrate the differences between data points, as shown in Figure 7. The vectors pointing to perpendicular directions are most different from each other.

Figure 7. Attack data projected to the first and second principal components.

The similarity between attacks can be computed with the results of LSI, more specifically, by calculating the dot product of the attack-component matrix.

Table 1. Attacks Similar to Attack from on 2015-03-23.

I connect two attacks if their similarity is above a certain threshold, e.g. 90%, and come up with a graph of connected attacks, shown in Figure 8.


Figure 8. Visualization of attacks connected by similarity.

There are a few big component subgraphs in the large graph. A component subgraph represents a group of attacks closely similar to each other. We can examine each of them in terms of what malware were deployed in the given attack group, what IP addresses were used, and how frequently the attacks were conducted.

I plotted the daily counts of attack for the two largest attack groups in Figure 9 and Figure 10. Both of them show that attacks happened more often on weekdays than on weekends. These attacks may have targeted different geo-located honeypots in the system and could be viewed as a widely expanded search for victims.

Figure 9. Daily counts of attack in one group.

Figure 10. Daily counts of attack in another group.

We can easily find out where those attackers’ IPs were located (latitude and longitude), and the who-is data associated with the IPs. But it’s much more difficult to fully investigate the true identity of the attackers.


In this post, I explained how to apply data science techniques to identify honeypot attackers. Mathematically, I framed the problem as an Attack-Malware matrix, and used fuzzy hashing to represent files and compute the similarity between files. I then employed latent semantic indexing methods to calculate the similarity between attacks based on file similarity values. Finally, I constructed a network graph where similar attacks are linked so that I could apply social network analytics to cluster the attacks.

As with my last blog post, this post demonstrates that data science can provide a rich set of tools that help security experts make sense of the vast amount of data often seen in cyber security and discover relevant information. Our data science team at Endgame is constantly researching and developing more effective approaches to help our customers defend themselves – because the hunt for attackers never ends.