Data Science for Security: Using Passive DNS Query Data to Analyze Malware
Most of the time, DNS services—which produce the human-friendly, easy-to-remember domain names that map to numerical IP addresses—are used for legitimate purposes. But they are also heavily used by hackers to route malicious software (or malware) to victim computers and build botnets to attack targets. In this post, I’ll demonstrate how data science techniques can be applied to passive DNS query data in order to identify and analyze malware.
A botnet is a network of hosts affected by malware to conduct nefarious activities, usually without the awareness of their owners. A command-and-control host hidden in the network communicates with the affected computers to give instructions and receive results. In such a botnet topology, the command-and-control becomes the single point of failure. Once its IP address is identified, it could be easily blocked and the whole communication with the botnet would be lost. Therefore, hackers are more likely to use a domain name to identify the command-and-control, and employ techniques like fast flux to switch IP addresses mapped to a single domain name.
As data scientists at Endgame, we leverage data sets in large variety and volume to tackle botnets. While the data we analyze daily is often proprietary and confidential, there is a publicly available data set provided by Georgia Tech that documents DNS queries issued by malware across the years 2011 - 2014. The malware were contained in a controlled environment and had limited Internet access. Each and every domain name query was recorded, and if a domain name could be resolved, the corresponding IP address was also recorded.
This malware passive DNS data alone would not provide sufficient information to conduct a fully-fledged botnet analysis, but it does possess rich and valuable insights about malware behaviors in terms of DNS queries. I’ll explain how to identify malware based on this data set, using some of the methods the Endgame data science team employs daily.
Graphical Representation of DNS Queries
Here is the data set I’ll examine. Each row is a record of DNS query, including date, MD5 of the malware file, the domain name being queried, and the IP address if the query finds a result.
What approach might enable the grouping of malware or suspicious programs based on specific domain names? As we have no information about the malware, the conventional static analysis of malware focusing on investigating binary files would not be helpful here. Clustering using machine learning may work only if each domain name is treated as a feature, but the feature space will be very sparse. That would result in expensive computation.
Instead, we can represent the DNS queries using a graphic network showing what domain names a malware is interested in, as displayed in Figure 1. Each malware program is labeled by an MD5 string. While Figure 1 only demonstrates a very small part of the network, the entire data set could actually be transferred into a huge network.
Figure 1. A small DNS query network
There are numerous advantages to expressing the queries in the format of a graph. First, this expedites querying complex relationships. A modern graph database, such as Neo4j, Orientdb, or Titandb, can efficiently store a large graphic network and conduct joint queries that normally are computationally expensive for relational databases, such as MS SQL Server, Oracle or MySql. Second, network analytic methods from a diverse range of scientific fields can be employed to analyze the data set to gain additional insights.
Graph Analysis on the Malware Network
The entire passive DNS data set covers several years, so I randomly picked a day during the data collection period and will present the analysis on the reduced data set. A graph was created out of a day’s worth of data, and the nodes include both domain names and malware MD5 strings. In other words, a node in the graph can either be an MD5 string, or a domain name, and an edge (or a connection) links an MD5 and a domain if the MD5 queries that domain name. The total number of nodes is 17,629, and the number of edges is 54,939. The average number of connections per node is about 3.
In my graph representation of DNS queries, there are two distinct sets of nodes: domain names and malware. A node in one set only connects with a node in the other set, and not one in its own set. Graph theory defines such a network as a bipartite graph, as shown in Figure 2. I wanted to split the graph into two graphs, one containing all the nodes of domain names, and the other containing only malware programs. This can be done by projecting the large graph onto the two sets of nodes, which creates two graphs. In each graph, two nodes are connected by an edge if they have connections to the same node of the other type. For example, domains xudunux.info and rylicap.info would be connected in the domain graph because both of them have connections with the same malware in the larger graph.
Figure 2. Bipartite graph showing two distinct types of nodes
Let’s look at the graph of malware first. For the day 2012-09-29 alone, there are 9876 unique malware recorded in the data set. First, I would like to know the topological layout of these malware and find out how many connected components exist in the malware graph.
A connected component is a subset of nodes where any two nodes are connected to each other by one or multiple paths. We can view connected components (or just components) as islands that have no bridge connecting each other.
Python programming language has an excellent network analysis package called networkx. It has a function to compute the number of connected components of a graph. The result of running that function, named number_connected_components, shows there are 2,114 components in the 9,876-node graph, 1,619 of which are one-node component. There are still 11 components that have more than 100 nodes within them. I will analyze those large components because the malware inside may be variants of the same program.
Figure 3 shows four components of the malware graph. The nodes in each component are densely connected to each other but not to any other components. That means the malware assigned to a component clearly possess some similar characteristics that are not shared by the malware from other components.
Figure 3. Four out of eleven components in the malware graph
Component 1 contains 201 nodes. I computed the betweenness centrality of the nodes in the graph, which are all zeros, while the closeness centrality values of the nodes are all ones. This indicates that each node has a direct connection with each other node in the component, meaning that each malware queried exactly the same domain names as the other malware programs. This is a strong indication that all 201 malware are variants of a certain type of malicious executable.
Let’s return to the large DNS query graph to find out what domains the malware targeted. Using a graph database like Neo4j or OrientDB, or a graph analytic tool like networkx, the search is easy. The result shows that the malware in component 1 were only interested in three domain names: ns1.musicmixa.net, ns1.musicmixa.org, and ns1.musiczipz.com.
I queried VirusTotal for each of the 201 malware in component 1. VirusTotal submits the MD5 to a list of scanning engines and return the reports from those engines. A report includes its determination of the MD5 to be either positive or negative. If it’s positive, the report would provide more information about what kind of malware the MD5 is, based on the signature that the scanning engine uses.
I assigned a score to each malware by computing the ratio of the number of positive results to the total number of results. The distribution of the scores is shown in Figure 4. The scanning reports imply that the malware is a Wind32 Trojan.
Figure 4. Histogram of VirusTotal score of malware in Component 1
Using Social Network Analytics to Understand Unknowns
When I look at each of the components, not all of them have such high level of homophily as component 1 does. A different component has 2,722 malware nodes, and 681,060 edges. 309 of the 2,722 malware in this component were not known to VirusTotal, while the rest, 2,413 malware, had reports on the website. We need a way to analyze those unknown malware.
Social network analytic (SNA) methods provide insights into unknown malware by identifying known malware that are similar to the unknowns. The first step is to try to break the large component into communities. The concept of community is easy to understand in the context of a social network. Connections within a community are usually much denser than those outside a community. Members of a community tend to share some common trait, such as mission, geo-location, or profession. In this analysis, malware were connected if they queried the same domain that could be interpreted as two malware exhibiting a common interest in a domain name. Therefore, we can expect that malware programs that have queried similar domains represent a community. Communities exist inside a connected component and differ from the concept of components in that communities still have connections between each other.
Community detection is a particular kind of data clustering within the domain of machine learning. There are a wide variety of methods for community detection in a graphic network. Louvain method is a well-known and well-performed one, and tries to optimize the measure of modularity by partitioning a graph into groups of densely connected nodes. By applying the Louvain method to the big component with 2,722 nodes, I can identify 15 communities and the number of nodes within each community as shown in Figure 5.
Figure 5. Number of nodes in each community
Let’s take a specific malware as an example. The MD5 of this malware is 0398eff0ced2fe28d93daeec484feea6, and the search of it on VirusTotal found no result, as shown in Figure 6.
Figure 6. Malware not found on VirusTotal
I want to know what malware programs have the most similar behavior in terms of DNS queries to this unknown malware. By looking into the similar malware that we do have knowledge about, we could gain insights into the unknown one.
I found malware 0398eff0ced2fe28d93daeec484feea6 in Community 4, which has 256 malware within it. To find the most similar malware programs, we need a quantitative definition of similarity. I chose to use Jaccard index to compute just how similar two sets of queried domains are.
Suppose malware M1 queried a set of domains D1, and malware M2 queried another set of domains D2. The Jaccard index of set D1 and D2 is calculated as:
The Jaccard index goes from 0 to 1, with 1 indicating an exact match.
Out of the total 2,722 nodes in Component 1, 100 malware programs have exactly the same domain queries as malware 0398eff. That means their Jaccard indices against malware 0398eff are 1. However, only 9 malware are known to VirusTotal. The 9 malware are shown below.
Each of the 100 malware programs, including the 9 known ones, that have the same domain queries as malware 0398eff appear in community 4. The histogram of Jaccard index is shown in Figure 7.
Figure 7. Histogram of Jaccard index for nodes in community 4
We can tell from the histogram that the malware programs in community 4 could be generally split into two sets. One set contains 100 malware that have exactly the same domain queries as malware 0398eff, and the other set has nodes that are much less similar to it. The graph visualization in Figure 8 demonstrates the split. By this analysis, we have found those previously unknown 91 malware behaving similarly to some known malware.
This blog post demonstrates how I used DNS query data to conduct network-based graphic analysis for malware. Similar analysis can be done with the domain names to identify groups of domains that tend to be queried together by a malware program. This can help identify potentially malicious domains that were previously unknown.
Given the vast quantities of data those of us in the security world handle daily, data science techniques are an increasingly efficient and informative way to identify malware and targeted domains. While machine learning and clustering tend to dominate these kinds of analyses, social network based graphic methods should increasingly become another tool in the data science toolbox for malware detection. Through the identification of communities, betweenness, and similarity scores, network analysis helps show not only connectivity, but also logical groupings and outliers within the network. Viewing malware and domains as a network provides another more intuitive approach for wrangling the big data security environment. Given the limited features available in the DNS passive query data, graph analytic approaches supplement traditional static and dynamic approaches and elevate capabilities in malware analytics.