Hunting on Networks, Part 2: Higher-Order Patterns
In the first part of the Hunting on Cheap series, I discussed the importance of passive DNS in an adversary hunting toolkit. I detailed how an organization can set up sensors to collect passive DNS data, as well as some of the options for handling this data. After putting that foundation in place, the next step is looking at the collected data to find patterns and signals of maliciousness that, with a relatively low false positive rate, provide the hunter with starting points to dig deeper into identifying unknown threats. A focus on these outliers and other patterns is important because adversaries easily change their attack infrastructure and render most network IOCs useless.
In this second post in our Hunting on the Cheap series, I will go through some of these signals and discuss how they can be applied to passive DNS data to hunt for unknown malicious adversaries in your network.
Fast flux is a technique used most frequently for malicious purposes by botnets. Normally, a fully qualified domain name (FQDN) resolves to the same address space for a relatively long period of time. With fast flux, a FQDN serving as a command and control server resolves to a large number of IPs over time, swapping in and out at high frequency. This has the effect of adding resilience against IP-based block lists because blocking a given IP is only effective for the very short time window during which the FQDN resolves to that IP. This pattern isn’t malicious by itself, since it can be used for benign purposes as well. A domain that receives a large amount of traffic may also resolve to a large number of IPs. Typically, benign domains resolve to a homogenous IP space either by ownership, address block, or geography. Malicious domains have greater heterogeneity in each of these. This is the first higher-order pattern: ‘domains that resolve to a large number of IPs and those IPs are diverse both by ownership and geography’. For example, looking for this pattern in our sample data revealed the domains listed below. VirusTotal confirms that they are indeed malicious.
Domain Generation Algorithm
Domain Generation Algorithm (DGA) malware uses an algorithm to randomly generate thousands of domains daily and attempts to connect to them to receive communications from a controller. Botnet masters register a (usually small) subset of those domains per day to keep the botnet going, knowing that the malware will eventually attempt to resolve to a registered domain. A well-known and effective way to stop DGA malware is to predict and register all possible domains before the botnet controller registers them. This requires reverse engineering a bunch of malware samples, which can be tedious. It also is difficult to remain current given the new families of malware and their constantly updated versions. So how do we determine whether a given domain in passive DNS data is generated via a DGA without directly generating a complete list of possible DGA domains using all possible algorithms, which would be an extremely difficult task.
Fortunately, algorithmically generated domains have structural properties that are different from benign domains. Benign domains are generally chosen because they are easy to remember or reflect common words across a variety of languages. That is our next higher-order pattern: ‘domains with abnormal lexicographical structure`.
One fairly accurate approach to detecting DGA domains is to extract features like consonant-to-vowel ratio, longest consonant sequence, entropy, common ngrams with dictionary words, etc. and analyze them in a random forest classification tree.
This data science approach to DGA detection is non-trivial to implement. We have provided code which can be used for detection. This specific classifier detects abnormal lexicographical structures from common English words. Similar approaches can be followed to include other languages and improve the false positives rate.
While block lists are appropriate for hunting a given fast flux botnets, it is not the appropriate technique for hunting for DGA domains in general. Because of the sheer number of domains per day per malware family, in addition to the rapidly changing malware samples, static analysis is inefficient and less effective for hunting DGA classifiers. However, there are a series of data science techniques – such as random forest classification – that are very well suited for hunting DGA domains.
DGA domains sometimes include English words to fool DGA classifiers that use lexicographical properties of the domain to detect them. An example of such a DGA family is Nivdort. However, DGA malware leaves behind another signal that is much harder to conceal. Since the malware generates thousands of domains and only a few of them resolve to actual hosts, the majority of DNS queries return an error (code=3) indicating a non-existent domain or NXDOMAIN. Normally, we see NXDOMAINs due to typos, copy paste errors, browser prefetch of malformed html, etc. at a rate of less than 5% of the DNS queries. On machines infected by the DGA family malware, this rate surges up. ‘A higher than normal rate of NXDOMAIN errors’ is the next higher-order pattern. Estimating the percentage of NXDOMAINs is really powerful since it catches all sorts of DGA malware families even if they evade our DGA classifier.
Recent phishing campaigns often rely on a small typo of a domain, or utilize a brand name to make it look genuine. In the first case, the phished domains are slightly modified versions of the real domain, while still retaining some resemblance to it. This is the next higher-order pattern to hunt: `DNS queries for domains that are slightly modified versions of a popular domain’. Edit-distance or Levenshtein distance can help measure the level of modification between two domains. Edit-distance between two words is the minimum number of single-character edits (e.g., insertions, deletions or substitutions) required to change one word into the other. Each DNS query can be analyzed for its edit-distance from popular domains. A potential phishing attempt may exhibit a low edit-distance from another popular domain, especially when there are different registrants.
In other cases, a phished domain contains a familiar brand name to appear genuine. This is another higher-order patter: 'DNS queries for domains that contain popular brand name'. A suffix tree of popular domains and brand names can perform at scale, matching the longest common substring against each DNS query. Once identified, it is important to validate such outliers by checking the WHOIS records.
DIY Outlier Detection
There are many additional patterns which could be noted in passive DNS data to drive your hunt. A fundamental principle of hunting is looking across your dataset and identifying outliers in that data. Doing this over time is an effective way to perform outlier analysis for network data. This boils down to creating additional hunting techniques using the following steps:
- Select one or more features or characteristics of DNS traffic.
- Discover the normal range or set of values for that particular feature(s).
- Find records where the feature deviates considerably from the normal.
Let’s take query type as an example. First, we discover the distribution of queries by query type. Say, we observe 93% query types for A records, 6% for NS records and 1 % for MX records. If suddenly we observe a much higher rate of MX queries, we have an outlier that we should investigate. This is the last higher-order pattern: “Features or characteristics that deviate from the normal distribution of the data.” This could indicate a malware infection that sends spam. Similarly, if we take a distribution of queries by TLD and find large number of queries to a TLD outside of that distribution, we have an outlier that warrants further analysis.
Note that, as with any outlier detection, you will have false positives. Part of the hunting process is understanding what is normal for your organization and incorporate that information in your analytic process.
Adversary hunting using passive DNS can be a rewarding experience, both in terms of understanding the network and its peculiarities, and for finding targeted threats such as APTs that evade usual IOC-based search. There are many known patterns and signals that indicate malicious behavior. These higher-order patterns provide a heuristic for anyone new to hunting on networks. The following are great places to start:
- Domains that resolves to a large number of IPs and those IPs are diverse both by ownership and geography
- Domains with abnormal lexicographical structure
- A higher than normal rate of NXDOMAIN errors
- DNS queries for domains that are slightly modified versions of a popular domain
- DNS queries for domains that contain popular brand name
- Features or characteristics that deviate from the normal distribution of the data
Other patterns exists as well. More manual, quantitative analysis can also identify outliers, based on known normal behavior, such as query types. Together, these are solid, open source first steps to begin hunting within networks.
The network is not the only place to hunt. In fact, a richer set of data is available on your endpoints to feed hunting operations. In our subsequent and final post on hunting on the cheap, we’ll address hunting on hosts.