# MinHash vs. Bitwise Set Hashing: Jaccard Similarity Showdown

As demonstrated in an earlier post, establishing relationships (between files, executable behaviors, or network packets, for example) is a key objective of researchers when automating the hunt. But, the scale of information security data can present a challenge if naïvely measuring pairwise similarity. Let’s take a look at two prominent methods used in information security to estimate Jaccard similarity at scale, and compare their strengths and weaknesses. Everyone loves a good head-to-head matchup, right?

Jaccard distance is a metric^{1} that measures the similarity of two sets, *A* and *B*, by

where *J _{s}* denotes the Jaccard similarity, bounded on the interval [0,1]. Jaccard similarity has proven useful in applications such as malware nearest-neighbor search, clustering, and code reuse detection. In such cases, the sets

*A*and

*B*might contain imported functions, byte or mnemonic n-grams, or behavioral properties observed in dynamic analysis of each file.

Since each datapoint (e.g., malware sample) often consists of many feature sets (e.g., imports, exports, strings, etc.) and each set can itself contain many elements, naïve computation of Jaccard similarity can be computationally expensive. Instead, it’s customary to leverage efficient descriptions of the sets*A* and *B* together with a fast comparison mechanism to compute *J _{d}(A,B*

*)*or

*J*

_{s}(A,B*)*. Minwise Hashing (MinHash) and bitwise set hashing are two methods to estimate Jaccard similarity. Bitwise set hashing will be referred to in this blog post as BitShred since it is used as the core similarity estimator in the BitShred system proposed for large-scale malware triage and similarity detection.

First, let’s review some preliminaries. First, key ideas behind MinHash and BitShred will be reviewed, with a few observations about each estimator. Then, these two methods will be compared experimentally on supervised and unsupervised machine learning tasks in information security.

*MinHash*

MinHash approximates a set with a random sampling (with replacement) of its elements. A hash function *h(***a***) *is used to map any element **a** from set *A* to a distinct integer, which mimics (but, with consistency) a draw from a uniform distribution. For any two sets *A* and *B,* Jaccard similarity can be expressed in terms of the probability of hash collisions:

where the min operator acts as the random sampling mechanism. Approximating the probability by a single MinHash comparison of *A* and *B* is actually an unbiased estimator, but has quite large variance—the value is either identically 1 or 0. To reduce the variance, MinHash averages over *m* trials to produce an unbiased estimator with variance *O(1/m)*.

Estimating Jaccard similarity via MinHash is particularly efficient if one approximates *h(***a***)* using only its least significant bit (LSB). This of course, introduces collisions between distinct elements since the LSB of *h(***a***)* is 1 with 0.5 probability—but the approximation has been shown to be effective if one uses many bits in the code. Overloading notation a bit, let **a** (respectively, **b**) be the bit string of *m *1-bit MinHashes for set *A* (respectively, *B*). Then Jaccard similarity can be approximated via a CPU-efficient Hamming distance computation (xor and popcount instructions):

It has been shown that the variance of 1-bit MinHash is* 2(1-J _{s}*)/

*m*when using

*m*total bits, and indeed the variance of

*any*summary-based Jaccard estimator has variance at least

*1/m*. Interestingly, the variance of

*b*-bit MinHash does not decrease if one uses more than

*b=1*bits to describe each hash output

*h(*

**a**

*)*while retaining the same number of bits in the overall description. With a little arithmetic, one can see that to achieve an estimation error of at most

*ε J*with probability exceeding 1/2, one requires

_{s}*m > (1-J*bits of 1-bit Minhash, by Chebyshev’s inequality.

_{s})/ (ε J_{s})^{2}Code (golang) to generate a 1-bit MinHash code and approximate Jaccard similarity from two codes is shown below.

func Hash64(s string, seed uint64) uint64 func PopCountUint64(x uint64) int func OneBitMinHash(set []string, N_BITS int) []uint64 { code := make([]uint64, N_BITS/64) var minhash_value uint64 for bitnum := 0; bitnum < N_BITS; bitnum++ { minhash_value = math.MaxUint64 for _, s := range set { minhash_tmp := Hash64(s, uint64(bitnum)) // bitnum as seed if minhash_tmp < minhash_value { minhash_value = minhash_tmp } } whichword := bitnum / 64 // which uint64 in the slice? whichbit := bitnum % 64 // which bit in the uint64? if minhash_value&0x1 > 0 { // is the bit set? code[whichword] = code[whichword] | (1 << uint8(whichbit)) } } return code } func JaccardSim_OneBitMinHash(codeA []uint64, codeB []uint64) float64 { var hamming int N_BITS := len(codeA) * 64 for i, a := range codeA { hamming += PopCountUint64(a ^ codeB[i]) } return 1.0 - 2.0*float64(hamming)/float64(N_BITS) }

*BitShred: Bitwise Set Hashing*

Feature hashing is a space-efficient method to encode feature-value pairs as a sparse vector. This is useful when the number of features is *a priori* unknown or when otherwise constructing a feature vector on the fly. To create an *m*-dimensional vector from an arbitrary number of feature/value pairs, one simply applies a hash function and modulo operator for each feature name to retrieve a column index, then updates that column in the vector with the provided value. Column collisions are a natural consequence in the typical use case where the size of the features space *n* is much larger than *m.*

BitShred uses an adaptation of feature hashing in which elements of a set are encoded as a single bit in a bit string. Since *m<<n*, a many-to-one mapping between set elements and bit locations introduces collisions. A concise bit description of set *A* is created by setting the bit at [*h(***a***) *mod* m*] for all elements **a** in *A*. Overloading notation again, let **a** (respectively, **b**) be the BitShred description of set *A* (respectively, *B*). Then Jaccard similarity is estimated efficiently by replacing set operators with bitwise operators:

To make sense of this estimator, let random variable *C _{i}* denote the event that one or more elements from

*each*set

*A*and

*B*both map to the

*i*th bit. Similarly, let random variable

*U*denote that one or more elements from

_{i}*either*set

*A*or

*B*(or both) map to the

*i*th bit. Then, the BitShred similarity estimator

*J*can be analyzed by considering the ratio

_{s}which is simply the (noisy, with collisions) sum of the intersections divided by the sum of the union. Estimating the bias of the ratio of random variables will not be detailed here. But, note that due to the many-to-one mapping, the numerator generally *overestimates* the true cardinality of the set intersection, while the numerator *underestimates* the true cardinality of the set union. So, without cranking laboriously through any math, it’s easy to see from the ratio of “too big” to “too small” that this estimator is biased^{2}, and generally overestimates the true Jaccard similarity.

Code (golang) to generate a BitShred code and estimate Jaccard similarity from two BitShred codes is shown below.

func Hash64(s string, seed uint64) uint64 func PopCountUint64(x uint64) int func BitShred(set []string, N_BITS uint16) []uint64 { code := make([]uint64, N_BITS/64) for _, s := range set { bitnum := Hash64(s, 0) % uint64(N_BITS) whichword := bitnum / 64 // which uint64 in the slice? whichbit := bitnum % 64 // which bit in the uint64? code[whichword] = code[whichword] | (1 << uint8(whichbit)) } return code } func JaccardSim_BitShred(codeA []uint64, codeB []uint64) float64 { var numerator, denominator int for i, a := range codeA { numerator += PopCountUint64(a & codeB[i]) denominator += PopCountUint64(a | codeB[i]) } return float64(numerator) / float64(denominator) }

*Estimator Quality*

The math is over; let’s look at some plots.

This plot shows the estimated vs. true Jaccard similarity for MinHash and BitShred, for the contrived case where sets *A *and *B* consist of randomly generated alphanumeric strings, *|A|=|B|=*64, and the number of bits *m=*128*. *The mean and 1 standard deviation error bars are plotted from 250 trials for each point on the similarity graph. The *y*=*x* identity line (dotted) is also plotted for reference.

A few things are evident. As expected, MinHash shows its unbiasedness with modest variance. BitShred is grossly biased, but has low variance. Note however, that the variance of both estimators vanishes as similarity approaches unity. In many applications such as approximate nearest-neighbor search, it’s the consistent rank-order of similarities that matter, rather than the actual similarity values. In this regard, one is concerned about the variance and strict monotonicity of this kind of curve only on the right-hand side, where*J** _{s} *approaches 1. The extent to which the bias and variance near

*J*1 play a role in applications will be explored next.

_{s}=

*Nearest Neighbor Search*

So, what about nearest-neighbor search? Let’s compare *k-NN *recall.

As a function of neighborhood size *k*, we measure the recall of true nearest neighbors, that is, what fraction of the true *k* neighbors did we capture in our *k-NN* query? The plot above shows recall vs. *k* averaged over 250 trials with one standard deviation error bars for MinHash vs. BitShred. The same contrived case is used as before, in which sets *A *and *B* consist of randomly generated alphanumeric strings, *|A|=|B|=*64, and the number of bits *m=*128. While it’s mostly a wash for small *k*, one observes that the lower-variance BitShred estimator general provides better recall.

Note that in this toy dataset, the neighborhood size increases linearly with similarity; but in real datasets the monotonic relationship is far from linear. For example, the first 3 nearest neighbors may enjoy Jaccard similarity greater than 0.9, while the 4^{th} neighbor may be very dissimilar (e.g., Jaccard similarity < 0.5).

*Applications: Malware Visualization and Classification*

Let’s take a look at an application. In what follows, we form a symmetric nearest neighbor graph of 250 samples from each of five commodity malware families plus a benign set, with k=5 nearest neighbors retrieved via Jaccard similarity (MinHash or BitShred). For each sample, codes are generated by concatenating five 128-bit codes (640 bits per sample) consisting of a single 128-bit code for each of the following feature sets extracted from publicly available VirusTotal reports:

- PE file section names;
- language resources (English, simplified Chinese, etc.);
- statically-declared imports;
- runtime modification to the hosts file (Cuckoo sandbox); and
- IP addresses used at runtime (Cuckoo sandbox).

A t-SNE plot of the data—which aims to respect local similarity—for MinHash and BitShred are shown below. (I use the same random initialization for both plots.)

**Figure 1:** MinHash similarity from k=5 symmetric similarity matrix

**Figure 2:** BitShred similarity from k=5 symmetric similarity matrix

The effects of BitShred’s positive bias can be observed when comparing to the MinHash plot. It’s evident that BitShred is merging clusters that are distinct in the MinHash plot. This turns out to be good for Allaple, but very bad for Ramnit, Sality and benign, which exhibit cleaner separation in the MinHash plot. Very small, tight clusters of Soltern and Vflooder appear to be purer in the BitShred visualization. Embeddings produced from graphs with higher connectivity (e.g., k=50) show qualitatively similar findings.

For a quantitative comparison, we show results for simple k-NN classification with k=5 neighbors, and measure classification performance. For MinHash the confusion matrix and a classification summary are:

And for BitShred:

In this contrived experiment, the numbers agree with our intuition derived from the visualization: BitShred confuses Ramnit, Sality and benign, but shows marginal improvements for Soltern and Vflooder.

*Summary*

MinHash and BitShred are two useful methods to approximate Jaccard similarity between sets with low memory and computational footprints. MinHash is unbiased, while BitShred has lower variance with nonnegative bias. In non-extensive experiments, we verified intuition that BitShred overestimates Jaccard similarity, which can introduce errors for visual clustering and nearest-neighbor classification. In our contrived experiments (which also plays out in practice), this caused confusion/merging of distinct malware families.

The bias issue of BitShred could be partially ameliorated by using neighbors that fall within a ball of small radius *r*, where the BitShred bias is small. (This is in contrast to *k-*NN approaches in which similarities in the “local” neighborhood can range from 0 to 1, with associated bias/variance.)

Finally, the Jaccard metric represents a useful measure of similarity. There are many others based on common or custom similarity measures, which may also be approximated by Hamming distance on compact binary codes. These, together with efficient search strategies (also not detailed in this blog post) can be employed for powerful large-scale classification, clustering and visualization.

^{1}How can one show that Jaccard distance is really a metric? Nonnegativity, coincidence axiom, and symmetric properties? Check, check and check. But, triangle inequality? Tricky! Alternatively, one can start with a known metric—the symmetric set difference between A and B—then rely on the Stenhaus Transform, to crank through the necessary arithmetic and arrive at Jaccard distance.

^{2}One may reduce the bias of BitShred by employing similar tricks to those used in feature hashing. For example, a second hash function may be employed to determine whether to xor the current bit with a 1 or a 0. This reduces bias at the expense of variance. For brevity, I do not include this approach in comparisons.

**Read more blog posts about Data Science.**

**Follow Hyrum on Twitter @drhyrum.**