Understanding Crawl Data at Scale (Part 1)


A couple of years ago, in an effort to better understand technology trends, we initiated a project to identify typical web site characteristics for various geographic regions. We wanted to build a simple query-able interface that would allow our analysts to interact with crawl data and identify nonobvious trends in technology and web design choices. At first this may sound like a pretty typical analysis problem, but we have faced numerous challenges and gained some pretty interesting insights over the years.

Aside from the many challenges present in crawling the Internet and processing that data, at the end of the day, we end up with hundreds of millions of records, each with hundreds of features. Identifying “normal trends” over such a large feature set can be a daunting task. Traditional statistical methods really break down at this point. These statistical methods work well for one or two variables but are rendered pretty useless once you hit more than 10 variables. This is why we have chosen to use cluster analysis in our approach to the problem.

Machine learning algorithms, the Swiss army knife of a data scientist’s toolbox, break down into three classifications: supervised learningunsupervised learning, and reinforcement learning. Although mixed approaches are common, each of the three lends itself to different tasks. Supervised learning is great for classification problems where you have a lot of labeled training data and you want to identify appropriate labels for new data points. Unsupervised techniques help to determine the shape of your data, categorizing data points into groups by mathematical similarity. Reinforcement learning includes a set of behavioral models for agent-based decision-making in environments where the rewards (and penalties) are only given out on occasion (like candy!). Cluster analysis fits well within the realm of unsupervised learning but can take advantage of supervised learning (making it semi-supervised learning) in a lot of scenarios, too.

So what is cluster analysis and why do we care? Consider web sites and features of those sites. Some sites will be large, others small. Some will have lots of images; others will have lots of words. Some will have lots of outbound links, and others will have lots of internal links. Some web sites will use Angular; others will prefer React. If you look at each feature individually, you may find that the average web site has 11 pages, 4 images and 347 words. But what does that get you? Not a whole lot. Instead, let’s sit back and think about why some sites may have more images than others or choose one JavaScript library over another. Each webpage was built for a purpose, be it to disseminate news, create a community forum, or blog about food. The goals of the web site designer will often guide his or her design decisions. Cluster analysis applies #math to a wide range of features and attempts to cluster websites into groups that reflect similar design decisions.

Once you have your groups, generated by #math, you’ve just made your life a whole lot simpler. A few minutes (or hours) ago you had potentially thousands or millions of items to compare across hundreds of fields. Now you’ve got tens of groups that you can compare in aggregate. Additionally, you now know what makes each group a group and how it distinguishes itself from one or more other groups. Instead of looking at each website or field individually, now you’re looking at everything holistically. Your job just got a whole lot easier!

Cluster analysis gives you some additional bonus wins. Now that you have normal groups of websites, you can identify outliers within the set - those that are substantially dissimilar from the bulk of their assigned group. You can also use these clusters as labels in a classifier and determine in which group of sites a new one fits best.

In coming posts, we will go into more detail about how we cluster and visualize web crawl data. Stay tuned!