Here's How We Do The Numbers
I spoke to a few IT leaders around the HIMSS conference last week. All of them expressed both a knowledge of the ATT&CK matrix and recent evaluations, and most of them also confessed to confusion about what really matters. Although they said their teams used the evaluations as part of vendor shortlisting or diligence on their current vendor, few were able to spend time understanding how to interpret the test result data. Even fewer had taken a swipe at analyzing the data published by the ATT&CK team.
So, Jamie Butler (Endgame CTO) and I got together to provide simple but important questions that will be important to everyone looking at the MITRE ATT&CK evaluation data. Most importantly, we wanted to make sure that the questions can be answered by the data.
Rather than asking a one dimensional “Who’s best” – which doesn’t tell you everything, no matter what anyone tells you – we focused on questions that would be key when implementing or operationalizing a new EDR tool. We settled on:
- Who missed the most?
- When you miss something, can the product find it later?
- When you detect, how useful is the data the product gives me?
For full transparency and in case you want to play along with the numbers, I’ve posted the scripts to GitHub here. I’ll add the command to generate the output for every chart as the figure description.
One final grateful shout to Forrester’s Josh Zelonis for publishing his analysis scripts back in December, because he saved me a lot of work. Yay for open source and transparency.
Who missed the most?
Figure 1: “python3 total_misses.py”
This is self-explanatory, and of course no vendor is perfect.
It should be noted that there is no severity rating for the missed TTP. Some may be more severe than others, and some may be mitigated through operational best practices or other security controls.
When you miss something, can you find it later?
Crowdstrike recently published their intelligence report and introduced a new metric they are calling “Breakout” speed – the time from an adversary’s initial successful compromise to the next activity; for example, privilege escalation or lateral movement.
I like to think this relates to the comparable blue team metric “Time to Containment”, so let’s pull some data to look at what APT3 actions were picked up in near-real time, and those that were delayed and introduced a “breakout” window.
Figure 2: “python3 delayed.py”
Obviously, it’s better to detect everything as close to real-time as possible, but let’s be honest – that's not possible; things will get through undetected for a while. That’s where threat hunting comes from.
The way that ATT&CK evaluations interoperates with delayed detections should be very interesting, because in the ATT&CK assessment this meant that the detection data was available but was not immediately raised as an alert. This delay could be due to over reliance on cloud analysis, maybe it was only discovered hours later by a managed service, or perhaps it is because the activity was discovered by skilled threat-hunters.
Essentially, the more delays here, the harder the work to get the detection alert and the bigger risk of a “breakout” window. A large number of delays can leave you with all of the uncertainty of a poorly defined managed service, but none of the benefits of an actual managed service.
Does that mean that a score with ZERO delayed detections is better?
In a word, no. Given that we already understand it may not be possible to prevent or detect everything, you should expect to see some detections that come from the collection of data. In other words, you’d need to hope that the vendors with zero delays detected everything in real-time (go look at total_misses.py again).
Organizations that want to reduce exposure, want to have as many detections that come as close to real-time as possible. Back to the ATT&CK evaluation dataset, let’s pull the number of real-time alerts generated in the evaluation.
Figure 3: “python3 real_time.py“
I don’t think anyone can disagree with real-time alerts being one of the important metrics.
When you detect, how useful is the data you give me?
The hardest part of a buying process is where organizations have to figure out how endpoint protection vendors can actually help them make faster decisions, take faster containment actions, and reduce the risk of damage and loss. You know, the workflow side of things.
How much expertise and work do I have to do to scope, triage and respond with zero disruption?
If a product is throwing up lots of detections but fails to provide the associated events that would help triage and decide on the initial containment step, it’s almost certainly going to be a hard to use product.
Let’s hit the data again to pull the number of detection events that had little context associated.
Figure 4: “python3 no_context.py”
I’ll quote Jamie Butler, our CTO here:
“These contextless, basic telemetry detections only give you a bigger pile of hay/data, so what can you do with it? Can you find the bad thing in a list of thousands of process creation events?”
Can this tell you what really matters?
Yes, well kind of. We already know what matters most in the world of EPP:
- Detect more things, faster – You need a good balance of as-real-time-as-possible detections, with low misses.
- Find the hard stuff – You need to avoid the low real-time detection rates that are combined with a low (or zero) number of delayed detections. (Are they collecting all of the event data that an EDR tool requires?)
- Respond confidently – With too many delayed detections, you must have strong contextual events. If some detection comes in late, you want to act on it fast. Security UX is critical.
We know that without satisfying all three of those, a detection and response/Security Operations function will be very difficult to implement and operationalize successfully – if at all. We can use these three things to make a final pull from the data.
Figure 5: "python3 what_matters.py"
Bar charts. Yay.