Defensible Options for MITRE Coverage

Inspired by Allyn Stott's talk at Vegas this Summer, and coinciding with ATT&CKcon, metrics are up in the discourse and coverage as mysterious an entity as ever.

In a business sense, the security team are in a tough spot. The executive team would love nothing more than to see a return on investment. Your customers are paying that extra bit for good and services, surely they are only really happy when it reflects the degree of trust in your product over another. Without a concrete return on investment for these people fundamentally the right thing to do is defer that trust and spend elsewhere; say on something like marketing or a competitors product respectively.

Perhaps if there was some metric, a number with a physical analogy, that is good even in the time between security events where nothing visibly bad is happening both these groups would be impressed? It seems like out of this is where coverage as a metric is born.

Roughly MITRE ATT&CK

MITRE ATT&CK is undeniably a fantastic resource. It builds a shared vocabulary, a shared understanding, of what the components of an attack look like and how a threat actor might operate.

It is also ground zero for concerns around Coverage as a metric. While parties like Forrester and MITRE themselves would rather you do not use it as a bingo card, at time of publication the first 5 or so pages in a search are claiming the coverage of certain products or how to measure it yourself to get as close as possible.

There could be nothing more natural than considering how to check off your worries around the box labelled Phishing, or even the bigger brighter box labelled Initial Access - so why are there so few satisfying answers?

Firstly if we look at something like NIST investigation begins with an indicator coming in from some kind of tool deployed earlier, or if you're unlucky a call from a government agency. We don't know at this time if the indicator is real or not just yet and so it goes into a funnel. With this funnel typically composed of various stages of analyst we reach the first contradiction of coverage.

Coverage implies top of the funnel visibility, but is measured much lower

Consider if your analysis pipeline can identify a compromised asset. It may sound simple. But this is the first hurdle. Given some arbitrary hunch if your analysts can't tell the difference between a compromised asset and an uncompromised asset your coverage is significantly reduced.

Imagining one side of the Venn diagram as the security tool's understanding of Phishing, the other is your analyst's understanding and the middle is necessarily a smaller amount. We know both the tool's and the analyst's visibility are different - As David J Bianco writes, less than 10% of malware samples are known to more than one organization - If we're talking numbers around 10%, or even less if we're comparing overlap in threat intelligence feeds, then front line analysts having even a slightly different picture can reduce that 10% down to 5% or lower. The coverage considered back when the tool was purchased is now halved.

Coverage should tell you if it's a tool issue or an analysis capability issue

Now for some good news, there are loads of great resources that will allow you to take an arbitrary asset and in 20 mins or so determine if there's a compromise: Patrick Wardle on Mac, SANS on Windows, SandFly Security on Linux, Chris Sanders on Everything. No need to wash multiple party's concepts of precision against each other, we have a nice, repeatable, relatively objective look at compromise. So how can we measure where we are against β€œgood enough” not detecting anything or not knowing how to look?

One approach is to take a baseline of a system we know doesn't work too well. Drawing assets out of a hat randomly.

Building on the principle of assume breach we can at least believe there is a single threat actor somewhere already within our environment. By using analogous reasoning to the famous Birthday Problem applied to the security space we can calculate how this random number generating security tool performs. For an organization with 10,000 assets every 200 alerts or so should reveal a compromise. Assuming this isn't an organization where the security engineers know what has been compromised and are choosing to deliberately hide it, the difference in compromises found to this number reflects the quality of alerting against the quality of analysis.

The first bar for any kind of meaningful coverage is to be performing many times better than random.

The numbers above assume there is only one threat actor in the universe and that they are only on one asset. Performing X times better than random and Y times better than last year is a concrete story, demonstrates what's possible with given resources and highlights the effectiveness of efforts from one period to the next.

Going past this hurdle we can start to peel off a layer and look more closely at the components we're interested in.

Coverage should consider what is worth looking at, and what it costs to cover

Threat actors are not so different to you and I, they have a boss, and assuming you aren't with certain organizations that consistently fail their financial audits by trillions of dollars, have a budget as well. There are techniques that are easier than others to learn and in aggregate these are going to be more popular.

Analyst attention after all is one of the most valuable if not the limiting resource of security operations. There is also a cost in building or maintaining the tools. Having these people build and operate the tools directly is a great way of aligning incentives to the analyst attention budget, but how do we let the data tell the story?

One technique is with a tool called the G-test. Pioneered in the context of natural language processing by Ted Dunning the G-test needs just 4 numbers. How many times the something covering our category of interest has fired (Fi), how many times that thing was interesting (FiInt), how many times anything else interesting was found (TotInt), and how many times anything else fired (Tot).

Now we can consider a few scenarios we'd be interested in and after throwing the numbers in a handy G-test calculator determine if stack ranking coverage this way makes sense.

Fires Frequently, Good Precision

Fi: 100, FiInt: 80, Tot: 200, TotInt: 100

G-Test Result: 25 πŸŽ‰

A great result. To be balanced too many of these can be an issue as well - indicating an uncontrolled vulnerability or precision at the expense of recall. These can crank up the false negatives which represent orders of magnitude more expense to the business than burning 20 minutes on what turns out to be nothing.

Fires Rarely, Good Precision

Fi: 10, FiInt: 8, Tot: 200, TotInt: 100

G-Test Result: 2.4 πŸ‘

Not bad, it's not going to come up every week but when it does it does so with authority. There are however hundreds if not thousands of tools, alerts, and hunts in this category so a great point to start culling them down. This is where we want most of the work, choosing from the best of decent options and pulling our attention budget out of the tails. This is the sort of coverage that fed through an effective security operations team really raises the bar for threat actors attempting to take actions on objective.

Fires Rarely, Bad Precision

Fi: 10, FiInt: 1, Tot: 200, TotInt: 100

G-Test Result: -5.1 πŸ‘Ž

We're taking a small liberty with the scoring here. The score negative score reflects the propensity for literally any other detection and is for compatibility with the demonstration G-test calculator linked above. The method used in Ted's paper considers only "interesting" or "not relevant" rather than a choice between two samples. Both work out the same in terms of stack ranking. If you're looking to productionize this reach out and we'll set you up with some dedicated tooling.

Fires Often, Bad Precision

Fi: 80, FiInt: 18, Tot: 200, TotInt: 100

G-Test Result: -41 πŸ’€

Get this thing out of here, almost anything could replace it. Considering the Birthday Hunting numbers above it is very unlikely that this is any better than random.

More Complicated Approaches

There's a couple of other lenses that we can use here. A notable one that we'll go into more depth with in a future post is game theory based where we consider payoffs for both the attacker and defender for compromising an asset or not, and choosing to investigate it or not. Given we know a few corners of the confusion matrix, giving us false positive rates, this sort of reasoning can allow us to estimate the false negative rate - how many compromised assets we likely have not found. The extra complexity allows us to take this number head-on rather than their implied existence through precision recall trade-off.

Birthday Hunting Performance and Ranked Unit Performance together tell a defensible tale of coverage and performance

Do we have enough alerts? Which alerts to we need move of? Where should we focus our procurement and engineering efforts next?

With the use of one chart in the birthday hunting paper and judicious application of a well established statistical test we can produce concrete answers to get the budget you need, the return on investment executives want, and build the trust customers deserve.

Previous
Previous

DFIR and GRC are the Business: Systems Modeling and Metrics for the Real World