Trend Micro’s Top Ten MITRE Evaluation Considerations

By Trend Micro

The introduction of the MITRE ATT&CK evaluations is a welcomed addition to the third-party testing arena. The ATT&CK framework, and the evaluations in particular, have gone such a long way in helping advance the security industry as a whole, and the individual security products serving the market.

The insight garnered from these evaluations is incredibly useful. But let’s admit, for everyone except those steeped in the analysis, it can be hard to understand. The information is valuable, but dense. There are multiple ways to look at the data and even more ways to interpret and present the results (as no doubt you’ve already come to realize after reading all the vendor blogs and industry articles!) We have been looking at the data for the past week since it published, and still have more to examine over the coming days and weeks.

The more we assess the information, the clearer the story becomes, so we wanted to share with you Trend Micro’s 10 key takeaways for our results:

1. Looking at the results of the first run of the evaluation is important:

Trend Micro ranked first in initial overall detection. We are the leader in detections based on initial product configurations. This evaluation enabled vendors to make product adjustments after a first run of the test to boost detection rates on a re-test. The MITRE results show the final results after all product changes. If you assess what the product could detect as originally provided, we had the best detection coverage among the pool of 21 vendors.
This is important to consider because product adjustments can vary in significance and may or may not be immediately available in vendors’ current product. We also believe it is easier to do better, once you know what the attacker was doing – in the real world, customers don’t get a second try against an attack.
Having said that, we too took advantage of the retest opportunity since it allows us to identify product improvements, but our overall detections were so high, that even removing those associated with a configuration change, we still ranked first overall.

	And so no one thinks we are just spinning… without making any kind of exclusions to the data at all, and just taking the MITRE results in their entirety, Trend Micro had the second highest detection rate, with 91+% detection coverage.

2. There is a hierarchy in the type of main detections – Techniques is most significant

There is a natural hierarchy in the value of the different types of main detections.
- A general detection indicates that something was deemed suspicious but it was not assigned to a specific tactic or technique.
- A detection on tactic means the detection can be attributed to a tactical goal (e.g. credential access).
- Finally, a detection on technique means the detection can be attributed to a specific adversarial action (e.g. credential dumping).
We have strong detection on techniques, which is a better detection measure. With the individual MITRE technique identified, the associated tactic can be determined, as typically, there are only a handful of tactics that would apply to a specific technique. When comparing results, you can see that vendors had lower tactic detections on the whole, demonstrating a general acknowledgement of where the priority should lie.
Likewise, the fact that we had lower general detections compared to technique detections is a positive. General detections are typically associated with a signature; as such, this proves that we have a low reliance on AV.
It is also important to note that we did well in telemetry which gives security analysts access to the type and depth of visibility they need when looking into detailed attacker activity across assets.

https://attackevals.mitre.org/APT29/detection-categories.html

3. More alerts does not equal better alerting – quite the opposite

At first glance, some may expect one should have the same number of alerts as detections. But not all detections are created equal, and not everything should have an alert (remember, these detections are for low level attack steps, not for separate attacks.)
Too many alerts can lead to alert fatigue and add to the difficulty of sorting through the noise to what is most important.
When you consider the alerts associated with our higher-fidelity detections (e.g. detection on technique), you can see that the results show that Trend Micro did very well at reducing the noise of all of the detections into a minimal volume of meaningful/actionable alerts.

4. Managed Service detections are not exclusive

Our MDR analysts contributed to the “delayed detection” category. This is where the detection involved human action and may not have been initiated automatically.
Our results shows the strength of our MDR service as one way for detection and enrichment. If an MDR service was included in this evaluation, we believe you would want to see it provide good coverage, as it demonstrates that the team is able to detect based on the telemetry collected.
What is important to note though is that the numbers for the delayed detection don’t necessarily mean it was the only way a detection was/could be made; the same detection could be identified by other means. There are overlaps between detection categories.
Our detection coverage results would have remained strong without this human involvement – approximately 86% detection coverage (with MDR, it boosted it up to 91%).

5. Let’s not forget about the effectiveness and need for blocking!

	This MITRE evaluation did not test for a product’s ability to block/protect from an attack, but rather exclusively looks at how effective a product is at detecting an event that has happened, so there is no measure of prevention efficacy included. This is significant for Trend, as our philosophy is to block and prevent as much as you can so customers have less to clean up/mitigate.

6. We need to look through more than the Windows

	This evaluation looked at Windows endpoints and servers only; it did not look at Linux for example, where of course Trend has a great deal of strength in capability. We look forward to the expansion of the operating systems in scope. Mitre has already announced that the next round will include a linux system.

7. The evaluation shows where our product is going

We believe the first priority for this evaluation is the main detections (for example, detecting on techniques as discussed above). Correlation falls into the modifier detection category, which looks at what happens above and beyond an initial detection.
We are happy with our main detections, and see great opportunity to boost our correlation capabilities with Trend Micro XDR, which we have been investing in heavily and is at the core of the capabilities we will be delivering in product to customers as of late June 2020.
This evaluation did not assess our correlation across email security; so there is correlation value we can deliver to customers beyond what is represented here.

8. This evaluation is helping us make our product better

The insight this evaluation has provided us has been invaluable and has helped us identify areas for improvement and we have initiate product updates as a result.
As well, having a product with a “detection only” mode option helps augment the SOC intel, so our participation in this evaluation has enabled us to make our product even more flexible to configure; and therefore, a more powerful tool for the SOC.
While some vendors try to use it against us, our extra detections after config change show that we can adapt to the changing threat landscape quickly when needed.

9. MITRE is more than the evaluation

While the evaluation is important, it is important to recognize MITRE ATT&CK as an important knowledge base that the security industry can both align and contribute to.
Having a common language and framework to better explain how adversaries behave, what they are trying to do, and how they are trying to do it, makes the entire industry more powerful.
Among the many things we do with or around MITRE, Trend has and continues to contribute new techniques to the framework matrices and is leveraging it within our products using ATT&CK as a common language for alerts and detection descriptions, and for searching parameters.

10. It is hard not to get confused by the fud!

MITRE does not score, rank or provide side by side comparison of products, so unlike other tests or industry analyst reports, there is no set of “leaders” identified.
As this evaluation assesses multiple factors, there are many different ways to view, interpret and present the results (as we did here in this blog).
It is important that individual organizations understand the framework, the evaluation, and most importantly what their own priorities and needs are, as this is the only way to map the results to the individual use cases.
Look to your vendors to help explain the results, in the context that makes sense for you. It should be our responsibility to help educate, not exploit.

The post Trend Micro’s Top Ten MITRE Evaluation Considerations appeared first on .

Getting ATT&CKed By A Cozy Bear And Being Really Happy About It: What MITRE Evaluations Are, and How To Read Them

By Greg Young (Vice President for Cybersecurity)

Full disclosure: I am a security product testing nerd*.

I’ve been following the MITRE ATT&CK Framework for a while, and this week the results were released of the most recent evaluation using APT29 otherwise known as COZY BEAR.

First, here’s a snapshot of the Trend eval results as I understand them (rounded down):

91.79% on overall detection. That’s in the top 2 of 21.

91.04% without config changes. The test allows for config changes after the start – that wasn’t required to achieve the high overall results.

107 Telemetry. That’s very high. Capturing events is good. Not capturing them is not-good.

28 Alerts. That’s in the middle, where it should be. Not too noisy, not too quiet. Telemetry I feel is critical whereas alerting is configurable, but only on detections and telemetry.

So our Apex One product ran into a mean and ruthless bear and came away healthy. But that summary is a simplification and doesn’t capture all the nuance to the testing. Below are my takeaways for you of what the MITRE ATT&CK Framework is, and how to go about interpreting the results.

Takeaway #1 – ATT&CK is Scenario Based

The MITRE ATT&CK Framework is intriguing to me as it mixes real world attack methods by specific adversaries with a model for detection for use by SOCs and product makers. The ATT&CK Framework Evaluations do this but in a lab environment to assess how security products would likely handle an attack by that adversary and their usual methods. There had always been a clear divide between pen testing and lab testing and ATT&CK was kind of mixing both. COZY BEAR is super interesting because those attacks were widely known for being quite sophisticated and being state-sponsored, and targeted the White House and US Democratic Party. COZY BEAR and its family of derivatives use backdoors, droppers, obfuscation, and careful exfiltration.

Takeaway #2 – Look At All The Threat Group Evals For The Best Picture

I see the tradeoffs as ATT&CK evals are only looking at that one scenario, but that scenario is very reality based and with enough evals across enough scenarios a narrative is there to better understand a product. Trend did great on the most recently released APT/29/COZY BEAR evaluation, but my point is that a product is only as good as all the evaluations. I always advised Magic Quadrant or NSS Value Map readers to look at older versions in order to paint a picture over time of what trajectory a product had.

Takeaway #3 – It’s Detection Focused (Only)

The APT29 test like most Att&ck evals is testing detection, not prevention nor other parts of products (e.g. support). The downside is that a product’s ability to block the attacks isn’t evaluated, at least not yet. In fact blocking functions have to be disabled for parts of the test to be done. I get that – you can’t test the upstairs alarm with the attack dog roaming the downstairs. Starting with poor detection never ends well, so the test methodology seems to be focused on ”if you can detect it you can block it”. Some pen tests are criticized that a specific scenario isn’t realistic because A would stop it before B could ever occur. IPS signature writers everywhere should nod in agreement on that one. I support MITRE on how they constructed the methodology because there has to be limitations and scope on every lab test, but readers too need to understand those limitations and scopes. I believe that the next round of tests will include protection (blocking) as well, so that is cool.

Takeaway #4 – Choose Your Own Weather Forecast

Att&ck is no magazine style review. There is no final grade or comparison of products. To fully embrace Att&ck imagine being provided dozens of very sound yet complex meteorological measurements and being left to decide on what the weather will be. Or have vendors carpet bomb you with press releases of their interpretations. I’ve been deep into the numbers of the latest eval scores and when looking at some of the blogs and press releases out there they almost had me convinced they did well even when I read the data at hand showing they didn’t. I guess a less jaded view is that the results can be interpreted in many ways, some of them quite creative. It brings to mind the great quote from the Lockpicking Lawyer review “the threat model does not include an attacker with a screwdriver”.

Josh Zelonis at Forrester provides a great example of the level of work required to parse the test outcomes, and he provides extended analysis on Github here that is easier on the eyes than the above. Even that great work product requires the context of what the categories mean. I understand that MITRE is taking the stance of “we do the tests, you interpret the data” in order to pick fewer fights and accommodate different use cases and SOC workflows, but that is a lot to put on buyers. I repeat: there’s a lot of nuance in the terms and test report categories.

If, in the absence of Josh’s work, if I have to pick one metric Detection Rate is likely the best one. Note that Detection rate isn’t 100% for any product in the APT29 test, because of the meaning of that metric. The best secondary metrics I like are Techniques and Telemetry. Tactics sounds like a good thing, but in the framework it is lesser than Techniques, as Tactics are generalized bad things (“Something moving outside!”) and Techniques are more specific detections (“Healthy adult male Lion seen outside door”), so a higher score in Techniques combined with a low score in Tactics is a good thing. Telemetry scoring is, to me, best right in the middle. Not too many alerts (noisy/fatiguing) and not too few (“about that lion I saw 5 minutes ago”).

Here’s an example of the interpretations that are valuable to me. Looking at the Trend Micro eval source page here I get info on detections in the steps, or how many of the 134 total steps in the test were detected. I’ll start by excluding any human involvement and exclude the MSSP detections and look at unassisted only. But the numbers are spread across all 20 test steps, so I’ll use Josh’s spreadsheet shows 115 of 134 steps visible, or 85.82%. I do some averaging on the visibility scores across all the products evaluated and that is 66.63%, which is almost 30% less. Besides the lesson that the data needs gathering and interpretation, it highlights that no product spotted 100% across all steps and the spread was wide. I’ll now look at the impact of human involvement add in the MSSP detections and the Trend number goes to 91%. Much clinking of glasses heard from the endpoint dev team. But if I’m not using an MSSP service that… you see my point about context/use-case/workflow. There’s effectively some double counting (i.e. a penalty, so that when removing MSSP it inordinately drops the detection ) of the MSSP factor when removing it in the analyses, but I’ll leave that to a future post. There’s no shortage of fodder for security testing nerds.

Takeaway #5 – Data Is Always Good

Security test nerdery aside, this eval is a great thing and the data from it is very valuable. Having this kind of evaluation makes security products and the uses we put them to better. So dig into ATT&CK and read it considering not just product evaluations but how your organization’s framework for detecting and processing attacks maps to the various threat campaigns. We’ll no doubt have more posts on APT29 and upcoming evals.

*I was a Common Criteria tester in a place that also ran a FIPS 140-2 lab. Did you know that at Level 4 of FIPS a freezer is used as an exploit attempt? I even dipped my toe into the arcane area of Formal Methods using the GYPSY methodology and ran from it screaming “X just equals X! We don’t need to prove that!”. The deepest testing rathole I can recall was doing a portability test of the Orange Book B1 rating for MVS RACF when using logical partitions. I’m never getting those months of my life back. I’ve been pretty active in interacting with most security testing labs like NSS and ICSA and their schemes (that’s not a pejorative, but testing nerds like to use British usages to sound more learned) for decades because I thought it was important to understand the scope and limits of testing before accepting it in any product buying decisions. If you want to make Common Criteria nerds laugh point out something bad that has happened and just say “that’s not bad, it was just mistakenly put in scope”, and that will then upset the FIPS testers because a crypto boundary is a very real thing and not something real testers joke about. And yes, Common Criteria is the MySpace of tests.

The post Getting ATT&CKed By A Cozy Bear And Being Really Happy About It: What MITRE Evaluations Are, and How To Read Them appeared first on .

FreshRSS

Trend Micro’s Top Ten MITRE Evaluation Considerations

Getting ATT&CKed By A Cozy Bear And Being Really Happy About It: What MITRE Evaluations Are, and How To Read Them