Evaluating security desperately needs innovation!

Years ago, I read this article about self-deception and self-inflation, and here is one interesting (and funny) stat mentioned in there: “if you ask professors (talking about academics) whether they are in the top half of their profession, 94 percent say they are.” We are all self-deceivers, we evolved that way (apparently). That stuck in my mind, and it helps me keep an open mind. There is a lot of what we think and believe that’s just not true! How much? It varies widely I’d say.

Now, to our subject of the day; testing security. It is one of the most vital and challenging issues in cybersecurity, and we’re just about starting to get it right! It has been mainly manual, or semi-automated at best. Pen testing and/or red teaming have long been part of testing, and still are, but they generally answer the question ‘can an attacker get in?‘ The more interesting question: ‘to which extent any of my controls work?‘ is a little different, and a bit more complicated to answer as well. In recent years, there have been new trends in testing, with BAS (Breach Attack Simulation) or testing based on MITRE ATT&CK, which attempt to better answer that question. So the focus here is rather about ‘testing security controls’, which is one aspect of testing security as a whole. 

The ‘old’ way

Since the emergence of the AV industry in the 90s, testing vendors also started to make their appearances. Their methods evolved over time, but their assessments (and certification) usually involve a set of in-the-wild malware samples, and in some cases some simulated samples. And then, they report on some of the usual metrics; true positives/ false negatives (detections/ misses), and also false positives (flagging clean stuff as malicious). This would be the most basic of tests and most testing vendors are also expanding their coverage to include ‘enhanced and real-world advanced threats’, which usually means ‘we’re testing against the ATT&CK matrix’. 

So I have looked at public data produced by a number of these vendors and this is how the protection rates look like. The interval represents the min and max recorded over the years (i.e. all their tests for that specific security vendor).

Testing vendorProducts and (historical) protection rates 
av-comparativesAvast Business Antivirus Pro Plus [.93, 1]
AVG CloudCare [.88, 1]
Avira Endpoint Security [.95, 1]
Bitdefender Endpoint Security Elite (GravityZone Elite HD) [.9, 1]
Cisco AMP for Endpoints [.97, .99]
CrowdStrike Falcon Endpoint Protection [.95, .99]
Emsisoft Anti-Malware [.92, .99]
Endgame Protection Platform [.97, .99]
eScan 360 [.89, 1]
ESET Endpoint Protection Advanced Cloud [.94, .99]
F-Secure Safe [.93, 1]
FireEye Endpoint Security [.86, .99]
Fortinet FortiClient [.9, 1]
G DATA Business Security [.94, 1]
K7 Enterprise Security [.9, .99]
Kaspersky Endpoint Security [.96, 1]
McAfee Endpoint Security [.86, 1]
Panda Endpoint Protection Plus [.93, 1]
Saint Security Max [.76, .91]
Seqrite Endpoint Security [.98, .98]
Sophos Intercept X Advanced [.9, .99]
SparkCognition DeepArmor Endpoint Protection Platform [.98, .99]
Symantec Endpoint Protection [.97, 1]
Trend Micro Office Scan XG [.92, 1]
VIPRE Endpoint Security Cloud [.98, 1]
Webroot SecureAnywhere [.86, .97]

Note: some products have been tested for a lot more years than others. These tests include scores from multiple tests, including what they call real-world tests (in reality payloads generated using known frameworks, such as Metasploit, Empire, Unicorn, etc. Mainly fileless and PowerShell based), and Malware file-based threats.
av-testAvast Business Antivirus Pro Plus [.97, 1]
Bitdefender Endpoint Security [.98, 1]
Check Point Endpoint Security [1, 1]
ESET Endpoint Security [.97, 1]
F-Secure PSB Computer Protection [1, 1] 
G Data AntiVirus Business [.97, 1]
Kaspersky Endpoint Security [1, 1]
McAfee Endpoint Security [.96, 1]
Microsoft Windows Defender Antivirus [.96, 1]
Seqrite Endpoint Security [.95, 1]
Sophos Intercept X Advanced [.95, 1]
Symantec Endpoint Protection [1, 1]
Trend MicroApex One [.98, 1]
Vmware Carbon Black Cloud [.95, 1]

Note: av-test gives the industry average for each test which can be useful as a point of reference for non tested products. Check Point, Kaspersky, and Symantec, all have the full mark in all tests going back 2 years!
SE LabsKaspersky Endpoint Security [.93, 1]
Symantec Endpoint Security [.98, 1]
Sophos Intercept X [.96, 1]
ESET Endpoint Security [.95, .97]
Trend Micro OfficeScan, Intrusion Defense Firewall [.93, .98]
Crowdstrike Falcon [.85, .96]
McAfee Endpoint Security [.9, 1]
SentinelOne Endpoint Protection [.95, .95] 
Microsoft Windows Defender Enterprise [.96, .98]
Bitdefender Gravity Zone Endpoint Security [.92, .97]
VIPRE Endpoint Security [.91, .92]
Webroot SecureAnywhere Endpoint Protection [.79, .79]

Note: SE runs tests quarterly for endpoint solutions and all vendors are not tested in all of these tests. This is common to most vendors. Not all results are available for free. Only endpoint security reports are available on the website (email, network, etc. are not). They also use publicly available tools for what they call targeted attacks, adding ‘it is often possible to evade detection using various common techniques’, without much additional information there. 
NSS LabsNot available. Requires a subscription.
MRG EffitasAvast Business Antivirus [.92, 1]
Avira Antivirus Pro [.5, .78]
Bitdefender Endpoint Security [.99, 1]
CrowdStrike Falcon Protect [.88, .91]
ESET Endpoint Security [.89, 1]
F-Secure Computer Protection Premium [.78, 1]
Kaspersky Small Office Security [.95, 1]
McAfee Endpoint Security [.96, 1]
Microsoft Windows Defender [.88, .95]
Sophos Intercept X [.81, 1]
Symantec Endpoint Protection Cloud [.94, 1]
Trend Micro Worry-Free Business Security [.75, 1]

Note: they provide more details on their test cases. The results are a bit more nuanced, e.g., auto block, behavioral, signature, blocked within 24h.

What you have certainly noticed is that the overall average is in the mid to high 90s. And for some well-known vendors, the full mark in all tests is quite common. How realistic is this? And I’m not questioning the validity of these results, just how useful they are in reflecting the reality of things. I believe this is not realistic and not even necessarily helpful. The question is, how many unknown-unknowns or even novel unknown-knowns are used in there? 

I know most heuristics and machine learning models deployed by security vendors are designed to detect variants and similar threats (that might be exhibiting similar patterns and/or behavior). But I don’t believe those protection rates reflect a real effort to mimic the threat landscape, and evasion efforts out there (even using available tools, something like venom for instance). In all tests I’ve come across, I might have seen once a mention of evasion or obfuscation! And it wasn’t even used. This, I believe, is a huge gap in this area.

Note that since mid-2018 an organization called AMSTO (Anti-Malware Standards Testing Organization) adopted a testing protocol standard that provides a testing framework and behavior expectations for testers and vendors in cybersecurity. This is supposed to increase the transparency of these tests.

A new trend

Since the introduction of MITRE ATT&CK, it has been widely adopted by cybersecurity teams to understand adversary behavior and advance and test defensive capabilities. It can be particularly helpful in visualizing the defensive coverage and assessing the detection of documented techniques. And it’s been shaping the way testing is done.

An interesting observation here is the gap that exists between the evaluations I mentioned above (mainly based on malware samples), and the few MITRE ATT&CK based evaluations such as those reported by MITRE and Mandiant, where the detected/ prevented portion is way below the rates reported above (and not by a small amount!). In Mandiant’s report, for instance, false negatives averaged 56%, which means controls were not able to either detect or prevent more than half of the attacks and techniques being tested. This, compared to previous evaluations, certainly points to a major issue in generating reliable measures and evidence of the effectiveness of controls. 

On the BAS side, I could unfortunately not find public evaluation data. But this is certainly another important emerging trend in testing security controls, mainly based on simulated tests it seems. This, again, raises the question about the effectiveness of the simulation itself; how real are those simulations? What if the simulated part doesn’t overlap with what the control is looking for to detect/ prevent? Which at the end of the day goes back to the question of how reliable are these simulations assessing the effectiveness of controls (forget about how effective they are in reliably simulating a threat). I took a quick look at some vendors:

Picus SecurityProvides security scores (similar to the above scores from testing vendors) and remediation/ mitigation suggestions.
AttackIQ Provides overall control effectiveness, and per-category effectiveness, based on a set of simulated tests. Good features here seem to be the ability to write new tests.
CymulateSeems to separate testing by attack vector (email, web) or across the full kill chain. SaaS-based, also provides security scores and remediation/ mitigation suggestions.

Finally, there are (and will always be) many unknowns about how controls work. It is also hard to interpret security controls’ performance in general, especially when it comes to behavioral-based protection and the variety of detection types (which cannot be reliably interpreted outside of the context of an environment). More understanding of the methods and technologies powering controls might also be necessary but it isn’t trivial.

All of these aspects make this area very challenging. And while MITRE ATT&CK based evaluations are a step in the right direction, there are still gaps, especially when it comes to leveraging evasion techniques. Testing based on existing samples (or even variants) has already become obsolete IMHO, and learning from the threat landscape and manual testing, and maybe generating machine learning models for attack campaigns, should be the future. This will allow to both model and reflect the uncertainty around controls and their performance.

1 Comment

  1. Alejandro

    HI Lamine, the above also reminds me of Goodhart’s law (https://miro.medium.com/max/875/0*Cd1A6cffoYPKHjE_.jpg). Sec. vendors do a lot of prep around AVtest, MITRE etc, that do not have concept drift in mind… which IMO should be the real challenge to tackle and measure here. Personally I would like to see multidimensional approaches when evaluating sec. products that would not only be based on scoring the ability to detect past/known attacks and variants but also (just to give an example) red team activity that may not necessarily fit within MITRE or modelling risky user behaviour separately.

    Another topic of discussion is if humans in the detection loop should be allowed when performing these evaluations.. theoretically they are part of the security product but they can provide an artificial edge in a test which may not correlate with in-field efficacy.

Leave a Reply

Your email address will not be published. Required fields are marked *