The pros, cons and limitations of AI and machine learning in antivirus software
When it comes to antivirus software, some vendors are hailing machine learning as the silver bullet to malware — but how much truth is there to these claims?
In today’s post, we’re going to take a look at how machine learning is used in antivirus software and whether it really is the perfect security solution.
How does machine learning work?
In the antivirus industry, machine learning is typically used to improve a product’s detection capabilities. Whereas conventional detection technology relies on coding rules for detecting malicious patterns, machine learning algorithms build a mathematical model based on sample data to predict whether a file is “good” or “bad”.
In simple terms, this involves using an algorithm to analyze the observable data points of two, manually created data sets: one that includes only malicious files, and one that includes only non-malicious files.
The algorithm then develops rules that allow it to distinguish the good files from the bad, without being given any direction about what kinds of patterns or data points to look for. A data point is any unit of information related to a file, including the internal structure of a file, the compiler that was used, text resources compiled into the file and much more.
The algorithm continues to calculate and optimize its model until it ends up with a precise detection system that (ideally) doesn’t classify any good programs as bad and any bad programs as good. It develops its model by changing the weight or importance of each data point. With each iteration, the model gets slightly better at accurately detecting malicious and non-malicious files.
Machine learning can help detect new malware
Machine learning helps antivirus software detect new threats without relying on signatures. In the past, antivirus software relied largely on fingerprinting, which works by cross-referencing files against a huge database of known malware.
The major flaw here is that signature checkers can only detect malware that has been seen before. That’s a rather large blind spot, given that hundreds of thousands of new malware variants are created every single day.
Machine learning, on the other hand, can be trained to recognize the signs of good and bad files, enabling it to identify malicious patterns and detect malware – regardless of whether it’s been seen before or not.
The limitations of machine learning
While machine learning can be a very effective tool, the technology does have its limitations.
Potential for exploitation
One of the key weaknesses of machine learning is that it doesn’t understand the implications of the model it creates – it just does it. It simply uses the most efficient, mathematically-proven method to process data and make decisions.
As noted earlier, the algorithm is fed with millions of data points but without anyone specifically telling it which data points are indicators for malware. That’s up for the machine learning model to discover on its own.
The upshot of this is that no human can ever really know which data points might – according to the machine learning model – indicate a threat. It could be a single data point, or a specific combination of 20 data points. A motivated attacker could potentially discover how the model uses these parameters to identify a threat and use it to their advantage. Changing one specific, seemingly non-relevant data point in a malicious file could be enough to trick the model into classifying malware as safe and undermine the whole model.
To rectify the issue, the vendor would have to add the manipulated file to the data set and recalculate the entire model, which could take days or weeks. Unfortunately, this still wouldn’t fix the underlying problem – even after the model was rebuilt, it would just be a matter of time until the attacker found another data point or combination of data points that could be used to fool the machine learning system.
That’s exactly what happened in July 2019, when researchers at Skylight Cyber discovered that a popular AI-based security product had whitelisted certain files to avoid triggering false positives. The strings of code in these whitelisted files were given a lot of weight in the algorithm’s scoring system, which meant they were almost guaranteed to override the algorithm’s natural decision-making process. When the model encountered the code contained in the whitelisted files, it flagged the file as safe – even if it was embedded in an otherwise malicious file. As a result, the researchers were able to undermine the algorithm by simply taking strings of code from a non-malicious whitelisted gaming file and attaching them to a malicious file.
As the researchers noted, this type of attack would not have been possible if the product used additional protection technologies such as a signature scanner, which doesn’t rely on algorithms, or heuristics, which detects threats based on behavior rather than a file’s parameters.
Requires a large, well-labeled dataset
Machine learning systems are only as good as the data they are supplied with. Training an effective model requires an enormous number of data inputs, each of which needs to be correctly labeled. These labels help the model understand certain characteristics about the data (e.g. whether a file is clean, malicious or potentially unwanted).
However, the model’s ability to learn effectively depends on the dataset being perfectly labeled, which can be difficult and resource-intensive to achieve. A single mislabeled input among millions of perfectly labeled data points may not sound like a big deal, but if the model uses the mislabeled input to form a decision, it can result in errors that are then used as the basis for future learning. This creates a snowball effect that can have significant repercussions further down the line.
A layered approach to cybersecurity
Machine learning is a powerful technology that may play an increasingly important role in the cybersecurity world in the years ahead. However, as mentioned above, it does have its flaws and limitations. Relying on antivirus software that is powered exclusively by AI or machine learning may leave you vulnerable to malware and other threats.
Solutions that use a combination of protection technologies will likely provide better security than a product that is entirely AI-based. For example, Emsisoft leverages the power of AI and machine learning as well as other protection technologies such as behavioral analysis and signature checkers. These systems work in synergy to double and triple-check each other’s results in order to provide you with the best malware protection possible.
Taking a multi-layered approach to security allows you to avoid putting all your eggs in one basket and maximizes your chances of stopping malware before it can infect your system.