How deep learning can be used to detect malware using 2D images
NEW DELHI: Manipulating images to hide malware is common. Once the image is opened on a system, the malware loader starts the decryption process. The decrypted file is then loaded on to the device memory triggering a malware attack.
Now, Microsoft and Intel have found a way to use images to detect malware attacks.
Intel Labs and Microsoft Threat Protection Intelligence are collaborating on a project named Static Malware-as-Image Network Analysis (STAMINA), which will turn any malicious code into images and use deep learning models to study them.
Classical malware detection approaches involve extracting binary signatures or fingerprints of the malware. However, due to growing number of malwares and signatures, matching signature has become challenging.
The other approaches include static and dynamic analysis. The former analyses the malware without executing it, but its performance can suffer from code obfuscation. The latter executes the malware in an sandbox to analyse it. It is effective but can be more time consuming.
That is where researchers turned to image-based transfer learning approach for static malware classification, using real-world data set. They used a Microsoft dataset of 2.2 million hashes of malware binaries and 10 columns of data.
A combination of known malware, potentially unwanted applications and unknown binaries with no known history were taken and converted into a stream of raw pixel data.
This one-dimensional pixel stream was then converted into a two-dimensional or 2D image to allow image analysis algorithms to work on them. The width and height were figured out by the file size after converting to pixel stream, following an empirically validated table.
Image height is calculated as the number of pixels divided by the width. After reshaping, the images were resized for transfer learning techniques.
Resizing does not adversely impact the classification result, since the system trains a very deep neural network to extract the deep-represented features, researches pointed out.
The 2D images were then fed into a deep neural network (DNN) that was trained using 60% of known malware samples. The DNN would scan and identify the image as clean or infected.
According to researchers, image-based technique used on x86 program binaries, achieved 99.07% accuracy with 2.58% false positive rate.
The study further showed that samples allowed all characteristics of the malwares to be captured during training. However, for applications of bigger size, STAMINA may not be fully effective as the software cannot convert billions of pixels into JPEG images and then resize them.
That is where meta-data-based methods can be more reliable over sample-based models.