Ensemble Malware Classification Using Neural Networks

This work presents an experimental study of malware classification using the Microsoft Malware Classification Challenge 2015 dataset. We combine the approach of the winning solution to the Microsoft Malware Classification Challenge with the neural network

  • PDF / 2,089,842 Bytes
  • 14 Pages / 439.37 x 666.142 pts Page_size
  • 82 Downloads / 261 Views

DOWNLOAD

REPORT


Abstract. This work presents an experimental study of malware classification using the Microsoft Malware Classification Challenge 2015 dataset. We combine the approach of the winning solution to the Microsoft Malware Classification Challenge with the neural network approach. Using a combination of n-grams features for both assembly (asm) and byte code enables us to significantly improve the result. By mixing multiple approaches, we are able to get the best log-loss result of 0.0025, so far. This comes mostly from the classical XGBoost method with n-gram contributions from the binary and assembly code. However, understanding this result is still incomplete. The standard neural network approaches (even with LSTM) alone give poorer results compared to the XGBoost, based on mostly n-gram. It is not clear why adding 6-grams to the binary code analysis does not improve results. There are many more options to be tested in the future, in particular networks. Keywords: Malware detection · Microsoft Malware Classification Challenge · Malware neural networks

1

Introduction

Machine learning has a clear advantage over signature methods still used in malware detection. Constantly changing malware signatures and the use of obfuscation methods require effective and fast detection and classification methods. 1.1

Machine Learning-Based Malware Detection

Different studies have demonstrated the proficiency of machine learning for the detection and classification of malware files. Further, the accuracy of these machine learning models can be improved by using feature selection algorithms to select the most essential features and by reducing the size of the dataset, which leads to decreased computational overhead. In general, there are two major approaches to malware classification. The first is the classical method based on Supported by PUT statutory funds. One of the authors (CJ) acknowledges the NVIDIA GPU Grant of Quadro P6000 card. c Springer Nature Switzerland AG 2020  A. Dziech et al. (Eds.): MCSS 2020, CCIS 1284, pp. 125–138, 2020. https://doi.org/10.1007/978-3-030-59000-0_10

126

P. Wyrwinski et al.

hand-crafted feature selection. The other is a neural network approach. The customary thinking is that the neural approach, where the progress in recent years has been tremendous, gives better results for very large systems independent of a domain. For example, for Question Answering on SQuAD2.01 , the F-measure increased from 70.3% in 2017 to 93.011% in 2020. One would expect that using attention neural networks [16], or BERT [5] CNN+LSTM based networks, would give better results. The objective of this work is to test many neural network approaches and the use of an ensemble method to verify whether richer neural architectures would lead to improvement. Also, we would like to establish the relative importance of binary vs assembly language (asm) data. Initially, our work followed the convolutional neural network (CNN) approach to bytecode, originated in the Gilbert’s thesis [6] and the black-box approach of [11]. We make comparisons to