Software Similarity and Classification

Software similarity and classification is an emerging topic with wide applications. It is applicable to the areas of malware detection, software theft detection, plagiarism detection, and software clone detection. Extracting program features, processing t

  • PDF / 702,422 Bytes
  • 9 Pages / 439.37 x 666.142 pts Page_size
  • 12 Downloads / 212 Views

DOWNLOAD

REPORT


Static Analysis of Binaries

Abstract Static binary analysis is more difficult than if source code is available. In many cases, the analyses are unsound and behaviours are omitted to make problems feasible. Heuristics may be required to separate code and data in a disassembly or pointer behaviour may be weakly modelled to make statically analysing programs feasible. Nevertheless, static analysis of binaries is an important area of research with a number of practical applications including the detection of software theft and the classification and detection of malware. This chapter examines static analysis of binaries with the intent that properties and features of binary programs can be extracted to create useful birthmarks for software similarity and classification. Keywords Disassembly Decompilation

 Intermediate language  Control flow reconstruction 

5.1 Disassembly Disassembly is the process of translating machine code to assembly language [1]. This is typically the first stage of a static analysis. Static disassembly parses the entire binary image statically without execution. In static disassembly, there are two main algorithms. In the Linear Sweep algorithm, the instructions are disassembled one instruction after another, starting from the beginning of code. The disadvantage of this method is that data introduced into instruction stream may be erroneously disassembled (Fig. 5.1). The other main algorithm to perform disassembly is the Recursive Traversal algorithm. This algorithm decodes each instruction following the order of the control flow. This resolves the issue of embedded data, but may miss decoding instructions that are the target of indirect jumps or other situations when it is hard to resolve control flow statically (Fig. 5.2). S. Cesare and Y. Xiang, Software Similarity and Classification, SpringerBriefs in Computer Science, DOI: 10.1007/978-1-4471-2909-7_5,  The Author(s) 2012

41

42

5 Static Analysis of Binaries

Fig. 5.1 Linear sweep disassembly

Fig. 5.2 Recursive traversal disassembly

Speculative Disassembly attempts to remedy the problems of the Recursive Traversal algorithm problem by first performing the Recursive Traversal, and then performing a Linear Sweep in regions that are not decoded (Fig. 5.3). Disassembly results in the following data. disassembly ¼ faddress; opcode; operand1 ; . . .; operandn g

5.2 Intermediate Code Generation A simple approach to transforming assembly into an intermediate language is to translate each instruction without maintaining intermediate state. This approach has been used successfully in the Reverse Engineering Intermediate Language

5.2 Intermediate Code Generation

43

Fig. 5.3 Speculative disassembly

Fig. 5.4 Procedure disassembly

(REIL) [2]. Other popular intermediate languages are Vex as used in the Valgrind binary instrumentation framework [3] and Vine as used in the BitBlaze project [4]. An example to translate native assembly into three address code is shown below. native assembly instruction ! ðTAC1 ; . . .TACn Þ

5.3 Procedure Ide