FinFisher is heavily obfuscated in many ways, including through the use of spaghetti code in its functions in an effort to confuse disassembly programs. It also uses a custom XOR algorithm to obfuscate code.[136][137]
The Algorithm Polymorphic Code 2012.rar
Ursnif has used an XOR-based algorithm to encrypt Tor clients dropped to disk.[387] Ursnif droppers have also been delivered as password-protected zip files that execute base64 encoded PowerShell commands.[388]
Typically, suspicious code is identified due to anomalous behavior of a computer system. Human experts then manually analyze the suspicious code to identify invariant code portions (syntactic analysis) or code portions that are regularly executed (semantic analysis). Such analysis leads to the generation of unique signatures for use by AVSs when scanning network packets, user files or memory. Before such signatures can be released, they must be checked against non-malware to ensure that the number of false positives is kept acceptably low. For instance, signatures based only on malware encryption/decryption information are likely to lead to unacceptably high false positives due to the large proportion of normal Internet traffic that also carries encryption/decryption information for integrity (e.g. hash algorithms) and authentication (e.g. certified public keys). But relying on human expertise alone to provide manually extracted signatures is becoming increasingly difficult with the growing volume of malware. As a result, interest continues to grow in methods to improve automatic signature extraction. Semantic approaches [4] [5] , in addition to standard dynamic and execution behavior analysis [6] [7] , now include methods such as control flow analysis [8] [9] , behavior model checking [10] [11] , executable graph mining [12] and formal semantic models of analysis [13] . The main problem with a semantic approach is that an infection must occur to produce anomalous behavior. Several execution traces may be required before signatures can be extracted manually, and there is always the risk that such signatures may not be effective for different execution paths of the same viral code. Syntactic or static approaches [14] [15] [16] on the other hand, while possibly preferable because of their ability to extract signatures that may apply to different variants of the same malware family and to generate signatures irrespective of differences in execution paths, have not managed to keep pace with the latest polymorphic and metamorphic techniques used by virus writers to obfuscate their malware [17] [18] . Static signature extraction methods must also disassemble or reverse engineer executable code so that structural analysis of the source code is possible. Such analysis includes: statistical analysis of parameter values and searching for repeating strings [19] [20] ; code feature selection [21] ; feature extraction [22] ; and n-grams analysis [23] [24] [25] . The mapping of executable code to a suitable level of program representation that allows such structural analysis is problematic, however, due to such code being deliberately constructed to hide its functionality, such as through the use of redundant control instructions and variable assignments.
Predicting future metamorphic and polymorphic viral forms to prepare AVSs for as yet unknown variants has remained a distant research goal for both semantic and syntactic techniques. The key to a successful syntactic approach would appear to lie in analyzing malware code directly and without execution, and so removing the need for reverse engineering. By comparing different structural variants of the same virus, a successful structural/static approach may be able to identify common code patterns despite attempts to obfuscate through polymorphism because, if the virus is to perform its designated payload or function and remain a variant of a virus family, a common code must be present even if it is deliberately obscured. A purely syntactic approach, such as the one proposed in this paper, should detect new polymorphic viral variants independently of semantic knowledge based on execution traces, command and control channels, deduplication and propagation vectors. That is, a purely syntactic approach to new variants should not require prior infection by those variants.
In this study, we focus on a sequence-based automatic signature extraction method for identifying polymorphic malware using syntactic analysis of hex code. Theoretically, malware with polymorphism changes its code and keeps the functions intact, whereas malware with metamorphism changes sub-functionality and code while preserving overall functionality [26] . The implications of this theoretical division are unknown for automatic signature extraction. It is not even known if any metamorphic malware actually exists [27] . For that reason, we confine our approach to polymorphic malware capable of mutating into a potentially infinite number of functionally equivalent but structurally different variants (details below).
Because the same viral function can appear in many different physical code forms it has been posited that only semantic analysis will reveal commonalities among variants of the same virus for effective signature generation. As a result, syntactic techniques for signature extraction based on structural detection of malware are relatively unexplored in comparison to semantic techniques, and so there is very little in the way of related literature. What literature there is discussed in Section 3. In order to understand syntactic-based polymorphism detection techniques it is useful to consider a simple example of linguistic signature extraction. Consider the following structurally-related sentences, where the first sentence is the original sentence, and the other three are polymorphic versions of it:
A sequence-based approach to signature extraction was previously proposed and demonstrated using the Smith-Waterman algorithm (SWA) without gap penalties [29] . SWA is used extensively in bioinformatics for sequence alignment (finding common subsequences or consensuses among a set of variable length sequences), and previous work demonstrated the feasibility of using such consensuses in viral hex code as signatures. The approach was further refined [30] by adopting SWA with six different substitution matrices. Results showed that it was possible to extract signatures/meta-signatures after applying data mining rule-extraction techniques to the extracted signatures. Such signatures/meta-signatures can, in turn, be employed as rule-based string templates for creating more specific, variant-oriented polymorphic malware signatures for detecting known variants belonging to the same virus family. In other words, previous work has shown how to progress syntactically (i.e. without execution traces) from viral code consensus identification for a set of variants of the same virus family (training set) to generation of signatures in either a regular expression or rule format for identification of other known variants of the same virus family (test set).
What has improved considerably since the historical view that only semantic analysis will reveal viral signatures is the growth in our knowledge of sequence-based syntactic and structural search algorithms in bioinformatics. Such algorithms do not just search for the presence or absence of characters in certain positions but also use pre-loaded substitution matrices that give substitution probabilities and/or allow such substitution matrices to be generated using probabilistic techniques. Of greater importance to this paper is that such algorithms manipulate (shift) the strings/sequences to allow for insertion and deletion of characters to maximize the number of matching characters. Previous work [32] showed that such string manipulation algorithms from bioinformatics work best with biologically represented strings (amino acids, nucleotide bases) rather than arbitrary character sets. This is due to the possible inclusion of heuristic biological information in the algorithms that determines to some extent the matching process (e.g. built-in information concerning mutation rates between amino acids or nucleotide bases). The implications of rewriting already well-understood and publicly available sequence-based bioinformatics algorithms to work on hex code (numeric data) are not known. For these reasons and to allow comparison with previous work, conversion of hex code to an appropriate biological representation is required before sequence matching, with conversion back to hex code for signature generation. We used a simple identity (ID) substitution matrix for our alignment experiments instead of other well-known biological substitution/mutation matrices, such as BLOSUM (Block Substitution Matrix) and PAM (Point Accepted Mutation). ID provides the most parsimonious method in that no assumptions are made as to how symbols may be related to each other. Also, the use of ID allows the effects of gap opening and closing to be accurately assessed without being compromised by probabilistic substitution matrices.
Our previous work [29] [30] [31] has shown that sequence alignment techniques supplemented with Smith-Waterman algorithm lead to signatures that generalized successfully to unseen but previously known variants of polymorphic viruses. This prior work adopted a fixed combination of gap open and gap extend penalties for the automatic generation of virus signatures. However, it is not known how well this method generalizes to new, unknown variants or what the effect of gap penalties is. In this paper, we use ten different combinations of gap open and gap extend penalties to determine whether changes in these penalty parameters can help to identify signatures for known as well as unknown polymorphic variants which we generate in the laboratory, thereby extending the ability of future AVSs to identify variants not previously encountered.
Step-2 (Converting the viral code into a form acceptable for sequence alignment): In this step, the extracted 18 hex dump sequences belonging to the three polymorphic malware families were converted into amino acid sequences. Conversion of hexadecimal into amino acid sequences for input to JAligner [67] was performed using the rules shown in Table 2. A short example of the conversion of hexadecimal code into 16 amino acid characters is shown below: 2ff7e9595c
Comments