Strings found in programs can be of great importance in order to quickly classify/triage (unpacked) samples. Thus, a lot of attentions has been paid to string extraction in Malcat.
Malcat can identify strings using different algorithms, depending on the file type. Algorithms can range from the simple regular-expression based linear sweep to more advanced file-format aware parsers, or even disassembly-based parsers. Multiple algorithms can also be used at the same time. We will detail some of the concept in this chapter.
By default, strings are extracted from the file using a naive linear-sweep algorithm. This algorithm discovers printable ascii and utf16-le strings using regular expressions. Code-referenced strings are labelled as USER strings, while all other get the SCANNED tag. This is the less precise algorithm. But more specific algorithms can be used.
In addition to the linear-sweep algorithm, strings constructed on stack or in global memory (your typical shellcode stack-strings) can be recovered using an in-house heuristic, cf. Dynamic strings.
Strings are extracted from #Strings stream (META) and #US stream (USER).
Golang strings are tricky to extract because no start of string/end of string information is present. Malcat scans the disassembled code looking for typical string access patterns in order to identify the start and size of accessed strings.
This is merely an heuristic and suffer from false positives and false negative, but gives better results than the default linear sweep still.
Strings are extracted from the serialized pyc structure. Strings present in the “Constants” pool of methodes are USER strings (constants used in code), all other strings are META strings (source code paths, class / function names, etc.)
Strings are extracted from the “Strings” region. Special characters in NSIS strings (variables references or hard-coded paths) are expanded when needed.
Strings are extracted from the BIFF opcodes when it makes sense. SST and STRING opcodes while be considered as USER strings, while other records containing a string while use the META tag.
Malcat scans the disassembled code looking for
LitVarStr opcodes. Arguments of these opcodes are used as entry-points to USER strings.
In Malcat, strings have a score, which is used to sort them inside the Strings list. The string score is a value between 0 and 255 (255 being the highest) which is given to every extracted string, using an internal algorithm. The goal is to give a higher value to the strings which are more interesting to human analysts, like crypto content, IOCs or code literals. Since this is merely an heuristic, don’t expect it to do miracles: interesting strings are more likely to be listed at the top, but it’s not a guarantee.
The exact algorithm won’t be described there, since it is a draft likely to change in the future. Its inputs are fixed though, and are the following:
Entropy of the string: the bigger the better
Number of printable characters: the bigger the better
Number of characters: the bigger the better
Type of the string: DYN > USER > META > SCAN
Encoding of the string: ascii strings gets negative points
Has the string a specific tag ? Gives extra points
Number of incoming references: 1 is the best, 2 or more references the second best, 0 the worst
Is a string a known constant (see Known patterns identification), this would lower the score
Is a string part of a yara pattern (see Yara signatures), this would increase the score
Again, any constructive feedback is welcome.
In Malcat, some strings may have a tag, which is nothing more than a category for the string. The tag is computed via regular expressions over the content of the string, independently of the string type or encoding. For instance, strings looking like IP addresses will get the IP tag. Strings cannot have more than one tag.
String tags are mostly used by python scripts and Anomaly scanner to filter relevant strings.
Still subject to changes