File parsers

Malcat makes use of the 50+ files parsers in order to identify the type of current file and highlight all of its internal structures. The file parsers also are used to carve known file type from the current file at any location (e.g. embedded archives or images inside a PE file).

How parsing is done

All of Malcat parsers are written in python. Relying on python allows Malcat to reduce the attack surface for bad parsing (file formats are tricky) and significantly speeds up the development of new file parsers (you can test everything live, just hit Ctrl+R ). This comes with a small performance penalty of course, but in practice it is negligible, provided some basic guidelines are followed.

All parsers are located in data/filetypes or in the filetypes sub-directory of your User data directory. When a new file needs to be analyzed in Malcat, the file parser is the first to be called. It will be responsible for several things:

  • Identify all of the file’s structures (the ones you can see in the Structure/text view)

  • Describe the file’s section mapping if relevant

  • Collect any metadata (date of creation, authors, etc.)

  • List all Virtual files if applicable (e.g. for an archive)

  • Populate the initial Symbols list if any (e.g. with imports and exports for a PE file)

Once done, the Analysis engine (doc in progress) can work its magic and start the analysis of the file.

File carving

File parsers are also used in Malcat’s carving algorithm in order to detect any file embedded inside the currently analyzed file. This feature is not only useful for DFIR (e.g. when analysing memory dumps), but also for malware analysis, when malware embed their next-stage payload inside a data section for instance.

All parsers located in data/filetypes (or in your User data directory)will be used in the carving process. Every time that the magic regexp of a parser matches, the parser’s FileTypeAnalyzer.parse() method will be called. During file carving, the whole parsing process of the parser won’t need to happen. In this mode, the parser may perform an early exit (for performance reasons) if both following conditions are met:

  • The parser did confirm the file type by calling the malcat.FileParser.confirm() method

  • The parser did specify the end of file by calling the malcat.FileParser.set_eof() method

The list of carved file is reported inside the Carved files tab in the UI.

Supported file formats

Malcat focuses mainly on file formats used directly or indirectly by malware authors. Adding a new file type is easy. Please refer to Writing new parsers. If you wish to make your new file format official, please refer to Contributing.

Programs

Here you can find the current list of supported executable formats:

Name

Structure parsing

Debug infos

Resources

Notes

AutoIT

3.26+ only

Scripts can be decompiled using F4

COFF

Yes

symbols and cv13

relocations, symbols, imports

ELF

Yes

symbols, no DWARF

relocations, symbols, imports, big and little endian

InnoSetup

Yes

Yes

setup script can be disassembled

LNK

Yes

while not a program format per se, it can be used to run commands

MDMP

Partial

No

Windows minidumps, partial support

NSIS

Yes

Yes

setup script can be disassembled, most sections parsed

OLE

Yes

VBA macros can be displayed using F4

PE/PE+

Yes

debug dir, no PDB

Yes

exports, imports (+ bound/delay), relocations, tls, debug,

load config, certificates, version informations

PE::DotNet

Yes

Yes

Yes

types, methods, resources, exceptions, strings

PE::Golang

Yes

pcln + file tables

PE::VB

Yes

types and events

VB forms

native and PCode support, project infos, objects array, forms and events

PYC

Yes

Yes

support for python 2.7+ and 3.6+, can handle PY2EXE and PYINST scripts

VBE

Yes

Malcat supports unpacking the original VBS script

XLS

Yes

The /Workbook stream inside OLE containers.

Cell informations (including formulas) can be recovered using F4

XLSB

Yes

The .bin files inside OpenXML .xlsb containers.

Cell informations (including formulas) can be recovered using F4

Structures parsing:

if the file format parser identifies (most of) the binary structures of the file format

Debug informations:

if debug informations are parsed

Resources:

if the program embeds resource, can Malcat identify and extract them?

Archives / file systems / databases

While Malcat has no pretension of being a full-fledged archive opener, it supports most archive types used by malware. Some file format parsers are more advanced than others and even allow the user to open archive member directly inside Malcat. Here is a list of supported file formats:

Name

Structure parsing

In-app unpacking

Summary

Notes

7Z

EncodedHeader only

No

No

ACE

Yes

Yes

Yes

AR

Yes

Yes

Yes

Used in static libraries (.lib, .a)

AutoIt

3.26+ only

3.26+ only

Yes

Scripts can be decompiled using F4

CAB

Yes

zlib encoding only

Yes

CFB/OLE2

Yes

Yes

Yes

VBA macros can be displayed using F4

FAT12/16/32

Yes

Yes

Yes

Currently limited to small to medium files trees

GZIP

Yes

Yes

Yes

InnoSetup

Yes

Yes

Yes

Support from InnoSetup 4.1.0 and onward, encryption support

ISO

Yes

Yes

Yes

No Juliet/RockRidge extensions support

JFFS2

Yes

lzo/lzma/rtime/zlib only

Yes

MSI

Yes

Yes

Yes

MSI tables can be displayed using F4

NSIS

Yes

zlib and lzma, no bz2

Yes

PYINST

Yes

Yes

Yes

Extracted python scripts get their python header restored

PYZ

Yes

Yes

Yes

Extracted python scripts get their python header restored

RAR4

Yes

No

Yes

Archives comments are shown for easy SFX analysis

RAR5

Yes

No

Yes

Archives comments are shown for easy SFX analysis

SquashFS

Yes

lzo/lzma/xz only

Yes

Can also display meta streams

Sqlite

Partial

Not yet

No

Work in progress

TAR

Yes

Yes

Yes

UDF

Yes

Yes

Yes

UImage

Yes

lzo/lzma/gzip/bzip2 only

Yes

VHD

Yes

Yes

Yes

Support for dynamic disks

ZIP

Yes

Yes

Yes

ZLIB stream

Yes

Yes

Yes

Structures parsing:

if the file format parser identifies (most of) the binary structures of the file format

In-application unpacking:

if the file format parser can directly extract and open archive members. Inside Malcat, one can then open a file by double-clicking them inside the Virtual File System tab.

Summary:

if Malcat displays a summary report in the Summary view

Multimedia / documents

Document/pictures identification is very useful for malware analysis. A lot of obfuscators love to disguise their payloads as multimedia files. Or hide it inside a multimedia file, in some unused space.

Name

Structure parsing

Metadata

Notes

BMP

Yes

Both BMP and DIB (i.e BMP without FileHeader) are supported

DOC

Partial (FCB)

Yes

The /WordDocument stream inside OLE containers

EMF

Yes

Yes

Used in office documents

GIF

Yes

Yes

ICO

Yes

JPEG

Yes

Tiff

ONE

Yes

No

Microsoft OneNote files. You can list and open embedded file objects

OOXML

No

No

Well it’s a ZIP, so you can browse it inside Malcat

PDF

Minimal

No

Very minimal support since not really a binary format

PNG

Yes

Yes

Pixel information can be extracted using scripts

WAV

Basic

No

XLS

Yes

Yes

Cell content + formula can be displayed using F4

XLSB

Yes

Yes

Cell content + formula can be displayed using F4

Structures parsing:

if the file format parser identifies (most of) the binary structures of the file format

Metadata:

if most metadata (author, comments, time, etc.) are extracted

Writing new parsers

Adding a new file parser in Malcat is just the matter of adding a new python file in the filetypes sub-directory of your User data directory and creating a class which inherits from FileTypeAnalyzer. Your parser class should define at least the following items:

  • a class attribute name: a (unique) short identifier/name for the file type, e.g. “PE” or “ISO” (see FileTypeAnalyzer.name)

  • a class attribute regexp: a regular expression using pcre2 syntax which has to match somewhere in the file for your parser to be even called (see FileTypeAnalyzer.regexp)

  • a class attribute category: the category of the file type. This tells malcat which icon to use to represent the file (see FileTypeAnalyzer.category)

  • a method parse, which will be called by Malcat and will do all the validating/parsing of the file (cf. FileTypeAnalyzer.parse())

If you want to test your parser, no need to restart Malcat: just hit Ctrl+R and it will be reloaded and reapplied to the current file (if the regexp matches). As an example, we can have a look at the PNG parser located at data/filetypes/PNG.py:

from filetypes.base import *
import malcat

class PNGAnalyzer(FileTypeAnalyzer):
    category = malcat.FileType.IMAGE
    name = "PNG"
    regexp = r"\x89PNG\r\n\x1A\n"

    def parse(self, hint=""):
         yield Bytes(8, name="Signature", category=Type.HEADER)
         # ...

Note

By defaults, the parsing starts at the location where the magic regexp was found. If the magic regexp is not the start of the file, please see Custom parsing start and context.

Now all you have to do is to write the parse method. The parse method will be responsible for :

  • Parsing/identifying all of the file’s structures (the ones you can see in the Structure/text view)

  • Describing the file layout

  • Collect any metadata (date of creation, authors, etc.)

  • Listing all Virtual files (e.g. for an archive)

  • Populating the initial Symbols list if any (e.g. with imports and exports for a PE file)

Identifying at least one field/structure is mandatory. The rest is optional and may or may not be relevant depending on the file type you are parsing. If during parsing you notice that this is not a valid file (e.g. the magic regexp hit was a false positive), you can abort the parsing at any time by raising a FatalError (cf. Exceptions and error handling). Each item will be discussed below.

Note

While these tutorial should be the first documentation you read when writing your own parser, we encorage you to also have a look at all the existing file parsers located in data/filetypes.

Parsing

The main goal of the parser is to identify known structures/fields inside the parsed file. This is done from within the FileTypeAnalyzer.parse() method by using the python keyword yield. A typical parser will FileTypeAnalyzer.jump() to various offsets in the file and yield a description of the different fields which are located there. Fields are named objects (the names you see in the Structure/text view for instance) and can be:

  • an atomic field : an integer, a string, a timestamp, etc

  • a structure field : like a typical C structure, a sequence of named fields

  • an array field : like a typical C array, a sequence of identical (unnamed) fields

  • a bitfield : a sequence of Bit fields

Whenever you yield a field/array/bitfield/structure from you parser, you will get the value of the field back, which will be:

  • the corresponding python value for atomic fields, e.g. a datetime.datetime for Timestamp fields, a str for StringUtf8 fields or a int for UInt64 fields.

  • a malcat.FieldAccess instance for structures, arrays and bitfields. The malcat.FieldAccess object allows you to navigate into these aggregate fields in a pythonic way. The interface is the same as for the well-documented malcat.StructAccess object described in details in the scripting documentation.

Yielding a field/structure also moves the field/parsing pointer at the end of the field/structure, so that the next yield will be located right after the previous field/structure. Let us have a look at the start of the PE parser for instance:

class MZ(Struct):
    def parse(self):
        #...

class PE(Struct):
    def parse(self):
        #...

def parse(self, hint=""):
    mzheader = yield MZ(category=Type.HEADER)
    # you can also access the structure by name after it has been yielded: mzheader = self["MZ"]
    self.jump(mzheader["AddressOfPE"])          # jump to PE header offset
    pe = yield PE(category=Type.HEADER)         # yield pe header structure
    magic, = struct.unpack("<H", self.look_ahead(2))
    self.is64 = magic == 0x020B
    if self.is64:
        optheader = yield OptionalHeader64(name="OptionalHeader", category=Type.HEADER) # located right after the PE struct
        self.set_architecture(malcat.Architecture.X64)
    else:
        optheader = yield OptionalHeader32(name="OptionalHeader", category=Type.HEADER) # located right after the PE struct
        self.set_architecture(malcat.Architecture.X86)
    self.set_imagebase(optheader["ImageBase"])

You can see there the use of FileTypeAnalyzer.jump() and the yield keyword to identify three structures. The script also uses the function malcat.FileTypeAnalyzer.look_ahead() to read the content of the file at the location pointed by the field pointer. Please note that field/structure cannot overlap (this would throw an exception) and a field/structure cannot be located past the end of the file.

You will also notice that all field types accepts (at least) the following keyword arguments:

  • name: the name of the field, that will be displayed in the Structure/text view

  • comment: a comment describing the field (optional). This will be displayed in the Structure/text view when the user hovers the field

  • parent: there you can give an already parsed malcat.FieldAccess instance which should be considered as the parent of the yielded field (optional). This will make the currently yield field appear as a child of the parent field in the data tab. This is purely visual and just there to help sort out all the structures of the file:

../_images/parent.png

Use of the parent keyword argument in fields

Additionally, for level-0 fields/structures, you may specify a category keyword argument that will tell Malcat which category to put the structure in. This will also help users to fine-tune the Highlighting. The category can be one of:

  • Type.HEADER: For structures/fields used to describe the file format

  • Type.CODE: For structures/fields containing executable code

  • Type.DATA: For structures/fields containing arbitrary/user-defined data

  • Type.DEBUG: For structures/fields containing debug information

  • Type.METADATA: For structures/fields containing metadata information (comments, times, etc)

  • Type.FIXUP: For structures/fields containing fixup information

  • Type.RESOURCE: For structures/fields containing images, sounds, etc data

That’s it for the basic of parsing. Since parsers are python classes, you may use whatever python code you need in addition to the methods defined in the malcat.FileTypeAnalyzer class, which makes them pretty flexible. The only thing you need to know now are the different fields that can be yielded, which will be described below.

Note

If you prefer to look at code, all the available types are defined in <malcat install dir>/bindings/filetypes/types.py.

Numeric types

Numeric fields describe a numerical value. They are non-aggregate data types that you may yield either from the FileTypeAnalyzer.parse() method or from the Struct.parse() method. The most important ones are listed below, but you can get the complete list by looking at bindings/filetypes/types.py.

The value returned when yielding an integer field is either an int or a float. Like most fields, they accept a name and comment keyword argument. Integer also accept a *values* parameter for enums. The values parameter is simply a list of tuple (enum name, enum value) which describes all the possibles values that the field can take. In the Structure/text view, they would appear as enums and can be edited via a combobox. A small example taken from data/filetypes/ACE.py:

class UnknownHeader(Struct):
    def parse(self):
        yield UInt16(name="Crc", comment="crc of data")
        size = yield UInt16(name="Size", comment="size of data")
        yield UInt8(name="Type", values=[
            ( "MAIN", 0),
            ( "FILE32", 1),
            ( "RECOVERY32", 2),
            ( "FILE64", 3),
            ( "RECOVERY64A", 4),
            ( "RECOVERY64B", 5),
            ])

Class

Size

Equivalent C type

Python type

Parameters

Description

UInt8

1

uint8_t

int

name, comment, values

A byte

Int8

1

int8_t

int

name, comment, values

A signed byte

UInt16

2

uint16_t

int

name, comment, values

A word

Int16

2

int16_t

int

name, comment, values

A signed word

UInt16BE

2

int

name, comment, values

A word, big endian

Int16BE

2

int

name, comment, values

A signed word, big endian

UInt24

3

int

name, comment, values

A 3-bytes int

UInt24BE

3

int

name, comment, values

A 3-bytes int, big endian

UInt32

4

uint32_t

int

name, comment, values

A dword

Int32

4

int32_t

int

name, comment, values

A signed dword

UInt32BE

4

int

name, comment, values

A dword, big endian

Int32BE

4

int

name, comment, values

A signed dword, big endian

UInt64

8

uint64_t

int

name, comment, values

A qword

Int64

8

int64_t

int

name, comment, values

A signed qword

UInt64BE

8

int

name, comment, values

A qword, big endian

Int64BE

8

int

name, comment, values

A signed qword, big endian

PrefixedVarUInt64

1-9

int

name, comment, values

variable uint64 where first byte tells how many bytes there are (always big-endian)

VarUInt64

1-10

int

name, comment, values

variable uint64 where first bit (0x80) of every byte tells if there is more data to come

VarUInt64BE

1-10

int

name, comment, values

variable uint64 where first bit (0x80) of every byte tells if there is more data to come, big endian

Float

4

float

float

name, comment

A 32bits float

FloatBE

4

float

name, comment

A 32bits float, big endian

Double

8

double

float

name, comment

A 64 bits double-precision float

DoubleBE

8

float

name, comment

A 64 bits double-precision float, big endian

Strings

Strings are fields which contain character data. The value returned when yielding a string field is either a str or a malcat.FieldAccess for strings composed of a size and the string data. Like most fields, they accept a name and comment keyword argument. There are three types of strings:

  • fixed-size strings (the String* fields) whose size is known in advance

  • null-terminated dynamic-size strings (CString*) whose size will be infered by Malcat by reading the file content

  • prefixed strings (Pascal*, UnicodeString) which are actually structures with two fields: a Size field, and a String field:

Class

Size

Equivalent C type

Python type

Parameters

Description

String

N

char[]

str

N, name, comment, zero_terminated

A fixed-size ascii string

StringUtf8

N

char[]

str

N, name, comment, zero_terminated

A fixed-size UTF8 string

StringUtf16le

N

char16[]

str

N, name, comment, zero_terminated

A fixed-size UTF16-le string

StringUtf16be

N

char16[]

str

N, name, comment, zero_terminated

A fixed-size UTF16-le string

CString

?

char*

str

name, comment, max_size=512

A null-terminated Ascii string, Malcat will truncate the string if bigger than max_size bytes

CStringUtf8

?

char*

str

name, comment, max_size=512

A null-terminated UTF8 string, Malcat will truncate the string if bigger than max_size bytes

CStringUtf16le

?

char16*

str

name, comment, max_size=512

A null-terminated UTF16-le string, Malcat will truncate the string if bigger than max_size bytes

CStringUtf16be

?

char16*

str

name, comment, max_size=512

A null-terminated UTF16-be string, Malcat will truncate the string if bigger than max_size bytes

PascalString

Prefixed

malcat.FieldAccess

name, comment

A struture composed of the size of the string (dword), followed by the actual Ascii string

PascalString8

Prefixed

malcat.FieldAccess

name, comment

A struture composed of the size of the string (byte), followed by the actual Ascii string

UnicodeString

Prefixed

UNICODE_STRING

malcat.FieldAccess

name, comment

A struture composed of the size of the string (dword), followed by the actual UTF16-le string

Time

Times and dates can be expressed in various different formats. The following field types can be used. Strings are fields which contain character data. The value returned when yielding a string field is a datetime.datetime instance. Like most fields, they accept a name and comment keyword argument.

Class

Size

Equivalent C type

Python type

Parameters

Description

Filetime

8

FILETIME

datetime.datetime

name, comment

Time format used in Windows

Timestamp

4

time_t

datetime.datetime

name, comment

Unix timestamp

TimestampBE

4

datetime.datetime

name, comment

Unix timestamp, big endian

Timestamp2000

4

datetime.datetime

name, comment

Like a Unix timestamp, but base date is 2000-01-01

DosDateTime

4

datetime.datetime

name, comment

Date-time format used in DOS

DosDate

2

datetime.date

name, comment

Date format used in DOS

DosTime

2

datetime.time

name, comment

Time format used in DOS

Pointers

Malcat also has support for pointer fields. These are numerical fields which points toward a data. These are displayed as clickable fields in the Structure/text view. The value returned when yielding a string field is a int. Like most fields, they accept a name and comment keyword argument, but they also accept two additions args:

  • hint: the Field type at the end of the pointer. Malcat may use the information to displayed the pointee field (espeicially for strings)

  • zero_is_invalid: if True, a value of zero is considered to be invalid, and the pointer will be displayed in the dim color inside the Structure/text view

  • base: only for Offset* instances: the offset is actually a delta offset relative to this base offset

Class

Size

Equivalent C type

Python type

Parameters

Description

Offset16

2

uint16_t

int

name, comment, hint, zero_is_invalid, base

A 16 bits file offset

Offset32

4

uint32_t

int

name, comment, hint, zero_is_invalid, base

A 32 bits file offset

Offset32BE

4

int

name, comment, hint, zero_is_invalid, base

A 32 bits file offset, big endian

Offset64

8

uint64_t

int

name, comment, hint, zero_is_invalid, base

A 64 bits file offset

Offset64BE

8

int

name, comment, hint, zero_is_invalid, base

A 64 bits file offset, big endian

Rva

4

uint32_t

int

name, comment, hint, zero_is_invalid

A relative virtual address, aka a 32-bits displacement relative to FileTypeAnalyzer.imagebase

Va32

4

uint32_t

int

name, comment, hint, zero_is_invalid

A 32bits memory address

Va32BE

4

int

name, comment, hint, zero_is_invalid

A 32bits memory address, big endian

Va64

4

uint64_t

int

name, comment, hint, zero_is_invalid

A 64bits memory address

Va64BE

4

int

name, comment, hint, zero_is_invalid

A 64bits memory address, big endian

Other

Here are a few additional non-aggregate field types which do not really fit in other categories:

Class

Size

Equivalent C type

Python type

Parameters

Description

GUID

16

str

name, comment, microsoft_order

A 16-bytes GUID field. The yielded python value is the string representation of the GUID, e.g. "00020906-0000-0000-C000-000000000046"

Bytes

N

uint8_t[]

bytes

N, name, comment

N raw bytes

Unused

N

void*

bytes

N, name, comment

N unused/reserved/padding bytes, will be displayed in a dimmed color

Align

depends

None

N, name, comment

yield Align(N) will yield a Unused field big enough so that the parsing pointer is aligned on a N-bytes boundary afterwards. Yields nothing if the parsing pointer is already aligned

StructAlign

depends

None

N, name, comment

yield StructAlign(N) is only valid withing a structure’s parse method. It will yield a Unused field big enough so that the structure size is aligned on a N-bytes boundary. Yields nothing if the structure size is already aligned

Structures/records

Structures (aka records) are aggregate fields, meaning they are composed of one ore more adjacent sub-fields. They are more expressive than C-style structures though, as every types.Struct has its own parse() method.

Contrary to atomic fields, there are very little predefined structure fields in Malcat: you’ll likely have to define your own. To define a new structure, it is very simple though:

  • Create a python class inheriting from types.Struct

  • Define a parse(self) method in the structure, which works like the FileTypeAnalyzer.parse() method we have seen before (except that you cannot FileTypeAnalyzer.jump() around since structure fields need to be adjacent)

Note

If you define your structure at the global scope of your parser python file (and if their types.Struct.parse() method does not need to access the types.Struct.parser attribute), it will be available to the user as “python type” in the Apply a custom type dialog. Try it out!

For instance, have a look at the following extract of the structure OptionalHeader32’s parse method:

class OptionalHeader32(Struct):

    def parse(self):
        magic = yield UInt16(name="Magic", comment="magic", values=[
            ("PE32", 0x10b),
            ("ROM", 0x107),
            ("PE32+", 0x20b),
            ])
        if magic != 0x010B and magic != 0x0107:
            raise FatalError("Invalid magic value: {:x}".format(magic))
        yield UInt8(name="MajorLinkerVersion", comment="linker version (major)")
        yield UInt8(name="MinorLinkerVersion", comment="linker version (minor)")
        # ...

Once this structure has been defined, it can be used and yielded like any other field type, e.g.:

optheader = yield OptionalHeader32(name="OptionalHeader", category=Type.HEADER)
linker_version = "{}.{}".format(optheader["MajorLinkerVersion"], optheader["MinorLinkerVersion"])

The value you get back from yielding a structure is a malcat.FieldAccess instance which allows you to access the structure and its fields easily. And that’s it! To help you further, Structure objects also defines a couple of attributes/methods that you may access from within your overridden types.Struct.parse() method:

class types.Struct

This class is the base class of custom structure fields.

parser: malcat.FileTypeAnalyzer

a pointer to the current file type parser. Note that this field can only be accessed during the file parsing process, i.e. from within the parse() method.

parse()

You will have to override this method with your parse code

look_ahead(size=1)

Read file data at the current parsing offset. Equivalent to self.parser.read(self.parser.tell(), size). Note that this method can only be called during the file parsing process, i.e. from within the parse() method.

Parameters:

size (int) – how many bytes to read

Return type:

bytes

__len__(size=1)

Returns the current length of the parsed structure (i.e. the sum of all fields that have been yielded from parse() at this point)

Parameters:

size (int) – how many bytes to read

Return type:

bytes

Arrays

Arrays are aggregate fields too, like structures, with the difference that all their sub-fields have the same type. Malcat supports three types of arrays, depending on whether the array size is known or not, or if the cell type need parsing or not. The three types of array fields will be discussed below:

Class

Size

Equivalent C type

Python type

Parameters

Description

Array

N

Field[N]

malcat.FieldAccess

N, Field, name, comment

A fixed-size array of N elements of type T. Only the first cell is parsed, all other cells are assumed to have the same size

DynamicArray

?

Field[?]

malcat.FieldAccess

fn_terminator, N, Field, name, comment

An array, but the size of the array is given by the predicate: fn_terminator(cell) == True. Only the first cell is parsed, all other cells are assumed to have the same size

VariableArray

N

malcat.FieldAccess

N, FieldClass, name, comment

An array, but its elements can have different sizes. In practice, this is a structure with N fields, and each field will be parsed. Note that you need to give the field class as parameter for this one.

The simplest array type you can use is the Array field. This is the one you should use when the size of the array is known and all cells are identical. The first two parameters you should give the Array constructor are:

  • The number of cells in the array

  • The field type of the cells

Note

If the array cell type is a structure, the parse method will only be called for the first cell of the array, and Malcat will assume that all subsequent cells have the exact same size. This allows for efficient parsing for even large arrays. If your array cells may have varying size, you should use VariableArray instead.

A simple example, taken from the PE parser:

expdir = yield ExportDirectory(name="ExportDirectory", category=Type.HEADER)
# ...
if nametable_off:
    self.jump(nametable_off)
    names = yield Array(expdir["NameTableEntries"], Rva(hint=String(0, True)), name="ExportNameTable", category=Type.HEADER, parent=expdir)
if ordinaltable_off:
    self.jump(ordinaltable_off)
    ordinals = yield Array(expdir["NameTableEntries"], UInt16(), name="OrdinalNameTable", category=Type.HEADER, parent=expdir)
if addresstable_off:
    self.jump(addresstable_off)
    addresses = yield Array(expdir["AddressTableEntries"], Rva(), name="ExportAddressTable", category=Type.HEADER, parent=expdir)

Sometimes, you don’t know what will be the size of the array. In this case, the array is usually terminated with a special terminator cell. In this case, you can use the DynamicArray field type. In this array type, the first constructor argument of the array is not the size, but a python function that returns True when the last cell (aka terminator cell) is reached. The size of this array type is always at least 1.

Note

While all cells still need to have the same size, using a DynamicArray is a bit less performant than the previous Array type since all cells need to be evaluated by the fn_terminator function.

An example is given below:

def parse_debug(self, dbg_foff, dbg_size):
    self.jump(dbg_foff)

    def is_last_debug_entry(current_cell, current_array_size):
        if current_cell.offset + current_cell.size >= dbg_foff + dbg_size:
            return True
        if current_cell["PointerToRawData"] == 0:
            return True
        if current_array_size > 1024:
            return True # avoid parsing most-likely invalid debug directories
        return False

    dds = yield DynamicArray(is_last_debug_entry, DebugDirectoryEntry(), name="DebugDirectories", category=Type.DEBUG)

Finally, the last array type that you can use is the VariableArray. This array type supports cells of different size. It means that if your cell type is a structure, every cell’s types.Struct.parse() method will be called. Basically, this is array is just a structure with N unamed fields. You can see in the example below that the structure PascalVariable does not have a fixed size: the last field (ExportName) is only present iff the structure’s flag Exported is set. As a consequence, you have to use a VariableArray if you want to yield an array of this cell type, since cell sizes may vary:

class PascalVariable(Struct):

    def parse(self):
        yield UInt32(name="TypeIndex")
        flags = yield BitsField(
            Bit(name="Exported", comment="Variable is exported"),
            name="Flags", comment="variable characteristics")
        if flags["Exported"]:
            yield PascalString(name="ExportName")

vars = yield VariableArray(varcount, PascalVariable, name="VariablesArray")
# you get a malcat.FieldAccess back
for var in vars:
    print(var["TypeIndex"])

Note

For this array type, you should give as parameter the class of the cell type to use, and not an instance!

Bitsfields

The last aggregate field type that you can use in your parser is the BitsField type. Bits fields are arrays of single bits with a few twists:

  • Each bit has a name

  • Number of bits has to be a multiple of 8 (if it is not, Malcat will round it up to the next multiple of 8)

The constructor for the Bitsfield type is also a bit different. In addition to the usual name and coment keyword parameters, it takes all the Bit() fields, in order, composing the bits field as standard arguments. If one or more bits are not used, which happens a lot in specifications, you can use the special field Nullbits(<number of unspecified bits>) in your bits list. Here is an example take from the PE parser:

process_heap_flags = yield BitsField(
    Bit(name="HEAP_NO_SERIALIZE", comment="serialized access will not be used for this allocation"),
    NullBits(1),
    Bit(name="HEAP_GENERATE_EXCEPTIONS", comment="system will raise an exception to indicate a function failure, such as an out-of-memory condition, instead of returning NULL"),
    Bit(name="HEAP_ZERO_MEMORY", comment="allocated memory will be initialized to zero"),
    NullBits(14),
    Bit(name="HEAP_CREATE_ENABLE_EXECUTE", comment="blocks that are allocated from this heap allow code execution"),
    NullBits(13),
    name="ProcessHeapFlags", comment="heap flags that correspond to the first argument of the HeapCreate function.")
# you get a malcat.FieldAccess back
if process_heap_flags["HEAP_NO_SERIALIZE"]:
    # ...

This code will produce the following field view in the Structure/text view:

../_images/bitsfield.png

A Bitsfield

And that’s it. Note that when you yield bitfields, you also get a malcat.FieldAccess instance back that lets you access individual bits easily.

Exceptions and error handling

Parsers in Malcat are responsible for both parsing and validating their file type. Any exception thrown before the malcat.FileParser.confirm() method has been called will abort the whole parsing all-together and the file won’t be considered of being of the parser’s supported type. Once malcat.FileParser.confirm() has been called from within the parser though, the file type will be set for good, and any subsequent exception will simply be reported in the analysis’s log (you can see the log in the Console window).

Now that’s all nice and good, but which type of exceptions are available to your parser? Malcat defines the base type ParsingError, which is the only type of exception which can be safely thrown from your parser.

Warning

Any exception not inheriting from ParsingError risen from your parser will be considered as a bug by Malcat: the exception will be displayed in red in the Console window and the analysis will be considered as unsuccessful. Please handle all the edge cases properly in your parsing code.

Inheriting from the class ParsingError are 4 additional exceptions that you may use and/or encounter:

FatalError(ParsingError):

You should raise a FatalError exception when you encounter invalid data or structure not respecting the specifications and want to abort the parsing.

OutOfBoundError(ParsingError):

This exception will be risen by Malcat if you try to:

  • Read outside of the file boundaries using any of the read_* methods

  • Yield a structure outside of the file boundaries

SuperposingError(ParsingError):

This exception will be risen by Malcat if you yield a structure/a field and this structure/field overlaps an already-defined structure/field

InvalidPassword(ParsingError):

You can raise this error from an unpack method to notify the user that the file requires a different password to be open (cf. The Virtual File System).

You can of course define your own parsing exceptions, just make sure it inherits from ParsingError.

File layout / regions

Another responsibility of the parser is reconstructing the layout of the analysed file. It means slicing the file into different regions with a meaningful name and permissions. In the Summary view, the layout is displayed visually:

../_images/summary_layout.png

The file layout as displayed in the summary view

For most file types, this is an optional task: it just helps the user to know what the file is made of. For programs though, it is a crucial step: it will tell Malcat which part of the file are loaded in memory, and at which address. This is vital information for many analyses, such as the disassembler, function recovery, debug information parsing, string scanner, etc.

Note

If you don’t define any section, Malcat will automatically create one for you named "unparsed" and covering the whole file. On the other hand, if you add one or more sections but it/they don’t cover all the file, Malcat will automatically fill the gaps with sections named:

  • "header", for the section gap at the beginning of the file

  • "gap", for all the gaps in the iddle of the file

  • "overlay", for the section gap between the last defined section and the end of the file

In order to inform Malcat about the layout of the file, you can use the following three functions:

  • FileTypeAnalyzer.set_imagebase(): This will tell Malcat what is the base memory address at which the file will be loaded. Only makes sense for program loaded in memory.

  • FileTypeAnalyzer.set_eof(): Sets the real/effective size of the current file (i.e. the size as specified in the file format headers).

  • FileTypeAnalyzer.add_section(): Describe a new section of the file. Example:

    self.add_section(sname, r.foff, r.fsize, max(0, self.imagebase + r.rva), r.vsize,
                r = r.section["Characteristics"]["MemRead"],
                w = r.section["Characteristics"]["MemWrite"],
                x = r.section["Characteristics"]["MemExecute"],
                discardable = r.section["Characteristics"]["MemDiscardable"],
            )
    

Telling Malcat the real size of the current file, via FileTypeAnalyzer.set_eof(), is also recommended. During the carving process, if FileTypeAnalyzer.confirm() and FileTypeAnalyzer.set_eof() have both been called, the parsing can be interrupted earlier which improves performance.

Note

If you never call FileTypeAnalyzer.set_eof(), the last byte of the file is assumed to be either the last byte of the last yielded field/structure, or the last byte of the last section, whatever is greater.

Metadata

Parsers can also gather some metadata during the parsing process. What we call metadata are any information which gives some context about the file, such as dates, authors, versions, copyright strings, paths etc. These metadata can be very valuable for analysts when making clean/malware decisions.

Metadata in Malcat are simple key-value paires, both key and values being arbitrary strings. You can also group metadata in categories, e.g. all debug-related metadata, or all export-related metadata, which improves their presentation in the Summary view. To add a metadata, simply call the FileTypeAnalyzer.add_metadata() function, for instance:

if int(expdir["TimeDateStamp"].timestamp()) not in (0, 0xffffffff):
   self.add_metadata("Exports date", expdir["TimeDateStamp"].strftime("%Y-%m-%d %H:%M:%S"), category="Exports")

The Virtual File System

Some file formats like archives or disk images embed sub-files that the user may want to extract for further analysis. Since recovering the list of files usually involves parsing the internal structures of the file format, the parser is also put in charge of this task.

From within the FileTypeAnalyzer.parse() method, you can notify Malcat of the existence of embedded files via the FileTypeAnalyzer.add_file() method. This method takes up to 5 parameters, but only the 3 first are mandatory:

  • vpath: the virtual path of the file, e.g. /directory/entry.txt

  • size: the size of the file once unpacked (or an estimate)

  • unpack_method_name: the method name to call (in this parser’s instance) to unpack the file.

Let us have a quick look at the ZIP parser for instance:

def parse(self, hint):
    self.filesystem = {}    # used internally to map a vfile path to the corresponding LocalFile ZIP structure
    files_seen = set()
    # ...
    if tag == b"PK\x03\x04":
        lfh = yield LocalFile(category=Type.HEADER)
        compressed_size = lfh["CompressedSize"]
        uncompressed_size = lfh["UncompressedSize"]
        # ...
        if "FileName" in lfh:
            fn = lfh["FileName"]
        if compressed_size and uncompressed_size:
            if fn and not fn in files_seen:
                files_seen.add(fn)
                self.add_file(fn, uncompressed_size, "open")    # tells Malcat about the file
                self.filesystem[fn] = (lfh, compressed_size)    # for us internally

There the parsers told Malcat, via the FileTypeAnalyzer.add_file() method, about all the file listed in the LocalFile structures of the ZIP archive. This will make these files be listed in the Virtual Files tab:

../_images/newtab.png

The virtual files tab

Now that’s all well and good, but what happens when the user double-clicks the virtual file? Well, that’s when the third parameter of FileTypeAnalyzer.add_file() comes into play: it tells Malcat which method of this parser to call in order to unpack the file. If we look into the code of ZIP parser , we can see indeed that it defines an open method:

def open(self, vfile, password=None):
   lfh, compressed_size = self.filesystem.get(vfile.path, (None, None))    # recover the info we saved in parse()
   if lfh is None:
       raise KeyError("Unknown file path {}".format(vfile.path))
   # ...
   return self.unpack_manual(lfh, buf, pwd)    # do the unpacking, return a bytes object

Note

You don’t have to name these method “open”, you can chose whatever name you see fit. Different files can also have different unpack/open methods.

As we can see, this open method takes two input parameters:

The method should then proceed to unpack the file and return its content as a bytes object. Since they are always called after the file parsing took place, unpack methods have access to all the structures parsed in FileTypeAnalyzer.parse().

And that’s pretty much it. If the files require a password and the one passed as parameter is wrong, the method should raise an InvalidPassword exception: this will trigger in the UI a dialog box asking the user to provide another password. Any other exception thrown from these methods are displayed to the user in a simple MessageBox.

Adding symbols

File symbols, such as PE imports or ELF exported symbols, are valuable information for analysts. It is also the responsibility of the parser, to tell Malcat which symbols are defined in the analysed file. This is done by using the method FileTypeAnalyzer.add_symbol(), which helps Malcat to associate a virtual address with a symbol name. This should also be done within the FileTypeAnalyzer.parse() method. Example:

ordinal = ordinals[i]
address = addresses[ordinal]
# add symbol
self.add_symbol(self.imagebase + address, name, malcat.FileSymbol.EXPORT)

Different types of symbols can be added, you can find the list there: malcat.FileSymbol.Type.

Advanced topics

If you want to go further, you will find here a few key facts to keep into consideration when designing your parser:

A note on performances

While writing parsers in python has a lot of advantages, it can also have an impact on performances. You may start to notice it in particular if your parser tries to yield a large number of structures. Indeed, for each structures that is yielded, the following actions will happen:

  • a python object will be allocated

  • the structure’s types.Struct.parse() method will be called

  • for every sub-field yielded by types.Struct.parse(), python objects will be allocated

  • every object will be converted to CPP, which will also lead to more allocations

The reason behind this cost is because the types.Struct class is very expressive with its types.Struct.parse() method. But sometimes, you don’t need to be expressive, like when you just want to yield simple C-style structures. This is typically the case when the structure’s content is not dependent on the actual data, i.e. the structure will always have the exact same fields.

If you happen to have such a structure, you can make your structure inherit from types.StaticStruct instead of types.Struct. types.StaticStruct are structures who’s types.StaticStruct.parse() method is called only once: the first time it is yielded. Every subsequent yield of the structure will give back the exact same structure. As a consequence, the types.StaticStruct.parse() method is now a class method, which has several implications:

  • when you yield sub-fields from within types.StaticStruct.parse(), None is always returned (remember, the structure’s content should not depend on the file’s data)

  • you cannot call types.Struct.look_ahead()

But the advantages in term of performances are huge, in particular if you yield a large number of instances of the same structure through your parsing. For instance, the DirectoryEntry structure found in FAT32 volumes is repeated a lot, once for each file in the file system. And since all the fields are static, inheriting from types.StaticStruct makes sense:

class DirectoryEntry(StaticStruct):

    @classmethod
    def parse(cls):
        filename = yield String(8, name="ShortFileName")
        yield String(3, name="FileExtension") # fixed-size strings are ok, C-strings or prefixed strings are not since file data would need to be read
        yield BitsField(
            Bit(name="ReadOnly"),
            Bit(name="Hidden"),
            Bit(name="System"),
            Bit(name="VolumeLabel"),
            Bit(name="Directory"),
            Bit(name="Archive"),
            Bit(name="Device"),
            name="FileAttributes")
        yield UInt8(name="ExtraAttributes")
        yield UInt8(name="CreationTimeFine")
        yield DosDateTime(name="CreationTime")
        yield DosDate(name="LastAccess")
        yield UInt16(name="FirstClusterHigh")
        yield DosDateTime(name="ModificationTime")
        yield UInt16(name="FirstClusterLow")
        size = yield UInt32(name="FileSize")
        # ^-- size would be None in this case, don't read yielded values

On another topic, the magic regular expression of your parser can also have an impact on performance if poorly chosen. Indeed, we have seen that the File carving algorithm will call the parser’s FileTypeAnalyzer.parse() method every time that the magic regular expression matches. If your magic regexp is not specific enough, this could lead to a huge number of parsing attempts, which leads to a huge number of CPP -> python transitions, which is rather slow by nature. In order to improve performances, you have two (complementary) solutions:

  • Choose a more precise regular expression as magic. If your file format does not have one at the beginning, you can always chose a rare one found somewhere in the middle of the file and then tell Malcat where the actual start of the file is (cf. Custom parsing start and context). This is what we do for the ISO parser for instance, since the signature is not located at the beginning of the ISO image. This will reduce the number of times your parser’s parse method is called unnecessarly.

  • Call FileTypeAnalyzer.set_eof() and FileTypeAnalyzer.confirm() when you know where the file ends and you are almost sure this is a valid file. When these two functions are called, the File carving algorithm will stop the parsing earlier, saving precious CPU time. For instance, you don’t need to parse all structures and imports to tell if a PE file is a somewhat valid PE file. Parsing a few key structures and the section table is enough to know the file size and validate the file.

Custom parsing start and context

Sometimes, locating the stat of a file is not as simple as looking for a magic signature. Sometimes the magic signature is not located at the beginning of the file. Or sometimes, you need to get some information from the parent file. Take for example the InnoSetup archive: while the archive is located in the overlay of the PE file, a few key offsets are stored in a RCDATA resourced named 11111 of the parent PE installer.

In order to handle these corner cases, Malcat offers you to override a special class method named FileTypeAnalyzer.locate(). This class method is called before the actual parsing takes place to locate the real start of the file and give some extra context information via the hint return value. In this method, you should return either None to abort the parsing, or return a pair of:

  • the real start of the file

  • a hint that should be given later as parameter to your parser’s parse() method

Let us look at a first example. The ISO format has its magic signature located at offset 0x8000 (second track). The first 0x8000 bytes are reserved and not specified. Having a regular expression to match at the start of the file is thus impossible. But we can use a marker found at offset 0x8000: \x01CD001\x01:

class ISOAnalyzer(FileTypeAnalyzer):
    category = malcat.FileType.FILESYSTEM
    name = "ISO"
    regexp = r"(?<=\x01)CD001\x01"  # this regexp is located at offset 0x8000

    @classmethod
    def locate(cls, curfile, offset_magic, parent_parser):
        if offset_magic < 0x8001:  # Note: 0x8000 + 1 because the first byte (\x01) is a look-back in the regexp (that's to improve the regexp performances)
            return None # if there are not at least 0x8000 bytes preceding the magic signature, then it can't be a valid ISO file: abort
        return offset_magic - 0x8001, ""    # real start of ISO is regexp match offset - (0x8000 + 1)

Other times, you need some context from the parent file type. For instance, for the InnoSetup parser:

class InnoSetupAnalyzer(FileTypeAnalyzer):
    category = malcat.FileType.ARCHIVE
    name = "InnoSetup"
    regexp = r"Inno Setup Setup Data \(.{25}\x00{16}.{13}\x5d\x00"  # this regular expression can appear anywhere in the archive

    @classmethod
    def locate(cls, curfile, offset_magic, parent_parser):
        if parent_parser is not None and parent_parser.name == "PE":    # InnoSetup archives are stored in the overlay of the installer, so the parent file is ALWAYS a PE file
            if "Resources.RCDATA.11111.unk.Data" in parent_parser:  # the 11111 resource contains useful offsets
                try:
                    d = parent_parser["Resources.RCDATA.11111.unk.Data"]
                    offsets = TSetupOffsets.deserialize(d)
                    # the archive is split into three parts: the uninstaller, the list of files + script and the actual file data.
                    # the start of the archive is the first of these 3 offsets
                    base = min(offsets.exe_offset, offsets.setup0_offset, offsets.setup1_offset)
                    return base, d.hex()    # <-- we'll give the content of the resource as hex string to our parse's hint parameter
                except:
                    return None
        return None

    def parse(self, hint):  # hint contains the 11111's resource content (hex-encoded)
        offsets = TSetupOffsets.deserialize(bytes.fromhex(hint))    # get back all the offsets

# this class describes the information found in the installer's 11111 RCDATA resource
class TSetupOffsets:

    def __init__(self, id, version, total_size, exe_offset, exe_uncompressed_size, exe_crc, setup0_offset, setup1_offset, offsets_crc):
        self.id = id
        self.total_size = total_size
        self.exe_offset = exe_offset
        self.exe_uncompressed_size = exe_uncompressed_size
        self.exe_crc = exe_crc
        self.setup0_offset = setup0_offset
        self.setup1_offset = setup1_offset

@staticmethod
def deserialize(data):
    return TSetupOffsets(*struct.unpack("<12s8I", data))

And that’s it. If you have more question regarding these complex topics, don’s hesitate to contact us on discord!

The parser object

class malcat.FileTypeAnalyzer

This is the base class of all Malcat’s file parsers. You have to inherit from it to define a new parser.

Class attributes

The following three attributes are class attributes required and used by Malcat’s parsing and carving algorithms:

name: str

A (unique) short identifier/name for the file type, e.g. “PE” or “ISO”.

category: malcat.FileType.Category

The category of the file type. This tells malcat which icon to use to represent the file.

regexp: str

A regular expression using pcre2 syntax which has to match for your parser to be even called.

Parsing

parse(hint='')

You have to override this function. It is the function which is called to perform the actual file parsing

Parameters:

hint (str) – extra free-form context information information forwarded from the parent parser or the locate() method.

confirm()

Call this method from within the parse() function to validate the file type. After calling this method, Malcat will fix the current file type to the analysed file forever, even if exceptions are later thrown by the parser.

is_confirmed()

Returns True iff confirm() has been called.

Return type:

bool

size()

The effective size of the current file

Return type:

int

jump(offset)

Move the parsing pointer to another offset so that the next yield XXX put the field/structure at this address. Only works from within parse().

Parameters:

offset (int) – the new offset

tell()

Returns the current parsing pointer.

Return type:

int

eof()

Returns True iff the current parsing pointer reached the last byte of the file

Return type:

bool

remaining()

Returns the number of bytes laft between the end of file and the current parsing pointer (i.e. how many bytes the parser can consume)

Return type:

int

__iter__()

Iterate over the list of fields/structures which have been parsed until now (sorted by offset) at the global level. Example:

for field in self:
    print(field.name, field.offset)
Return type:

Iterable[malcat.FieldAccess]

__getitem__(key)

Returns the value of the last field/structure named key (if key is a string) or the ith field/structure (if key is an int) which has been parsed at the global level.

print(self["MZ"]["AddressOfPE"])
Parameters:

key (Union[str, int]) – the name or position of the field

Return type:

FieldAccess for aggregate fields, the python type for atomic fields

at(key)

Returns an accessor to the last field/structure named key (if key is a string) or the ith field/structure (if key is an int) which has been parsed at the global level.

if self.at(0).name != "MZ":
    raise ValueError
print(self.at(0)["AddressOfPE"])
Parameters:

key (Union[str, int]) – the name or position of the field

Return type:

malcat.FieldAccess

__contains__(name)

Returns True iff a field named name has been parsed at the global level.

if not "MZ" in self:
    raise ValueError
Parameters:

name (str) – the name of the field

Return type:

bool

classmethod locate(class, file_object, offset_magic, parent_parser)

This class method is called before the actual parsing takes place to locate the real start of the file and give some extra context information via the hint return value. Override this method for complex file formats having a non-standard start of file or needing extra information from the parent file format.

In this method, you should return either None to abort the parsing, or return a pair of:

  • the real start of the file

  • a hint that should be given as parameter to your parser’s parse() method

Parameters:
  • class – this parser’s class

  • file_object (malcat.File) – the file object being parsed

  • offset_magic (int) – the offset in file_object where this parser’s regexp has been found

  • parent_parser (malcat.FileTypeAnalyzer) – the parent file type analyser, if this parser’s instance has been invoked from the File carving process.

Returns:

(real start of file, hint) or None

Return type:

int, str

Layout

add_section(name, offset, size, va=None, vsize=None, r=True, w=False, x=False, discardable=False)

Describe a new section of the file.

Parameters:
  • name (str) – the name of the section, e.g. “.text” or “FileAllocationTable”

  • offset (int) – file offset of the start pf the section

  • size (int) – size of the section on disk. Can be zero.

  • va (int) – address of the start pf the section in memory. If None, will be assumed to be offset.

  • vsize (int) – size of the section in memory. Can be zero. If None, will be assumed to be size

  • r (bool) – True if the section as READ rights

  • w (bool) – True if the section as WRITE rights

  • x (bool) – True if the section as EXEC rights

  • discardable (bool) – True if the section will be discareded from memory after the file is loaded (e.g. .rsrc section or headers)

sections: Tuple[malcat.VirtualFile]

the current list of defined sections

set_eof(size)

Call this method from within the parse() function to set the effective size of the current file (i.e. the size as specified in the file format headers). During the carving process, if confirm() and set_eof() have both been called, the parsing can be interrupted earlier which improves performance.

Parameters:

size (int) – the actual size of this file (aka offset of last bytes + 1)

Note

If you never call this function, the last byte of the file is assumed to be either the last byte of the last yielded field/structure, or the last byte of the last section, whatever is greater.

set_imagebase(va)

Define the memory address at which the binary will be loaded. Impacts other methods such as malcat.Analysis.r2a() and malcat.Analysis.a2r()

Parameters:

va (int) – the memory address

imagebase: int

the currently defined imagebase

Metadata

add_metadata(key, value, category='')

Describe a new metadata found inside the file. Metadata are arbitrary string values that will be displayed in the Summary view to give some context to the analyst.

Parameters:
  • key (str) – the name of the metadata, e.g. “Creation date”

  • value (str) – the content of the metadata, e.g. “2024-05-10”

  • category (str) – a category for the metadata (e.g “All dates”, optional)

set_architecture(architecture)

Tells Malcat which CPU architecture should be used for the disassembly

Parameters:

architecture (malcat.Architecture) – the CPU architecture

architecture: malcat.Architecture

The current CPU architecture

Virtual File System

add_file(vpath, size=0, unpack_method_name='', type='', hint='')

Add a new file to the virtual file system.

Parameters:
  • vpath (str) – the virtual path of the file, e.g. /directory/entry.txt

  • size (int) – the size of the file (just for information, a rough estimate is also ok)

  • unpack_method_name (str) – the method name to call (in this parser’s instance) to unpack the file.

  • type (str) – you can force a given parser type to be applied by setting this file to the parser’s name (e.g. “PNG”)

  • hint (str) – a hint to give to the virtual file’s parser parse() method

files: List[malcat.VirtualFile]

Added files

Symbols

add_symbol(memory_address, name, type=malcat.FileSymbol.EXPORT)

Add a new symbol

Parameters:
  • memory_address (int) – the memory address (aka VA) of the symbol

  • name (str) – the name of the symbol

  • type (malcat.FileSymbol.Type) – type of the symbol

symbols: Iterable[malcat.FileSymbol]

An iterator over the list of defined symbols

I/O

A few helper function to help you with parsing

read(where=None, size=1)

Read file data at the given offset

Parameters:
  • where (int) – the file offset where to read. If none, tell() will be used.

  • size (int) – how many bytes to read

Return type:

bytes

look_ahead(size=1)

Read file data at the current parsing offset. Equivalent to self.read(self.tell(), size).

Parameters:

size (int) – how many bytes to read

Return type:

bytes

read_cstring_ascii(where=None, max_bytes=512)

Read a null-terminated C string at the given offset

Parameters:
  • where (int) – the file offset where to read the string. If none, tell() will be used.

  • max_bytes (int) – maximal size of the string to return

Return type:

str

read_cstring_utf8(where=None, max_bytes=512)

Read a null-terminated UTF8-encoded string at the given offset

Parameters:
  • where (int) – the file offset where to read the string. If none, tell() will be used.

  • max_bytes (int) – maximal number of bytes to read

Return type:

str

read_cstring_utf16le(where=None, max_bytes=512)

Read a null-terminated utf16le-encoded string at the given offset

Parameters:
  • where (int) – the file offset where to read the string. If none, tell() will be used.

  • max_bytes (int) – maximal number of bytes to read

Return type:

str

read_cstring_utf16be(where=None, max_bytes=512)

Read a null-terminated utf16be-encoded string at the given offset

Parameters:
  • where (int) – the file offset where to read the string. If none, tell() will be used.

  • max_bytes (int) – maximal number of bytes to read

Return type:

str

Field Access

Within the FileTypeAnalyzer.parse() method, yielding structures/records, arrays or bitfields gives you back a malcat.FieldAccess instance. This accessor class allows you to inspect the field, its name, address, value and, for aggregate fields, access all the contained sub-fields.

Note

malcat.FieldAccess instances have the same interface as malcat.StructAccess, with the only different being that they have no address field (because the file layout is still unknown at this stage):

class malcat.FieldAccess

Has exactly the same interface than malcat.StructAccess, with the only different being that there is no address field

Attributes

All types of fields have the following attributes and methods:

value: depends on the field type

the value of the field. For aggregate fields, this would return itself (aka a FieldAccess instance) since aggregate fields have no value per se. For atomic fields, the returned type depends on the field type: int, str, datetime, etc.

name: str

the name of the field. Example:

print(self["Directories"][0].name)
>> Directories[0]
print(self["Directories"][0].StreamSize.name)
>> StreamSize
offset: int

the physical address of the field. Fields can only be defined on file-backed memory, so they always have a valid physical address.

size: int

how many bytes does the field takes on disk

__len__()

return the field’s size

Return type:

int

bytes: bytes

raw bytes of field on disk. Returned bytes have a size of size

has_enum()

some atomic fields have a fixed set of values. If so, has_enum will be true

Return type:

bool

enum: str

the textual representation of the field’s value if the field has an enum defined (i.e. has_enum() is True). If the field is not an enum, the empty string is returned. Example:

print(self["PE"].Machine.value)
>> 332
print(self["PE"].Machine.enum)
>> IMAGE_FILE_MACHINE_I386

Aggregate fields

For structures/records, arrays and bitfields, you have access to the following additional methods:

count: int

number of sub-fields/members of the aggregate. For atomic fields this attribute is still defined, but it will be always 1

__iter__()

Iterate over all the aggregate’s members, i.e. all the bits of a bitfield, all the rows of an array or all members of a record field

sections = self["Sections"]
for s in sections:
    print("{}: #{:x}".format(s["Name"], analysis.map.a2p(s["PointerToRawData"])))
Returns:

iterator over the list of field members

Return type:

iterator over FieldAccess instances

Raises:

Error for atomic fields

__getitem__(interval)

Iterate through from the ith to the jth sub-elements of the array/record/bitfield

for s in self["Sections"][1:]:
    print("#{:x}: {}".format(analysis.map.a2p(s.address), s.name))
Parameters:

interval (slice) – index interval

Return type:

iterator over the list of members (FieldAccess)

__getitem__(i)

return the value of the the ith aggregate member of the field

is_executable = self["PE"]["Characteristics"][1]
Parameters:

i (int) – position of the aggregate member to query

Return type:

FieldAccess for aggregate fields, the python type for atomic fields

Raises:

KeyError if the aggregate field has less than i members

__getitem__(name)

return the value of the first aggregate member named name. Note that this method is not valid for arrays, since array cells don’t have names.

is_executable = self["PE"]["Characteristics"]["ExecutableImage"]
Parameters:

name (str) – name of the member

Return type:

FieldAccess for aggregate fields, the python type for atomic fields

Raises:

KeyError if no member named name can be found

at(name)

return an accessor to the first aggregate member named name. Note that this method is not valid for arrays, since array’s elements don’t have names.

is_executable = self["PE"]["Characteristics"].at('ExecutableImage').value
Parameters:

name (str) – name of the member

Return type:

FieldAccess

Raises:

KeyError if no member named name can be found

at(i)

returns an accessor to the ith aggregate member

is_executable = self["PE"]["Characteristics"].at(1).value
Parameters:

i (int) – position of the aggregate member to query

Return type:

FieldAccess

Raises:

KeyError if the aggregate field has less than i members

__getattr__(name)

return an accessor to the first aggregate member named name. Note that this method is not valid for arrays, since array cells don’t have names.

is_executable = self["PE"]["Characteristics"].ExecutableImage.value
# equivalent
is_executable = self["PE"]["Characteristics"].at('ExecutableImage').value
# equivalent
is_executable = self["PE"]["Characteristics"]['ExecutableImage']
Parameters:

name (str) – name of the member

Return type:

FieldAccess

Raises:

KeyError if no member named name can be found

Symbols

class malcat.FileSymbol

A symbol is a name attached to a given memory address

address: int

The memory address (aka VA) of the symbol

name: str

Name of the symbol

type: malcat.FileSymbol.Type

Which kind of symbol it is

class malcat.FileSymbol.Type
EXPORT

Anxported function. Will be used By Malcat’s CFG reconstruction algorithm as entry point

IMPORT

An imported function / variable

ENTRY

Entry point of a program. Note that there can be multiple entry points in one program (e.g. TLS callbacks)

FUNCTION

An internal, non-exported function

DATA

A variable