Tuesday, October 16, 2007

General Binary File Parser in Python

The tradition way of parsing (reading ) a binary file is to represent a binary structure by a C structure, then write a read function for each structure . 

Let's take java class file a example:
A class file consists of a single ClassFile structure:
ClassFile {
u4 magic;
u2 minor_version;
u2 major_version;
u2 constant_pool_count;
cp_info constant_pool[constant_pool_count-1];
......
}


Traditional C++ Code:
class ClassFile{
u4 magic;
u2 minor_version;
u2 major_version;
u2 constant_pool_count;
ConstantPool* constant_pool;

void ReadClassFile(FILE *f)
{
magic = read_u4(f);
minor_version = read_u2(f);
major_version = read_u2(f);
constant_pool_count = read_u2(f);
constant_pool = new ConstantPool[constant_pool_count];
for(int i=0; i
<
constant_pool; i++)

Problems:

  • There is a duplication of information about Java Class File in the fields of ClassFile and ReadClassFile function. Duplication is evil, source of software maintaining cost.
  •  The information of ClassFile are hidden by the noise of the source code.

  • To avoid the duplication, we can use a function to represent the binary structure( ClassFile), instead of the above c++ class.

    We can define the following Python function:


    def ClassFile():
    magic = u4()
    minor_verison = u2()
    major_version = u4()
    constant_pool_count = u2()
    constant_pool = [cp_info() for i in range(constant_pool_count-1)]
    # other fields not implemented yet
    return locals()

    The u4 is a function, it read 4 bytes from the binary file and return the integer value.
    cp_info is a function which represent the constant pool.
    Pros:

    • Specify the format in a programming language that is well know.

    • The code is very close to the binary file format specification ( same high level)

    • The code is better than the specification in the senser it's executable.



    Cons:
    • Parsing but not editing.  This method is usefull when you want to parsing ( read ) a binary file, but if you want to edit (write) a binary file,  this method will not work. 

    • for general parsing and editing of binary file, google is your friend. see http://www.gigamonkeys.com/book/practical-parsing-binary-files.html

    • In practice, it's very common you only need parse a binary file. so why make things so complex by function that you don't need?