Showing posts with label binary file. Show all posts
Showing posts with label binary file. Show all posts

Tuesday, October 16, 2007

General Binary File Parser in Python

The tradition way of parsing (reading ) a binary file is to represent a binary structure by a C structure, then write a read function for each structure . 

Let's take java class file a example:
A class file consists of a single ClassFile structure:
ClassFile {
u4 magic;
u2 minor_version;
u2 major_version;
u2 constant_pool_count;
cp_info constant_pool[constant_pool_count-1];
......
}


Traditional C++ Code:
class ClassFile{
u4 magic;
u2 minor_version;
u2 major_version;
u2 constant_pool_count;
ConstantPool* constant_pool;

void ReadClassFile(FILE *f)
{
magic = read_u4(f);
minor_version = read_u2(f);
major_version = read_u2(f);
constant_pool_count = read_u2(f);
constant_pool = new ConstantPool[constant_pool_count];
for(int i=0; i
<
constant_pool; i++)

Problems:

  • There is a duplication of information about Java Class File in the fields of ClassFile and ReadClassFile function. Duplication is evil, source of software maintaining cost.
  •  The information of ClassFile are hidden by the noise of the source code.

  • To avoid the duplication, we can use a function to represent the binary structure( ClassFile), instead of the above c++ class.

    We can define the following Python function:


    def ClassFile():
    magic = u4()
    minor_verison = u2()
    major_version = u4()
    constant_pool_count = u2()
    constant_pool = [cp_info() for i in range(constant_pool_count-1)]
    # other fields not implemented yet
    return locals()

    The u4 is a function, it read 4 bytes from the binary file and return the integer value.
    cp_info is a function which represent the constant pool.
    Pros:

    • Specify the format in a programming language that is well know.

    • The code is very close to the binary file format specification ( same high level)

    • The code is better than the specification in the senser it's executable.



    Cons:
    • Parsing but not editing.  This method is usefull when you want to parsing ( read ) a binary file, but if you want to edit (write) a binary file,  this method will not work. 

    • for general parsing and editing of binary file, google is your friend. see http://www.gigamonkeys.com/book/practical-parsing-binary-files.html

    • In practice, it's very common you only need parse a binary file. so why make things so complex by function that you don't need?