C struct like Parsing of Binary Data with Java

Binary Formats:

Oh, the wonders of binary data.   Something too many programmers no little about, even though computers are binary processors.  Ironic. It seems odd that there are few facilities to deal with binary data elegantly in “modern” programming languages.  This is a shame since it is more natural for a computer to read binary forms. This should make coding easier for reading binary forms than other forms of data like text or xml. It is easier in lower level programming languages such as C.  Also, it should be inherently more efficient since it uses the smallest amount of natural space for storing information. 

To this end, C does have a wonderful idiom for parsing simple non-variable length binary data directly into C structs.

#pragma pack(1)
struct MyStruct
{
unsigned int PD_MSG_TYPE:8;
unsigned int NUM_REQUESTS:4;
unsigned int NUM_RESPONSES:4;

.... [ snip happens] ....

unsigned int GPS_ACQ_CAP:12;
unsigned int LOC_CALC_CAP:12;
};

You simply allocate a memory space for this struct and read straight into this allocated structure.  It cannot be simpler or faster.  Each field of the struct is fully packed into bytes and the alignment issues of those bits into bytes are dealt with by the C compiler.

There are places where this starts to break down.  If you have to store long arrays, string or other variable length data, you have to create a more dynamic data structures and split up you reading based upon lengths or tags, etc.  But, for simple stuff (and there is a lot of simple stuff out there) this works great.

For further study of C bit packing see:

Other advantages of bit fields are:

  • code is self documenting
  • bits are packed for efficient use of the memory or bandwidth
  • no bit arithmetic by the programmer and therefore less error prone
  • speed – it is very fast since there are no checks on the buffer from the network [1]
  • real programmers use this technique – well used to. [2]

There are inefficiencies with the above structure for certain types of high-performance, computational based algorithms.  The odd alignment in memory causes native compilers to produce code awkward for some processors to deal with at top speed.  Memory data alignment is the subject for others to discuss.  Here are focused on programmer performance and not machine performance.  It is important to get a job done reliably and quickly.

It is often not necessary to deal with data in this form.  The average programmer rarely runs into issues of this kind, but when you do run into it having the right tool/pattern to address it makes all the difference.

The Problem:

Now that I have hopefully sold you on the merits on binary format, and the C idiom to read or write them, I can get to the problem.

Java does not support this C idiom directly.  My searches have not lead me to anything nearly as elegant as the C construct above.  In java, you end up with a series of gets from some IO library:

class MyStruct
{
    int i;
    long l;
    // possibly 10's or 100's of other fields
}

And then in the read routine you might have

int i=stream.getInt();
long l = stream.getLong();
// possibly 10's or 100's of other get calls

etc, etc, etc, until the entire field set is read.  You have to hand code things twice.  This breaks the rule of “don’t repeat yourself” which I buy religiously.

Early during the .Net beta I wanted to post something similar for C# in that you could do single call read for entire structures.  This idea required the use of .Net Annotations.  This was something java did not have at that time.  Now java does.  So, I want to show a way to build this functionality almost directly into the syntax of java itself.  The end result is a simple Annotation and a utility [3] that uses reflection to safely read the network binary blob and automatically populate the class field values.  This is similar to what hibernate, toplink, and now JPA supports.  However, the goal here is not to build the most elegant framework but to show the beginnings of one.

Update: I recently stumbled upon this struct class that seems to address this idiom and problem. It appears to be a complete solution but does not address all types of binary data such as big versus little endian. The solution I present does allow for extensions to the many varied types of binary data, but it is not complete – just the start of a solution – just an idea. In addition, this struct class does involve creating objects containing other objects which might be a concern for memory efficiency.

The class definition for a C like struct could work something like this in java:

final public class TestStruct
{
    public TestStruct() {} // default constructor
    @Bits(len=8) int PD_MSG_TYPE;
    @Bits(len=4) int NUM_REQUESTS;
    @Bits(len=4) int NUM_RESPONSES;
  ...  snip happens  ...
    @Bits(len=12) int GPS_ACQ_CAP;
    @Bits(len=12) int LOC_CALC_CAP;
}

Similar to the C struct above, this class definition shows the bit-allocations of each field in the class as it would appear in the binary bit stream.

The next step is to provide a single routine that reads from a binary buffer into the a class instance:

    public static int parseBin(byte[] buff, Object obj, int bitpos)

The use of this routine would be simply:

    TestStruct ts = new TestStruct();
    BinaryParser.parseBin(buff, ts, 0);

How does one implement parseBin?

There are probably several ways to implement this – everything from byte-code manipulation for extremely high-speed parsing to simple code that extracts the bits and sets the members using reflection.  I will show the much simpler way here:

    public static int parseBin(byte[] buff, Object obj, int bitpos) throws IllegalArgumentException, IllegalAccessException
    {
        Class c = obj.getClass();
        byte curr=0;
        for (Field f : c.getDeclaredFields())
        {
            Annotation[] tations = f.getAnnotations();
            if ( tations.length!=1  )
                throw new RuntimeException("Class: " + c.getName() + " has the wrong number of annotations of " + tations.length + " on field: " + f.getName());

            Annotation functor = tations[0];

            if ( functor instanceof Bits )
            {
                Bits bits = (Bits)functor;
                int len = bits.len();
                long l = bit(buff, bitpos, bitpos+len-1);
                bitpos+=len;

                setValue(obj, f, l);

            }
... snip happened

The entire routine can be found at the end of this blog, but this bit of code points out the key tricks.  We walk the field members of the class in order in which they occurr using java reflection.  As we go we are checking for the Bits annotation.  When we find the Bits annotation, we use that annotation to tell us where to get the next thing.  The order of the fields is important here just as it is in the C struct.  The bit routine does the magic of finding the bits in the buffer and converts them to a single unsigned integral value.  That value is then used to set the value of that field on that object.

Here’s the setValue routine [it is like the routine above in that it is not fully baked - it does compile/run for a limited]:

    public static void setValue(Object o, Field f, long l) throws IllegalArgumentException, IllegalAccessException
    {
        // might add domain checks

        // order here is optimized slightly to those types most commonly encountered
        if( f.getType() == int.class )
        {
            f.setInt(o, (int)l);
        }
        else if ( f.getType() == long.class )
        {
            f.setLong(o, l);
        }
        else if( f.getType() == short.class )
        {
            if ( l>0 )
                f.setBoolean(o, true);
            else
                f.setBoolean(o, false);
        }
        else if( f.getType() == byte.class )
... snip happened

Based upon the field where the data is to land, we convert the base type here to the destination type.

The Bits Annotation is just a simple data hold that gets set aside in the class information for relection to use:

@Target(ElementType.FIELD)
@Retention(RetentionPolicy.RUNTIME)
public @interface Bits {
    public int len();
}

The key things to note is that this annotation is associated with a field and available at runtime.

That’s about it!  These ideas could certainly be extended to include reading a richer type set.  Such things as variable len field, strings of ascii or ebcdic type, etc.  Other more complex type information could be encoded in various annotation types.

How fast is it?

Well, this initial code here is far from optimal at 40,000 fields/sec on a Core2Duo.   There are probably many things that could be done to increase this speed.  For one, some type of preprocessing of the class’s field set and creating a series of anonymous class calls might make things faster as it would by pass all but the first set of reflection calls.

Cheers!
Steve

Notes:

[1] In the more dynamic formats, it is possible to take advantage of buffer overruns in code on the receiving end or reading end.  Many viruses/worms spread using this technique, but for statically allocated structures this is much less likely if not impossible.  Further, newer processors and OSs eliminate this problem by taking advantage of non-executable stack segments or pages.  See buffer overrun

[2]Protocols such as tcp/ip relying on these techniques to minimize overhead.  Old-school office-suite file formats also use these and dynamic techniques to read and write files.  Now they are using xml, but are compelled to compress the files it takes to store a single document into a zip file.  [just try unzipping a xlsx file sometime]  I believe this was done since it give the appearance of making the file more accessible to lesser or non-programmers.  The standards [such as the ISO standards] should certainly make it more accessible by all.  Being able to view the document in it “raw” xml form in a ordinary editor is always an advantage.

[3]This technique is at least partially inspired by perl and C.  It just seems to be that reading data from binary data using java can and should easier to do.  It should be just as easy in java as it is in C or perl where the level of abstraction is higher, but it seems to have been a neglected subject until NIO, but I would hardly call NIO an elegant/precise way to do this same thing. 

Here’s the example source code.

The next idiom to look at is perl unpack in java….

Tags: , , , ,

9 Responses to “C struct like Parsing of Binary Data with Java”

  1. [...] codify.flansite.com » Blog Archive » C struct like Parsing of … [...]

  2. [...] codify.flansite.com » Blog Archive » C struct like Parsing of … internet advertising   [...]

  3. KrisBelucci says:

    Great post! Just wanted to let you know you have a new subscriber- me!

  4. Kelly Brown says:

    Hi, gr8 post thanks for posting. Information is useful!

  5. Hi, gr8 post thanks for posting. Information is useful!

  6. GarykPatton says:

    You know so many interesting infomation. You might be very wise. I like such people. Don’t top writing.

  7. imeldahar says:

    It is remarkable, the helpful information

  8. Farko says:

    I added your blog to Google Reader.

  9. Electronics says:

    Great post, thanks for sharing this with me :)

    I look forward to reading your future posts!

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>