How making your own writeables is the best thing you can do
Hadoop is a very nice tool that has over the last few years become the workhorse of the new data science industry that is gathering maturity. It does a lot of things for you, mostly shuffling your data to the computer where it is needed for computation. To do this it has a transport mechanism taking messages across nodes constituted. The messages are constituted of two parts a routing address, which states where it should end, and a payload which is what you are transporting. This is as old as the world, and it also applies to hadoop.
I will focus in this post on the payload, as the writables are effectively the entry point to make your application more reliable and maintainable.
The basic writable interface is formed of two very simple methods. A write in which it is the responsibility of the developer to write all the fields that need to travel to a DataOutput object and a readFields in which all fields should be read back into the object state. The nice thing about the Data input and output interface is that it provides a read and write method for each basic type and Strings, freeing you from most of the hassle of casting and converting the bytes you would read into something usable.
There are however a few idiosyncrasies with it. At first sight the fact that you expect an end of file exception EOFException when the output has been consumed sounds like a good idea. You just put your fields in a while(true) loop and you are a happy bunny. Until you make part of your state a collection of writables. Then you are in a pickle forever. Why? Very simple the contained type will catch the end of file exception react to it as expected and not forward it upstream since in all strict justice the exception has been consumed. So now you have to change the behaviour of a writable depending on its location in the encapsulation. Is it the encapsulated or does it encapsulate? If encapsulated it has to forward up the call chain the exception if it encapsulates it just consumes it. This is a text book example of why its a bad idea to rely on exceptions for program flow.
When storing collections in a DataOutoutStream the simplest at most straight forwards way is to pre-pend the size of any collection to the collection elements, like that you do not have to loop forever in the loading of your data.
As per usual the code is over at github, this time I decided to use a maven as a build mechanism. Once cloned a simple mvn test will execute all the unit tests associated with the code. Hoping you learn from this valuable code which shows clearly how not to do things and you consider doing them properly.