So far distributed apps seem to be the best use case for why it is important to be able to walk through code and do it by hand with a pencil and paper. I have been writing analytics code that runs on a cluster for close to a year and still have not been able to figure out a good way of debugging the code that runs on it.
Being able to bring it down to a testcase works well in some situations but I have found it ineffective in the long run. When you are dealing with a few terabytes of data it can be difficult to cut the problem down to a few megs of input data that you can reason about. While the framework I am using lets me run code on a single process so I can debug it, it is again not always possible to bring the input data down small enough that it is manageable on a single process. On top of that, the behavior you are trying to debug may not show up on such a simple case. It's pretty impossible to use a traditional debugger if the bug only shows up in a few megs of data when you are dealing with a terabyte, how can you step through a 6 hour run if it only shows up on a small portion of that data? How do you predict what machine that bad data will run on if you don't even know what the bad data is?
I have so far been using three methods to debug a distributed app.
- If I have some idea what the bad data looks like, I'll look for some approximation of it in the input data and throw an exception with the debugging info I'm interested in. Exceptions make their way back to the main console of where the jobs are run from so I can see them easy enough.
- Use standard IO. The annoying part is the framework I use only logs this out to local files on the machine it ran on, so I have to wrap the grep in an ssh call to each machine. It's also sometimes difficult to know what to even print out. Too much printing can also significantly reduce performance and even fill up the disk with logging data. Logging data as big as the input data is not any easy to comb through.
- Just reading through the code by hand and working it out with pencil and paper. This has only really worked if I can get a good mental image of what the input is like. If I have a terabyte of input from untrusted sources it can be quite freeform and difficult to reason about as I walk through the code by hand.
Right now I am experiencing a bug that I have been unsuccessful in tracking down. The problem is it only shows up when I use it on huge amounts of data. I have two ways of doing the same thing. One of them takes way too long and the other takes a very short time, and the short way is not producing the same numbers as the long way. It is off by only about 500 million in 5 billion rows of input and it is being a very frustrating bug to track down. The main problem is that I cannot produce a smaller set of input files to cause the issue. Runs take over 6 hours with the input file I need, so a single test can take almost my entire work day. If anyone has any suggestions, please let me know.