Thursday, October 22, 2009

Suggestions On Debugging Distributed Apps?


So far distributed apps seem to be the best use case for why it is important to be able to walk through code and do it by hand with a pencil and paper. I have been writing analytics code that runs on a cluster for close to a year and still have not been able to figure out a good way of debugging the code that runs on it.




Being able to bring it down to a testcase works well in some situations but I have found it ineffective in the long run. When you are dealing with a few terabytes of data it can be difficult to cut the problem down to a few megs of input data that you can reason about. While the framework I am using lets me run code on a single process so I can debug it, it is again not always possible to bring the input data down small enough that it is manageable on a single process. On top of that, the behavior you are trying to debug may not show up on such a simple case. It's pretty impossible to use a traditional debugger if the bug only shows up in a few megs of data when you are dealing with a terabyte, how can you step through a 6 hour run if it only shows up on a small portion of that data? How do you predict what machine that bad data will run on if you don't even know what the bad data is?




I have so far been using three methods to debug a distributed app.


  • If I have some idea what the bad data looks like, I'll look for some approximation of it in the input data and throw an exception with the debugging info I'm interested in. Exceptions make their way back to the main console of where the jobs are run from so I can see them easy enough.

  • Use standard IO. The annoying part is the framework I use only logs this out to local files on the machine it ran on, so I have to wrap the grep in an ssh call to each machine. It's also sometimes difficult to know what to even print out. Too much printing can also significantly reduce performance and even fill up the disk with logging data. Logging data as big as the input data is not any easy to comb through.

  • Just reading through the code by hand and working it out with pencil and paper. This has only really worked if I can get a good mental image of what the input is like. If I have a terabyte of input from untrusted sources it can be quite freeform and difficult to reason about as I walk through the code by hand.





Right now I am experiencing a bug that I have been unsuccessful in tracking down. The problem is it only shows up when I use it on huge amounts of data. I have two ways of doing the same thing. One of them takes way too long and the other takes a very short time, and the short way is not producing the same numbers as the long way. It is off by only about 500 million in 5 billion rows of input and it is being a very frustrating bug to track down. The main problem is that I cannot produce a smaller set of input files to cause the issue. Runs take over 6 hours with the input file I need, so a single test can take almost my entire work day. If anyone has any suggestions, please let me know.

Sunday, October 18, 2009

Sneaking Complexity - Being Asynchronous


I went out to lunch with some coworkers last week and we got to talking about the project they are working on. They had done a great job of implementing the app which pretty much makes up the backbone of our company. They rewrote it in Java and have gotten orders of magnitude better performance. The app needs access to various pieces of persistent data quickly and they have created several data-clusters that perform amazingly. And the code is very clean and easy to read. The problem they are running into though is the client code for the data-clusters.




One of the main reasons that Java was selected as a language to write in is the simplicity of reading and writing it. It is fairly easy to jump into a method and, with a good IDE, figure out what it is doing. It is simple. The problems they are running into, though, is when doing things Teh Java Way isn't working out for the volume and latency they are trying to reach. Specifically, they are hitting issues when they query the data-cluster. Generally when they decide they need to hit the data-cluster they do a query and if they don't get a response back in some timeout period they go on as if they have a new record. In this case, being correct about the data all the time is not very important. But they are getting over the number of timed out queries they feel comfortable with (there is also a concern that the timeouts on the library they are using are not working properly). What they would like to do is handle hitting the data-cluster asynchronously. When they think they are going to need data, they send their query in, go do some more processing, then check to see if the data is there and continue on as if a new record if the timeout is reached. The problem is, this really sucks so far in Java. Being asynchronous is hard if you haven't designed your application around it in the first place. You pretty much have to fire up a new thread for every one of these queries you want to do. When considering a single case, this doesn't sound too bad, but the JVM uses OS threads so if you are already pushing your app to the limit, in the worst case doubling or tripling the number of threads it needs is not going to help. You also have increased the complexity of your application. Most of the threading in this application doesn't really need to share any data, each request is off doing its own thing and rarely hitting shared objects. But in this case, sharing the result of the query will need to share some data. It may not be much but it is added complexity. On top of that, there might be a performance hit in terms of GC. I'm not an expert in the JVM GC but shared memory means you might have to walk the entire heap in order to clean up the objects created by the query thread.




This brings me to something that is so great about Erlang. Accomplishing this is trivial. Since messages in Erlang are asynchronous, you simply send your message to query the data-cluster at the beginning of your block, then do whatever work you'd like, and receive it when you are good and ready. A receive block can take a timeout so you don't have any issues there. Doing things asynchronously is how Erlang works pretty much from the get-go. Erlang processes also have their own heap so cleaning up after one is fairly light on the GC, just throw away that process's heap.




To be fair, I am certain that had they implemented this in Erlang they would have run into their own fair share of issues. No language is perfect and Erlang certainly is no exception. But the application in question is basically what Erlang was designed for and is good at. There are also other issues that we talked that they would benefit from had they used Erlang too that I did not talk about. But this is a pattern that seems common in my short career as a developer. People look at the project they are solving, then look at whatever tools solve all the easy problems fast but in the end leave them in a worse state when it comes to solving the harder problems. Those tools that get you 70% of the way there have forced you to design your app so that it even harder to solve that 30%. This happened at my previous employer. A framework was chosen that implemented all the simple things they wanted to solve but they had now inherited this framework whose design was not made to solve the more complex problems. In the end they had to rewrite a bunch of the framework to get what they wanted. I'm sure that my coworkers are going to be able to solve this problem in Java (they have no choice) and perhaps there is a Java-way that this should have been done and I am sure that had they implemented this in Erlang there would still be problems being discussed over lunch. But I feel confident that they would be frustration Erlang records, or syntax, or strings, not with the sneaking complexity of trying to get it to do things asynchronously (and don't even get me started on fault-tolerance).

Lorenz Attractor In R


I spent much of this weekend trying to figure out how to graph Chua's Circuit for a homework assignment. I ended up using R since I don't have or know MATLAB and I don't really want to learn Octave. It took me longer than it should have mostly because I don't know much of anything about differential equations. The implementation required two extra R packages, deSolve and scatterplot3d. The fun thing about working on this project is the whole point is to graph it so you get to see some neat visual output.



After finishing the circuit I decided to do a little extra and graph the Lorenz Attractor, which comes up quite often in class. This is a pretty famous structure in the realm of chaos theory. The term 'butterfly effect' comes from look of the attractor.



My solution uses the deSolve package in R. The function ode is used to solve the three equations that make up the attractor. With some minor understanding of R it should be pretty easy to play with. The settings I have in there give this pretty image:







library(deSolve)
library(scatterplot3d)

##
# Parameters for the solver
pars <- c(alpha = 10,
beta = 8/3,
c = 25.58)

##
# In initial state
yini <- c(x = 0.01, y = 0.0, z = 0.0)


lorenz <- function(Time, State, Pars) {
with(as.list(c(State, Pars)), {
xdot <- alpha * (y - x)
ydot <- x * (c - z) - y
zdot <- x*y - beta*z
return(list(c(xdot, ydot, zdot)))
})
}

runIt <- function(times) {
out <- as.data.frame(ode(func = lorenz, y = yini, parms = pars, times = times))

scatterplot3d(x=out[,2],
y=out[,3],
z=out[,4],
type="l",
box=FALSE,
highlight.3d=TRUE,
xlab="x",
ylab="y",
zlab="z",
main="3d Plot of Lorenz Oscillator")

}

##
# Run a default value
runAll <- function() {
runIt(seq(0, 100, by=0.01))
}