Wednesday, April 27, 2011

[JC] Efficient Parallel Programming in Poly/ML and Isabelle/ML

Authors: David C.J. Matthews and Makarius Wenzel
Year: 2010

I had not heard of Poly/ML or Isabelle/ML prior to reading this paper. I tried to do a bit of background research to get an idea for it, but I may have gotten some details wrong. If I made any mistakes below I will be happy to correct them.

The Paper

Poly/ML is a full implementation of Standard ML originally developed by David Matthews. Larry Paulson was an early adopter and implemented the Isabelle theorem prover. Isabelle/ML is the implementation of Isabelle on Poly/ML. Poly/ML was originally implemented with a single threaded run time system (RTS). With the ubiquity of multicore machines the RTS was modified to support threading, the garbage collector was modified and threading APIs were introduced. Finally, the modifications were tested in Isabelle. This paper is relevant to the frequent debates that occur on the Ocaml mailing list. Ocaml currently has no support for parallelism and this deficiency is frequently brought up as a negative.

The primary motivation for adding parallelism to Poly/ML is Isabelle/ML, where proofs can take hours to days to execute. Poly/ML and Isabelle/ML have been developed for many years which restricted the modifications to not break existing code and not grossly negatively effect single threaded applications. The authors decided to start with a fairly low level, pthreads-like, implementation and build layers on top of it, each more abstract than the previous. The majority of the changes took place in the RTS, Poly/ML base library, and Isabelle/ML library. In the end, no user's Isabelle code need be modified to take advantage of the performance benefits.

As stated before, Poly/ML is an implementation of Standard ML. The original definition of SML included an asynchronous exception called Interrupt. This was removed from the 1997 definition however Poly/ML has kept in for interrupting threads. This exception can be triggered by one thread in another thread to stop a computation. Interrupt is a useful mechanism to get a threads attention.

The RTS is written in C++, provides memory management, access to the underlying OS and required the most modifications. The RTS provides a one-to-one relationship between OS threads and ML threads. The OS is in charge of scheduling threads on cores. Synchronization primitives are not implement as a direct call to the underlying OS's, however. Each thread is given a single condition variable which the RTS can signal for the unlocking of a mutex, or signaling of a condition variable, in the ML code. A single threaded stop-the-world GC is used. While it is known that this is a likely bottleneck as the number of cores increase, it is shown that it not an issue up to 8 cores.

The base libraries were hardly altered, beyond adding the necessary threading APIs. The low-level APIs provided are:

fork: (unit -> unit) * attribute list -> thread
interrupt: thread -> unit
setAttributes: attribute list -> unit
getAttributes: unit -> attribute list
mutex: unit -> mutex
lock: mutex -> unit
unlock: mutex -> unit
Condition Variables:
condVar: unit -> condVar
wait: condVar * mutex -> unit
signal: condVar -> unit
boradcast: condVar -> unit

The only one that might need explaining is interrupt. This raises the asynchronous exception Interrupt in the provided thread. A few other locations of the standard library were modified, such as the I/O module which had locks added to each operation. The authors do point out:
... this overhead can be almost entirely avoided by using the functional IO layer of the ML library in which reading a character returns the character and a new stream since in this case a lock is only needed when the buffer is empty and needs to be refilled.

Abstractions over these were added to the Isabelle/ML library. The first is a synchronized variable with the following definition:
type 'a var
val var: 'a -> 'a var
val value: 'a var -> 'a
val guarded_access: 'a var -> ('a -> ('b * 'a) option) -> 'b

guarded_access takes a synchronized variable, and a function that takes the value of the synchronized variable as input and returns an option containing a tuple of the new value to put in the synchronized variable and the value to return. The combinator combines the idea of a mutex and condition variable. In the code guarded_access v f, f will be applied to the value stored in v. If f returns NONE then guarded_access waits for the next signal on that variable and applies f again. If f returns Some (a, b), then the value b will be put back into the synchronized variable and the value a will be returned.

This simple construct can be used to implement variables that can be safely shared, and updated, between threads. mvar's, variables where a thread waits for a value if none is there when getting and waits for the value to be removed when putting, can be easily implemented with this:
type 'a mvar = 'a option var
fun mvar () = var NONE
fun take v = guarded_access v 
               (fn NONE => NONE | Some x => Some (x, NONE))
fun put v x = guarded_access v 
               (fn SOME _ => NONE | NONE => (SOME ((), SOME x)))

The other abstraction presented is futures, which are a way to represent the result of a computation before it is completed. The futures interfaces looks a lot like a typical lazy evaluation interface, except the value will always be computed unless it is canceled. Futures are an attractive way to represent parallel computations. Consider this contrived example:
val x = future some_expensive_computation
val y = future some_other_expensive_computation
val z = join x + join y

We create two futures, x and y which evaluate to integers. Then we create z which will be the sum of x and y. Assuming you have enough cores, x and y will be computed in parallel. The join function waits for the future and evaluates to the value of the future. Here is the interface for futures in Isabelle/ML:
type 'a future
val future: (unit -> 'a) -> 'a future
val join: 'a future -> 'a
val cancel: 'a future -> unit

If a future is cancelled or produces an exception, that exception will be reraised when join is called. Isabelle/ML also contains a few future combinators, such as future groups which create a hierarchy of execution that kills all futures in the hierarchy if one fails.

The implied implementation of futures is a thread pool where the future function queues work for the thread pool. A scheduler will then decide which futures get work or are cancelled.

That is the end of additions to Poly/ML and Isabelle/ML. How well does it work? While subjective, performance improvements were more than acceptable with tests plateauing at around 8 cores. The tests were performed using large Isabelle/Isar proofs available in the standard distribution. Isabelle/Isar is non computational language and has a modular nature that makes it possible to exploit implicit parallelism. Four Isabelle applications were used for testing, all of which are considered reasonably large. The results showed a maximum of 3.2x speedup for 4 cores and 6.5x on 8 cores. Past 8 cores the performance increase plateaus. Two bottlenecks were investigated further. The first is garbage collection. As stated earlier Poly/ML has a single threaded stop-the-world GC. At 16 cores GC becomes about 15-30% of the run time. The second bottleneck is insufficient parallelization. Of the applications chosen, there was an insufficient amount of work to keep all of the cores busy.

While there are many implementations for functional languages out there, Poly/ML remains one of the few that supports true multicore threads in a stable state and used in realistic applications. The modifications to Poly/ML took about 1 person year of work. Future work includes parallelizing the garbage collector.

The Discussion

Every few months the question of if Ocaml will get multicore support comes up. The answer is generally "no" (although it's coming closer to "yes" thanks to OcamlPro) with the reason being it is too hard to provide a proper implementation, garbage collection being given as the primary reason. I think that the results of Poly/ML show that even a single threaded stop-the-world GC can provide a sufficient performance advantage while also providing more powerful abstractions for parallelism. A parallel garbage collector can be worked in later, if needed, without affecting existing code.

I liked that the authors chose to add low-level threading support and build layers on top. It gives developers options if the provided abstractions are not meeting their needs. Futures appear to be a very appealing abstraction but I worry somewhat that they are insufficient. In order to maximize parallelism, should every function consume and produce futures? That seems a bit overkill and the overhead would likely be enormous. But it is easy to think of situations where either decision hurts you, either because of the overhead or because you aren't utilizing all of the cores effectively. I do like futures as an option though and hope to see them if/when Ocaml gets multicore support.

This is my first exposure to Poly/ML and the authors have done excellent work. I hope it provides motivation for bringing Ocaml into the multicore world.

No comments:

Post a Comment