I had not heard of Poly/ML or Isabelle/ML prior to reading this paper. I tried to do a bit of background research to get an idea for it, but I may have gotten some details wrong. If I made any mistakes below I will be happy to correct them.
Poly/ML is a full implementation of Standard ML originally developed by David Matthews. Larry Paulson was an early adopter and implemented the Isabelle theorem prover. Isabelle/ML is the implementation of Isabelle on Poly/ML. Poly/ML was originally implemented with a single threaded run time system (RTS). With the ubiquity of multicore machines the RTS was modified to support threading, the garbage collector was modified and threading APIs were introduced. Finally, the modifications were tested in Isabelle. This paper is relevant to the frequent debates that occur on the Ocaml mailing list. Ocaml currently has no support for parallelism and this deficiency is frequently brought up as a negative.
The primary motivation for adding parallelism to Poly/ML is Isabelle/ML, where proofs can take hours to days to execute. Poly/ML and Isabelle/ML have been developed for many years which restricted the modifications to not break existing code and not grossly negatively effect single threaded applications. The authors decided to start with a fairly low level, pthreads-like, implementation and build layers on top of it, each more abstract than the previous. The majority of the changes took place in the RTS, Poly/ML base library, and Isabelle/ML library. In the end, no user's Isabelle code need be modified to take advantage of the performance benefits.
As stated before, Poly/ML is an implementation of Standard ML. The original definition of SML included an asynchronous exception called
Interrupt. This was removed from the 1997 definition however Poly/ML has kept in for interrupting threads. This exception can be triggered by one thread in another thread to stop a computation.
Interruptis a useful mechanism to get a threads attention.
The RTS is written in C++, provides memory management, access to the underlying OS and required the most modifications. The RTS provides a one-to-one relationship between OS threads and ML threads. The OS is in charge of scheduling threads on cores. Synchronization primitives are not implement as a direct call to the underlying OS's, however. Each thread is given a single condition variable which the RTS can signal for the unlocking of a mutex, or signaling of a condition variable, in the ML code. A single threaded stop-the-world GC is used. While it is known that this is a likely bottleneck as the number of cores increase, it is shown that it not an issue up to 8 cores.
The base libraries were hardly altered, beyond adding the necessary threading APIs. The low-level APIs provided are:
fork: (unit -> unit) * attribute list -> thread interrupt: thread -> unit setAttributes: attribute list -> unit getAttributes: unit -> attribute list
mutex: unit -> mutex lock: mutex -> unit unlock: mutex -> unit
condVar: unit -> condVar wait: condVar * mutex -> unit signal: condVar -> unit boradcast: condVar -> unit
The only one that might need explaining is
interrupt. This raises the asynchronous exception
Interruptin the provided thread. A few other locations of the standard library were modified, such as the I/O module which had locks added to each operation. The authors do point out:
... this overhead can be almost entirely avoided by using the functional IO layer of the ML library in which reading a character returns the character and a new stream since in this case a lock is only needed when the buffer is empty and needs to be refilled.
Abstractions over these were added to the Isabelle/ML library. The first is a synchronized variable with the following definition:
type 'a var val var: 'a -> 'a var val value: 'a var -> 'a val guarded_access: 'a var -> ('a -> ('b * 'a) option) -> 'b
guarded_accesstakes a synchronized variable, and a function that takes the value of the synchronized variable as input and returns an option containing a tuple of the new value to put in the synchronized variable and the value to return. The combinator combines the idea of a mutex and condition variable. In the code
guarded_access v f,
fwill be applied to the value stored in
guarded_accesswaits for the next signal on that variable and applies
Some (a, b), then the value
bwill be put back into the synchronized variable and the value
awill be returned.
This simple construct can be used to implement variables that can be safely shared, and updated, between threads. mvar's, variables where a thread waits for a value if none is there when getting and waits for the value to be removed when putting, can be easily implemented with this:
type 'a mvar = 'a option var fun mvar () = var NONE fun take v = guarded_access v (fn NONE => NONE | Some x => Some (x, NONE)) fun put v x = guarded_access v (fn SOME _ => NONE | NONE => (SOME ((), SOME x)))
The other abstraction presented is futures, which are a way to represent the result of a computation before it is completed. The futures interfaces looks a lot like a typical lazy evaluation interface, except the value will always be computed unless it is canceled. Futures are an attractive way to represent parallel computations. Consider this contrived example:
val x = future some_expensive_computation val y = future some_other_expensive_computation val z = join x + join y
We create two futures,
ywhich evaluate to integers. Then we create
zwhich will be the sum of
y. Assuming you have enough cores,
ywill be computed in parallel. The
joinfunction waits for the future and evaluates to the value of the future. Here is the interface for futures in Isabelle/ML:
type 'a future val future: (unit -> 'a) -> 'a future val join: 'a future -> 'a val cancel: 'a future -> unit
If a future is cancelled or produces an exception, that exception will be reraised when
joinis called. Isabelle/ML also contains a few future combinators, such as future groups which create a hierarchy of execution that kills all futures in the hierarchy if one fails.
The implied implementation of futures is a thread pool where the
futurefunction queues work for the thread pool. A scheduler will then decide which futures get work or are cancelled.
That is the end of additions to Poly/ML and Isabelle/ML. How well does it work? While subjective, performance improvements were more than acceptable with tests plateauing at around 8 cores. The tests were performed using large Isabelle/Isar proofs available in the standard distribution. Isabelle/Isar is non computational language and has a modular nature that makes it possible to exploit implicit parallelism. Four Isabelle applications were used for testing, all of which are considered reasonably large. The results showed a maximum of 3.2x speedup for 4 cores and 6.5x on 8 cores. Past 8 cores the performance increase plateaus. Two bottlenecks were investigated further. The first is garbage collection. As stated earlier Poly/ML has a single threaded stop-the-world GC. At 16 cores GC becomes about 15-30% of the run time. The second bottleneck is insufficient parallelization. Of the applications chosen, there was an insufficient amount of work to keep all of the cores busy.
While there are many implementations for functional languages out there, Poly/ML remains one of the few that supports true multicore threads in a stable state and used in realistic applications. The modifications to Poly/ML took about 1 person year of work. Future work includes parallelizing the garbage collector.
The DiscussionEvery few months the question of if Ocaml will get multicore support comes up. The answer is generally "no" (although it's coming closer to "yes" thanks to OcamlPro) with the reason being it is too hard to provide a proper implementation, garbage collection being given as the primary reason. I think that the results of Poly/ML show that even a single threaded stop-the-world GC can provide a sufficient performance advantage while also providing more powerful abstractions for parallelism. A parallel garbage collector can be worked in later, if needed, without affecting existing code.
I liked that the authors chose to add low-level threading support and build layers on top. It gives developers options if the provided abstractions are not meeting their needs. Futures appear to be a very appealing abstraction but I worry somewhat that they are insufficient. In order to maximize parallelism, should every function consume and produce futures? That seems a bit overkill and the overhead would likely be enormous. But it is easy to think of situations where either decision hurts you, either because of the overhead or because you aren't utilizing all of the cores effectively. I do like futures as an option though and hope to see them if/when Ocaml gets multicore support.
This is my first exposure to Poly/ML and the authors have done excellent work. I hope it provides motivation for bringing Ocaml into the multicore world.