Saturday, October 29, 2011

What is Covariance and Contravariance Anyways?

If you have been following the talk around Google's latest language, Dart, then you might have heard things like "covariant arrays are widely regarded as a mistake" (src). But what is covariance? Why are covariant arrays considered a mistake? Who has covariant arrays? Why?

What is Covariance and Contravariance All About?

class X:
    pass

class Y(X):
    def zoom(self):
        print 'Y.zoom'

class A:
    # Returns a value of type Y
    def foo(self):
        return Y()

class B(A):
    # Returns a value of type X
    def foo(self):
        return X()


# Takes a value of type A
def bar(a):
    y = a.foo()
    y.zoom()


# Call bar with a B
bar(B())
Is this code safe? Often, in an OO language, inheritance is used to create a type that looks just like another type, and pass values of this new type to functions that expect the other type and without breaking anything. So one might expect, looking at this code, that passing a B to bar would be safe. But it isn't. The problem is B.foo returns an X where A.foo returns a Y. bar then goes on to try to call a member function that only exists in Y, so the code will fail at runtime. B has violated the interface that A established. Python doesn't give us any tools to discover this error without executing the code, but statically typed languages like C++ and Java do. These languages can define the types that member functions of a subclass can take, and return, in order to be valid through covariant and contravariant properties. The concepts are larger than member functions but it's a good place to start.

Definitions

Before we can talk about covariance and contravariance we need to know what subtyping is. I'll restrict this to classes in an object-orientated language because that's what my knowledge is limited to, but these concepts go beyond that. An example of a subtype in an OO language would be a class that extends or inherits from another class, for example in C++ you would do class S : public T for S to extend T. The common way type theorists write this is S <: T. This can be read in a few ways: "any term of type S can safely be used in a context where a term of type T is expected" or "every value described by S is also described by T" (Pierce, 182). Thus, S is a more specific type than T, and conversely T is a more generic type than S. Given this, covariance and contravariance define when a more specific or more generic type is acceptable in a particular context.

The definition from the wiki article:

covariance and contravariance refers to the ordering of types from narrower to wider and their interchangeability or equivalence in certain situations
And their definitions:
  • Covariant - Given the types S and T specified above, covariance is when a more specific type, S, can be used when a more generic type, T, is specified. This applies to functions, a function that returns S can be used in the same context as a function that returns T.
  • Contravariant - When the more generic type, T, can be used where the more specific type, S, is specified. A function that takes a T can be used in the same context as a function that takes a S.
  • Invariant - The type specified is the only type that can be used.
My method for remembering the distinction is contravariant is larger than covariant, so it means narrower to wider.

Covariant Example

Consider this example from C++:

class X {};
class Y : public X {};
class Z : public Y {};

class A {
public:
  virtual Y *foo() { return new Y(); }
};

class B : public A {
public:
  virtual Z *foo() { return new Z(); }
};
Here we have three classes X, Y, and Z which we will return from a virtual function in classes A and B. This code is valid because B::foo is returning a narrower type than A::foo because Z is a subtype of Y. But what happens if we make B::foo return a wider type?
class X {};
class Y : public X {};
class Z : public Y {};

class A {
public:
  virtual Y *foo() { return new Y(); }
};

class B : public A {
public:
  virtual X *foo() { return new X(); }
};
$ g++ -W -Wall -ansi -pedantic -c foo.cc foo.cc:12: error: invalid covariant return type for ‘virtual X* B::foo()' foo.cc:7: error: overriding ‘virtual Y* A::foo()’

Contravariant Example

As we saw above, one can be covariant on return types, but what about on parameters to a member function? One has to be the opposite actually: contravariant. Unfortunately, gcc (or the version I'm using at least) doesn't give us an error about contravariance, instead we have to add a little bit of code using B::foo to see the problem:

class X {};
class Y : public X {};
class Z : public Y {};

class A {
public:
  virtual void foo(Y &y) { }
};

class B : public A {
public:
  virtual void foo(X &x) { }
};


int main() {
  B b;
  Y y;
  Z z;
  
  b.foo(y);
  b.foo(z);
  return 0;
}
This behaves as we would expect. We can call B::foo with an X, Y, or Z, since B::foo takes the superclass to all of them (X). But what if we modify it to take Z:
class X {};
class Y : public X {};
class Z : public Y {};

class A {
public:
  virtual void foo(Y &y) { }
};

class B : public A {
public:
  virtual void foo(Z &z) { }
};


int main() {
  B b;
  Y y;
  Z z;
  
  b.foo(y);
  b.foo(z);
  return 0;
}
$ g++ -W -Wall -ansi -pedantic foo.cc foo.cc: In function ‘int main()’: foo.cc:24: error: no matching function for call to ‘B::foo(Y&)’ foo.cc:14: note: candidates are: virtual void B::foo(Z&)
The error we get is about B::foo(Y&) not existing. Because Z is a narrower type than Y, we have broken the interface of A, so B is no longer a valid substitute for A.

But You Already Knew This

Intuitively, this is the way that makes the most sense. The use case for inheritance is generally when a function takes a certain type as input but you want to pass your own type. In order to ensure that the function, bar, can successfully use the type you pass it, all of the methods that bar calls on your type need to take types that bar can pass to it and return types that bar knows how to use. Take this code:

class X {};
class Y : public X { /* ... */ };
class Z : public Y { /* ... */ };

class A {
public:
  virtual Y *foo(Y &y) { /* ... */ }
};

void bar(A &a) {
  Y y;
  Y *y_ptr = a.foo(y);
}
If you were to pass your own type, B, which is a subtype of A (that is, B <: A) the only way we can implement B::foo such that our function bar still works is if B::foo takes an object as input that is a supertype to Y and returns a type that is a subtype of Y. Consider if it were valid for B::foo to want a Z. B::foo could try to call a member function that exists only in Z. But since bar is passing an object of type X, which we can't guarantee has member function that exists only in Z, our code would be unsafe. A similar argument can be made with the return type of Y and X.

What About Arrays?

Now that we get what these terms mean, how do they apply when it comes to arrays of objects? In C++, arrays are not covariant, however in Java they are:

public class Test {
    public static void foo() {
        Test[] tests = new Test[10];
        Object[] objects = tests;
    }
}
This code looks innocent enough, but there is a problem: what if I modify objects? For example:
public class Test {
    public static void foo() {
        Test[] tests = new Test[10];
        Object[] objects = tests;
        objects[0] = new Integer(1);
    }
}
This code compiles fine, but executing it gives the following error (changed foo to main in order to run it):
Exception in thread "main" java.lang.ArrayStoreException: java.lang.Integer at Test.main(Test.java:5)
This is what people mean when they refer to covariant arrays being a mistake. A compile-time check has been moved to runtime, which means you don't know if your code is safe without running it. Note, though, that what makes covariant arrays a problem in Java is modifying the array, not reading it. If we could somehow ensure that the objects array was immutable, covariant arrays in Java would be safe. Why does Java have covariant arrays then? Originally Java did not have parametric polymorphism (generics), without covariant arrays it would be impossible to write a generic function that copies one array to another. Instead one would have to write a copy function for every type of array they wish to copy (Pierce, 188). While the behavior is not ideal, at the least the JVM can provide some level of safety by performing type checks at runtime.

Unfortunately, I cannot shed any light on why Dart has covariant arrays. It has done everything from befuddle to enrage type theorists (null pointers, in 2011, why?!). Dart appears to support parametric polymorphism, the lack of which lead to Java having covariant arrays. It is a shame that Dart seems to be repeating, what some, would classify as mistakes for no obvious benefit. Dart is not the only one sinning though, C# also has covariant arrays and seems to have simply copied them from Java.

Conclusion

I'm no type theorist, this post went through several iterations with comments and corrections from friends smarter than I, so this post only scratches the surface of covariance and contravariance. Hopefully it is enough to understand what people are referring to when the terms come up in casual Reddit conversation. If you're interested in more, Benjamin Pierce's book, Types and Programming Languages, has an entire chapter dedicated to subtyping. While I haven't finished the book yet, what I have read is excellent. Even if you skip the math that looks complicated you will come out better.

Thanks

Special thanks to @dklee. Much of this post is based off of emails we have exchanged. Thanks to @j2labs, and @apgwoz, as well, for help in making this post.

Sunday, October 2, 2011

Your Favorite Language is Probably Terrible at Concurrency too

The internet has been ablaze with posts on NodeJS, to some people's joy and to others chagrin. Some have claimed that Node solves a long standing problem in concurrency, saying:

People are starting to build more on Node.js because it’s superior and it solves these problems that have always existed. I/O has been done wrong for the last 30 years

In my opinion, Node is bad at concurrency, and guess what? Your language probably isn't any better. But let's make sure we're on the same page first.

  • Language/Framework - Most languages do not have concurrency as a first class citizen. So when I say "your language is bad at concurrency", what I really mean is "the options available for doing concurrent things in your language are bad". The former just rolls off your tongue better.
  • Concurrency - What do I mean by concurrency? I mean a model by which you can define actions that can happen at the same time. That could mean running multiple pieces of code in parallel or interleaving them. Specifically in this post I am concerned with solving problems where the number of things you want to do concurrently is significantly larger than the number of cores you have.

There are a lot of options for concurrency out there. You may have heard of things like Pi calculus, Join calculus, Communicating Sequential Processes, Event-loops and Coroutines. Your language probably has an implementation of one of these, or a conceptual subset. NodeJS and Twisted implement an event-loop. Coroutines is the path Python's Gevent has taken, as well as libraries for Ruby, C, and C++. Go has chosen Communicating Sequential Processes. But all these distinctions aren't important unless I can say what I consider a good solution to concurrency.

Ideally, a good solution should have the following properties:
  • Scaling - If you are writing concurrent software you've already decided handling one thing at a time is not a scalable solution, so now you want to handle multiple things at a time. An ideal solution should scale to the limits of the machine. That means making use of multiple cores, if available.
  • Reasoning - It should be easy for a reader of your code to reason about what it does. Edge cases and gotcha's should be limited. Preferably one shouldn't even be aware of the concurrent aspects of the code unless they need to be.
  • Debugging - Debugging should not be painful. Standard tools like stacktraces should be meaningful. Tracing the path a piece of code takes shouldn't be harder than launching the space shuttle.

My claim is that very few concurrent solutions meet these criteria. But let me be clear, I'm not saying this is the only way you should judge selecting a solution. There is a Python library that does basically everything you think you need and it will be really hard to re-implement that functionality in another language? Well, maybe dealing with Python's concurrency shortcomings is less work than rewriting the library.

Scaling

Most languages were built for writing serial code. Memory is accessible by any piece of code in the process and it is assumed that nothing interesting happens between two function calls. But modern computers are not fast enough to do all the work programmers want them to do in serial and these languages have a lot of momentum behind them. For valid reasons, it is challenging to just move to another solution. Instead, we duct tape concurrency on top of these serial languages. One problem is that some of these languages can't even run code in parallel (that is, have two functions running at the same exact time) even if they wanted to. Python and Ocaml have a global lock that restricts this. In other languages it's just too much coordination to do safely. In C and C++ it can be too hard and time consuming to coordinate distributing concurrent work over multiple threads. For this reason, many mainstream solutions to concurrency are limited to running on a single core. It's insane, right? I can buy a laptop with, ostensibly, 8 cores now, yet a program written in most mainstream languages cannot make use of more than one.

For this reason, most solutions fail to be scalable. For example, NodeJS, Twisted, Ocaml/Lwt, and Gevent: from the point-of-view of a user of these frameworks, their code not only cannot run on multiple cores, but it depends on it. Consider some Twisted code that downloads N web pages and appends the result to a list:

def downloadUrls(urls):
    d = defer.Deferred()
    ret = []
    def _returnWhenDone(_):
        if len(ret) == len(urls):
            d.callback(ret)
    for url in urls:
        downloadDefer = downloadUrlAsString(url)
        downloadDefer.addCallback(lambda s : ret.append(s))
        downloadDefer.addCallback(_returnWhenDone)
    return d

Ignoring my failure to handle failures, this code is acceptable Twisted, and it could not work if Python suddenly got the ability to run code on multiple cores and Twisted used it. The reason being, there is no coordination around the ret.append(s) line. What if two threads were to try to append to ret at the same time? NodeJS and Gevent have the same idea in mind. Almost no data access is surrounded by a mechanism to coordinate multiple pieces of code accessing it at the same time. The result is, none of the code using these frameworks can be run on multiple cores. If CPython or V8 got multicore support it would take a rewrite of all of the code to make use of it.

But, you say, who cares? "I can just spin up N instances of my program, where N is the number of cores on my machine. I can easily scale that way". You can't even get concurrency right and now you want to move into distributed programming? Who are you fooling? But seriously, the problem is your code now needs to be "location aware". If you want to do something with object X, you have to be aware of where object X lives. This adds another layer of complexity to your system. Without a good way of communicating between instances you are limited to solving embarrassingly parallel problems or pushing the concurrency to another software layer. Either way, you aren't actually solving the problem with your framework. Luckily, a lot of what people want concurrency for is serving webpages, which requires almost no interprocess communication right now.

Reasoning

No matter how you slice it, writing concurrent code is hard. When it comes to serial code, looking at it and knowing what it does is as simple as understanding how each function operates given the current state of the program. But with concurrent code, the state of the program is changing while a function runs. Understanding a concurrent program involves understanding how the concurrent components are interacting with each other. Some solutions make this easier than others.

Take the following piece of example NodeJS code:

var db = require('somedatabaseprovider');
app.get('/price', function(req, res) {
  db.openConnection('host', 12345, function(err, conn) {
    conn.query('select * from products where id=?', [req.param('product')], function(err, results) {
      conn.close();
      res.send(results[0]);
    });
  });
});

The amount of syntax is enormous. There is a huge amount of line noise for what should look, at worst, like this:

var db = require('somedatabaseprovider');
app.get('/price', function(req, res) {
    var conn = db.openConnection('host', 12345)
    var result = conn.query('select * from products where id=?', [req.param('product')])
    conn.close();
    res.send(results[0]);
});

If you want to add proper error handling, the situation gets worse with callback code. Twisted has attempted to solve this by encapsulating code flow in an object called a Deferred, but the problem remains: a unit of work in callback-based code is not a function, like one is used to in serial code, it is work to do between events. Like the above example code showed, there isn't a function that connects to a db, does a query, and returns the result. There is a function to open a db connection, another function for when that is done and to do the db query, and another function to handle the result. You have defined three functions where you previously needed one. More importantly, you have to define functions not because it makes your code clearer but because the framework requires it.

Given how negatively this affects code, there are a lot of attempted solutions. Twisted, for example, allows one to use the defer.inlineCallbacks decorator so a function can use generators to express asynchronous code. Our previous NodeJS code might look like this:

@defer.inlineCallbacks
def handlePrice(req, res):
    conn = yield db.openConnection('host', 12345)
    result = yield conn.query('select * from products where id=?', [req.param('product')])
    yield conn.close()
    res.send(results[0])

app.get('/price', handlePrice)

In many ways this is an improvement but it does have its limitations.

The NodeJS community has been at work solving this problem for themselves too. One person added coroutines to V8, and gave it a C#-like syntax. OKCupid gave us TameJS. Both of these solutions have their problems which are deal breakers for many.

There are also, less complete, solutions like Step. But library solutions, like Step, only give you access to a subset of functionality you would get from the sequential code you really want to write. To do that you need a full CPS transformation (which is what TameJS gives you, at a cost of debugging). This is actually how the syntax extensions for Ocaml/Lwt work. The previous NodeJS code might look like this in Ocaml/Lwt (the relevant part is that lwt causes a CPS transformation to turn the code into the appropriate callback-based code):

let handle_price req res =
  lwt conn = DB.open_connection "host" 12345 in
  lwt result = DB.query conn (SQL.sprintf "select * from products where id=?" (req#param "product")) in
  DB.close conn;
  res#send results.[0]

App.get "/price" handle_price

This is one reason for Gevent/Eventlet's popularity in Python. Gevent uses coroutines to give you asynchronous code that looks sequential. The trick is, underneath the hood, some function calls actually result in all of the state for your current function call being saved, another one switched to, executed, rinse, repeat. Gevent has a cooperative scheduler that tries to intelligently decide which function to switch to.

Say you want to write the earlier NodeJS code in sequential Python, you might get:

def handlePrice(req, res):
    conn = db.openConnection('host', 12345)
    result = conn.query('select * from products where id=?', [req.param('product')])
    conn.close()
    res.send(results[0])

app.get('/price', handlePrice)

How would this look in Gevent? Exactly the same. The openConnection and query functions have an I/O call which actually jumps back to the Gevent scheduler so it can do something else while the I/O happens.

But Gevent is not without its cost when it comes to reasoning about code. Consider this:

def foo(data):
    print data.bar
    do_something()
    print data.bar

Looking at this code, will the same value be printed twice? The answer is: no idea. Even though do_something does not take data as input, it could do something that causes Gevent to context switch to another function, another function which also has access to data and modifies it. There is no way to tell, simply by looking at the code, if it will context switch or not.

Debugging

The previous Gevent code is printing out two different values for data.bar and you don't want this, how do you fix it? The first thing you might try, from your serial programming days, is a debugger. But that might not work very well. Why? You're in concurrent-land now, multiple things are happening at once! That means timing is important. If you set a break point somewhere, you've disrupted the time things happen and your program could take a completely different path, not the one you want to debug.

If you're smart and you control access to data.bar through function calls, you can do some printf debugging. Perhaps print out a stacktrace when one modifies it. But let's say, even those prints are causing the timing of your program to change, so now data.bar is coming out as the same value at each print. What do you do?!

The point is, debugging concurrent code can be very hard. Event-loop code adds another problem to debugging: your code doesn't have a linear path. If you could visualize sequential code, it would be a line. You start at point A, you do the things in order to get to point B, at any point if you have an error your callstack represents the path you took to get there. Event-loop code always needs to hit the event-loop for a blocking call though. The callstack you see is always limited to the path from the last event you got. A callstack in the code handling a database query may not contain the how you got there. If that query is part of a piece of fairly generic code you don't have many leads to go on to track it down.

Who got it right then?

Three languages come to mind: Erlang, Oz, Haskell. There are more out there but I'm not omnipotent. In my opinion, these languages are capable of the three properties I previously mentioned. Right now you are probably rolling your eyes and saying "I should have known, one of THOSE guys". But my argument is conservative: based on the properties that I believe are important for concurrent solution to be good, these languages excel (or are capable of it) at them. Real world problems contain more than just concurrency issues though, so this does not mean you're wrong to use a language that doesn't meet my criteria, but it does mean you are sacrificing something. Perhaps that sacrifice is acceptable. But don't fool yourself into thinking your language is not terrible at concurrency, because it probably is.

Sunday, July 31, 2011

C Gotcha Of The Day: Pointers aren't integers

The C standard is clear that pointers are not required to be convertible to or from an integer.

Section 6.3.2.3.5-6 in the C99 draft

An integer may be converted to any pointer type. The result is implementation-defined, might not be properly aligned, and might not point to an entity of the referenced type.)

Any pointer type may be converted to an integer type; the result is implementation-defined. If the result cannot be represented in the integer type, the behavior is undefined. The result need not be in the range of values of any integer type.

Basically, you can't depend on the following code doing anything useful:

int *p = (int*)0xff;

The C standard does not define a machine with a flat memory model. Old Intel systems are an example of a non flat memory model, they had a segmented memory model where a pointer needed a segment and offset.

Conversion to a string is also implementation defined in C, from 7.19.6.1 fprintf, the section on the %p format specifier:

The argument shall be a pointer to void. The value of the pointer is converted to a sequence of printable characters, in an implementation-defined manner.

And from fscanf:

Matches an implementation-defined set of sequences, which should be the same as the set of sequences that may be produced by the %p conversion of the fprintf function. The corresponding argument shall be a pointer to a pointer to void. The interpretation of the input item is implementation-defined. If the input item is a value converted earlier during the same program execution, the pointer that results shall compare equal to that value; otherwise the behavior of the %p conversion is undefined.

It does appear that even though the result of %p is implementation defined, it is guaranteed that you can fprintf and fscanf pointers and get back the same result inside the same program execution.

Friday, July 29, 2011

C Gotcha Of The Day: ptrdiff_t

Excerpt from C99 Draft (AFAIK this has not changed):

The size of the result is implementation-defined, and its type (a signed integer type) is ptrdiff_t defined in the <stddef.h> header. If the result is not representable in an object of that type, the behavior is undefined. In other words, if the expressions P and Q point to, respectively, the i-th and j-th elements of an array object, the expression (P) - (Q) has the value i−j provided the value fits in an object of type ptrdiff_t.

That means, while the type size_t is capable of expressing the size of any object, you cannot guarantee that the subtraction of two pointers inside your object will result in defined behavior. That is because ptrdiff_t is signed (so it can give you the direction of the difference) and size_t is unsigned. You can use the macros PTRDIFF_MAX and SIZE_MAX to determine if your subtraction is safe though.

Tuesday, May 3, 2011

Sometimes I forget the full degree in which Python's "scoping" is broken.

>>> [s for s in [1, 2,3]]
[1, 2, 3]
>>> s
3

Wednesday, April 27, 2011

[JC] Efficient Parallel Programming in Poly/ML and Isabelle/ML

Authors: David C.J. Matthews and Makarius Wenzel
URL: http://www4.in.tum.de/~wenzelm/papers/parallel-ml.pdf
Year: 2010

I had not heard of Poly/ML or Isabelle/ML prior to reading this paper. I tried to do a bit of background research to get an idea for it, but I may have gotten some details wrong. If I made any mistakes below I will be happy to correct them.

The Paper


Poly/ML is a full implementation of Standard ML originally developed by David Matthews. Larry Paulson was an early adopter and implemented the Isabelle theorem prover. Isabelle/ML is the implementation of Isabelle on Poly/ML. Poly/ML was originally implemented with a single threaded run time system (RTS). With the ubiquity of multicore machines the RTS was modified to support threading, the garbage collector was modified and threading APIs were introduced. Finally, the modifications were tested in Isabelle. This paper is relevant to the frequent debates that occur on the Ocaml mailing list. Ocaml currently has no support for parallelism and this deficiency is frequently brought up as a negative.

The primary motivation for adding parallelism to Poly/ML is Isabelle/ML, where proofs can take hours to days to execute. Poly/ML and Isabelle/ML have been developed for many years which restricted the modifications to not break existing code and not grossly negatively effect single threaded applications. The authors decided to start with a fairly low level, pthreads-like, implementation and build layers on top of it, each more abstract than the previous. The majority of the changes took place in the RTS, Poly/ML base library, and Isabelle/ML library. In the end, no user's Isabelle code need be modified to take advantage of the performance benefits.

As stated before, Poly/ML is an implementation of Standard ML. The original definition of SML included an asynchronous exception called Interrupt. This was removed from the 1997 definition however Poly/ML has kept in for interrupting threads. This exception can be triggered by one thread in another thread to stop a computation. Interrupt is a useful mechanism to get a threads attention.

The RTS is written in C++, provides memory management, access to the underlying OS and required the most modifications. The RTS provides a one-to-one relationship between OS threads and ML threads. The OS is in charge of scheduling threads on cores. Synchronization primitives are not implement as a direct call to the underlying OS's, however. Each thread is given a single condition variable which the RTS can signal for the unlocking of a mutex, or signaling of a condition variable, in the ML code. A single threaded stop-the-world GC is used. While it is known that this is a likely bottleneck as the number of cores increase, it is shown that it not an issue up to 8 cores.

The base libraries were hardly altered, beyond adding the necessary threading APIs. The low-level APIs provided are:

Threads:
fork: (unit -> unit) * attribute list -> thread
interrupt: thread -> unit
setAttributes: attribute list -> unit
getAttributes: unit -> attribute list
Mutexes:
mutex: unit -> mutex
lock: mutex -> unit
unlock: mutex -> unit
Condition Variables:
condVar: unit -> condVar
wait: condVar * mutex -> unit
signal: condVar -> unit
boradcast: condVar -> unit

The only one that might need explaining is interrupt. This raises the asynchronous exception Interrupt in the provided thread. A few other locations of the standard library were modified, such as the I/O module which had locks added to each operation. The authors do point out:
... this overhead can be almost entirely avoided by using the functional IO layer of the ML library in which reading a character returns the character and a new stream since in this case a lock is only needed when the buffer is empty and needs to be refilled.

Abstractions over these were added to the Isabelle/ML library. The first is a synchronized variable with the following definition:
type 'a var
val var: 'a -> 'a var
val value: 'a var -> 'a
val guarded_access: 'a var -> ('a -> ('b * 'a) option) -> 'b

guarded_access takes a synchronized variable, and a function that takes the value of the synchronized variable as input and returns an option containing a tuple of the new value to put in the synchronized variable and the value to return. The combinator combines the idea of a mutex and condition variable. In the code guarded_access v f, f will be applied to the value stored in v. If f returns NONE then guarded_access waits for the next signal on that variable and applies f again. If f returns Some (a, b), then the value b will be put back into the synchronized variable and the value a will be returned.

This simple construct can be used to implement variables that can be safely shared, and updated, between threads. mvar's, variables where a thread waits for a value if none is there when getting and waits for the value to be removed when putting, can be easily implemented with this:
type 'a mvar = 'a option var
fun mvar () = var NONE
fun take v = guarded_access v 
               (fn NONE => NONE | Some x => Some (x, NONE))
fun put v x = guarded_access v 
               (fn SOME _ => NONE | NONE => (SOME ((), SOME x)))

The other abstraction presented is futures, which are a way to represent the result of a computation before it is completed. The futures interfaces looks a lot like a typical lazy evaluation interface, except the value will always be computed unless it is canceled. Futures are an attractive way to represent parallel computations. Consider this contrived example:
val x = future some_expensive_computation
val y = future some_other_expensive_computation
val z = join x + join y

We create two futures, x and y which evaluate to integers. Then we create z which will be the sum of x and y. Assuming you have enough cores, x and y will be computed in parallel. The join function waits for the future and evaluates to the value of the future. Here is the interface for futures in Isabelle/ML:
type 'a future
val future: (unit -> 'a) -> 'a future
val join: 'a future -> 'a
val cancel: 'a future -> unit

If a future is cancelled or produces an exception, that exception will be reraised when join is called. Isabelle/ML also contains a few future combinators, such as future groups which create a hierarchy of execution that kills all futures in the hierarchy if one fails.

The implied implementation of futures is a thread pool where the future function queues work for the thread pool. A scheduler will then decide which futures get work or are cancelled.

That is the end of additions to Poly/ML and Isabelle/ML. How well does it work? While subjective, performance improvements were more than acceptable with tests plateauing at around 8 cores. The tests were performed using large Isabelle/Isar proofs available in the standard distribution. Isabelle/Isar is non computational language and has a modular nature that makes it possible to exploit implicit parallelism. Four Isabelle applications were used for testing, all of which are considered reasonably large. The results showed a maximum of 3.2x speedup for 4 cores and 6.5x on 8 cores. Past 8 cores the performance increase plateaus. Two bottlenecks were investigated further. The first is garbage collection. As stated earlier Poly/ML has a single threaded stop-the-world GC. At 16 cores GC becomes about 15-30% of the run time. The second bottleneck is insufficient parallelization. Of the applications chosen, there was an insufficient amount of work to keep all of the cores busy.

While there are many implementations for functional languages out there, Poly/ML remains one of the few that supports true multicore threads in a stable state and used in realistic applications. The modifications to Poly/ML took about 1 person year of work. Future work includes parallelizing the garbage collector.

The Discussion

Every few months the question of if Ocaml will get multicore support comes up. The answer is generally "no" (although it's coming closer to "yes" thanks to OcamlPro) with the reason being it is too hard to provide a proper implementation, garbage collection being given as the primary reason. I think that the results of Poly/ML show that even a single threaded stop-the-world GC can provide a sufficient performance advantage while also providing more powerful abstractions for parallelism. A parallel garbage collector can be worked in later, if needed, without affecting existing code.

I liked that the authors chose to add low-level threading support and build layers on top. It gives developers options if the provided abstractions are not meeting their needs. Futures appear to be a very appealing abstraction but I worry somewhat that they are insufficient. In order to maximize parallelism, should every function consume and produce futures? That seems a bit overkill and the overhead would likely be enormous. But it is easy to think of situations where either decision hurts you, either because of the overhead or because you aren't utilizing all of the cores effectively. I do like futures as an option though and hope to see them if/when Ocaml gets multicore support.

This is my first exposure to Poly/ML and the authors have done excellent work. I hope it provides motivation for bringing Ocaml into the multicore world.

Thursday, April 14, 2011

A little gotcha in overloading comparison methods in Python

Python supports chaining relational operators so you can express 2 < 3 < 4 and get true. It looks like this is implemented as a little compiler trick that actually creates the expression "2 < 3 and 3 < 4". This can be confusing if you try to do something insane like implement Haskell's (>>=) in Python using (>=). Something like:
Just(10) >= (lambda x : Just(x + 1)) >= (lambda x : Just(x / 2))
Will actually become:
Just(10) >= (lambda x : Just(x + 1)) and \
    (lambda x : Just(x + 1)) >= (lambda x : Just(x / 2))
This only seems to only apply to the relational operators, choosing to implement (>>=) with (>>) in Python seems to work fine. Here is a Gist showing the problem from @apgwoz: https://gist.github.com/916132.