Saturday, October 29, 2011

What is Covariance and Contravariance Anyways?

If you have been following the talk around Google's latest language, Dart, then you might have heard things like "covariant arrays are widely regarded as a mistake" (src). But what is covariance? Why are covariant arrays considered a mistake? Who has covariant arrays? Why?

What is Covariance and Contravariance All About?

class X:
    pass

class Y(X):
    def zoom(self):
        print 'Y.zoom'

class A:
    # Returns a value of type Y
    def foo(self):
        return Y()

class B(A):
    # Returns a value of type X
    def foo(self):
        return X()


# Takes a value of type A
def bar(a):
    y = a.foo()
    y.zoom()


# Call bar with a B
bar(B())
Is this code safe? Often, in an OO language, inheritance is used to create a type that looks just like another type, and pass values of this new type to functions that expect the other type and without breaking anything. So one might expect, looking at this code, that passing a B to bar would be safe. But it isn't. The problem is B.foo returns an X where A.foo returns a Y. bar then goes on to try to call a member function that only exists in Y, so the code will fail at runtime. B has violated the interface that A established. Python doesn't give us any tools to discover this error without executing the code, but statically typed languages like C++ and Java do. These languages can define the types that member functions of a subclass can take, and return, in order to be valid through covariant and contravariant properties. The concepts are larger than member functions but it's a good place to start.

Definitions

Before we can talk about covariance and contravariance we need to know what subtyping is. I'll restrict this to classes in an object-orientated language because that's what my knowledge is limited to, but these concepts go beyond that. An example of a subtype in an OO language would be a class that extends or inherits from another class, for example in C++ you would do class S : public T for S to extend T. The common way type theorists write this is S <: T. This can be read in a few ways: "any term of type S can safely be used in a context where a term of type T is expected" or "every value described by S is also described by T" (Pierce, 182). Thus, S is a more specific type than T, and conversely T is a more generic type than S. Given this, covariance and contravariance define when a more specific or more generic type is acceptable in a particular context.

The definition from the wiki article:

covariance and contravariance refers to the ordering of types from narrower to wider and their interchangeability or equivalence in certain situations
And their definitions:
  • Covariant - Given the types S and T specified above, covariance is when a more specific type, S, can be used when a more generic type, T, is specified. This applies to functions, a function that returns S can be used in the same context as a function that returns T.
  • Contravariant - When the more generic type, T, can be used where the more specific type, S, is specified. A function that takes a T can be used in the same context as a function that takes a S.
  • Invariant - The type specified is the only type that can be used.
My method for remembering the distinction is contravariant is larger than covariant, so it means narrower to wider.

Covariant Example

Consider this example from C++:

class X {};
class Y : public X {};
class Z : public Y {};

class A {
public:
  virtual Y *foo() { return new Y(); }
};

class B : public A {
public:
  virtual Z *foo() { return new Z(); }
};
Here we have three classes X, Y, and Z which we will return from a virtual function in classes A and B. This code is valid because B::foo is returning a narrower type than A::foo because Z is a subtype of Y. But what happens if we make B::foo return a wider type?
class X {};
class Y : public X {};
class Z : public Y {};

class A {
public:
  virtual Y *foo() { return new Y(); }
};

class B : public A {
public:
  virtual X *foo() { return new X(); }
};
$ g++ -W -Wall -ansi -pedantic -c foo.cc foo.cc:12: error: invalid covariant return type for ‘virtual X* B::foo()' foo.cc:7: error: overriding ‘virtual Y* A::foo()’

Contravariant Example

As we saw above, one can be covariant on return types, but what about on parameters to a member function? One has to be the opposite actually: contravariant. Unfortunately, gcc (or the version I'm using at least) doesn't give us an error about contravariance, instead we have to add a little bit of code using B::foo to see the problem:

class X {};
class Y : public X {};
class Z : public Y {};

class A {
public:
  virtual void foo(Y &y) { }
};

class B : public A {
public:
  virtual void foo(X &x) { }
};


int main() {
  B b;
  Y y;
  Z z;
  
  b.foo(y);
  b.foo(z);
  return 0;
}
This behaves as we would expect. We can call B::foo with an X, Y, or Z, since B::foo takes the superclass to all of them (X). But what if we modify it to take Z:
class X {};
class Y : public X {};
class Z : public Y {};

class A {
public:
  virtual void foo(Y &y) { }
};

class B : public A {
public:
  virtual void foo(Z &z) { }
};


int main() {
  B b;
  Y y;
  Z z;
  
  b.foo(y);
  b.foo(z);
  return 0;
}
$ g++ -W -Wall -ansi -pedantic foo.cc foo.cc: In function ‘int main()’: foo.cc:24: error: no matching function for call to ‘B::foo(Y&)’ foo.cc:14: note: candidates are: virtual void B::foo(Z&)
The error we get is about B::foo(Y&) not existing. Because Z is a narrower type than Y, we have broken the interface of A, so B is no longer a valid substitute for A.

But You Already Knew This

Intuitively, this is the way that makes the most sense. The use case for inheritance is generally when a function takes a certain type as input but you want to pass your own type. In order to ensure that the function, bar, can successfully use the type you pass it, all of the methods that bar calls on your type need to take types that bar can pass to it and return types that bar knows how to use. Take this code:

class X {};
class Y : public X { /* ... */ };
class Z : public Y { /* ... */ };

class A {
public:
  virtual Y *foo(Y &y) { /* ... */ }
};

void bar(A &a) {
  Y y;
  Y *y_ptr = a.foo(y);
}
If you were to pass your own type, B, which is a subtype of A (that is, B <: A) the only way we can implement B::foo such that our function bar still works is if B::foo takes an object as input that is a supertype to Y and returns a type that is a subtype of Y. Consider if it were valid for B::foo to want a Z. B::foo could try to call a member function that exists only in Z. But since bar is passing an object of type X, which we can't guarantee has member function that exists only in Z, our code would be unsafe. A similar argument can be made with the return type of Y and X.

What About Arrays?

Now that we get what these terms mean, how do they apply when it comes to arrays of objects? In C++, arrays are not covariant, however in Java they are:

public class Test {
    public static void foo() {
        Test[] tests = new Test[10];
        Object[] objects = tests;
    }
}
This code looks innocent enough, but there is a problem: what if I modify objects? For example:
public class Test {
    public static void foo() {
        Test[] tests = new Test[10];
        Object[] objects = tests;
        objects[0] = new Integer(1);
    }
}
This code compiles fine, but executing it gives the following error (changed foo to main in order to run it):
Exception in thread "main" java.lang.ArrayStoreException: java.lang.Integer at Test.main(Test.java:5)
This is what people mean when they refer to covariant arrays being a mistake. A compile-time check has been moved to runtime, which means you don't know if your code is safe without running it. Note, though, that what makes covariant arrays a problem in Java is modifying the array, not reading it. If we could somehow ensure that the objects array was immutable, covariant arrays in Java would be safe. Why does Java have covariant arrays then? Originally Java did not have parametric polymorphism (generics), without covariant arrays it would be impossible to write a generic function that copies one array to another. Instead one would have to write a copy function for every type of array they wish to copy (Pierce, 188). While the behavior is not ideal, at the least the JVM can provide some level of safety by performing type checks at runtime.

Unfortunately, I cannot shed any light on why Dart has covariant arrays. It has done everything from befuddle to enrage type theorists (null pointers, in 2011, why?!). Dart appears to support parametric polymorphism, the lack of which lead to Java having covariant arrays. It is a shame that Dart seems to be repeating, what some, would classify as mistakes for no obvious benefit. Dart is not the only one sinning though, C# also has covariant arrays and seems to have simply copied them from Java.

Conclusion

I'm no type theorist, this post went through several iterations with comments and corrections from friends smarter than I, so this post only scratches the surface of covariance and contravariance. Hopefully it is enough to understand what people are referring to when the terms come up in casual Reddit conversation. If you're interested in more, Benjamin Pierce's book, Types and Programming Languages, has an entire chapter dedicated to subtyping. While I haven't finished the book yet, what I have read is excellent. Even if you skip the math that looks complicated you will come out better.

Thanks

Special thanks to @dklee. Much of this post is based off of emails we have exchanged. Thanks to @j2labs, and @apgwoz, as well, for help in making this post.

Sunday, October 2, 2011

Your Favorite Language is Probably Terrible at Concurrency too

The internet has been ablaze with posts on NodeJS, to some people's joy and to others chagrin. Some have claimed that Node solves a long standing problem in concurrency, saying:

People are starting to build more on Node.js because it’s superior and it solves these problems that have always existed. I/O has been done wrong for the last 30 years

In my opinion, Node is bad at concurrency, and guess what? Your language probably isn't any better. But let's make sure we're on the same page first.

  • Language/Framework - Most languages do not have concurrency as a first class citizen. So when I say "your language is bad at concurrency", what I really mean is "the options available for doing concurrent things in your language are bad". The former just rolls off your tongue better.
  • Concurrency - What do I mean by concurrency? I mean a model by which you can define actions that can happen at the same time. That could mean running multiple pieces of code in parallel or interleaving them. Specifically in this post I am concerned with solving problems where the number of things you want to do concurrently is significantly larger than the number of cores you have.

There are a lot of options for concurrency out there. You may have heard of things like Pi calculus, Join calculus, Communicating Sequential Processes, Event-loops and Coroutines. Your language probably has an implementation of one of these, or a conceptual subset. NodeJS and Twisted implement an event-loop. Coroutines is the path Python's Gevent has taken, as well as libraries for Ruby, C, and C++. Go has chosen Communicating Sequential Processes. But all these distinctions aren't important unless I can say what I consider a good solution to concurrency.

Ideally, a good solution should have the following properties:
  • Scaling - If you are writing concurrent software you've already decided handling one thing at a time is not a scalable solution, so now you want to handle multiple things at a time. An ideal solution should scale to the limits of the machine. That means making use of multiple cores, if available.
  • Reasoning - It should be easy for a reader of your code to reason about what it does. Edge cases and gotcha's should be limited. Preferably one shouldn't even be aware of the concurrent aspects of the code unless they need to be.
  • Debugging - Debugging should not be painful. Standard tools like stacktraces should be meaningful. Tracing the path a piece of code takes shouldn't be harder than launching the space shuttle.

My claim is that very few concurrent solutions meet these criteria. But let me be clear, I'm not saying this is the only way you should judge selecting a solution. There is a Python library that does basically everything you think you need and it will be really hard to re-implement that functionality in another language? Well, maybe dealing with Python's concurrency shortcomings is less work than rewriting the library.

Scaling

Most languages were built for writing serial code. Memory is accessible by any piece of code in the process and it is assumed that nothing interesting happens between two function calls. But modern computers are not fast enough to do all the work programmers want them to do in serial and these languages have a lot of momentum behind them. For valid reasons, it is challenging to just move to another solution. Instead, we duct tape concurrency on top of these serial languages. One problem is that some of these languages can't even run code in parallel (that is, have two functions running at the same exact time) even if they wanted to. Python and Ocaml have a global lock that restricts this. In other languages it's just too much coordination to do safely. In C and C++ it can be too hard and time consuming to coordinate distributing concurrent work over multiple threads. For this reason, many mainstream solutions to concurrency are limited to running on a single core. It's insane, right? I can buy a laptop with, ostensibly, 8 cores now, yet a program written in most mainstream languages cannot make use of more than one.

For this reason, most solutions fail to be scalable. For example, NodeJS, Twisted, Ocaml/Lwt, and Gevent: from the point-of-view of a user of these frameworks, their code not only cannot run on multiple cores, but it depends on it. Consider some Twisted code that downloads N web pages and appends the result to a list:

def downloadUrls(urls):
    d = defer.Deferred()
    ret = []
    def _returnWhenDone(_):
        if len(ret) == len(urls):
            d.callback(ret)
    for url in urls:
        downloadDefer = downloadUrlAsString(url)
        downloadDefer.addCallback(lambda s : ret.append(s))
        downloadDefer.addCallback(_returnWhenDone)
    return d

Ignoring my failure to handle failures, this code is acceptable Twisted, and it could not work if Python suddenly got the ability to run code on multiple cores and Twisted used it. The reason being, there is no coordination around the ret.append(s) line. What if two threads were to try to append to ret at the same time? NodeJS and Gevent have the same idea in mind. Almost no data access is surrounded by a mechanism to coordinate multiple pieces of code accessing it at the same time. The result is, none of the code using these frameworks can be run on multiple cores. If CPython or V8 got multicore support it would take a rewrite of all of the code to make use of it.

But, you say, who cares? "I can just spin up N instances of my program, where N is the number of cores on my machine. I can easily scale that way". You can't even get concurrency right and now you want to move into distributed programming? Who are you fooling? But seriously, the problem is your code now needs to be "location aware". If you want to do something with object X, you have to be aware of where object X lives. This adds another layer of complexity to your system. Without a good way of communicating between instances you are limited to solving embarrassingly parallel problems or pushing the concurrency to another software layer. Either way, you aren't actually solving the problem with your framework. Luckily, a lot of what people want concurrency for is serving webpages, which requires almost no interprocess communication right now.

Reasoning

No matter how you slice it, writing concurrent code is hard. When it comes to serial code, looking at it and knowing what it does is as simple as understanding how each function operates given the current state of the program. But with concurrent code, the state of the program is changing while a function runs. Understanding a concurrent program involves understanding how the concurrent components are interacting with each other. Some solutions make this easier than others.

Take the following piece of example NodeJS code:

var db = require('somedatabaseprovider');
app.get('/price', function(req, res) {
  db.openConnection('host', 12345, function(err, conn) {
    conn.query('select * from products where id=?', [req.param('product')], function(err, results) {
      conn.close();
      res.send(results[0]);
    });
  });
});

The amount of syntax is enormous. There is a huge amount of line noise for what should look, at worst, like this:

var db = require('somedatabaseprovider');
app.get('/price', function(req, res) {
    var conn = db.openConnection('host', 12345)
    var result = conn.query('select * from products where id=?', [req.param('product')])
    conn.close();
    res.send(results[0]);
});

If you want to add proper error handling, the situation gets worse with callback code. Twisted has attempted to solve this by encapsulating code flow in an object called a Deferred, but the problem remains: a unit of work in callback-based code is not a function, like one is used to in serial code, it is work to do between events. Like the above example code showed, there isn't a function that connects to a db, does a query, and returns the result. There is a function to open a db connection, another function for when that is done and to do the db query, and another function to handle the result. You have defined three functions where you previously needed one. More importantly, you have to define functions not because it makes your code clearer but because the framework requires it.

Given how negatively this affects code, there are a lot of attempted solutions. Twisted, for example, allows one to use the defer.inlineCallbacks decorator so a function can use generators to express asynchronous code. Our previous NodeJS code might look like this:

@defer.inlineCallbacks
def handlePrice(req, res):
    conn = yield db.openConnection('host', 12345)
    result = yield conn.query('select * from products where id=?', [req.param('product')])
    yield conn.close()
    res.send(results[0])

app.get('/price', handlePrice)

In many ways this is an improvement but it does have its limitations.

The NodeJS community has been at work solving this problem for themselves too. One person added coroutines to V8, and gave it a C#-like syntax. OKCupid gave us TameJS. Both of these solutions have their problems which are deal breakers for many.

There are also, less complete, solutions like Step. But library solutions, like Step, only give you access to a subset of functionality you would get from the sequential code you really want to write. To do that you need a full CPS transformation (which is what TameJS gives you, at a cost of debugging). This is actually how the syntax extensions for Ocaml/Lwt work. The previous NodeJS code might look like this in Ocaml/Lwt (the relevant part is that lwt causes a CPS transformation to turn the code into the appropriate callback-based code):

let handle_price req res =
  lwt conn = DB.open_connection "host" 12345 in
  lwt result = DB.query conn (SQL.sprintf "select * from products where id=?" (req#param "product")) in
  DB.close conn;
  res#send results.[0]

App.get "/price" handle_price

This is one reason for Gevent/Eventlet's popularity in Python. Gevent uses coroutines to give you asynchronous code that looks sequential. The trick is, underneath the hood, some function calls actually result in all of the state for your current function call being saved, another one switched to, executed, rinse, repeat. Gevent has a cooperative scheduler that tries to intelligently decide which function to switch to.

Say you want to write the earlier NodeJS code in sequential Python, you might get:

def handlePrice(req, res):
    conn = db.openConnection('host', 12345)
    result = conn.query('select * from products where id=?', [req.param('product')])
    conn.close()
    res.send(results[0])

app.get('/price', handlePrice)

How would this look in Gevent? Exactly the same. The openConnection and query functions have an I/O call which actually jumps back to the Gevent scheduler so it can do something else while the I/O happens.

But Gevent is not without its cost when it comes to reasoning about code. Consider this:

def foo(data):
    print data.bar
    do_something()
    print data.bar

Looking at this code, will the same value be printed twice? The answer is: no idea. Even though do_something does not take data as input, it could do something that causes Gevent to context switch to another function, another function which also has access to data and modifies it. There is no way to tell, simply by looking at the code, if it will context switch or not.

Debugging

The previous Gevent code is printing out two different values for data.bar and you don't want this, how do you fix it? The first thing you might try, from your serial programming days, is a debugger. But that might not work very well. Why? You're in concurrent-land now, multiple things are happening at once! That means timing is important. If you set a break point somewhere, you've disrupted the time things happen and your program could take a completely different path, not the one you want to debug.

If you're smart and you control access to data.bar through function calls, you can do some printf debugging. Perhaps print out a stacktrace when one modifies it. But let's say, even those prints are causing the timing of your program to change, so now data.bar is coming out as the same value at each print. What do you do?!

The point is, debugging concurrent code can be very hard. Event-loop code adds another problem to debugging: your code doesn't have a linear path. If you could visualize sequential code, it would be a line. You start at point A, you do the things in order to get to point B, at any point if you have an error your callstack represents the path you took to get there. Event-loop code always needs to hit the event-loop for a blocking call though. The callstack you see is always limited to the path from the last event you got. A callstack in the code handling a database query may not contain the how you got there. If that query is part of a piece of fairly generic code you don't have many leads to go on to track it down.

Who got it right then?

Three languages come to mind: Erlang, Oz, Haskell. There are more out there but I'm not omnipotent. In my opinion, these languages are capable of the three properties I previously mentioned. Right now you are probably rolling your eyes and saying "I should have known, one of THOSE guys". But my argument is conservative: based on the properties that I believe are important for concurrent solution to be good, these languages excel (or are capable of it) at them. Real world problems contain more than just concurrency issues though, so this does not mean you're wrong to use a language that doesn't meet my criteria, but it does mean you are sacrificing something. Perhaps that sacrifice is acceptable. But don't fool yourself into thinking your language is not terrible at concurrency, because it probably is.