Wednesday, April 2, 2008

Scaling webapps

DISCLAIMER: I am not an expert in web apps and simply have been doing some research on them. Given that, my claims and concerns may be completely unfounded and simply a result of my ignorance of web app design, if so, please let me know.

I have been doing a little bit of research on webapps and I'm not sure how I feel about the standard model. The standard model, as I understand it, works by every incoming http request results in some number of DB requests where the DB is the holder of all state information. The web app itself, whether it be in PHP, or Rails, etc is stateless. I certainly don't doubt that for a large majority of applications this works fine. Take something like the yellow pages which is a Rails application according to this. The standard model seems quite reasonable for this. You have a page that is fairly static, people are mostly doing requests for data which requires walking through some large database and the results are being displayed and not much is happening again until the user initiates another page request.

Web apps are changing though. Today, we expect our web apps to look and feel like native applications, which means snappy responses and updates of it even without the user explicitly reloading the page. We have things like AJAX and Comet-style serverpush. In the end, this means more requests are going to the http server even when the user is not doing anything. The Comet style is pretty neat but from what I can see, polling with AJAX is the most widely used way. My concern is how well the standard model is going to scale as web apps are required to act more and more like native applications. The reason for my concern is because this method is simply polling the DB for state changes on every request and in every other situation polling quite clearly has not scaled. This is the exact reason Comet exists.

For the sake of simplicity, let's imagine that the web app is AJAX with polling. Some portion of the page is going to be asking for updates every 5 seconds. The standard model would have us querying the DB every 5 seconds for that request to see if the state has changed. If you have a couple thousand people on this page that is a lot of work. If the component of the page is changing often for a majority of the people then it's not too big of a deal but if it's not changing often for a majority we are doing a lot of worthless DB calls. I'm sure we would design our database so that this call is hopefully very lightweight but how well does that really scale in the end?

Now, the reason we want our wep app (such as something in Rails) to be stateless is because for each request we can be pushed around to a separate VM. So if we hold onto session state information there is a good chance we might not even get a chance to use it on the next request. On top of that, if you are load balancing between several hosts, you may not even be in a good place to share state information between VMs.

I'm sure almost all of this is not news to anyone who has made it this far in the post. What can be done to make a webapp scale better? I'm not sure but here is my suggestion and hopefully someone with more experience than myself can come back and say if it's a horrible idea or not. Let's say I want to make a web app that will be handling something like instant messaging such as google talk or meebo (which are both using Comet).

For starters, I want this to be fairly real time, when I send a message to someone, I want it to get to them ASAP. Secondly, there will be a lot of interaction between users. Clearly, page-by-page viewing is not going to work here. People can't be refreshing their page every few seconds to see if there is an update. AJAX with polling or Comet are clearly your two choices currently. How should this look after the HTTP request comes in though?

Here is how I am suggesting it should:
Each request should have some sort of session ID. Each session ID will be associated with a process. By process here, it could be an Erlang process or some OS process written in some other language, whatever. The point here is, for a particular session ID, its request will get forwarded to the same process every time for the life time of its session. This way, the process can store all the state information. We don't have to do worry about sharing state information in something like memcache. The process would have some sort of timeout value so they die off eventually. At worst case, we are back to the standard model where if a user does multiple requests that happen to have a pause between them longer than the timeout we now have to initiate a new process and it has to do the DB calls to initiate itself. In the best case though, we are consistently getting sent to the same process which is holding onto our data. The upside to this method is the process can also do things specific for that user such as opening up other connections. For instance, in the instant messaging example, a user logs in, they get a session ID and a process created for them somewhere that will be mapped to this session ID. The process opens up a connection to an event server for that user so it can listen for IMs and push IM's out. We now have an application that is quite event based on all sides of it. We won't be hitting the DB too often. Given that we don't really care where this process lives, we can also scale it out to multiple machines and not have to worry about replicating the same data over many machines because we don't know where the users request will go to.

Downsides?
Certainly. Clearly if we have 2000 people using our IM application at once, we need to have 2000 processes a live. If we wrote this in Erlang where each process for a session ID maps to an Erlang process (sorry about the terminology here all)? That's childsplay. We could host this application on one machine! But not everyone wants to write things in Erlang, so what about them? If I were in this situation, I would probably have some machines each running some amount of applications that will be doing some I/O multiplexing to handle this. So you would have some amount of OS processes and each one can handle some amount of session ID's. Pretty standard for something that isn't Erlang but needs to handle a bunch of things at once. If you are into Python, Twisted comes to mind. I'm sure other languages have their own way of doing this.

Another edge case here, and I think this is probably not too hard to deal with, is what if an event comes in (such as an IM) and the process times out due to lack of activity? You could have each event have a timeout and if it is not ACK'd in that time it gets saved to a DB and the next process that is created for that user picks it up as part of its initiation.


To reiterate, I do not have much experience in webapps, I simply have been doing some research and this is the impression that I have gotten. Are my concerns valid? Is the system I described what people are already going to or is it broken? How would someone write Meebo or google talk? Let me know.

10 comments:

  1. Your concerns are valid, and are a major part of every web architect's job.

    The technique you're describing is known as "sticky sessions", in which all requests in a session* are directed back to the same application server process that handled the first request.

    (*Defining a session turns out to be pretty tricky, since you can never be quite sure that the request you're processing is the user's last one. Hence sessions always have a timeout. This annoys users unless you call it a security feature. Then they don't seem to mind.)

    Sessions and session management are a big issue, and constitute one of the three legs of a site's capacity. (The other two being bandwidth and concurrent connection ceiling.)

    <shamelessplug>I deal with session management and techniques for managing sessions on high capacity web sites in my book Release It!. There's a whole section on how to use sessions without abusing them.</shamelessplug>

    ReplyDelete
  2. Are there any popular web frameworks out there that do use sticky sessions? Is this the direction people think web apps will go in? What are some alternative?

    ReplyDelete
  3. Pretty much everything based on J2EE or servlets will use sticky sessions. (It's baked into the standard.)

    Every ASP.NET app I've looked at uses sticky sessions.

    I actually think web apps need to get away from so much reliance on sessions. They are helpful in the middle range of scalability, but they quickly become a problem as traffic increases.

    When traffic gets heavy and app servers start to reach their capacity ceilings, you'd rather devote memory devoted to serving more pages instead of holding objects in sessions.

    Second, sessions always have to be backed up or transferred to another server. Otherwise, when one app server fails, work for all users on that server gets lost. Whatever the framework's specific session backup strategy is, I guarantee it consumes memory and bandwidth.

    I still see sessions being used to hold things like the entire set of search results from a search engine (100's of results) in order to paginate the results. What a waste of RAM! It scales much better to hold only the query parameters and the index of the first result to display.

    I think you're going to see more of a move to distributed caching rather than increased reliance on sessions. Check out highscalability.com for a lot of details about highly scalable web architectures.

    ReplyDelete
  4. I suggest you read about REST principles if you haven't already. As soon as a form of session state is kept on the server between requests, it's hard to scale. Though it's not really applicable to IM (nothing much cacheable), then maybe it's not HTTP you want but XMPP.

    ReplyDelete
  5. I was part of a team that developed a web messenger like meebo. We used ejabberd as the backend and attached the web messenger via the http binding module for ejabberd.

    I did the JS communication library that implemented the http binding, a mechanism which has evolved into BOSH. Please google for the two XEPs containing BOSH and have a look at the examples.
    It works pretty well.

    I am not sure which was first, BOSH or Comet long polling, but it seems to be the same idea. If there is no activity (no messages from server to client and from client to server) it is a polling with a long period T_0, given by some timeout to prevent some component closing the connection due to no activity. If there are new messages all the time, you poll with a short T_1 given by some DOS value, like no consecutive trivial messages (just an empty body) within 2s. We found it quite efficient.

    If you look at the examples you will note that the messages arrive at the server out of order, but can put back into order, by using the message ids.

    By the way, of course you don't have to share the state only via the database.

    You can put state into parts of the URL (be it REST or CGI/GET variables - but beware of total size limitations) or put it into the HTTP message body (CGI/POST).

    You can put state into the HTTP header section (Cookies).

    Passing the state into the request and returning some tupel (new_state, ..) is actually more functional style. Using the database is similiar to using a global variable. :-)

    ReplyDelete
  6. The way I envisioned it would be to have session ID's purely be used to map to a process, but that process is responsible for pushing changes tot hat user back to perm storage so it can easily die and a new one be created that will pick up where it left off. I'm not sure how well that works.

    What about an application that is mostly read-only? For instance a shared calendar application, where changes to the calendar are limited to the speed humans can change them. In computer time, this means changes will be quite rare but it also means polling the DB and webserver often, to give the feel of a native application, is wasting a lot.

    At the same time though, how can one determine if their projected usage of an application will scale in a standard model or if a more progressive model is required? Plus any fudge factor for the unknown usage.

    ReplyDelete
  7. Comet/Bayeux is earlier than BOSH, which can be figured out from BOSH specification.

    A large part of scalability of web applications is achieved by caching. For long process/workflow, session id is, but not the only way to maintain state. In HTTP, connection is also a way to maintain state.

    Erlang encourage messaging between process, while REST applications communicate through unified interface. Which one is more scalable? Erlang to me is still close to distributed objects model.

    ReplyDelete
  8. I'd say this can result in a similar behaviour and programming model as continuation based servers / app frameworks, see for example seaside: http://www.seaside.st/

    ReplyDelete
  9. We would like to nominate you for a FreedomToBlog.com Best of Blog Entry.

    Please visit our site at: http://www.freedomtoblog.com to submit your blog entry!

    Benefits include:
    1) Permanent Backlinks to your blog
    2) Fully SEO optimized for maximum exposure
    3) A chance to be published into a top 500 best of blog book"

    ReplyDelete
  10. The Haskell web framework called Happstack (http://happstack.com) is designed around ideas similar to what you describe. Happstack has its own web server, as well as an in-memory solution for persistent data storage. If Haskell isn't too scary, I think it might interest you.

    ReplyDelete