Debugging a race condition with client-side load balancing

While working at StudyRoom, the social learning network for online students, I recently faced a very difficult-to-debug problem that ended up being caused by a race condition between different parts of a distributed system. In this article I'll take you through the steps I used to isolate, debug, and ultimately solve the issue. I'll start off with some context on our systems architecture.

The Architecture

At the core of StudyRoom are "rooms" that students connect to in order to see, chat, and collaborate with their peers. Each room is managed by a "room server", which accepts WebSocket connections from each client connecting to the room, and routes messages among them as needed.

Each room server can handle multiple rooms, but it is critical that a room be managed by one and only one room server, or the room's clients won't be able to see and communicate with each other. To handle this, we also have a "master server", which connects to a Redis database that stores mappings of room IDs to the room servers that handle them. Before connecting to a room, the client tells the master server the room ID, and it replies with the IP address of the room server that handles that room. The client then initiates a new connection to the given room server. This is a form of client-side load balancing because the connection is initiated from the client directly to the selected server, not routed through the master server or other server-side load balancer.

The Problem

A few weeks ago, I received the first report of what our users called "ghost rooms". Users were not able to see other users that they knew were in the room at the time. In some cases, when they sat down at table in rooms (which are load-balanced as separate "rooms" from the main room), they could see users there that were not in the main room.

Our server status report confirmed that the same room ID was active on multiple servers, and both had a significant number of users connected. In the Redis database, the room ID was assigned to one of the two servers. This raised the question of why any clients connected to the other server for that room ID.

Early Hypotheses

My first hypothesis was that the room-to-server mapping was expiring. The Redis keys that store these mappings are set to expire every 10 minutes to keep the database clean and ensure that an assigned server going down doesn't make a room permanently inaccessible. Each room server refreshes the time-to-live value of all rooms currently active on it every minute to prevent them from expiring, which could cause this situation of the room being reassigned to another server. But upon examining the TTL values of the relevant keys during ghosting incident, it was clear that the server was refreshing the key as expected.

My second hypothesis was a race condition in clients requesting a server assignment for a room for which there is not yet one. If the server tries and fails to get the current assignment for both at the same time, it would generate and store a new one for both at the same time, which could be different. However, I had already thought of and accounted for this possibility when writing the original assignment code. Here's a simplified version:

server = r.get(key)
if server is not None:
    return server
else:
    server = _get_best_server()
    if r.setnx(key, server):
        return server
    else: # The key was created in the meantime
        return r.get(key)

The key here is the Redis SETNX command. It set a key, if and only if it does not yet exist, and is guaranteed to be atomic. Thus, in the condition described above, only one server assignment would be inserted into the database; the other SETNX would fail, and the master server would get and return the server assigned in the other request.

Nevertheless, I decided to perform a stress test against this behavior to ensure it really worked, since it seemed to be a perfect explanation for the issue. I wrote a script that would hammer the master server with nearly-simultaneous requests for a random room ID, and ensure that all requests received the same server:

from datetime import timedelta
from tornado.httpclient import AsyncHTTPClient
import tornado.ioloop
import random, sys

def run_test():
    PATH = 'http://localhost:9001/connect/testsession%d' % random.randint(1, 100000)
    l = []
    def get(url, callback):
        c = AsyncHTTPClient()
        c.fetch(url, callback)
    def callback(data):
        l.append(data.body)
        if len(l) == 2:
            if l[0] == l[1]:
                print 'OK', l
            else:
                print 'FAIL', l
                sys.exit(0)
    tornado.ioloop.IOLoop.instance().add_timeout(timedelta(seconds=0.00001 * random.randint(1, 100)), lambda: get(PATH, callback))
    tornado.ioloop.IOLoop.instance().add_timeout(timedelta(seconds=0.00001 * random.randint(1, 100)), lambda: get(PATH, callback))
    tornado.ioloop.IOLoop.instance().add_timeout(timedelta(seconds=0.01), run_test)

run_test()
tornado.ioloop.IOLoop.instance().start()

This script uses the Tornado AsyncHTTPClient to launch two requests at the master server 100 times per second. In the data callback, I ensure that when both requests finish, they return the same server assignment. I ran the test against my development server and our staging server, and with both zero time spacing between requests and a random number of milliseconds (hoping to catch the server in a certain stage of processing) between requests, as seen in the code snippet. After running hundreds of thousands of requests, no failures occurred. This particular race condition prevention measure was airtight.

The Workaround

I wasn't getting anywhere with debugging the issue, but I had to come up with a short-term solution. Cases of ghost rooms would arise every few days, and I dealt with them by resetting the "second" server, which had the room active but was not the assigned server. This was far from ideal, since each server reset carried collateral damage, disrupting all other rooms that happened to share that server. Even worse, this solution would often cause a second ghost room to immediately arise. Only resetting both servers the room was assigned at the same time would completely clear up the issue.

I eventually chose to shut down all but one of the room servers, leaving only one to deal with all rooms. This made it physically impossible for rooms to be duplicated, but of course left only one server to deal with all load. It coped fairly well, but it was clear that I had to come up with a real solution soon.

The Breakthrough

While doing some routine development one day, I accidently reproduced the issue on my local machine. This was a significant breakthough; until now the issue had only surfaced on production machines where many variables complicate debugging. The context was quite simple: I had a room loaded in two windows, and refreshed them at the same time. They ended up assigned to different server processes on the development server.

I had already ruled out the race condition of multiple clients requesting a server assignment at the same time receiving different assignment, but clearly there was some other sort of race condition at play. I realized there was one important difference between this situation and my earlier race condition test: in this case the two clients were both already connected to the room and were disconnecting then reconnecting, while in the test both clients were connecting fresh.

I zeroed in on the room disconnection logic, performing a careful code review of the room server's behavior when clients disconnect. I found the following code snippet, run when a room is "destroyed", i.e. when all clients disconnect:

def destroy_room(room):
    'Called by room server when all clients have left a room'
    r.delete('room-%s' % room)
    r.delete('room-%s-members' % room)
    r.publish('status-room-destroy', room)

Aha: it seems that rooms delete the key room-<roomname> whenever all clients disconnect. This is the same key that is used to store the room-to-server mappings. This allows the following race condition:

  • Client A requests to join room 1. He is assigned server X, and the mapping 1 -> X is stored. He connects to server X.
  • Client B requests to join room 1. She is also assigned server X, since the mapping 1 -> X is present.
  • Before client B can actually connect to server X, client A disconnects from server X. Since room 1 is now empty, server X destroys the room and deletes the mapping 1 -> X.
  • Client B connects to server X. Server X re-creates room 1.
  • Client A again requests to join room 1. No mapping for this server exists, so he is assigned a random server: server Y. The mapping 1 -> Y is stored. He connects to server Y.
  • Room 1 is now simultaneously present on both servers X and Y.

When I first came up with this explanation, my co-founders pointed out that it doesn't fully explain the issues reported, since it can explain only one person being assigned to a wrong room. This is because it relies on a room being completely empty and thus destroyed before the race condition, and it is highly improbable that the upwards of 10 people we've seen in ghost rooms would all have triggered this condition simultaneously. I agreed, until I realized the further implications of this situation.

Consider the state of the room right after the condition is triggered, with one client each on servers X and Y, and the official mapping pointing to server Y. More clients join the room, until there are a total of ten on server Y. Eventually, the lone client on server X will realize nobody is joining him, and leave. This will again trigger server X to destroy the room and delete the mapping 1 -> Y, even though it doesn't point to server X. Now, new clients will be assigned to yet another server, say server Z, and the ten clients on server Y are now the ghost room. This situation will perpetuate until, by pure chance, the server clients are assigned to happens to be the server containing the ghost room. This also explains why terminating the server containing the ghost room doesn't help; the server is programmed to catch SIGTERM and cleanly destroy all of its rooms.

The Fix

The ultimate fix to this issue is trivial: just remove the r.delete('room-%s' % room) line from the room destroy method. Since the room keys will expire in 10 minutes of no refreshes anyways, this doesn't significantly affect the operation of the infrastructure.

Some of the lessons I learned thoughout this are:

  • Be extremely careful to synchronize access to shared state properly in distributed systems. The core of this problem was both the master and room servers mutating the server mappings and causing the race condition. Even though the state is stored in a remote database with atomic operations, these kinds of problems arise.
  • Make sure the development environment mirrors the production environment. If the development environment hadn't used multiple servers just like production, I would never have triggered the issue in development and realized the problem.
  • Automated test scripts are great for attempting to reproduce issues, especially race conditions. Ideally I would have written these earlier and made them part of our main test suite.
  • Ultimately, code review is extremely important as the final step for finding these issues, so make sure code flow is easy to follow.