John Fremlin's blog: How to scale a consumer web product launch (load shedding)

Posted 2014-11-01 20:46:00 GMT

The situation: you've come up with a cool new app, have beta-tested it for a while and are ready to open it up to the public. How much server capacity will you need? There's no way to estimate the loadspike - one hit post on reddit or a partner promotion could get you 100k uniques over a few hours and if you've got the killer app you think you have that might just be the start of the onslaught. To control the load, you can have an invite flow where you gather customer contact information and then notify them that they can use the product, but this is likely to cause more of a drop-off than immediately wowing people.

There are three engineering problems of scale here: the obvious one is making the software you run efficient - modern servers can handle 10k+ rps per core - for efficient requests. This requires monitoring and measurement. The second obvious one is ensuring a good asymptotic response to additional hardware resources - adding servers should make it possible to serve more customers, but if you have a single database shard that is authoritative then adding more API servers might not help at all. Again, monitoring and measurement are essential to verify that the characterization of the system fits the model you have of it under the real load it experiences.

The third problem is making sure that in the event of a failure of the scale plan, that the user experience is still professional. For example, if things are overloaded or broken, disable marketing campaigns and notifications drawing people to the product. This may seem obvious but its easy to forget about Adwords when product is down.

The minimal example of a load shedding UX is the famous Twitter fail whale - a static landing page (served perhaps from separate reliable infrastructure) that you can redirect people to. Shifting load can actually work well for highly motivated users: if someone really wants your product then they will come back tomorrow, so downtime mostly loses the chance to convert people who were just curious into customers. To try to minimize the loss, the landing page can ask for contact information so that you can notify people when the service is fixed. Having a count of customer sign-ups might help turn negative publicity around downtime into a story about popularity.

Depending on the product, it may be possible to implement a less debilitating load shedding: request fewer results from the server, do less ranking, depend on client-side caches and reduce polling frequency. Plan for this by changing client code to not retry aggressively and have server-side timeouts on queued requests: if a request is in a queue for longer than the customer is likely to wait for it, just fail it rather than spending compute resources.

Even with a good plan, and a well characterized system whose response under load is well understood, an influx of publicity will bring a new class of semi-malicious or malicious use. People will poke around at your API and use unexpected classes of devices to access the product. This is one example demonstrating that auto-scaling can be a really bad idea: if someone is DDoSing you, then adding scale to handle that just makes the DDoS more expensive. Some mischievous attackers specialize in finding poisonous requests that will consume many server resources while very few client ones. Be ready to react with reliable dashboards to give you the information you need, and do not depend on automatic systems. A load spike can turn into a delightful opportunity to exercise your disaster recovery plan when servers overheat and lock up :)

Post a comment