John Fremlin's blog: Remote Procedure Calls at scale

Posted 2022-04-01 22:00:00 GMT

Large scale distributed systems can feel like chemical plant. Issues can slowly build up like rising pressure and then take a while to clean up like a chemical spill. My experience with several systems using hundreds of thousands of machines or millions of cores is that you can't just treat remote procedure call APIs as if each function call is fully independent.

Imagine just one server with a broken disk that keeps returning errors. It's load will be low so it can take a disproportionate amount of client traffic. Imagine a client that fails to read the response, and retries. It's like a denial of service attack.

Healthcheck and heartbeat. Best practice for servers is to have a healthcheck method that the load balancer checks with before letting clients call the server. This method shouldn't be left as an empty return OK. It should confirm that all expected dependencies (e.g. downstream databases) are accessible and all configuration is present and error free. At scale, where there is always a weird machine with a system package going haywire on a broken kernel version, you also need to sanity check basic things, like that there is roughly adequate disk and CPU bandwidth.

Beyond the healthcheck communication to the load balancer or cluster manager, it's important to have a heartbeat with the client. The heartbeat should be called periodically by the client, and it's beneficial if on first use of the server-client pair. The heartbeat has multiple objectives

It's really important to have strong integration test guarantees that clients respond well to the sleep, go to another server, and flow control update commands. Software inventory management at scale is very hard. Some large service might suddenly be restarted with an old version that boots up very aggressively. To contain this explosion of load, the servers activate the heartbeat update to moderate the clients, targeted closely at the misbehaving ones. This means the heartbeat has to have enough information to semantically isolate problematic categories of client, i.e. full details of which application and location and so on. If you can only filter on user account or network location it may be hard to differentiate between clients with big impact or experimental clients and batch systems that are resilient to downtime. When the server is unhealthy, it needs to tell clients to go away.

Call pacing. At scale an API is also like a flow. The system as a whole may aspire to be elastically scalable but any one single server machine or network is not. After successful and unsuccessful calls, it's really useful if the server can set a guideline for how long to wait for the next call or retry (in general, having the ability to update any heartbeat setting in the call response metadata). Allowing the server to guide the clients use of connection pools and buffers also helps contain overload. The best place to contain overload is at the source. Throttling after something is sent by the client is more expensive than not generating the request. And sometimes the server may find it hard to guess how much work will be needed for a request. If the client can set a short deadline and indicate the call should fail if it needs buffers above a certain size, the server can prioritize and schedule better.

Observability. If it's a retry or a hedged request, the server should be told that. Otherwise, a single failure can look like many. Ideally the server should be able to log enough information to connect to client logs and tracing, and to differentiate between classes of requests by the same client. It's generally useful to combine some fixed agreed naming (e.g. LOW_PRIORITY or ADS_HIGH_PRIORITY) with unstructured tags (e.g. "experment_job=try_123"). Both are useful. If the client logs timing information for the previous call when making the next call, the gap between server and client measurements can be closed without additional RPCs or other logging destinations.

The general reasoning: the server is easier to reconfigure than clients and has more system context. If there is just one error a client might as well retry immediately. If the system is overloaded and cannot process requests, low priority clients should wait for a long time to retry. The client doesn't have this context. The above mechanisms help make the system manageable. They could theoretically be integrated into the RPC layer but normally there are additional application settings useful in the heartbeats and metadata.

It's best to have clients control load before sending it. However, a common failure situation is a large client deployment starting up or restarting in rapidly in a loop. Then even the first API calls may overwhelm. Other load control mechanisms and defense in depth is important. Ideally both servers and clients are able to share capacity and load expectations, so for example, the server doesn't try sending many huge responses to a client that can't expect to process them immediately.

Hope this helps! Big systems are awesome and don't need to be scary!

Post a comment