How to succeed when designing real-time systems involving synchronous calls.
Synchronous calls are something you should avoid. They decrease availability, increase latency, and have other drawbacks. AWS has a builder’s library article in which they describe the difficulty of operating synchronous systems:
At the… most difficult end of the spectrum, we have hard real-time distributed systems. These are often called request/reply services. (emphasis mine)
Yet sometimes, there are systems that have to make synchronous calls. This article gives tips for both service providers and clients on how to succeed in this situation.
Design for failure
Realize that service errors will cascade
Make your clients’ lives easy
- Provide an SDK/library (that you vet in your functional tests) consumers can include in their project and use to call your service
- Publish your endpoints to service discovery (parameter store or eureka) and resolve the url and use it during your tests
Prioritize requests
- If you have health checks (Kubernetes apps, or apps behind an elb), prioritize them
- Reject expired requests — check the propagated expiration header for the original request timeout. If it is past the timeout, don’t process it.
Make debugging easy
Share your service status
- Publish service status (via status pages) and allow clients to subscribe to outage alerts (you have cloudwatch alarms already for rollbacks)
Generate a client
Your generated client should include the following features (I’ll have another article dedicated to just this)
Don’t give up after the first attempt (give yourself more than one chance to succeed)
- Use the SDK provided by the service producer, which will automatically retry (has other benefits too)
- Put sync calls in their own lambda triggered by sqs — decouples from failure/latency increases; gives retries with exponential backoff (in addition to those provided by the client)
Be a considerate consumer
- Surround sync calls with the circuit breaker pattern if you are using lambda, I’d highly suggest following this pattern, as it prevents the lambda from being triggered at all, rather than just sending to dlq
- Only send requests where the original request has not yet timed out, and propagate the timeout header
Make debugging easier
- Periodically exercise calls to the endpoint — I prefer real traffic with cloudwatch synthetics, but you can execute connectivity checks NOT health checks) if needed
Don’t depend purely on server-side monitoring
- Monitor your end-to-end latency, since server-side metrics don’t include network latency
We’ve seen a variety of ways to increase the odds of success when using synchronous calls. These responsibilities fall on both the service provider, and the service consumer, and range from making services idempotent, using canary deploys and generating clients; to clients retrying requests and using circuit breaker patterns.
Do you have any other tips for ensuring the reliability of synchronous systems? If so, please leave feedback.