Successful Microservice Architecture | by Connor Butch

How to succeed when designing real-time systems involving synchronous calls.

a pair of lemons is shown with the caption, “when life gives you lemons, make lemonade” — src

Synchronous calls are something you should avoid. They decrease availability, increase latency, and have other drawbacks. AWS has a builder’s library article in which they describe the difficulty of operating synchronous systems:

At the… most difficult end of the spectrum, we have hard real-time distributed systems. These are often called request/reply services. (emphasis mine)

Yet sometimes, there are systems that have to make synchronous calls. This article gives tips for both service providers and clients on how to succeed in this situation.

Design for failure

Realize that service errors will cascade

Make your clients’ lives easy

Provide an SDK/library (that you vet in your functional tests) consumers can include in their project and use to call your service
Publish your endpoints to service discovery (parameter store or eureka) and resolve the url and use it during your tests

Prioritize requests

If you have health checks (Kubernetes apps, or apps behind an elb), prioritize them
Reject expired requests — check the propagated expiration header for the original request timeout. If it is past the timeout, don’t process it.

Make debugging easy

Share your service status

Publish service status (via status pages) and allow clients to subscribe to outage alerts (you have cloudwatch alarms already for rollbacks)

Generate a client

Your generated client should include the following features (I’ll have another article dedicated to just this)

Don’t give up after the first attempt (give yourself more than one chance to succeed)

Use the SDK provided by the service producer, which will automatically retry (has other benefits too)
Put sync calls in their own lambda triggered by sqs — decouples from failure/latency increases; gives retries with exponential backoff (in addition to those provided by the client)

Be a considerate consumer

Surround sync calls with the circuit breaker pattern if you are using lambda, I’d highly suggest following this pattern, as it prevents the lambda from being triggered at all, rather than just sending to dlq
Only send requests where the original request has not yet timed out, and propagate the timeout header

Make debugging easier

Periodically exercise calls to the endpoint — I prefer real traffic with cloudwatch synthetics, but you can execute connectivity checks NOT health checks) if needed

Don’t depend purely on server-side monitoring

Monitor your end-to-end latency, since server-side metrics don’t include network latency

We’ve seen a variety of ways to increase the odds of success when using synchronous calls. These responsibilities fall on both the service provider, and the service consumer, and range from making services idempotent, using canary deploys and generating clients; to clients retrying requests and using circuit breaker patterns.

Do you have any other tips for ensuring the reliability of synchronous systems? If so, please leave feedback.