Resiliency patterns

Resiliency is the ability of a system to gracefully handle and recover from failures. The nature of cloud hosting, where applications are often multi-tenant, use shared platform services, compete for resources and bandwidth, communicate over the Internet, and run on commodity hardware means there is an increased likelihood that both transient and more permanent faults will arise. Detecting failures, and recovering quickly and efficiently, is necessary to maintain resiliency.

Pattern

Bulkhead

Isolate elements of an application into pools so that if one fails, the others will continue to function.

Read more

Compensating Transaction

Undo the work performed by a series of steps, which together define an eventually consistent operation.

Read more

Leader Election

Coordinate the actions performed by a collection of collaborating task instances in a distributed application by electing one instance as the leader that assumes responsibility for managing the other instances.

Read more

Retry

Enable an application to handle anticipated, temporary failures when it tries to connect to a service or network resource by transparently retrying an operation that's previously failed.

Read more

Circuit Breaker

Handle faults that might take a variable amount of time to fix when connecting to a remote service or resource.

Read more

Health Endpoint Monitoring

Implement functional checks in an application that external tools can access through exposed endpoints at regular intervals.

Read more

Queue-Based Load Leveling

Use a queue that acts as a buffer between a task and a service that it invokes in order to smooth intermittent heavy loads.

Read more

Scheduler Agent Supervisor

Coordinate a set of actions across a distributed set of services and other remote resources.

Read more