3 Simple steps to Resilience in Dot NET applications

GuruPrakash December 01, 2024

An Introduction

Modern applications are heavily dependent on external services, hosted either on the cloud or other machines. These services might occasionally become unavailable due to temporary network glitches, high server load, or other unforeseen issues. This might cause failed requests or service interruptions in your application.
You can implement a retry mechanism to make your application more robust. This strategy involves automatically reattempting failed requests after a brief delay. If the service recovers, the retry will likely succeed.
Resilience is the ability of a system to recover from failures and continue operating. It's not about preventing failures entirely, but about gracefully handling them to minimize disruptions.
This blog post describes how you can realize such resilience in your. .NET applications using the Polly library, a powerful tool for handling transient faults.

What is Polly?

Polly is a versatile .NET library designed to enhance the resilience of your applications. It provides a simple and efficient way to implement strategies like:

Retry: Automatically reattempt failed operations, giving the underlying service a chance to recover.
Circuit Breaker: Prevent cascading failures by temporarily halting requests to a failing service.
Timeout: Set time limits for operations to avoid indefinite waits.
Bulkhead Isolation: Limit the impact of failures by isolating specific operations or services.
Rate Limiting: Control the frequency of requests to prevent overloading services.
Fallback: Provide alternative responses or actions when primary operations fail.

By including these policies in your code, you significantly improve the reliability and responsiveness of your application in the face of unexpected problems.

How does Polly work?

Polly uses "policies" to define how it should act on transient errors. These policies, usually configured during application start-up, define what should occur in the event of an error. You can inject these policies into your code via dependency injection or create them directly using factories.

Next, we'll look at some of the most important Polly policies and their practical applications:

Polly policies

Retry

Unless the fault is an unexpected one, you can retry immediately or after a few seconds.

Waiting between retries: Waiting for a configured time before sending the request again. Practices such as exponential backoff and jitter refine this by scheduling retries to prevent them from becoming sources of further load or spikes.

Circuit Breaker

A circuit breaker acts as a safety mechanism. It tracks the number of failed requests it receives. If the number of failures exceeds a defined threshold, it automatically stops routing new requests to the failing service. This prevents a cascade of failures and protects the overall system's stability.

A circuit breaker goes through three distinct states as it operates.

Closed: This is the startup state. The circuit breaker will allow all requests to be processed normally even in the presence of some transient failures.
Open: If the number of failures exceeds a predefined threshold, then the circuit breaker enters the open state. In this state, all requests are immediately rejected, and no further strain is put on the failing service. The circuit breaker stays in this state for a specified duration.
Half-Open: The circuit breaker goes to the half-open state immediately after the open state timeout. Here, a limited number of requests are allowed to go through. If all these requests turn out to be successful, then the circuit breaker returns back to the closed state. In case any of these requests turn up a failure, the circuit breaker turns back to the open state.

The primary role of circuit breakers is to handle remote service failures. They may not be the best choice to use in handling exceptions from local resources, which require specific error handling strategies.

Fallback

The fallback policy lets you specify a default response or action to be executed when an operation fails, even after multiple retries. This provides a graceful way to handle failures and prevent unexpected errors from disrupting your application.

Timeout

A timeout policy sets a maximum time limit for an operation to complete. If the operation exceeds this limit, it's automatically canceled. Polly supports two timeout strategies:

Optimistic Timeout: Assumes the underlying operation can be nicely canceled with a cancellation token.
Pessimistic Timeout: Presumes the operation might not be cancelable. In this case, Polly will abort the operation and return to you the abandoned task, and it's your turn to take care of it properly.

For additional information about these policies and their options for configuration, refer to Polly's official documentation.

Bulkhead Isolation

The bulkhead isolation policy is designed to prevent a system from catastrophic failures by limiting the impact of resource-intensive operations. It does this by allocating a fixed number of resources, such as threads or connections, to a specific operation or service. If this limit is reached, subsequent requests are either queued or rejected, preventing resource exhaustion and cascading failures.

Combining Polly Policies

This is where you can combine multiple Polly policies to create more complex strategies for handling a variety of failure scenarios. In the next section, we will explore how to use PolicyWraps to effectively combine these policies and enhance your application's resilience.

Implementing Polly in a .NET Application

Install Polly NuGet package

Type install-package Polly command in your Nuget package manager console.

Implementing a Simple Retry Policy

A simple retry policy may be put in place through repeated attempts to a failed operation at fixed time intervals. Unfortunately, this may not be the best approach due to the possibility of overloading the system if the underlying service continues to be unavailable.

builder.Services.AddSingleton(x => {
  var _policy = Policy.Handle().WaitAndRetryAsync(
    retryCount: 5,
    sleepDurationProvider: (retryCount) => TimeSpan.FromMilliseconds(300 * retryCount),
    onRetry: (result, timeSpan, retryCount, context) => {
      Log.Logger.Information($"Begin {retryCount}th retry for correlation {context.CorrelationId} with {timeSpan.TotalSeconds} seconds of delay.");
    });
  return _policy;
});

A better approach would be to use an exponential backoff algorithm. The algorithm increases the delay between retries exponentially, so that as the number of failures grows, the requests are made less and less frequently. This will avoid overloading the service and give sufficient time for recovery.

builder.Services.AddSingleton(p => {
  int firstRetryDelay = 45;
  int retryCount = 12;
  var jitter_delay = Backoff.DecorrelatedJitterBackoffV2(TimeSpan.FromMilliseconds(firstRetryDelay), retryCount);
 
  var _policy = Policy.Handle().WaitAndRetryAsync(jitter_delay);
 
  return _policy;
});

Implementing an Async Circuit Breaker Policy

Here's an example of how to configure an asynchronous circuit breaker policy:

builder.Services.AddSingleton  (x => {
  var _policy = Policy.Handle().CircuitBreakerAsync(exceptionsAllowedBeforeBreaking: 5,
					      durationOfBreak: TimeSpan.FromSeconds(5),
					      onBreak: (msg, timeSpan) => {
					        Log.Logger.Warning("Circuit breaker in now open");
					        // add your other logic
					      },
					      onHalfOpen: () => {
					        Log.Logger.Warning("Circuit breaker moved to half open");
					        // add your other logic
					      },
					      onReset: () => {
					        Log.Logger.Warning("Resetting circuit breaker");
					        // add your other logic
					      });

  return _policy; 
});

Fallback Policy

	Policy
  .Handle()
  .Fallback(
    fallbackAction:  () => { /* Demonstrates fallback action/func syntax */
      return "Please try again later";
    },
    onFallback: e => {
      Log.Logger.Debug("Fallback catches eventually failed with: " + e.Exception.Message);
    }
  );

Network timeout policy

	var timeout_policy = Policy.TimeoutAsync(timeout_inSecs, TimeoutStrategy.Pessimistic,
  onTimeoutAsync: (context, timespan, _, _) => {
    Log.Logger.Error("Timeout during execution of the call");
    return Task.CompletedTask;
  });

Policy Wraps

AsyncPolicyWrap CreatResilientPolicies()
{
    int maxRetries = 5;
    int breakCurcuitAfterErrors = 7;
    int keepCurcuitBreakForMinutes = 2;
    int timeoutInSeconds = 5;
 
    // Specify the type of exception that our policy can handle.
    // Alternately, we could specify the return results we would like to handle.

    var policy_builder = Policy.Handle();
 
    // Fallback policy:
    var fallback_policy = policy_builder.FallbackAsync((calcellationToken) =>
        {
            // In our case we return a null response.

            Log.Logger.Information($"{DateTime.Now:u} - Fallback null value is returned.");
 
            return Task.FromResult("Some value");
        });
 
    // Wait and Retry policy:
    // Retry with exponential backoff

    var retry_policy = policy_builder.WaitAndRetryAsync(maxRetries, retryAttempt =>
        {
            var waitTime = TimeSpan.FromSeconds(Math.Pow(2, retryAttempt));
            Log.Logger.Information(
                "{DateTime.Now:u} - RetryPolicy | Retry Attempt: {retryAttempt} | WaitSeconds: {waitTime.TotalSeconds}");
 
            return waitTime;
        });
 
    return Policy.WrapAsync(fallback_policy, retry_policy);
}
 
var resilientPolicies = CreatResilientPolicies();
builder.Services.AddSingleton(resilientPolicies);

Usage

await resilientPolicies.ExecuteAsync(async () => await doSomething())).ConfigureAwait(false);

Conclusion

Polly is a very powerful tool for handling transient failures; it's not a replacement for a general error handling strategy. Use Polly as an addition to what your application already does.