Retrying tasks with TPL, async and synchronous code

by Matthew Adams

In this world, we have to face up to the horrible possibility of failure. Sometimes, that failure is irrecoverable. There’s no going back. Sometimes, we can have another bash at it and see if it comes out OK the second time around. Or maybe the third. Or the fourth.

In the software business, historically, we’ve kind of got away with assuming that everything is going to work fine. Resources are ‘just there’ when we ask for them. They don’t mysteriously disappear only to  reappear a moment later. Tasks complete if you set them going…

Of course, that was never really true before, and the unreality of that attitude is all the more obvious when dealing with cloud services.  Here resources really can be temporarily missing, refused or moved. Tasks may begin and never complete because our thread has suspended or died in some virtual infrastructure, only to be resurrected again later.

When we start an operation that may potentially fail, there are a few failure scenarios that could play out:

1) A catastrophic error has occurred, so we want to propagate that exception straight out to the caller

2) A potentially transient error has occurred, so we want to retry the exact same operation again

3) We want to abandon the operation

Abandoning the operation is straightforward. We can use the standard .NET CancellationToken mechanism.

Scenarios 1 and 2 require a Retry mechanism of some kind,

For those, we’ve provided the Endjin Retry Framework. (That link takes you to the nuget package download.) Source code is also available, along with a sample.

Policy

Policies are used to determine whether we should consider retrying at all, given that a particular exception has occurred The default policy is AnyException – you can always retry regardless of the particular exception or its content

We also provide an AggregatePolicy which allows you to retry if and only if all of a set of policies allow you to retry.

It is up to you to write custom policy if you want particular exceptions to be “non-retryable” For example, you might set up a policy that does not allow you to retry if you get a 404 (not found) from an http operation, but does retry if you get a 501 (internal server error). To do this, you implement the  (very simple) IRetryPolicy interface.

Strategy

The strategy determines how a task is retried. There are two phases to this. In the first phase, we prepare to retry given a particular exception, and calculate an optional delay before we retry the operation. In the second phase, the framework checks whether we are allowed to attempt a retry (we haven’t, for example, exceeded some maximum number of retries).

The strategy also aggregates the exceptions that caused us to require a retry attempt, and raises a Retrying event just before a retry attempt occurs.

We provide three strategies in the box:

Count: will retry immediately up to a maximum number of times

Incremental: retries up to a maximum number of times, with an (optionally increasing) delay between retries.

Backoff: is similar to Incremental, but provides an exponentially increasing delay between retries, with a random element.

Note – you don’t want to use the Incremental or BackOff strategies in Windows Azure. It is better to hammer the fabric and let it adapt to your preferred usage pattern.

If you want to implement your own strategy, you inherit from RetryStrategy and override the PrepareToRetry() and CanRetry() methods. Here’s an example.

Scenario One: TPL

If you want to start a new Task using the TPL, but automatically retry that task using our retry framework, we provide a source-compatible replacement for Task<T>.Factory.StartNew() and Task.Factory.StartNew() called RetryTask<T>.Factory.StartNew(), and RetryTask.Factory.StartNew().

Each of the overloads also takes additional optional parameters for the Strategy and Policy, defaulting to AnyException and Count (with a maximum of 5 retries).

Here’s an example of that.

Scenario Two: Async

Sometimes, you are using the async/await pattern and need to call an async method. In that case, you can use our Retriable.RetryAsync() method.

This also works for inline async delegates

Scenario Three: Synchronous

Finally, it works just as well for a synchronous method.