Error handling strategy

Temporal automatically retries failed Activities and recovers from infrastructure failures through Durable Execution. But not all failures should be retried. This page covers how to categorize failures, when to mark errors as non-retryable, and how to implement compensation when retries are not enough.

For background on how Temporal represents and propagates failures, see Application failures.

Categorize failures

When an operation fails, the appropriate response depends on the nature of the failure. Failures fall into three categories based on whether retrying can resolve them.

Transient failures

A transient failure is a one-off event that resolves on its own without intervention. For example, a Worker happens to make a network request at the exact moment an administrator replaces a network cable. The cause is unlikely to affect future requests.

Transient failures are resolved by retrying the operation shortly after the failure. Temporal's default Retry Policy handles transient failures automatically.

Intermittent failures

An intermittent failure is one that recurs but resolves over time. For example, a service that uses rate limiting will reject requests once the threshold is reached, but will accept requests again after the rate limiter resets.

Intermittent failures require retries spaced out over a longer period. Configure your Retry Policy with an appropriate backoffCoefficient and maximumInterval to avoid overwhelming the failing service.

Permanent failures

A permanent failure is one that will recur indefinitely until the cause is fixed. For example, a request that fails due to an invalid email address will continue to fail no matter how many times the operation retries. The only resolution is to correct the email address.

Permanent failures cannot be resolved through retries. They require different input data, a code fix, or some external intervention. Mark these errors as non-retryable to fail fast instead of consuming resources on retries that will not succeed.

Mark errors as non-retryable

When your code detects a permanent failure, mark the error as non-retryable to prevent unnecessary retry attempts.

Use non-retryable errors for situations like:

Invalid input data: A malformed email address, a negative payment amount, or a missing required field.
Business rule violations: A customer outside the service area, an order exceeding credit limits, or an expired promotion code.
Authorization failures: The caller does not have permission to perform the operation.
Data validation errors: A referenced record does not exist, or data fails integrity checks.

There are two ways to mark errors as non-retryable:

In the Activity (implementer decides): Set the non_retryable flag when throwing an Application Failure. This enforces the constraint for all callers. Use this when the Activity implementer knows that the error can never be resolved through retries.

In the Retry Policy (caller decides): Add the error type to the Retry Policy's list of non-retryable error types. This lets different Workflows make different decisions about the same Activity. Use this when the decision depends on the caller's business logic.

Use non-retryable errors sparingly. In most cases, let the Retry Policy handle retry limits through timeouts and maximum attempts. Reserve non_retryable for cases where retrying is guaranteed to be futile.

For SDK-specific syntax and code examples, see the error handling guide for your language:

Design Activities for idempotence

Activities may execute more than once due to retries, so design them to be idempotent: producing the same result whether executed once or multiple times.

This is especially important because of an edge case in distributed systems. A Worker can execute an Activity, complete it, and then crash before reporting the result to the Temporal Service. The Activity is retried even though it completed, because the Service has no record of the completion.

Use idempotency keys to prevent duplicate operations. Combine the Workflow Run ID and Activity ID for a value that is consistent across retries but unique across Workflow Executions.

For a detailed explanation, see Activity idempotence.

Implement compensation with the Saga pattern

Some operations cannot be "retried away." When a multi-step process fails partway through, previous steps may need to be undone. The Saga pattern provides a structured way to handle this.

What is the Saga pattern

A saga coordinates a sequence of operations where each operation has a corresponding compensating action that reverses its effects. If any operation in the sequence fails, the compensating actions for previously completed operations execute in reverse order.

For example, an order fulfillment process might involve three steps:

Reserve inventory (compensating action: release inventory)
Charge payment (compensating action: refund payment)
Create shipment (compensating action: cancel shipment)

If the payment charge fails, the saga runs the compensation for step 1 (release inventory). If the shipment fails, the saga runs compensations for steps 2 and 1 (refund payment, then release inventory).

When to use it

Use the Saga pattern when:

A Workflow involves multiple steps that produce side effects in external systems.
Each step can be reversed with a compensating action.
Retrying the failed step is not sufficient because earlier steps have already committed changes.

The Saga pattern is not needed when Temporal's built-in retries can resolve the failure, or when operations are naturally idempotent and do not produce side effects that need to be reversed.

Designing compensating actions

Each forward action needs a corresponding compensating action. Keep these guidelines in mind:

Make compensating actions idempotent. Compensations may also be retried, so they must be safe to execute more than once.
Add compensations before executing the step. Register each compensating action before running the corresponding forward action, so the compensation is available if the forward action partially completes and then fails.
Run compensations in reverse order. Undo operations in the opposite order from which they were performed to maintain data consistency.
Handle compensation failures. A compensating action can itself fail. Log the failure and continue executing remaining compensations rather than stopping. This prevents a single compensation failure from leaving the system in a partially rolled-back state.

Example: order fulfillment

The following pseudocode shows the structure of a Saga implementation in a Workflow:

compensations = []

try:
    // Step 1: Reserve inventory
    compensations.add(release_inventory)
    execute reserve_inventory(order)

    // Step 2: Charge payment
    compensations.add(refund_payment)
    execute charge_payment(order)

    // Step 3: Create shipment
    compensations.add(cancel_shipment)
    execute create_shipment(order)

    return success

catch error:
    // Run compensations in reverse order
    for each compensation in reverse(compensations):
        try:
            execute compensation(order)
        catch compensation_error:
            log("Compensation failed", compensation_error)

    raise ApplicationFailure("Order failed", cause: error)

In Temporal, compensating actions are implemented as Activities. Temporal manages the state of the compensation list and handles retries for each compensation Activity, making the Saga pattern more straightforward to implement than in systems without Durable Execution.

For SDK-specific implementations with working code examples, see the error handling guide for your language:

Python

Categorize failures​

Transient failures​

Intermittent failures​

Permanent failures​

Mark errors as non-retryable​

Design Activities for idempotence​

Implement compensation with the Saga pattern​

What is the Saga pattern​

When to use it​

Designing compensating actions​

Example: order fulfillment​