Handling external API errors: A resumable approach

When an application needs to integrate with external APIs, making workflows fault-tolerant and resumable may be a subtle, hidden, and unobvious requirement. Why design a workflow with such qualities in mind, and what are the pitfalls?

Reasons to make a workflow fault-tolerant and resumable

“Fault-tolerant and resumable” is a set of error handling strategies to maintain integrity and reach eventual data consistency in the face of expected and unexpected errors, so it’s a very important quality in certain workflows.

When integrating with external systems, we’ll generally want to keep track of their data in a database that we fully control; that may include external IDs, monetary amounts, taxes, etc., and any other valuable data returned to our system.

But how do we ensure those data references are consistent and correctly in sync with the external system?

What not to do

Let’s use the following example workflow:

Given a contract with fundings, issue an API request to create an external account;
Iterate over existing contract fundings, and for each:
- Issue an API request to fund the external account;
- Mark the funding as released.

Here’s a naive way to design such a workflow:

ApplicationRecord.transaction do
  external_account_id = create_external_account(contract)
  contract.update!(external_account_id: external_account_id)

  contract.fundings.each do |funding|
    fund_external_account(contract, funding)
    funding.update!(released: true)
  end
end

Let’s assume the first funding fails to be updated, and an exception is raised. What happens? The database transaction would be rolled back, and all of the local state created within the transaction would evaporate. External state, however, wouldn’t be rolled back; we’d have an external account, potentially with a few fundings, not tracked anywhere in our local system.

Worse yet, a new attempt to run the same workflow would yield the same side effects, which means the workflow is not idempotent. That may be far from acceptable depending on our requirements and how the external system works.

What to do

Divide into control points

When designing a workflow like that, we’ll never want to roll back local transactions. A much better version of our workflow would be:

if contract.external_account_id.nil?
  external_account_id = create_external_account(contract)
  contract.update!(external_account_id: external_account_id)
end

contract.fundings.each do |funding|
  if funding.unreleased?
    fund_external_account(contract, funding)
    funding.update!(released: true)
  end
end

The first noticeable improvement is that there is no transaction. Second, we’re guarding the critical steps of our workflow under conditionals; if the local state for a step is already recorded, we skip that step. In other words, we’re defining units of work that serve as control points to resume the computation where it left off with the help of global state. Creating an account is the first unit of work, and funding the account is the second.

Now, if a funding fails to be updated, at least the creation of the external account would be persisted. Rerunning the workflow would skip creation of the external account and hopefully release the remaining fundings, and we’d be good to go!

That improves the resumability of our workflow but is still not ideal. Let’s find out why.

Roll back external state

The problem when a funding fails to be released is that its external counterpart would not be rolled back:

fund_external_account(contract, funding)

# If failing here, the previous command wouldn't roll back
funding.update!(released: true)

If we rerun our workflow successfully, we’d be double-funding the external account, which seems a very dangerous and undesirable side effect. The solution is to roll back the external funding, but how?

A general answer to that question lies in the Sagas pattern. Since external systems are unaware of local transactions, the solution is to emit compensating actions to roll back the external state established within the current unit of work. Now, how to do that may vary depending on what the external system provides.

In our example, a “funding” creates ledger entries. A possible attempt would involve a new API request to delete the extra ledger entries, but let’s assume that not to be possible. Let’s assume that, instead, ledger entries can’t be deleted, but they can be created with a pending status and then archived if something goes wrong. If all goes well, we can move their status to posted. Let’s introduce that change into our snippet:

if contract.external_account_id.nil?
  external_account_id = create_external_account(contract)
  contract.update!(external_account_id: external_account_id)
end

contract.fundings.each do |funding|
  if funding.unreleased?
    begin
      ledger_transaction_id = fund_external_account(contract, funding)
      funding.update!(released: true)

      PostLedgerEntries.call(ledger_transaction_id)
    rescue Exception => e
      if ledger_transaction_id
        ArchiveLedgerEntries.call(ledger_transaction_id)
      end

      # Reraising is important to signal a failed workflow
      raise e
    end
  end
end

Archiving ledger entries is how we roll back fundings. That’s great! So, is our mission accomplished? Not yet.

Run select parts in the background

What if the “archive” step throws an exception?

rescue Exception => e
  if ledger_transaction_id
    # What if this line throws an exception?
    ArchiveLedgerEntries.call(ledger_transaction_id)
  end

  # ...

When rerunning the workflow, the failed funding would be left orphaned and unreleased, and another funding could be created with a posted status. That would be disastrous since we’re dealing with money. And the same would happen if the entries failed to be posted!

We can solve that problem by running the “archive” and “post” steps in another process or thread. Background processing tools such as Sidekiq have built-in retry capabilities, making our workflow more likely to reach eventual consistency. Moving parts into the background should be analyzed case-by-case, but it would make sense in our example. Let’s do it:

if contract.external_account_id.nil?
  external_account_id = create_external_account(contract)
  contract.update!(external_account_id: external_account_id)
end

contract.fundings.each do |funding|
  if funding.unreleased?
    begin
      ledger_transaction_id = fund_external_account(contract, funding)
      funding.update!(released: true, ledger_transaction_id: ledger_transaction_id)
    rescue Exception => e
      if ledger_transaction_id
        ArchiveLedgerEntries.perform_async(ledger_transaction_id)
      end

      # Reraising is important to signal a failed workflow
      raise e
    else
      PostLedgerEntries.perform_async(ledger_transaction_id)
    end
  end
end

We’ve turned ArchiveLedgerEntries and PostLedgerEntries into jobs with a perform_async method, which can be independently dispatched to the background. Unfortunately, no solution is perfect so there’s still a chance that retries would exhaust and neither step would ever finish. In that case, we could learn something new about the domain and improve our code, or even devise tools to detect and repair inconsistencies, but we won’t touch that here in this article.

Use locks to handle concurrency

Our workflow should be perfect now that it’s resumable and fault-tolerant; API requests shouldn’t be able to make it through more than once, right? Wrong.

What if two or more instances run exactly at the same time and read back the same state? One or more if conditions could likely resolve to truthy, which would result in two single-funded accounts (one of them being orphaned), or, even worse - one double-funded account! When it comes to critical workflows, we can’t assume this will never happen. It can happen, and it’s more frequent than one would imagine. For example:

A user could click on a submit button multiple times and trigger multiple web requests (if the workflow is synchronous);
Two or more jobs could accidentally be scheduled for the same contract and be executed concurrently (if the workflow is asynchronous);
Two or more people could trigger our workflow on hundreds of contracts, and due to stale UIs on either side, trigger it concurrently for the same contract.

Many things could happen, and resolving the issue on the front end is never enough unless the workflow is not critical. The back end should guarantee data consistency.

Since our workflow is backed by a contract database row, a simple solution would be to use a row-level lock when working with PostgreSQL or another compliant database. They would be used solely as a global mutex across processes and machines. However, row-level locks run under a transaction, and that’s where we need to be careful. We want to ensure our transaction is always committed and never rolled back, so the following pattern would be advisable:

captured_exception = nil

# Warning, implicit transaction here!
contract.with_lock do
  contract.reload

  # Execute work
rescue Exception => e
  # Let the transaction commit no matter what

  captured_exception = e
end

raise captured_exception if captured_exception

I’d advise you to play with locks in psql to learn how they work

Let’s apply that pattern to our code snippet:

captured_exception = nil

contract.with_lock do
  contract.reload

  if contract.external_account_id.nil?
    external_account_id = create_external_account(contract)
    contract.update!(external_account_id: external_account_id)
  end

  contract.fundings.each do |funding|
    if funding.unreleased?
      begin
        ledger_transaction_id = fund_external_account(contract, funding)
        funding.update!(released: true, ledger_transaction_id: ledger_transaction_id)
      rescue Exception => e
        if ledger_transaction_id
          ArchiveLedgerEntries.perform_async(ledger_transaction_id)
        end

        raise e
      else
        PostLedgerEntries.perform_async(ledger_transaction_id)
      end
    end
  end
rescue Exception => e
  captured_exception = e
end

raise captured_exception if captured_exception

Now, only one workflow would acquire the lock at a time, while concurrent workflows would be paused until they are able to acquire the lock.

Note that we’d usually resort to advisory locks when dealing with application-level locks, but they are harder to test under transactional fixtures. If you have a backing object like contract, row-level locks should be pretty much equivalent in functionality.

Fixing possible transient failures

Can you spot an innocuous bug now that we’ve introduced locks into our code? 👀

Answer: we’re scheduling background jobs before committing the transaction, so there’s a chance the lookup by ledger_transaction_id would fail within the job because it wouldn’t see uncommitted data. It would, however, likely succeed on the next try, because the database transaction would have been committed by then. This is a reminder for us to be careful when mixing background jobs with database transactions!

Fixing our transient issue’s a bit tricky, requiring us to keep track of successful and failed ledger transaction ids and moving the scheduling of background jobs upper in the stack, outside of any database transactions:

captured_exception = nil
ok_ledger_transaction_ids = []
failed_ledger_transaction_id = nil

contract.with_lock do
  contract.reload

  if contract.external_account_id.nil?
    external_account_id = create_external_account(contract)
    contract.update!(external_account_id: external_account_id)
  end

  contract.fundings.each do |funding|
    if funding.unreleased?
      begin
        ledger_transaction_id = fund_external_account(contract, funding)
        funding.update!(released: true, ledger_transaction_id: ledger_transaction_id)

        ok_ledger_transaction_ids << ledger_transaction_id
      rescue Exception => e
        failed_ledger_transaction_id = ledger_transaction_id

        raise e
      end
    end
  end
rescue Exception => e
  captured_exception = e
end

ok_ledger_transaction_ids.each { |id| PostLedgerEntries.perform_async(id) }

if failed_ledger_transaction_id
  ArchiveLedgerEntries.perform_async(failed_ledger_transaction_id)
end

raise captured_exception if captured_exception

Thank you Elias Rodrigues for calling out this bug!

Use idempotency keys

Some APIs (such as Stripe) provide a feature called “Idempotency keys” to safely retry requests without performing the same side effect twice. To make it work, we provide a unique key along with the request. Let’s see how we could model our first step, creating the account, with that in mind:

external_account_id = create_external_account(
  contract,
  idempotency_key: "create_account_#{contract.id}"
)
contract.update!(external_account_id: external_account_id)

Our idempotency key uses contract.id to ensure uniqueness, which means only a single account would be created if we run the same workflow multiple times. Does that render our guarding if condition unnecessary? Should we really do that? I don’t believe so.

Unfortunately, idempotency keys usually expire after 24 hours (look up your API’s documentation). Considering that tools like Sidekiq may attempt retries for around three weeks after the first attempt, it’s not safe to rely only on an idempotency key. I would advise having both the guard conditions and the idempotency keys if possible, for extra security. That would make the requests safer to use outside of the workflow.

When using idempotency keys, we should make sure to change the key’s value if we ever want to retry our workflow due to a validation error or another expected error, so be careful! That would require tracking the key’s version with the help of database state, whether it be Postgres, Redis, etc:

begin
  create_external_account(
    contract,
    idempotency_key: "create_account_#{idempotency_key.version}"
  )
rescue *VALIDATION_ERRORS => e
  idempotency_key.increment_version
  raise e
end

How much error handling do you need?

If you’re paying attention, you may have noticed a possible point of failure. In the following section, there’s no error handler:

ok_ledger_transaction_ids.each { |id| PostLedgerEntries.perform_async(id) }

if failed_ledger_transaction_id
  ArchiveLedgerEntries.perform_async(failed_ledger_transaction_id)
end

What if the connection with Redis fails (considering Sidekiq) and throws an exception? We’d be left with a pending ledger transaction! Among a few possible solutions, we could record the funding status:

funding.update!(
  released: true,
  status: "pending",
  ledger_transaction_id: ledger_transaction_id
)

If an error is raised when trying to schedule the jobs, assuming that a retry would be attempted, we could initialize ok_ledger_transaction_ids with the inconsistent ids:

# Using example Rails code
ok_ledger_transaction_ids = contract
  .fundings
  .where(released: true, status: "pending")
  .pluck(:ledger_transaction_id)

When the workflow would hit the bottom, it would retry scheduling the jobs that failed to be scheduled. Here’s the final code:

captured_exception = nil

ok_ledger_transaction_ids = contract
  .fundings
  .where(released: true, status: "pending")
  .pluck(:ledger_transaction_id)

failed_ledger_transaction_id = nil

contract.with_lock do
  contract.reload

  if contract.external_account_id.nil?
    external_account_id = create_external_account(contract)
    contract.update!(external_account_id: external_account_id)
  end

  contract.fundings.each do |funding|
    if funding.unreleased?
      begin
        ledger_transaction_id = fund_external_account(contract, funding)
        funding.update!(
          status: "pending",
          released: true,
          ledger_transaction_id: ledger_transaction_id
        )

        ok_ledger_transaction_ids << ledger_transaction_id
      rescue Exception => e
        failed_ledger_transaction_id = ledger_transaction_id

        raise e
      end
    end
  end
rescue Exception => e
  captured_exception = e
end

ok_ledger_transaction_ids.each { |id| PostLedgerEntries.perform_async(id) }

if failed_ledger_transaction_id
  ArchiveLedgerEntries.perform_async(failed_ledger_transaction_id)
end

raise captured_exception if captured_exception

We’re assuming that PostLedgerEntries will update the funding status to posted and ArchiveLedgerEntries to archived.

There could be room for improvement. For instance, would it make sense to only archive a ledger transaction if PostLedgerEntries retries are exhausted? It depends on our requirements and on how long we can keep a pending funding around.

Be careful with code inside rescue/else/ensure because it’s not under the error handler!

When to choose a resumable strategy?

Choose a resumable error handling strategy when:

The workflow is composed of discrete steps that run independently;
There is loose coupling between the steps;
The workflow has a reasonable amount of steps and API requests to deal with;
You can bear temporary inconsistency;
When there is a retry mechanism in place such as a background processing tool, or an external API that retries webhooks on errors.

For workflows running in the background, it is possible to use websockets to broadcast flash messages and other data if delivering feedback to the user is essential.

Conclusion

There is no single answer on how to handle errors, so it should always be analyzed case-by-case. The key idea behind this post is to present you with a few techniques to hopefully inspire your own error handling strategies, which are especially important on critical workflows.

Make sure that you handle both expected and unexpected errors. Never assume unexpected errors can’t happen. They can and they will happen, and there are a multitude of possible reasons: API timeouts, outages, cloud instances restarting, unstable internet connection, unexpected database constraints kicking in, and so on. When possible, always design with failure in mind.

Sign up to receive a weekly recap from thoughtbot

Leave the Maintenance to Us