When an application needs to integrate with external APIs, making workflows fault-tolerant and resumable may be a subtle, hidden, and unobvious requirement. Why design a workflow with such qualities in mind, and what are the pitfalls?
Reasons to make a workflow fault-tolerant and resumable
“Fault-tolerant and resumable” is a set of error handling strategies to maintain integrity and reach eventual data consistency in the face of expected and unexpected errors, so it’s a very important quality in certain workflows.
When integrating with external systems, we’ll generally want to keep track of their data in a database that we fully control; that may include external IDs, monetary amounts, taxes, etc., and any other valuable data returned to our system.
But how do we ensure those data references are consistent and correctly in sync with the external system?
What not to do
Let’s use the following example workflow:
- Given a contract with fundings, issue an API request to create an external account;
- Iterate over existing contract fundings, and for each:
- Issue an API request to fund the external account;
- Mark the funding as released.
Here’s a naive way to design such a workflow:
ApplicationRecord.transaction do
external_account_id = create_external_account(contract)
contract.update!(external_account_id: external_account_id)
contract.fundings.each do |funding|
fund_external_account(contract, funding)
funding.update!(released: true)
end
end
Let’s assume the first funding fails to be updated, and an exception is raised. What happens? The database transaction would be rolled back, and all of the local state created within the transaction would evaporate. External state, however, wouldn’t be rolled back; we’d have an external account, potentially with a few fundings, not tracked anywhere in our local system.
Worse yet, a new attempt to run the same workflow would yield the same side effects, which means the workflow is not idempotent. That may be far from acceptable depending on our requirements and how the external system works.
What to do
Divide into control points
When designing a workflow like that, we’ll never want to roll back local transactions. A much better version of our workflow would be:
if contract.external_account_id.nil?
external_account_id = create_external_account(contract)
contract.update!(external_account_id: external_account_id)
end
contract.fundings.each do |funding|
if funding.unreleased?
fund_external_account(contract, funding)
funding.update!(released: true)
end
end
The first noticeable improvement is that there is no transaction. Second, we’re guarding the critical steps of our workflow under conditionals; if the local state for a step is already recorded, we skip that step. In other words, we’re defining units of work that serve as control points to resume the computation where it left off with the help of global state. Creating an account is the first unit of work, and funding the account is the second.
Now, if a funding fails to be updated, at least the creation of the external account would be persisted. Rerunning the workflow would skip creation of the external account and hopefully release the remaining fundings, and we’d be good to go!
That improves the resumability of our workflow but is still not ideal. Let’s find out why.
Roll back external state
The problem when a funding fails to be released
is that its external
counterpart would not be rolled back:
fund_external_account(contract, funding)
# If failing here, the previous command wouldn't roll back
funding.update!(released: true)
If we rerun our workflow successfully, we’d be double-funding the external account, which seems a very dangerous and undesirable side effect. The solution is to roll back the external funding, but how?
A general answer to that question lies in the Sagas pattern. Since external systems are unaware of local transactions, the solution is to emit compensating actions to roll back the external state established within the current unit of work. Now, how to do that may vary depending on what the external system provides.
In our example, a “funding” creates ledger entries. A possible
attempt would involve a new API request to delete the extra ledger
entries, but let’s assume that not to be possible. Let’s assume that,
instead, ledger entries can’t be deleted, but they can be created with
a pending
status and then archived
if something goes wrong. If all
goes well, we can move their status to posted
. Let’s introduce that
change into our snippet:
if contract.external_account_id.nil?
external_account_id = create_external_account(contract)
contract.update!(external_account_id: external_account_id)
end
contract.fundings.each do |funding|
if funding.unreleased?
begin
ledger_transaction_id = fund_external_account(contract, funding)
funding.update!(released: true)
PostLedgerEntries.call(ledger_transaction_id)
rescue Exception => e
if ledger_transaction_id
ArchiveLedgerEntries.call(ledger_transaction_id)
end
# Reraising is important to signal a failed workflow
raise e
end
end
end
Archiving ledger entries is how we roll back fundings. That’s great! So, is our mission accomplished? Not yet.
Run select parts in the background
What if the “archive” step throws an exception?
rescue Exception => e
if ledger_transaction_id
# What if this line throws an exception?
ArchiveLedgerEntries.call(ledger_transaction_id)
end
# ...
When rerunning the workflow, the failed funding would be left orphaned
and unreleased, and another funding could be created with a posted
status. That would be disastrous since we’re dealing with money. And
the same would happen if the entries failed to be posted!
We can solve that problem by running the “archive” and “post” steps in another process or thread. Background processing tools such as Sidekiq have built-in retry capabilities, making our workflow more likely to reach eventual consistency. Moving parts into the background should be analyzed case-by-case, but it would make sense in our example. Let’s do it:
if contract.external_account_id.nil?
external_account_id = create_external_account(contract)
contract.update!(external_account_id: external_account_id)
end
contract.fundings.each do |funding|
if funding.unreleased?
begin
ledger_transaction_id = fund_external_account(contract, funding)
funding.update!(released: true, ledger_transaction_id: ledger_transaction_id)
rescue Exception => e
if ledger_transaction_id
ArchiveLedgerEntries.perform_async(ledger_transaction_id)
end
# Reraising is important to signal a failed workflow
raise e
else
PostLedgerEntries.perform_async(ledger_transaction_id)
end
end
end
We’ve turned ArchiveLedgerEntries
and PostLedgerEntries
into jobs
with a perform_async
method, which can be independently dispatched
to the background. Unfortunately, no solution is perfect so there’s
still a chance that retries would exhaust and neither step would ever
finish. In that case, we could learn something new about the domain
and improve our code, or even devise tools to detect and repair
inconsistencies, but we won’t touch that here in this article.
Use locks to handle concurrency
Our workflow should be perfect now that it’s resumable and fault-tolerant; API requests shouldn’t be able to make it through more than once, right? Wrong.
What if two or more instances run exactly at the same time and read
back the same state? One or more if
conditions could likely resolve
to truthy, which would result in two single-funded accounts (one of
them being orphaned), or, even worse - one double-funded account! When
it comes to critical workflows, we can’t assume this will never
happen. It can happen, and it’s more frequent than one would
imagine. For example:
A user could click on a submit button multiple times and trigger multiple web requests (if the workflow is synchronous);
Two or more jobs could accidentally be scheduled for the same contract and be executed concurrently (if the workflow is asynchronous);
Two or more people could trigger our workflow on hundreds of contracts, and due to stale UIs on either side, trigger it concurrently for the same contract.
Many things could happen, and resolving the issue on the front end is never enough unless the workflow is not critical. The back end should guarantee data consistency.
Since our workflow is backed by a contract
database row, a simple
solution would be to use a row-level lock when working with
PostgreSQL or another compliant database. They would be used solely as
a global mutex across processes and machines. However, row-level
locks run under a transaction, and that’s where we need to be careful.
We want to ensure our transaction is always committed and never rolled
back, so the following pattern would be advisable:
captured_exception = nil
# Warning, implicit transaction here!
contract.with_lock do
contract.reload
# Execute work
rescue Exception => e
# Let the transaction commit no matter what
captured_exception = e
end
raise captured_exception if captured_exception
Let’s apply that pattern to our code snippet:
captured_exception = nil
contract.with_lock do
contract.reload
if contract.external_account_id.nil?
external_account_id = create_external_account(contract)
contract.update!(external_account_id: external_account_id)
end
contract.fundings.each do |funding|
if funding.unreleased?
begin
ledger_transaction_id = fund_external_account(contract, funding)
funding.update!(released: true, ledger_transaction_id: ledger_transaction_id)
rescue Exception => e
if ledger_transaction_id
ArchiveLedgerEntries.perform_async(ledger_transaction_id)
end
raise e
else
PostLedgerEntries.perform_async(ledger_transaction_id)
end
end
end
rescue Exception => e
captured_exception = e
end
raise captured_exception if captured_exception
Now, only one workflow would acquire the lock at a time, while concurrent workflows would be paused until they are able to acquire the lock.
Note that we’d usually resort to advisory locks when dealing with application-level locks, but they are
harder to test under transactional fixtures. If you have a backing
object like contract
, row-level locks should be pretty much
equivalent in functionality.
Fixing possible transient failures
Can you spot an innocuous bug now that we’ve introduced locks into our code? 👀
Answer: we’re scheduling background jobs before committing the
transaction, so there’s a chance the lookup by
ledger_transaction_id
would fail within the job because it wouldn’t
see uncommitted data. It would, however, likely succeed on the next
try, because the database transaction would have been committed by
then. This is a reminder for us to be careful when mixing background
jobs with database transactions!
Fixing our transient issue’s a bit tricky, requiring us to keep track of successful and failed ledger transaction ids and moving the scheduling of background jobs upper in the stack, outside of any database transactions:
captured_exception = nil
ok_ledger_transaction_ids = []
failed_ledger_transaction_id = nil
contract.with_lock do
contract.reload
if contract.external_account_id.nil?
external_account_id = create_external_account(contract)
contract.update!(external_account_id: external_account_id)
end
contract.fundings.each do |funding|
if funding.unreleased?
begin
ledger_transaction_id = fund_external_account(contract, funding)
funding.update!(released: true, ledger_transaction_id: ledger_transaction_id)
ok_ledger_transaction_ids << ledger_transaction_id
rescue Exception => e
failed_ledger_transaction_id = ledger_transaction_id
raise e
end
end
end
rescue Exception => e
captured_exception = e
end
ok_ledger_transaction_ids.each { |id| PostLedgerEntries.perform_async(id) }
if failed_ledger_transaction_id
ArchiveLedgerEntries.perform_async(failed_ledger_transaction_id)
end
raise captured_exception if captured_exception
Thank you Elias Rodrigues for calling out this bug!
Use idempotency keys
Some APIs (such as Stripe) provide a feature called “Idempotency keys” to safely retry requests without performing the same side effect twice. To make it work, we provide a unique key along with the request. Let’s see how we could model our first step, creating the account, with that in mind:
external_account_id = create_external_account(
contract,
idempotency_key: "create_account_#{contract.id}"
)
contract.update!(external_account_id: external_account_id)
Our idempotency key uses contract.id
to ensure uniqueness, which
means only a single account would be created if we run the same
workflow multiple times. Does that render our guarding if
condition
unnecessary? Should we really do that? I don’t believe so.
Unfortunately, idempotency keys usually expire after 24 hours (look up your API’s documentation). Considering that tools like Sidekiq may attempt retries for around three weeks after the first attempt, it’s not safe to rely only on an idempotency key. I would advise having both the guard conditions and the idempotency keys if possible, for extra security. That would make the requests safer to use outside of the workflow.
When using idempotency keys, we should make sure to change the key’s value if we ever want to retry our workflow due to a validation error or another expected error, so be careful! That would require tracking the key’s version with the help of database state, whether it be Postgres, Redis, etc:
begin
create_external_account(
contract,
idempotency_key: "create_account_#{idempotency_key.version}"
)
rescue *VALIDATION_ERRORS => e
idempotency_key.increment_version
raise e
end
How much error handling do you need?
If you’re paying attention, you may have noticed a possible point of failure. In the following section, there’s no error handler:
ok_ledger_transaction_ids.each { |id| PostLedgerEntries.perform_async(id) }
if failed_ledger_transaction_id
ArchiveLedgerEntries.perform_async(failed_ledger_transaction_id)
end
What if the connection with Redis fails (considering Sidekiq) and throws an exception? We’d be left with a pending ledger transaction! Among a few possible solutions, we could record the funding status:
funding.update!(
released: true,
status: "pending",
ledger_transaction_id: ledger_transaction_id
)
If an error is raised when trying to schedule the jobs, assuming that
a retry would be attempted, we could initialize
ok_ledger_transaction_ids
with the inconsistent ids:
# Using example Rails code
ok_ledger_transaction_ids = contract
.fundings
.where(released: true, status: "pending")
.pluck(:ledger_transaction_id)
When the workflow would hit the bottom, it would retry scheduling the jobs that failed to be scheduled. Here’s the final code:
captured_exception = nil
ok_ledger_transaction_ids = contract
.fundings
.where(released: true, status: "pending")
.pluck(:ledger_transaction_id)
failed_ledger_transaction_id = nil
contract.with_lock do
contract.reload
if contract.external_account_id.nil?
external_account_id = create_external_account(contract)
contract.update!(external_account_id: external_account_id)
end
contract.fundings.each do |funding|
if funding.unreleased?
begin
ledger_transaction_id = fund_external_account(contract, funding)
funding.update!(
status: "pending",
released: true,
ledger_transaction_id: ledger_transaction_id
)
ok_ledger_transaction_ids << ledger_transaction_id
rescue Exception => e
failed_ledger_transaction_id = ledger_transaction_id
raise e
end
end
end
rescue Exception => e
captured_exception = e
end
ok_ledger_transaction_ids.each { |id| PostLedgerEntries.perform_async(id) }
if failed_ledger_transaction_id
ArchiveLedgerEntries.perform_async(failed_ledger_transaction_id)
end
raise captured_exception if captured_exception
We’re assuming that
PostLedgerEntries
will update the funding status toposted
andArchiveLedgerEntries
toarchived
.
There could be room for improvement. For instance, would it make sense
to only archive a ledger transaction if PostLedgerEntries
retries
are exhausted? It depends on our requirements and on how long we can
keep a pending
funding around.
When to choose a resumable strategy?
Choose a resumable error handling strategy when:
- The workflow is composed of discrete steps that run independently;
- There is loose coupling between the steps;
- The workflow has a reasonable amount of steps and API requests to deal with;
- You can bear temporary inconsistency;
- When there is a retry mechanism in place such as a background processing tool, or an external API that retries webhooks on errors.
For workflows running in the background, it is possible to use websockets to broadcast flash messages and other data if delivering feedback to the user is essential.
Conclusion
There is no single answer on how to handle errors, so it should always be analyzed case-by-case. The key idea behind this post is to present you with a few techniques to hopefully inspire your own error handling strategies, which are especially important on critical workflows.
Make sure that you handle both expected and unexpected errors. Never assume unexpected errors can’t happen. They can and they will happen, and there are a multitude of possible reasons: API timeouts, outages, cloud instances restarting, unstable internet connection, unexpected database constraints kicking in, and so on. When possible, always design with failure in mind.