Exploring software design problems and solutions: Transactions and side effects

As practice in thinking and talking about practical software design approaches (at the high and low level), I want to explore some scenarios from first principles. The goal of this exercise is to analyse problems, the solutions we apply to them, and the new problems those bring up. I encourage you to pause the article at certain points and think through it yourself. Code examples are Ruby/Rails-ish, but should be easy enough to follow.

Importantly, these thoughts are subjective, as a lot of design is. Feel free to question my reasoning, and leave a comment to let me know what you think.


We start with a simple system for a shopping website: you have an endpoint where you want to create an account and a cart for a customer at the same time.

def call
  account = Account.create!(...)
  cart = Cart.create!(account) 
  CartItems.create_many!(cart, ...) # Add items to the cart
  # Maybe also create some other things ...
end

There's a problem here, which is that if the cart item creation fails, you end up with inconsistent state. The endpoint will return an error, and the account exists, and maybe the cart does, but it's empty.

We can solve this by having the caller retry on failure. But we need to change Account.create! to Account.find_or_create! to prevent duplicates, and so on. What problems does this bring?

  • Creation logic gets more confusing and messy, as now another non-trivial yet orthogonal concern (idempotency) has been mixed in.
  • All future changes to this endpoint need to know about and carry on with the find_or_create! pattern, or risk incorrectness. This gets more difficult to enforce if we extract parts of the endpoint into other methods or classes.
  • If the caller does not retry, you're left with orphaned entities (such as the Account which will never be used).

We know the right answer here: use database transactions! If one step fails, the transaction is rolled back, so nothing is persisted. But before we jump to that, I want to explore why.

You don't have to use a database transaction; you could also write your custom rollback implementation.

def call
  @account = Account.create!(...)
  @cart = Cart.create!(@account)
  @cart_items = CartItems.create_many!(@cart, ...)
rescue
  @cart_items.delete_all if @cart_items.present?
  @cart.delete if @cart
  @account.delete if @account
end

I'm not judging. There may be valid reasons for this. You may want to avoid nested, sequential, or long-running transactions, or you just work in a weird place where they shun database transactions.

But let's step back. Why is this frowned upon?

Major reason: It's still susceptible to failure. If there's a bug in your code or something goes wrong in your database while you're trying to delete the orphaned entities, you're screwed.

Minor reason: It leaves traces. For instance, the deleted entities will still likely take up disk space until the disk is purged. Any autoincrementing sequences (like IDs) will also likely not reset.

This is why we in the industry agreed to rely on database transactions for this. Instead of treating the data layer as a dumb store and trying to handle everything ourselves, we rely on it to help us ensure correctness, which allows us to leverage its ACID guarantees. This means that when we open a transaction, the database can promise us that all the operations in that transaction will be considered one atomic operation. They will succeed together or fail together. Database transactions also help us mitigate concurrency problems.

We've made progress. Databases which have ACID guarantees (typically SQL databases) are really good at this. We should lean on them.

So now we have:

def call
  in_transaction do
    account = Account.create!(...)
    cart = Cart.create!(account) 
    CartItems.create_many!(cart, ...)
  end
end

But things can get murky. Suppose we want to send an email to the user to verify their email after we create the account. Let's say we do this by enqueueing an async job which is pushed to Redis or a separate database and then processed by Sidekiq.

def call
  in_transaction do
    account = Account.create!(...)
    SendVerificationEmail.perform_later(account.id)
    # or
    # SendVerificationEmail.perform_later(account.email)
    cart = Cart.create!(account) 
    CartItems.create_many!(cart, ...)
  end
end

We have a problem again. If the account creation succeeds, but the cart creation does not, the whole transaction will be rolled back. But the SendVerificationEmail job will have already been enqueued! It will either fail (no account found), or succeed (if we passed the account email directly).

This is what happens when you introduce a side effect. I see side effects as a change in a different system outside the control of the current executor (the database). Here, it is an async job, but it could also be, for instance, a synchronous API call to a payment provider to charge a user's card. How do we solve this?

Let's consider some options. The simplest fix is moving the side effect outside of the transaction.

def call
  account = in_transaction do
    account = Account.create!(...)
    cart = Cart.create!(account) 
    CartItems.create_many!(cart, ...)
    account # Return account
  end
  
  # If the transaction was rolled back, the method would exit.
  # This means that, at this point, account definitely exists.
  SendVerificationEmail.perform_later(account.id)
end

If you can, you should do this. But this isn't always possible. For instance, maybe the account creation logic is not directly in this method, but encapsulated in a service class somewhere in your application:

class AccountCreator
  def call
    account = Account.create!(...)
    SendVerificationEmail.perform_later(account.id)
  end
end

and then used here and other places:

def call
  in_transaction do
    account = AccountCreator.call(...)
    # Maybe you also have a CartCreator
    cart_with_items = CartCreator(account, ...)
  end
end

The AccountCreator is probably used in a few other places, like the normal sign up flow. It's not so easy to extract the email sending from the transaction without affecting all these places. And this could happen for many other service classes: they innocently initiate their own external calls without knowing we're in a transaction.

In some ways, this is a silly problem. This is not a case of race conditions, possible error scenarios or high-level design problems. This is purely because of how we organize our code. If we choose to not use service classes, and rather inline this, it takes us back to the previous version, and we can now simply move the job to the end.

But it's a real problem. Code structure matters. The design and domain model we choose at the high level influences how we structure our code, and this in turn locks us into certain paths of development.

So how is this solved in the Rails' world? To solve this, Rails has a callback called after_commit. We could use it like this:

# It only works in models, so we must put the job enqueueing in our Account model
class Account
  # This method will be called whenever anyone does Account.create!
  after_commit :send_verification_email, on: :create
  
  def send_verification_email
    SendVerificationEmail.perform_later(id)
  end
end

With this we can solve the problem of nested service classes: move all external system calls into after_commit in the model. But you can probably already spot some problems this raises:

  1. Actions are now coupled to models rather than business processes, which means they are always invoked, even when not relevant. send_verification_email will always be called, even in some cases where we may not want to (for example, backfilling accounts via a one-off script). We could add filters in the callbacks, but this increases the amount of logic they hold.
  2. Dependencies are now hidden. It's easy to add a new flow for creating accounts that calls Account.create!, without realizing that it will also send emails.
  3. Logic is fragmented. Rather than having each service class handle a business process, some of the logic needs to be moved into the model. The code becomes harder to follow, and we end up with hidden side effects. And remember, code structure matters. We will pay the price.
  4. Visibility (and thus debugging) becomes harder, since we're now relying on a chain of responsibility handled by the framework, not our code, thus introducing a new layer of indirection. Stack traces will now jump from your code deep into the framework, before getting back to your code. And when multiple callbacks or multiple models are involved, it's hard to be sure of the order in which they were executed. I have personally wasted several minutes of my life hunting down the source of some external change. I had to do a mix of code reading and runtime debugging, tracing it across several boundaries, and I can tell you, it sucks.

One more big problem is that after_commit can have surprising behaviour when you have nested transactions.

A way to improve on this is with the library after_commit_everywhere. This library makes it possible for you to use after_commit outside models, so now the logic can remain in the corresponding service class.

class AccountCreator
  include AfterCommitEverywhere

  def call
    account = Account.create!(...)
    after_commit { SendVerificationEmail.perform_later(account.id) }
  end
end

I much prefer this, as we can now keep related concerns together. But I still see some minor problems:

  • after_commit_everywhere depends on monkey-patching, which feels like a hack.
  • It doesn't solve the problem of indirection. It in fact makes it a bit worse, as we've now added another layer on top the framework.
  • Is this a leaky abstraction? I don't know. Someone could argue that it is, as it means the service class needs to know/care about the existence of a database transaction.

But overall, after_commit_everywhere seems like a good solution for most cases.

Taking an assessment of where we are:

  • We introduced transactions to solve the problem of incorrectness due to partial failure of a multi-step process.
  • We introduced after_commit to solve the problem of inconsistency due to nested service classes interacting with external systems during a transaction.
  • We introduced after_commit_everywhere to solve the fragmentation of regular after_commit.

Great! Let's backtrack a bit and consider some alternative approaches to after_commit.

I haven't seen this done anywhere yet, but in theory you could make each service class declare or return its side effects, and leave it to the caller to execute them. This could look something like this:

Result = Data.define(:item, :side_effects)

class AccountCreator
  def call
    account = Account.create!(...)
    Result.new(
      item: account, 
      side_effects: [
        lambda { SendVerificationEmail.perform_later(account.id) },
      ]
    )
  end
end

# ... In the original caller
def call
  @side_effects = []
  in_transaction do
    result = AccountCreator.call(...)
    @side_effects.concat(result.side_effects)
    cart_with_items = CartCreator(result.item, ...)
  end
  
  @side_effects.each(&:call)
end

I like this because it separates side effects clearly. But now it relies on the caller to explicitly dispatch those side effects. Our custom Result class makes it harder to miss, but it's still fairly easy to forget to write that last line. It also gets pretty unwieldy with nesting, as each nested service class will have to pass to its parent not just its result and side effects, but also those from all other service classes it invoked.

Let's consider yet another alternative: If we look at the problem some more (enqueueing external jobs), we can see that it arises due to control. The queue system which handles the SendVerificationEmail job is outside the control of our database, so these two systems cannot synchronize. By moving the job to after the transaction, we're saying, "Since this external system (queued jobs) is outside the control of our transaction, let's wait until the database relinquishes control back to us." But what if we brought the jobs under the control of our database?

If we changed our queue system to store jobs in the same database, the SendVerificationEmail job would become part of the current transaction, and so it would not be visible to our queue executor until the transaction commits. And if the transaction is rolled back, so is the job. Problem solved.

I think this is a fair solution for a small app, but suboptimal for larger ones. Database jobs are a different kind of workload from application records. Storing them in the same database means your job executions can impact your regular application (for example, queue workers constantly polling your database could affect app performance, jobs could contribute to database disk bloat).

So let's iterate on this. How about if we put the jobs under our database's control, but only temporarily? This is what the transactional outbox pattern essentially is:

  1. Rather than enqueuing jobs to our queue system, we store them as messages in a table in our app's database. This allows our database to retain control over these messages, and roll them back if the transaction is rolled back.
  2. We then have a separate process (for example, a cron job) which is responsible for checking for new messages regularly and enqueueing them as jobs in our queue system. The table of messages is our outbox, containing messages from our application to our queue system.

It could look like this:

# Now we write messages
class AccountCreator
  def call
    account = Account.create!(...)
    # JobMessage is the model for our outbox table
    JobMessage.create!(class: SendVerificationEmail, params: [account.id])
  end
end

# A cron that runs every n seconds, and enqueues the jobs
class EnqueueJobsFromMessages
  def call
    JobMessage.each do |message|
      message.class.send(:perform_later, params)
      message.delete
    end
  end
end

Transactional outbox is awesome. We are able to synchronize our queue system to the app database while still keeping them decoupled.

Of course, it has its problems too!

  1. We've now introduced infrastructure complexity. We've added one more component in between our application and queue workers, and every new component is a new point of failure.
  2. We'll likely have an increase in latency (probably tiny though), since now we don't process jobs immediately, but wait for the cron job to pick them up. Our throughput may also decrease since batch jobs like this creates a temporary bottleneck until they are fanned out to the workers.
  3. We also need to watch for concurrency issues in the cron job. With enough usage, we will quickly run into funny situations such as different instances of the cron job overlapping and then trying to enqueue the same messages. A more robust version of this cron job would look like:
class EnqueueJobsFromMessages
  def call
    in_transaction do
      # FOR UPDATE will lock each selected message so another cron job instance doesn't pick it up
      JobMessage.select("FOR UPDATE SKIP LOCKED").each do |message|
        message.class.send(:perform_later, params)
        message.delete
      end
    end
  end
end
  1. What about side effects which aren't jobs? For example, calling an external service API. To make these work with transactional outbox, we'll need to wrap them in jobs, which also means we can no longer process them synchronously. At that point, I'd say we need to rethink our paradigm and consider whether we needed these to be synchronous in the first place. It makes me wonder about durable workflows, something I hope to explore in the future.

Final thoughts

And now I'll call it a day. We could still expand on these with more solutions and problems, both common and novel. For me, this exercise in deconstructing the patterns we use demonstrated again that we make solutions to common problems which introduce problems themselves. They don't exist in isolation, and for every solution we pick we must pay a price. Sometimes they increase complexity, sometimes they require a change of paradigm, sometimes they reduce visibility, but sometimes they're good enough.

I don't think this is a surprise, as we all know there are tradeoffs. Still, there are patterns that we've come to universally accept/reject, and it's always interesting to dig into the why, so we can more confidently pick one over the other. Adios!

(PS: This article was inspired by poring over this.)



I write about my software engineering learnings and experiments. Stay updated with Tentacle: tntcl.app/blog.shalvah.me.

Powered By Swish