In defence of complexity

I noticed a pattern when I join a new company. I'm being onboarded, someone is explaining their systems to me, and they make a joke about how it's unnecessarily complex and probably could be simplified. We laugh, but I often think to myself, "Maybe, maybe not".

I'm a strong advocate for simplicity in software design, but in this article I want to try to explain, with some personal examples, why we should have a different attitude to complexity.

Minimum Necessary Complexity

Every engineering problem has a minimum necessary complexity associated with any solution.

Yes, sometimes, they are over-engineered, and the level of complexity is higher than it needs to be, but this is not something easily determinable. Even the simplest of solutions to any non-trivial problem will need to overcome a certain amount of complexity. As software architects, our job is to determine how much complexity is appropriate and how we wish to distribute it.

Distributing complexity

Take an example of a text-based note-taking app such as Google Keep. Compared to richer organizers such as OneNote, Keep has always been more basic, which makes it appealing for simple cases—the equivalent of a set of digital sticky notes. In the early days, Keep only supported text. If you wanted to add an image, you would have to upload it somewhere else and then insert the URL.

As the developer, you might be fine with this. It's a win-win: your software remains simple to use, while users can still put in whatever they want, as long as they can convert it to text. But supposing you want to provide a better experience for your users and so you add image support. This will undoubtedly raise the complexity of your software. Now you need to think about file storage, backwards compatibility (keeping all notes text-based internally), presentation (how should images appear), interface (UI for uploading images), limits, abuse, and so on.

Another example is Twitter's "Tweet all" feature (writing a thread at once). Users can post one tweet and then reply to it, and so on. But Twitter added the ability to post multiple tweets that would then be linked as one thread. Again, we're trading some simplicity in our system for a better UX.

Of course, the amount of complexity that will be involved in both cases will vary dependeing on factors like your tech stack and architecture, as well as your skills. But there will be some.

Taking complexity on: Simple does not mean "not complex"

I call this the act of taking on complexity. We don't have to, but we want to reduce friction for our end users and make things "simple" to them. So we step in and shift some of this complexity to our end.

So, yes, something that appears simple to the user may in fact belie a lot of complexity. Adding a new feature, supporting alternatives, adding redundancy, caching to improve performance and so on.

There are also non-product reasons to add complexity, chief among them probably being security. Rate limits, authorization, RBAC, MFA, and the like all add complexity.

Complexity happens in various forms. It could be:

  • combinatorial (if you have 6 possible toggles, this is actually 2⁶ possible scenarios)
  • infrastructure
  • performance
  • system architecture
  • code architecture

You can't build good (intuitive, reliable, secure, insert-adjective-here) software without these two things:

  • the willingness to take on complexity
  • the discernment to know what complexity should be taken on.

Modern civilization runs on complex systems. Water and power delivered to your house, plumbing, telecommunications, Internet, traffic lights, airports, manufacturing of cans and bottles—these are examples of complex systems we've come to take for granted, because of the simplicity of their outputs.

Complexity's problems

But yes, complexity does have problems...

Cost

For one, it's costly, especially when it's intentional and well-designed. It takes time, effort and the right skills. This is why there's another option: offload it to an external provider. This lets you achieve the final goal, such as a better UX, without the full investment. The cost is still borne by you (monetarily), but your core system can remain "simple".

Offloading complexity also works as a starting point, similar to starting with a cloud provider before moving to self-hosted. Especially when you don't have the needed skills, using an external provider gives you a chance to get progressively familiar with the challenges of such systems, until you get to a point where you can finally take it on.

Growth

A second problem is that complexity grows. Complex systems beget complex systems. Once you've accepted the complexity, problems that could be previously "simple" may now take additional work. For example, improving performance in a monolith would be fairly straightforward; with microservices, you pay the price for the extra network calls, which means taking more extreme measures like aggressive caching, switching network protocols and serialization formats, and the like.

This growth is often by accident. You could start out with an architecture that makes sense for current and future use cases, but then the product evolves in unforeseen ways, trapping you in an unplanned mess. Unfortunately, this happens; you can't always avoid this.

Accidental complexity

And this leads us to our main enemy. Accidental complexity happens when a process is complex in a way we didn't expect. This usually indicates a mismatch between our mental model of how complex the process should be versus the requirements forced on us.

A common example today is deployment of web apps, especially with tools like Kubernetes. Sometimes people adopt Kubernetes because they've heard about the reliability features, and then they get frustrated by all of the extra work that it imposes on them. The complexity may be necessary, but only because Kubernetes is engineered towards a specific class of problems. And to this user, who doesn't care about those problems, all of this feels accidental.

Avoiding accidental complexity is hard, especially when it's not created by us. Sometimes the complexity is necessary, but we're just poorly informed. The best option is probably to get properly informed, weigh it up and decide if it is worth it for us. If it is, can we convert it into intentional complexity, via proper planning, documentation, and changing our approach? If not, it probably makes sense to migrate to a different solution.

A personal story

An example of complexity that sticks with me is a project from a previous job. We had an appointment booking system in our app, which was simply an embedded Calendly widget. We did this because we didn't want to deal with the vagaries of scheduling. Calendly was good at this stuff—timezones, round-robin scheduling, managing availability, group scheduling, etc. Unfortunately, they require you to use their UI for scheduling, hence the widget.

But even with this, we still had a lot of complexity to deal with.

  • We had to sync appointments from Calendly into our system, but our models did not match 1:1
  • We had to do this sync via listening to their webhooks. Let's just say: there's a reason people are trying to build startups out of webhook processing.
  • Their API was limited in some annoying ways, so things that should actually be simple weren't.
  • Worse, we had a second external service where agents could set up appointments on their own. This provider had a different model from Calendly and an even more frustrating API.
  • Even worse, an appointment booked via Calendly could also show up in this other service and be managed from there. (PS: "Single source of truth" is a lie.)
  • The icing on the cake was that the process was driven by humans, so you couldn't rely on them not doing stupid things. They would often do random things that would cause undefined behaviour, such as changing an item mid-processing (with no change history).

Complex? As fuck.

But was this really necessary? Mostly.

Primarily because this was a process that existed before we began incorporating it into our backend, so we couldn't simply rip it up and build it from scratch in an ideal manner. Also, we didn't have the resources (people or time) to work on a whole appointment scheduling + CRM system.

As the lead developer on this system, I spent days just thinking about the complexity of it all, looking for ways we could reduce it. I rewrote a bunch of the code several times. Wrote tons of comments and documents describing how the system worked.

More problems

A second example: a problem caused by the complexity of this system. I mentioned that we used the Calendly scheduling widget in our app. The flow was like this: user is on appointments page (our app) → user clicks "New appointment" → we redirect to Calendly widget → they finish booking → Calendly page closes, they are back in our app and they should see the newly booked appointment.

The problem? We relied on webhooks for notifying us of new appointments, so if the Calendly webhook had not arrived or been processed when the user was redirected back to our app, they would still see an empty list of appointments, creating confusion.

To solve this, we added redundancy: whenever the user landed on that page, we would call the Calendly API proactively to check if there were any appointments we hadn't yet processed via webhook, and thus process them preemptively, while of course avoiding race conditions (processing the same appointment twice). This worked reliably; the only downside was the latency introduced by that API call.

A third example: an evolution of the system. Users could reschedule an appointment in our app, which would use our UI and Calendly's API. However, we wanted to change the flow in certain cases. Rescheduling in Calendly keeps the same agent, but we wanted to return the user back to the pool of available agents.

The solution here was straightforward: instead of rescheduling via Calendly, we would cancel and and create a new appointment. But in order to do this seamlessly for the user, we had to come up with a trick: whenever the user clicked our "Reschedule" link, we would create a record on our backend called a "PendingCancellation", attached to the existing appointment. We would then redirect the user to the Calendly "new appointment" page. From here, we would process new appointments as normal (Calendly webhook), but also check for any PendingCancellation recently created by the user, and go ahead and cancel the old appointment.

We didn't have to do this; we could have pushed the complexity on to the user by asking them to first cancel the previous one. But that didn't align with our business goals, so we took it on ourselves.

You can probably see a pattern: we took on complexity initially, and this led to more complex problems we had to solve with more complex solutions. But most of this was needed, and was ultimately managed quite well. Some of this complexity was accidental (due to Calendly's API making things harder). But since we had already bought in to creating a complex system at the start, we could take a step back and come up with reliable ways for handling these. It was ultimately quite fun (when I wasn't tearing my hair out)!

Managing complexity

So the panacea is not necessarily rejecting complexity outright, but knowing how to manage it. Some things I've learnt that help:

  • Discernment: It's super important to know what complexity is worth taking on, and where to distribute it in your stack. Some kinds of complexity, such as combinatorial, are best avoided as much as possible. Take the time to assess the situation and decide whether it's worth it.

  • Documentation: Be sure to document things. Describe:

    • the desired results
    • the path we're taking to achieve them
    • why

    Everything from comments in code to diagrams and external docs helps here.

  • Standardization: It helps if you can stick to common conventions. This means you only need to document the places where you diverge.

  • Observability: Complex systems can and will fail in surprising ways. A system is only as good as the confidence you have in it. You must invest in your ability to understand your systems (here's my free book on Observability Basics). This means architecting the system at a low and high level in such a way that changes can be understood and problems debugged easily. This entails:

    • thinking about failure modes and consequences
    • making the system verbose via logs, metrics and traces
    • collecting metrics to track certain paths
    • building features like history and changelogs to track changes
    • adding the ability to observe and test behaviour in production (eg via feature flags or parallel experiments)
    • adding the ability to simulate specific scenarios

    This isn't easy, and also adds complexity. But they will increase your confidence in the system. For example, in my Calendly story, we often didn't even have IDs we could rely on for looking up models, but had to use heuristics like "how long ago was this scheduled?" For cases like this, I used metrics, logs and occasional spot checks to compare the results of the heuristics with what was expected.

  • Iteration: Don't be afraid to rewrite things. This architecture may have made sense last year, but we've grown and now it's hard to keep up. There's never a perfect setup.

  • Git gud: Finally, you have to level up in your stack. Being knowledgeable in your tools will give you more options and save you from building things afresh. Additionally, not all implementations will be readable or easy to understand; building up your expertise will make them more approachable.

Conclusion

Always question complexity. But don't assume it's always unnecessary. And when you need to take it on, be equipped for how to manage it.



I write about my software engineering learnings and experiments. Stay updated with Tentacle: tntcl.app/blog.shalvah.me.

Powered By Swish