Everyone is a FAANG engineer

I’m entirely convinced that basically every developer alive today heard the adage “dress for the job you want, not the job you have” and figured that, since they always wear jeans and a t-shirt anyway, they might as well apply it to their systems’ architecture. This explains why the stack of every single company I’ve seen is invariably AWS/GCP with at least thirty microservices (how else will you keep the code tidy?), a distributed datastore that charges per query but whose reads depend on how long it’s been before the last write, a convoluted orchestrator to make sure that you never know which actual computer your code runs on, autoscaling so random midnight breakages ensure you don’t get too complacent with your sleep schedule, and exactly two customers (well, potential customers).

I don’t know the exact point when everything went wrong, but I suspect it was somewhere in the 2000s, when Google introduced Map/Reduce and every developer thought “well that’s cool, I’m going to base all our production code on that paradigm, and eventually I will hopefully understand how it works”. We’ve been in “FAANG architecture by default” hell ever since.

FAANG architecture by default

I asked ChatGPT for some filler images. Thanks ChatGPT, thanks for nothing.

The first problem every startup solves is scalability. The first problem every startup should solve is “how do we have enough money to not go bust in two months”, but that’s a hard problem, whereas scalability is trivially solvable by reading a few engineering blogs, and anyway it’s not like anyone will ever call you out on it, since you’ll go bust in two months.

Solving problems like “how do we make something people want” and “how do we make people give us more money” are questions as uninteresting to developers as “how do elderly people have sex”. In both cases, the answer is “with great difficulty and by taking risks”, but developers would much rather answer the question “how do we make our infrastructure scale to millions of users”, which, to a developer, has the same answer as the elderly sex issue: It’s not hard.

You just get some AWS products, abstract the hardware away, turn all your code into functions instead of services, put a network under those services, and boom! Infinite scalability, and it costs nothing because we got a few million in AWS credits, which means it’s all free forever, where “forever” means “for much longer than we’ll be in business”.

The issue with scalability

Do you know what the difference between Google and your startup is? It’s definitely not scalability, you’ve solved that problem. It’s that Google has billions upon billions with which to pay for that scalability, which is really good because scalability is expensive. Scalability is expensive because it’s complicated, and complexity doesn’t come cheap, in whatever form you encounter it.

The tragedy I see these days is that building scalable services is the height of tech fashion, and every engineer wants a fashionable CV that will help her get the next job. Nobody has ever gotten a job with a CV that said “I don’t know AWS and Kubernetes, but I know how to fulfill all your SLAs for a $100/mo infra bill”, because every hiring manager has stopped reading before the comma, and that engineer has starved to death.

Anyway, all this is to say that complexity is expensive, and scalability needs a whole bunch of complexity, so make sure you don’t pay the cost until you absolutely have to, and even then you should still accept some slowness if it gets you a few more months of the simple architecture, because that means you’ll be much faster than your competitors for a little longer.

The alternative

Look, honestly, I get it. Who among us really has the wherewithal to avoid having all the separate components of our architecture call each other in obscure and convoluted ways? Isn’t making each component its own service an elegant solution?

No it isn’t, and if that was the first solution you jumped to, I don’t want to know how many separate services you managed to make a distributed monolith out of.

First of all, you should come to terms with the idea that you really should deploy a monolith. Putting a network under function calls was always meant to be a horrific atrocity, one that should only be committed when you absolutely have no choice, not the first thing you do when setting up a project. And, before you ask, no, “monolith” doesn’t mean you’ll only have one web worker for your code, you can deploy the same code on multiple servers and load-balance between them. You’ll usually only have one database, though, and that setup should be enough for you all the way up to you having so many customers you really don’t know what to do with all that money.

What has worked well for me is the following architecture:

  • All the code that’s part of our architecture goes into the one monolith we have.
  • The monolith is composed of separate modules (modules which all run together in the same process).
  • Modules cannot call each other, except through specific interfaces (for our Python monolith, we put those in a file called some_module/api.py, so other modules can do from some_module.api import some_function, SomeClass and call things that way.
  • All of these interfaces are statically typed. What the functions accept and what they return are statically typed, with types usually being Pydantic classes (no passing a bunch of opaque dicts!).

The above is enforced via automated checks on CI and in the git pre-commit stage. I’ve heard Tach is a good tool to enforce the above programmatically, though I haven’t used it yet.

The benefits

The above simple rules bring massive benefits. First of all, you’re guaranteed to always have clean separation between modules, as they can’t reach into each other’s internals, and they have to always call the designated interfaces. This means that you won’t end up with the dreaded “ball of yarn” monolith that nobody can ever debug or extend.

Another incredible benefit is that you’ll be able to change any API literally whenever you want without having to version your APIs, or to worry about who’s using which version, or having to write backwards-compatible endpoints, or any of that. If you change an API and break a client, your type checker will tell you immediately exactly what you broke. You then just go to each of those call sites, change the calls to the new API, and you’re done, and that change is atomic. You can deploy the change to the module and all the callers at the same exact time! Amazing.

You have rich type information at every call site. You’ll no longer have to debug HTTP endpoints with payloads that could have whatever random crap in them. Now, if your function gets called, you know exactly what’s in the input arguments, and if you’re the caller, you know exactly what you’ll get back. You’ll never have to deal with random dicts nested ten levels deep.

It also kind of goes without saying that there’s a massive speed difference as well. Calling a function in your own process is a few thousand times faster than a network roundtrip, and those savings add up.

The downsides

This isn’t all upsides, there are some disadvantages to the method as well:

It’s not as easy to scale each component on its own. If you have an authentication service that everyone calls, and which is I/O heavy, it’s hard to split that out into its own thing, because it now gets called not via the network, but via intra-process communication. You can still add more workers to your entire monolith, as well as scaling each worker vertically, but you can’t add more resources to a specific module, you’ll have to add more resources to all of them. This hasn’t been such a big problem for us in practice, since you usually can just get a bigger server (and that’s what you should do), but it is a downside.

That’s it for downsides, really.

The other stuff

By far the most horrified question I get when I describe this architecture is “what, we’ll have to code in a monorepo?! (disgusted face)”. No, no you don’t, you can deploy your modules to their own repos and work on them that way, though then you lose the nice property of being able to atomically deploy changes across your entire codebase when an API changes. That’s a tradeoff you’ll have to decide for yourself, though.

Epilogue

All in all, this approach has worked well for us. The main gist of the matter, though, is this:

Avoid paying the cost of a distributed architecture for as long as you can. Everybody else pays it up-front, and almost nobody gets a positive ROI on that, so you’ll be very far ahead of the pack with just this one simple trick.

If you have any feedback or hatemail, tweet or toot at me, or email me directly.