Should you even bother loadtesting

At my last four jobs, loadtesting was a big part of my job. At my last two jobs, I started loadtesting as soon as I landed.

Here’s the problem: it’s hard to loadtest well. People don’t agree on acceptance criteria, you can fall down an infinite rabbit hole of building tooling, but too little tooling is also a waste of time. Loadtesting is frequently not the low hanging fruit for scalability improvements.

My tests are often bad comparisons for prod traffic, or I’ve tested at the wrong scale, or the test is so hard to run that you can’t iterate effectively or be sure of what scenario you’re testing.

Based only on mature prod codebases I’ve worked in: I believe prod systems that have never been optimized can run with 2x the traffic on the same hardware with 2 weeks of analysis and 2 weeks of targeted refactoring. But this may not matter.

If your annual cloud bill costs less than a person it’s not worth hiring a person to lower it. There are tons of good tech companies that run on under 10 boxes, have really good margins, and the productivity cost of focusing on cloud perf is non-zero. Getting to 5 boxes might not be worth it.

Still, there are larger cos where the economics are different. Read on.

When to care
Workload
Passing criteria
Unit-of-scale testing
Conclusion – think twice before you take this on

When to care

Loadtesting matters when some of these things are true:

You don’t have continuous load, i.e. your traffic is bursty or you’re turning on a new service. If you do have continuous load, you can learn a lot about your load characteristics by running better metrics in prod.
You have good metrics. You’ve already made an investment in understsanding your stack by setting up metrics, logging, timing spans, maybe some long-period reporting. If you don’t have good metrics, you won’t be able to prove that the LT workload is similar to prod and interpreting LT results will be an uphill battle.
Horizontal scaling is a big chunk of your bill. For what shall it profit a man, if he shall gain the whole world, and lose his own soul.
Something falls over at scale. If there’s a non-reproable error that only happens at load, or your performance gets nonlinearly wonky at a certain scale point, consider LT if you can’t explain it any other way.
Idle hands. Loadtesting is basically the devil’s work, so idle hands are a prerequisite. If every scale-minded person at your place is busy with something else, LT needs a really good bottom-line justification.

Ultimately the goal is to either improve your product or reduce your cloud bill, and if you can make a bottom-line case for one of those things, and you’re being realistic about the project scope and the impact on your team of running LT, then fine, do it.

Workload

There are various ways to verify that your LT workload is equivalent to prod, and various bad things that can happen when you don’t.

Workload isn’t just about rough ratios of API calls. If your system is complex, other things that matter are:

Fixtures. Are you creating users / data on the fly or do you have a DB dump that you init from? Is the DB dump up to date? If the dataset is synthetic, have you captured all the edge cases?
Cache. Do you have a cache? How many layers? Unrealistic cache hit ratios can make your LT falsely appear to succeed or fail.
Realtime / order. Order of calls per user may matter. If route X does some critical setup for route Y, and users typically hit X first, and your LT doesn’t capture that, it’s not a realistic test even if you got the X to Y ratio right
Infra. Are the global bottlenecks the same in your LT env as prod? (Database, machine type, network caliber, binpacking of services to machines).
Hops. Does your LT use load balancers in the same way as prod, or is it sourcing the traffic internally? Load balancers aren’t infinite resources and as of the last time I hosted big software on AWS, some of them take time to scale up.

If you have a few of these factors in play, you may need to pull in the in-house experts who designed your system to explain to you what’s happening.

Replaying prod traffic is in theory a way to do LT but it’s not super maintainable and repeatable and you have to be pretty careful about getting a DB snapshot.

I’ve frequently gotten workload wrong at the beginning, gradually adding metrics to measure the things I care about (cache hit ratio, for example), and discarding LT results where those metrics are out of band.

If you’re not trying to LT in a whole separate env, you can direct test traffic at your prod cluster during your slow times (but this is not zero risk). Or you can use ‘dark testing’, i.e. mirror production traffic to an experimental backend service. But these techniques make it hard to iterate and experiment.

Passing criteria

What does it mean for your LT to pass? It depends.

Most frequently, API route timings and error rates are what matters.

You may not be trying to get a ‘pass’ at first, but instead to repro a production failure so that you can debug it.

You might be trying for a fail, i.e. instead of saying yes/no can I handle traffic T, trying to solve for the T at which your system falls over.

Given that you’ll want to run a bunch of traffic patterns through a bunch of code versions, the more repeatable and automatic your loadtest is, the better. Issues I’ve had with loadtest repeatability in the past:

Hard to split up metrics by test, especially when your metrics are not highly granular. Traditional metrics systems are frequently time series and not conducive to producing a report along arbitrary boundaries
If you can pull metrics per test run, you’ll have to rewrite a bunch of your prod alerts in a stats context to get binary pass/fail marks
Knowing that the LT system is in a bad state is more art than science, and it may get flummoxed more often than not. Repair steps are not automatic
Some aspect of the traffic sourcing, backend deployment or metrics collection is not automatic

This is what I mean about tooling; getting all this stuff set up is hard but running LT without this stuff is often a waste.

Unit-of-scale testing

Another way to approach LT is to test the smallest unit of your system you can isolate – typically a single server process or single machine. Then you can repro some load-based errors, still do a lot of experiments around max density, but maintain simplicity and iteration speed.

The fact that you’re not testing your central bottlenecks is a con but also a pro. It’s obviously bad because you won’t run into global limits like connection counts, certain networking limits, DB CPU, and other failures of non-horizontal services.

But it’s a pro because it’s more repeatable and way easier to design and run. It’s the same as how in a test suite, you don’t need to integration-test everything; at a certain point you should trust the glue.

If you ever set up a CI regression test for scalability, it will likely be at the single unit of scale.

Testing the range of 10-100 simultaneous connections can still reveal a lot of weirdness that wouldn’t show up in single-connection-at-a-time tests or a ‘unit-gration’ test suite test that hits API & DB. One example from my personal history is some lockup that happened only when a python process exceeded its DB pool.

Conclusion – think twice before you take this on

Scoping LT projects is really hard and the state of tools is pretty bad.