You have to force quit because cache design is hard

I just finished a high-adrenaline refund convo with an app provider. Their server-side logs showed my device making successful requests, but screenshots from my device showed stale data. From my seat, their app wasn’t working.

Fascinatingly, the app vendor knew their app sucked and advised me to turn off wifi or force quit. Google’s help page on IAP (archive.org link) also tells you to ‘force stop, then reopen the app or game’.

I bet most smartphone users have in their lifetime ‘juiced the cache’ somehow – drag from the top of screen, force quit and reload, in the worst case uninstall/reinstall. We’ve also had the experience of different parts of the app having different information, like you sent the email reply but your inbox still thinks there’s a draft. Or of something completely stroking out when your connection goes down – not just clearing its data but becoming unresponsive to input. (Hint: don’t do IO in the main thread).

Even google maps, which in theory should be first class on android, freaks out if you leave it backgrounded for too long and erases your directions. This is inconvenient when I set up my route last night and am now offline in the subway and need to know which stop to get off at.

Cache design is a missing skill
The tools that can help don’t play well with languages
Technical example(s)
What’s next

Cache design is a missing skill

By cache design I mean the engineering sub-discipline of making sure that when you update data in one place, you update it everywhere. This is particularly hard when you’re storing data in multiple layers of caches, including some that are in the user’s pocket. Mobile devices are (1) hard to include in automated tests and (2) and may be running an old version of your software.

If you have the budget to get somebody good, it’s easy to hire good mobile devs. Do a couple hours of whiteboarding, make sure they’re using recent tools, keto up the snack fridge and make them an offer.

Not so for cache design. The hiring ecosystem isn’t there. People are starting to say all programmers are distributed programmers now, but the skills are diffusing slowly.

In particular, these parts of the ecosystem are missing:

Job title: Unless you ship a product that has caching as a feature (like cockroachdb or cloudflare), you probably don’t have a person(s) in your place who is responsible for this. That means when caching is broken you don’t know who to assign it to. You may not even recognize that your recent bugs all fall under this umbrella.

If one person is responding to all these issues, they may not have the authority to impose changes across the org (backend / mobile divide, for example). And if you need to hire for this, it’s hard to find because you don’t know what to call it.

Testing: Manual testing doesn’t always uncover these edge cases, because part of the skill of testing is getting a clean starting point. And in-house dogfooders are using your app on hot new hardware + silky smooth office wifi. Or cruddy office wifi from downstairs, and so the devs blame the wifi for logic errors.

Even the original author of the caching logic may not have a testing strategy. For any kind of logic with a big state space, it’s not uncommon to ship code with a ‘recovery case’ that hasn’t been run ever.

An example of this in a GCE postmortem from 2016:

In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.

My read is that this recovery case was never end-to-end tested with the given versions of the software.

The tools that can help don’t play well with languages

Functional languages are in vogue now and we’re seeing functional features in daily drivers on a lot of frontend platforms (JS, kotlin, swift). And with functional tends to come strong types. But type safety ends at the water’s edge, or more accurately the serialization boundary.

Type safety has cured a lot of ills in our industry but it doesn’t help with caching, and in fact creates false confidence – one species of caching bug is mismatched types between client and server, or between the version that wrote data and the one that reads it.

Rule 1 of protobuf use at google is ‘all fields optional’. In other words, tolerate missing fields. This balloons the line count of every function that has to deal directly with externally stored data, but decreases the odds of schema changes leading to failure, either at the parsing step or in biz logic.

Amazon uses verification tools for their high-reliabilty systems. I’ve always loved this paper on TLA+ / PlusCal at amazon. They say:

So far we have used TLA+ on 10 large complex real-world systems. In every case TLA+ has added significant value, either finding subtle bugs that we are sure we would not have found by other means, or giving us enough understanding and confidence to make aggressive performance optimizations without sacrificing correctness.

But you can’t write runnable systems directly in TLA, it’s just ‘exhaustively executable pseudocode’. Taking an implementation that’s running across multiple codebases on different OSes and model-checking it requires hefty work from lots of different kinds of developers. Bridging the gap between verification models and runnable code is something we have yet to see in industrial languages, and would be a good improvement.

Technical example(s)

Some ultra-simple examples that will be familiar to anyone who’s been stuck in this particular well:

Change doesn’t trigger refresh: let’s say ‘change’ means ‘purchase’ in this case. A book you’ve bought may not immediately appear in your library, and may require banging on your refresh button to get it to show up. The mobile device in question was just used to make the purchase, it definitely knows what happened, it’s just not moving the data to the right place.

Real-life examples of this: sent messages from an email client don’t show up in ‘sent’ list under intermittent connectivity, or don’t clear draft preview in ‘inbox’ list. (This is in a widely used email client that came with my shitty phone). New purchases aren’t visible in library right away (on the VR interface to a major game purchase platform).

Bundled state: A home screen includes bundled preview data from different parts of the app. Data may be correct when you click through to the subsection, but the preview will be wrong on the homepage even after a force-refresh because the bundle is new and the server won’t invalidate it. It’s being stored as a blob and there aren’t individual TTLs on its subfields.

Summary doesn’t agree with list: summaries are updated through event bus messages that can be slow or missed, so you’ll see a transaction in a ‘list of all transactions’ screen but not in a tally (‘account balance’, for example, or ‘point summary’).

Detecting that you need to recover from a missed update isn’t easy. But users usually sense that something is wrong – a UX solution to this is that when a user has forced an update, invalidate all cache layers, not just those on the user’s device.

A schema design solution is to synchronously mark the summary dirty, then asynchronously do the (presumably expensive) computation to update it. If the user requests the summary between those two events, you can recompute it in-band or something (but make sure to use a vector clock for the dirty bit).

What’s next

I should caveat all this. My daily driver is an older android device which means I get hit with a triple whammy when it comes to app quality:

Many apps don’t exist for android at all
If they do, it’s an outsourced web view or less-tested implementation
Even if the android app is high-quality, it may not be tested on my OS version

Still, my main point is that software sucks and so do the people who write it (myself included) and I think that’s true on all OSes.

The good news is that distributed programming knowledge, including cache design, is diffusing through the landscape. We’ll see more tricks for preventing these failures, both built into our languages and pounded into our brains. And eventually even in our job descriptions.