A lot of small Outages/Restarts over the last days

meredydd · April 6, 2023, 9:57pm

Hi folks,

Sorry for being later to this thread than we normally aim for. Suffice it to say that if you’re seeing a 503 error (aka the page with the Anvil logo), something big has fallen over and all the alarms are going off on our end.

This incident should now be resolved, and normal service resumed. If you see anything else, please let us know.

A little more behind-the-curtain detail: We’ve been experiencing escalating load on the main data store that backs the Anvil hosted platform - that includes both platform data (user accounts, apps, etc) and Data Tables data - and it’s caused a bit of platform instability. (Unfortunately, @nickantonaccio, this is a “scaling a growing service” problem rather than a “deployed bad code” problem. And, as I’ll talk about in a minute, the only way out involves shipping more new code, which makes “LTS” not really an option.)

We’re responding by improving our data storage in a few ways, including:

Shipping performance improvements to improve throughput and remove bottlenecks (for example, we have fully rebuilt how Media is stored in Data Tables, addressing the underlying cause of this outage last month), and several less braggable-about but crucial efficiency improvements;
Upgrading our version of Postgres to a version with better scalability characteristics for the access patterns we use in Data Tables; and
More fully isolating shared Data Tables storage from the platform bookkeeping data, so that saturating one does not produce a slowdown in the other.

Doing this on a live service that’s processing many thousands of queries per second feels a little bit like open-heart surgery. We can’t just shut down for the weekend and dump/restore or migrate all of our data. What’s worse – and the proximate cause of recent outages – is that some of these changes require migrations such as “rewriting every Media object stored in Data Tables”. That’s terabytes of data, and this produces a lot of load on the data store (the same data store that’s causing problems by being overloaded). If we slip up (as we did a couple of times today, trying to migrate too much, too fast), it can all go pear-shaped very fast. In the extreme case, enough of our platform can be wedged trying to get a word in with the data store that our load-balancer decides that all the servers are down, there’s nowhere to send your request, and hands out 503 errors for a few moments.

A note to @mark.breuss: Normally, your Dedicated plan insulates you from issues like this. Unfortunately, the “blast radius” of this data store includes parts of the system that your requests pass through on the way to your dedicated server, so you’ve been affected this time. (We’re working on making this more robust and isolated – see above. If you’re on a Dedicated plan, problems in shared Data Tables should not become your problems!)

Finally, an apology to everyone who’s been receiving an earful from their customers, senior management, or anyone else about this. I know that feeling really sucks, especially when the cause is not within your control. This incident should now be resolved, some of the performance improvements I talked about are already kicking in, and we’re racing towards the next Big Upgrade as quickly as we can, which should buy us significant breathing space.