Hi all,
As you may have noticed, the hosted Anvil service experienced some stability issues last week. Starting on Wednesday the 3rd, and running through Friday the 5th, we experienced intermittent outages, variously including reduced performance, intermittent Data Table access failures, and (for an agonising few minutes) total inaccessibility of all hosted Anvil apps.
The summary is “this wasn’t good, and we’re taking steps to avoid it happening again”. In the interest of transparency, I’d like to share some of our post-mortem conclusions about this incident:
The original impetus here was scaling – over the last few months, Anvil’s traffic and user numbers have grown enormously, and this exposed some weak spots in our infrastructure. While the problems this creates are mostly “good problems to have”, they are definitely problems, and I’d like to apologise for the bumpy ride. This is definitely not the level of service we aim to provide!
The proximate causes were issues in our work-scheduling and Postgres connection pool systems, which had been creaking under the increased load, contributing to a number of minor incidents over the last few weeks. Last week, we shipped upgrades to both of those systems, to improve scalability and observability. Both of these systems turned out to have bugs that did not reveal themselves in our testing, but turned up semi-reliably under production load, and they took turns kneecapping Anvil for the next 2.5 days.
As the first issues appeared and the alerts went off, we made the decision to “roll forward” and fix these issues rather than “roll back” to the version that was deployed on Tuesday morning. There were several reasons behind this, some good, some bad: We originally anticipated a quick solution; we did not know how to replicate these issues except under the peculiarities of production load (some of them turned out to be concurrency bugs that would have been very difficult to tease out under synthetic load); the release contained several other important fixes (accumulated while we held back the new upgrades for testing); and we knew that the pressures of scaling were only going to get more acute, and possibly cause more outages, if we rolled back. Ultimately, however, it was this decision not to roll back early that turned a minor incident into a major one.
Contributing factors include the “fog of war”: By Friday, the team had spent the last two days responding to an apparently never-ending series of alerts, was somewhat sleep deprived, and had started to acquire tunnel vision. By that point most of the issues were in fact resolved, but Friday evening’s database outage, for example, resulted from an operator error at a live production console, which would have been a lot less likely with a fresh mind.
Finally, a word about communication – we have definitely heard the requests for more information! During an outage, communicating is difficult: You don’t have all the information, the things you think you know might be wrong, and the top priority is restoring service rather than carefully crafting a message that expresses that uncertainty, as well as what we know so far, in front of a justifiably irritated audience. We do, however, understand that many of you would have really benefited from knowing that the team was aware and working on the problem, and we are working out how best to communicate this in future.
The good news: With the new subsystems deployed, Anvil is now ready for another round of users to pour in and start building things, and the increased observability should give us tools to minimise or avoid scaling-related problems. We’re thinking about how best to provide effective status updates while staying focused on the problem, and we’re working on deeper technical initiatives to better manage deployment.
Thank you all for your patience as we go through these growing pains; we hope the next order of magnitude will be a bit smoother