Hi everyone,
Everything should now be back to normal.
I’m sorry for the interruption. You’re quite right that several klaxons go off when things like this occur, but it still happens more often than we would like.
There’s a lot of redundancy in the Anvil architecture, and we can cope with individual failures in many parts of the infrastructure. Unfortunately, a couple of single points of failure inevitably remain, and this hardware failure took out one of those: the server responsible for managing storage of your apps in git. This meant that many apps continued to work for a while, until caches elsewhere timed out, and then eventually all apps were inaccessible. By that time we were already well on our way to restoring the failed instance.
The maximum total downtime for any app was about 23 minutes, which is clearly much longer than we would like, and we recognise that this has a big impact on many of our customers. Please be assured that we are working to both reduce the number of single points of failure in our infrastructure, as well as recover more quickly when outages like this do occur.
If anyone is seeing any residual issues in the platform, please let us know by responding to this thread.
- Thanks, Ian.