[RESOLVED] Anvil Incident - Apps not loading

Yeah, all back for me now.

yes all back now. thank you

Hi everyone,

Everything should now be back to normal.

I’m sorry for the interruption. You’re quite right that several klaxons go off when things like this occur, but it still happens more often than we would like.

There’s a lot of redundancy in the Anvil architecture, and we can cope with individual failures in many parts of the infrastructure. Unfortunately, a couple of single points of failure inevitably remain, and this hardware failure took out one of those: the server responsible for managing storage of your apps in git. This meant that many apps continued to work for a while, until caches elsewhere timed out, and then eventually all apps were inaccessible. By that time we were already well on our way to restoring the failed instance.

The maximum total downtime for any app was about 23 minutes, which is clearly much longer than we would like, and we recognise that this has a big impact on many of our customers. Please be assured that we are working to both reduce the number of single points of failure in our infrastructure, as well as recover more quickly when outages like this do occur.

If anyone is seeing any residual issues in the platform, please let us know by responding to this thread.

- Thanks, Ian.

6 Likes

Yes - my app also not working

“You may have been logged out - please refresh the page.”

@Ilqar Thanks for letting us know, sorry about that. Please try again now - I believe that issue is now resolved too.

Thanks. The app seems to load now, but calling something from the database is not possible yet.
11.47 UTC

@kempynck.jan All database operations are working okay as far as we can see at this end. Please can you start a new thread to describe the particular issue you’re seeing? If you can let us know what you’re trying to, and what error you’re seeing, we can investigate. Thanks!

Ok I deleted the latest added row of my database stored during the trouble time and now it works fine. thank you

1 Like

App is back again - Thank you very much Ian !

I cant seem to merge branches.

This pops up.

And a refresh yields the same issue.

Ok, this is not a residual issue per-se but one of my 8 or so uplink apps went into a loop of
Anvil fatal error: Repository unavailable: This app is currently offline - please try again in a few minutes.

forever, until i manually restarted it. If it had just crashed, it would have reset itself.

Only mentioning because None of my other uplinks did this, so if you are having trouble with an uplink you may need a manual restart as I did. The loop continued way past when everything went back up.

(this log file is now 11,000 lines long)

Given the recent discussions about LTS and the perceived lack of reliability of Anvil, I would like to point out that a 23-minute downtime due to an hardware failure in one of the few non-redundant areas of the Anvil infrastructure is still better than what I could achieve with a self-hosted solution, regardless of whether it is LTS or not.
My thanks to Anvil for this.
:slight_smile:

1 Like

Try to reset the branch to an older commit.
See this.

1 Like

Downtime:
if the downtime was a hardware issue, something you could not be prepared the answer might be yes. BUT if any change was made by Anvil (i.e. replacing a hw, reconnect a wire, etc) which caused the failer, then it is pretty much a different issue. If you change something you can be prepared if someone else in the background, you will only see if something went wrong.

If it was a HW failer, nothing to do with Anvil (no changes made which might cause it) then it is depends on the provider. If you compare it to google or Microsoft, I believe Anvil was faster and more responsive on forums, so you can spot the problem sooner. However, if you compare to some other providers, the really good one hosting stable vservers, they use to send a mail IMMEDIATELY when an attack, hw failure, downtime is happening. So you are aware of it and you can communicate with your clients if needed and being sure the problem is under investigation, someone will fix it. It is not your task anymore, other then notify your customers.

LTS is not about downtime, it is about changes made on the fly during a live system. HW failure would cause issue with LTS as well.

Btw, I assume this issue blocked any running apps from outside of editor as well, It was not only an editor - developer issue, but a customer would be blocked to use any anvil apps, right?
(scary to see it for me… I’m on a trial period, testing the system and I’ve seen already multiple small hick-ups, which makes me rethink about Anvil)

Anyway, in case of similar failure I would really recommend to send en email to Anvil users by Anvil support team, so we can be aware of it even if we are not working on a project atm and we don’t need to search/write into forums to figure out what is going on.

It would be realitvely easy to catch issues like this by Anvil, if Anvil would catch issues in the background. (i.e. email/log directly to Anvil support to notify them about a crash, problem with starting up apps, db access etc.) instead of waiting for user’s reports. I made similar systems where big organisation’s IT department or dev team was notified immediately when a crash or similar error happened, so they know about it before the customer called them… It looks very different from the point of customer, when the developer not asking but telling you what happened, since they can see the error in details from a generated report.

I have always been impressed by how quickly Anvil staff responds to app downtime issues. But, as someone who now runs a production system, it would be awfully nice to know about the downtime as early as possible so I can mitigate the effects on my end (in terms of customer support, a notice on the static page, etc).

An email out to all Anvil customers on a paid plan about an outage affecting running apps does seem like a reasonable thing to expect.

For now, there’s a workaround available, to set up an email alert for yourself for whenever something is posted to the Announcements forum category (which is how I learned of this incident this morning).

edit: I’m unclear on what the lag was between the first user report in the forum and this Announcement post, but I received the email notification around 20 minutes after the first user report.

2 Likes

I use the free tier of UptimeRobot: Preferred Application Monitoring Solutions? - #3 by stefano.menci

2 Likes

yeah, just found it a few min before… Thanks!

Also Thanks for @hugetim to point out the email alert feature!

btw, don’t get me wrong. It might see I have “trouble” with Anvil. I like, no I LOVE what Anvil is doing and how easy and quick is working with it. That is why I’m still here… I just might have a bit more concern about QA, Stability and production environments…

4 Likes

In the last week I had two outages.

One was this failure. I got immediately notified about the Anvil failure on my app by UptimeRobot and I got an explanation on the forum.

Fortunately this happened during the night, so my users were not affected and no one called me.

The other was a Google Map service returning NOT_FOUND as the distance between two zip codes for about 30 minutes. I would expect better redundancy and reliability from Google, instead I got calls from clients that couldn’t place an order. My logs show the call returning NOT_FOUND for about 30 minutes, then they got the correct value. There was no way for me to know about the failure (I can’t possibly setup an UptimeRobot for every pair of zipcodes!), I didn’t get any explanation from Google and it lasted longer than the Anvil failure.

In this case a client called because couldn’t place an order on the ecommerce website.

I am not denying that Anvil has more issues than the typical provider. My point is that there is no provider that is 100% reliable and that the benefits provided by Anvil’s development far outweigh the downtime.

Obviously my point of view applies to my use case: single developer maintaining apps mostly for internal use (some, like the one with http endpoints used by the ecommerce is used also by external users) and your requirements may be different.

4 Likes

What’s the typical provider in this space? I can’t think of another Python based web browser editor and app framework system that is as feature filled, transparent, community focused, and business savvy than the team at Anvil

3 Likes