A lot of small Outages/Restarts over the last days

Is this specific to our server or is this something global?

4 Likes

Have as well some problemes - loading is veryyyyy slow

2 Likes

We havent seen speed issues but a lot of confused customer calls about this logo:

Its getting to a point where confused customer calls turn into angry customer calls.

2 Likes

Yes I have been experiencing the same thing. Luckily my tools are in-house to the company. Unluckily, my upper management is NOT happy with the stability of my applications.

1 Like

Same here. Iā€™ve had intermittent slowdowns and not being able to access apps.

Same here. Constant timeout and server errors today.

I canā€™t say Iā€™ve had a single problem lately, which is odd, since when everyone else has problems I usually do as well.

Only one of my uplink logs caught the outages:
(the integers are unix timestamps from right after when the script immediately restarts )

'Connection to Anvil Uplink server lost'
1680616997

'Connection to Anvil Uplink server lost'

1680699796

'Connection to Anvil Uplink server lost'
[Errno 11001] getaddrinfo failed
1680700044

'Connection to Anvil Uplink server lost'
1680782844

'Connection to Anvil Uplink server lost'
Invalid response status: b'503' b'Service Unavailable'
1680783037

1680783038

1680783038

1680783039

1680783039

'Connection to Anvil Uplink server lost'
1680790443


1 Like

Seeing these little complaints and signs of problems pop up regularly is emotionally painful. I absolutely love using Anvil, but in the end reliability is more important to me than productivity (even though I treasure productivity - if the system canā€™t be trusted, then productivity goes out the window) :frowning: Luckily, for my needs, the open source version currently provides a solution for in-house needs, but in the past Iā€™ve experienced an entire ecosystem I relied upon disappear, and Iā€™m very guarded about ever getting myself into that position againā€¦

3 Likes

There was a brief discussion about an ā€˜LTSā€™ version of Anvil. I have to wonder if having a system like that in place for production apps is a possibility.

1 Like

Not everything is anvil related in the end as well, AWS is just plain :poop: sometimes.

At one of the last places I worked we had heavily integrated zendesk email into our customer service workflows and the Europe amazon EC2 cluster/pod that ran our instance of ZD would sometimes go down for hours or days with no resolution in sight other than :person_shrugging: from ZendDesk and :no_mouth: from amazon.

2 Likes

Please donā€™t get me wrong, Anvil seems to provide top notch technical service and the company appears to be fantastically talented. I just see a potential need to look into how to provide more rock solid support for existing apps that are fully working. Some sort of ā€˜LTSā€™ system seems to be a useful proposition that could put a lot of users at easeā€¦

1 Like

Dear Meredydd,
please can we have any sort of infos whatā€™s going on.
I guess it is connected to the release of accelerated tables on April 14th.
Cheers Aaron

1 Like

My biggest annoyance is that Anvil has yet to make any statements or declare an ETR.

1 Like

Iā€™m assuming theyā€™re well into their investigation right now. Letā€™s wait it out. :grimacing:

my editor is down as well

I havenā€™t had any issues myself (mostly just editing an app on-and-off over the past 24 hours). But I want to second the concern about that logo popping up. Such a message may be disconcerting to users of apps-in-production who have no idea what ā€œAnvilā€ is.

1 Like

Hi folks,

Sorry for being later to this thread than we normally aim for. Suffice it to say that if youā€™re seeing a 503 error (aka the page with the Anvil logo), something big has fallen over and all the alarms are going off on our end.

This incident should now be resolved, and normal service resumed. If you see anything else, please let us know.


A little more behind-the-curtain detail: Weā€™ve been experiencing escalating load on the main data store that backs the Anvil hosted platform - that includes both platform data (user accounts, apps, etc) and Data Tables data - and itā€™s caused a bit of platform instability. (Unfortunately, @nickantonaccio, this is a ā€œscaling a growing serviceā€ problem rather than a ā€œdeployed bad codeā€ problem. And, as Iā€™ll talk about in a minute, the only way out involves shipping more new code, which makes ā€œLTSā€ not really an option.)

Weā€™re responding by improving our data storage in a few ways, including:

  • Shipping performance improvements to improve throughput and remove bottlenecks (for example, we have fully rebuilt how Media is stored in Data Tables, addressing the underlying cause of this outage last month), and several less braggable-about but crucial efficiency improvements;
  • Upgrading our version of Postgres to a version with better scalability characteristics for the access patterns we use in Data Tables; and
  • More fully isolating shared Data Tables storage from the platform bookkeeping data, so that saturating one does not produce a slowdown in the other.

Doing this on a live service thatā€™s processing many thousands of queries per second feels a little bit like open-heart surgery. We canā€™t just shut down for the weekend and dump/restore or migrate all of our data. Whatā€™s worse ā€“ and the proximate cause of recent outages ā€“ is that some of these changes require migrations such as ā€œrewriting every Media object stored in Data Tablesā€. Thatā€™s terabytes of data, and this produces a lot of load on the data store (the same data store thatā€™s causing problems by being overloaded). If we slip up (as we did a couple of times today, trying to migrate too much, too fast), it can all go pear-shaped very fast. In the extreme case, enough of our platform can be wedged trying to get a word in with the data store that our load-balancer decides that all the servers are down, thereā€™s nowhere to send your request, and hands out 503 errors for a few moments.

A note to @mark.breuss: Normally, your Dedicated plan insulates you from issues like this. Unfortunately, the ā€œblast radiusā€ of this data store includes parts of the system that your requests pass through on the way to your dedicated server, so youā€™ve been affected this time. (Weā€™re working on making this more robust and isolated ā€“ see above. If youā€™re on a Dedicated plan, problems in shared Data Tables should not become your problems!)

Finally, an apology to everyone whoā€™s been receiving an earful from their customers, senior management, or anyone else about this. I know that feeling really sucks, especially when the cause is not within your control. This incident should now be resolved, some of the performance improvements I talked about are already kicking in, and weā€™re racing towards the next Big Upgrade as quickly as we can, which should buy us significant breathing space.

6 Likes

Thank you for the the explanation Meredydd, I hope youā€™re able to get some sleep sometime soon :slight_smile: When youā€™re able to breath again - is it possible to distribute anvil geographically (i.e., anvilworks.uk, anvilworks.us, anvilworks.in, etc.), so that traffic coming from various sections of the globe are initially handled by entirely different instances of what is now anvil.works (load balancers and all)?

Or perhaps anvillts1.com, anvillts2.com, etc. - at which the back end system experiences a less frequent LTS upgrade schedule - and when youā€™ve got max users at each domain, each new group of clients who chose ā€˜ltsā€™ accounts get set up at anvillts3.com, anvilllts4.com, etc. I donā€™t personally care what URL I have to log into to get to the editor, or if my subdomains work at myurl.anvil.app vs myurl.anvillts1.appā€¦ 2 centsā€¦

Hi @meredydd,

thanks for the thorough update on the issue.
I do get scaling issues and I apprecieate the complexity you guys handle.

Needless to re iterate your own words, but the isolation from such outages is top priority for us.

Also the part of me that is not responsible for building a stable software is rooting for you guys since additional load = additional customers :wink: :rocket:

2 Likes