Code running slow since upgrading to Business plan

kaleb · April 13, 2024, 10:50am

Yesterday I upgraded my account to the Business plan due to an expected increase in usage. Today, there was a major failure on one app and another is running very slow since upgrading and enabling the persistent server. Unsure what could be causing this issue when usage or code hasn’t changed dramatically

stefano.menci · April 13, 2024, 12:40pm

I doubt that changing plan will have a noticeable effect on performances.

An increase in usage on the other hand could have an exponentially increasing negative impact. Once you hit the limit of transactions per second or of available memory for example, an increase of 10% in the number of requests could mean a slow down of 1000% together with tons of failures. The load balancer will mitigate the problem, but scalability is not a problem that is fixed by upgrading a plan. For example, (I think that) upgrading from business to dedicated, where you have much more resources of some kind, like database rows, but you have only one server, you couldn’t take advantage of the load balancer plus Anvil’s server farm when the number of requests increases.

As for the keep server running, it usually helps. I have it in all my apps, and the startup time for every request is exactly zero. Big improvement.

But if the app uses global variables that grow in size, or, worse, that affect other requests, the persistent server could cause slow down (it never happened to me) or failures (it has happened with badly designed apps).

The solution here is easy: disable the persistent server and see what happens.

kaleb · April 13, 2024, 12:50pm

I should probably give a bit more context, its been a long stressful day so forgot to do that to begin with lol

I have two apps that get served as iframes. One app gets run across 6 clients concurrently, making a server request every 2ish seconds each. Those requests take around 1.5 seconds to complete. The second app is nearly identical to the first in every meaningful way. Last week I ran the 6 clients of the first app successfully without issue on the Professional tier, yet today adding 4 clients of the second app caused ONLY the second app to fail spectacularly with the following error, despite moving up tiers and using persistent server

Downlink disconnected anvil.server.RuntimeUnavailableError: Server resources not available for 'python3-full'. If this error persists for more than a few minutes, please contact support@anvil.works.

The reason, from what I could tell, was that imports were taking 10x longer than they normally would. Even after all those instances of the second app had stopped, I was unable to start it up again for a while. Once all 6 instances of the first app stopped, it changed. The issue is that I’ve had similar levels of use in the past, but never had it fail like this. The only other thing that was interesting was I received two of these requests in a third separate app, a bit before the issues started occurring

Obviously there’s about 100000 reasons why my app and not Anvil itself would cause all this, and I’d love to understand where the issue lies deeper, as I definitely don’t want this to occur again

Edit: no global variables, and in the middle of everything I changed the request timer from 2 to 10 seconds and that made zero improvement on anything

stefano.menci · April 13, 2024, 1:00pm

If I understand, you have control over the load, that is you can experiment with starting and stopping the load.

This is great, so makes your experimenting much easier.

For example you could disable the persistent server and see if that makes a difference. If it does, make sure you don’t have server side globals, unless you really need them, in which case make sure they are thread safe.

You could also increase the interval from 2 to 5 seconds, and see what happens.

EDIT
Ahahah… I saw your edit after I posted this.

kaleb · April 13, 2024, 1:02pm

After I’ve had a good nights sleep I’ll definitely dig deeper. At least I have very specific testing criteria to go by. I’ll try and replicate the conditions, its all just so weird. Appreciate the input

EDIT I think the fact the imports were directly affected (one session in the third app was showing 566356ms import time from 113 server calls) is the most interesting part, that’s not normal at all

stefano.menci · April 13, 2024, 1:14pm

That would be 5 seconds per import.

And the persistent server should bring that down to zero.

You need to enable the persistent server, then refresh the client to load the new app version.

A print in a server module should show when the import happens. I have a dedicated server, and I only see the imports happening once after the app changes.

You have a load balancer, and, perhaps, the next requests goes to another server that needs to spin a new instance? I would imagine they would try to minimize the number of instances serving the same app, but…?

kaleb · April 13, 2024, 1:23pm

Every app had the persistent server enabled before the issues started occuring, which is what makes it even weirder

My normal import time was around 7-800ms

p.colbert · April 15, 2024, 1:28pm

Which suggests that either

The Persistent Server is not being used, for some reason, or
Background tasks are being launched.

kaleb · April 16, 2024, 2:41am

So the Persistent Server wasn’t being used in some instances, but the time for import to complete had increased tenfold. I believe I fell under that recent issue with the images needing to be rebuilt as the issue hasn’t popped up since

kaleb · April 16, 2024, 3:38am

Just an update with my investigations. I have since moved all 3 apps onto Python 3.10 Beta, with the Minimal set of packages, and enabled Persistent Server on all of them. The Persistent Server seems to be working, confirmed with a print() at the start of a server module. I have stress tested the first two apps, and they seem to be running as they should be. Requests now seem to be starting and completing themselves as they should. I hope this has been an outlier, I’m not sure if there are other bugs in the persistent server stuff that is causing issues here, seems to have happened to a few people in a variety of instances fairly recently