How to handle unexpected disconnect in Uplink program?

p.colbert · December 4, 2019, 3:42pm

Context
We’re developing an App that requests lengthy calculations. Calcs are performed in-house. To handle multiple simultaneous incoming requests, we have an in-house “pipeline”, with 3 programs: Listener, Calc, and Shipper. Calc, the slow one, has an input queue, so jobs can pile up without being lost. Once Calc finishes a result, it moves it to Shipper’s in-queue. Shipper connects to Anvil, and waits for these results, ultimately packaging and uploading each result to its destination (database row).

At least, when things are working as planned.

Possible problem
We’ve noticed that on some days, our connections to Anvil get interrupted or lost. We see this when running the App in a browser. (No big deal; re-trying the operation almost always works.)

At about the same time, Shipper has been observed to terminate, for no apparent reason. It occurred to me that it might be getting disconnected, too.

Listener can call anvil.server.wait_forever(), so the Anvil library is in control, and can re-connect at will. In fact, it does, as can easily be seen in Listener’s own console window.

Shipper, however, can’t use anvil.server.wait_forever(), to automatically re-connect. It is waiting not for Anvil’s Remote Procedure Calls, but for local events (e.g., arrival of a file). Therefore, until such an event happens, its Anvil connection remains unused. If, in the meantime, it loses its connection, for whatever reason, then Shipper won’t know – until it tries to use it, i.e., to update a table row, or call a Server Module function. If that raises an Exception, then that would explain why Shipper terminates.

Whether this is actually the problem, it’s hard to tell. But it’s certainly one we’d want to address anyway. Shipper should be able to

Detect when it has disconnected
Attempt to re-connect
Confirm the connection
Determine whether the operation failed, and
if so, then re-try it.

This looks an awful lot like @anvil.tables.in_transaction. Is that the prescribed solution? If not, what is?

stefano.menci · December 4, 2019, 5:36pm

Do you see any error message on the console?

If you are running it on a window that closes immediately after the process ends then try to create a script that keeps the window open so you can see any error message.

Very likely you can fix your problem with a simple try: ... except: that catches the errors so the process doesn’t end.

I have a similar setup and when something goes wrong I don’t even bother trying to recover from the crash. The script sends me an email (so I know that something went wrong) and keeps going getting the next item from the queue. If the previous run crashed the next item will be the previous one, otherwise it will be the next one.

My scripts have been running for months without problems.

p.colbert · December 4, 2019, 6:00pm

As a separate process, the console window closes immediately. So, yes, my next step will be to use a more persistent window, to try to catch the message, for the next time it occurs. But I’ll still have to wait for it to actually occur.

A “blanket” except would also keep the window open. (I’ll look up how to dump a reasonably complete traceback, for diagnosis.) I wouldn’t necessarily want to leave it that way, but this is an early draft of the program, and the more details I can exhume from the event, the better.

stefano.menci · December 4, 2019, 6:19pm

Here is an excerpt from my main cycle. It includes a multiple attempt at sending the email, because the connection problem that caused the first crash might still affect send_error_email. The script keeps waiting 1 second and trying to send the email again until the connection problem is solved and the email is sent without problems.

import traceback
if __name__ == '__main__':
  while True:
    try:
      # check if the file exists and do something that could crash
      time.sleep(0.2)
    except Exception as e:
      while True:
        try:
          if type(e) == 'anvil._server.AnvilWrappedError':
            send_error_email(f'Error:<br>\n{repr(e)}<br>\n{e.error_obj}')
          else:
            send_error_email(f'Error:<br>\n{traceback.format_exc()}')
          break
        except:
          time.sleep(1)

p.colbert · December 4, 2019, 6:44pm

Thank you, @stefano.menci! I can really sink my study teeth into this.

This code does not try to re-establish a connection. Is that somewhere else? Or does it count on Anvil’s library to re-establish the connection on its own?

stefano.menci · December 4, 2019, 7:51pm

There is one anvil.server.connect("xxx") at the beginning, and that’s enough to try to reconnect when there is a problem.

… Well, almost.

That is enough to reconnect and get your uplink to manage data tables, emails and call server functions. Anything initiated from the uplink side.
But is not enough to reconnect to the server and register the server callable functions defined in the uplink and allow the server modules to call them. Which, if I understand your question, is not your case.

But it is my case. So I added a try: ... except: block also around the calls to the uplink scripts that are waiting forever, and if the call to the function defined in the uplink fails I do:

On the server:
- Send myself an email so I know I need to restart the uplink script
- Add a row to a table that says that I tried and failed to call the uplink function
On the uplink script:
- Poll the server every 60 seconds and check for the above mentioned rows, and if I detect one I act upon it as if I were responding to the request coming from the server, only with a few seconds dealy

I think I could do better. That is the 60 second polling could start a new instance of itself and terminate, so the new instance would establish a new fresh connection. But the (1) the reboot is required only once or twice a month and (2) everything works well, at worst with a few seconds delay that doesn’t bother me.

Now that I think about it, two nights ago all the connections were interrupted (I saw it from other logs), but I didn’t receive any emails and everything is still working without restarting the uplink scripts.
Question for Anvil Central: did something change in the automatic reconnection and now the server callable functions keep working when the connection is automatically reconnected?