Yes, that approach will work – it’s one of the intended use cases for Background Tasks!
The crucial design question is: What happens when something fails?
Let’s start with the unpleasant truth: Everything fails, from time to time. The best you can do is to manage it to a level that’s acceptable for your application. Even modern airliners, whose safety engineering is a monument to human achievement and should leave all of us in slack-jawed awe, budget for 1 catastrophic failure every billion flight hours. They can’t achieve “perfectly safe”, but they don’t need to: all they need is “safer than crossing the street”.
Neither servers nor networks nor databases are airliners: they fail rather more often than that. Transient failures (eg “that request came in while the database was restarting”) are vastly more likely than permanent failures (“oh no Postgres lost my data”). Postgres is good at not losing your data, even if you are unable to read/write it for a moment.
If you’re using a well-designed system, it is very likely that transient failures will be reported to you – that is, you’ll experience them as exceptions, rather than silent data loss.
This is the standard Anvil holds itself to: transient failures should raise exceptions, and durable data should be highly reliable. Thus, transient failures (by default) propagate to the user, who (a) sees an error message and is aware that the operation they just attempted failed, and (b) can refresh their browser or try again later. Meanwhile, anything that’s committed to Data Tables should be durable and consistent – so any successful operation stays that way. (We use well-proven technology for this: Data Tables are backed off Postgres, with continuous backups.)
Let’s look at your case: You’re taking in an HTTP API call, and writing to a database and a message queue. This is subject to all sorts of transient failures: perhaps your database restarts, or the network connection between Anvil and your DB goes down for a minute, or one of these things happens to us and you get an Anvil-internal error.
If you do it all synchronously (ie while processing the API call), then a transient failure produces an exception which produces an HTTP 500 error to the caller.
If you do it asynchronously (ie starting a Background Task and returning immediately), then a transient failure will produce an exception in the Background Task which will produce an error message in your App Logs…and nothing else. That’s one of those “silent permanent data loss” things we usually try to avoid.
Off the top of my head, I can think of three ways of addressing this issue:
-
Keep the process synchronous, and report errors via HTTP errors. The API caller then gets to decide whether to retry the operation or accept the loss. The good news is, this is already their job! The API caller already has to account for what happens if their network connection goes down and they can’t reach your API, so fault tolerance (or acceptance) is their responsibility anyway.
-
Just swallow the error: decide it’s an acceptable loss. (Not every silent permanent data loss is bad. For example, if this data were website analytics, silently losing a tiny fraction of the clickstream data probably wouldn’t be a big deal, but response time would be very, very important.)
-
Write code to provide improved durability. For example: Your HTTP endpoint writes to a Data Table, recording what it’s about to do and the ID of the BG task it launched. The BG task deletes that row once done. A Scheduled Task sweeps the table every hour, looking for stale tasks that haven’t completed, and relaunches them, kill()
ing the old Background Task if it’s still going and wedged somewhere. A transient failure can interrupt any of these processes, and the data still won’t be lost.
(You do have to worry about duplication from retries, of course – what happens if the DB write succeeds but the RabbitMQ operation fails, and then you retry both? Judicious use of DB transactions can reduce your window of vulnerability here, but keeping multiple components consistent remains a Hard Problem, and I have already rambled too long without going into the options here.)
Reliability engineering is fun, and there’s so much more to it than I’ve discussed here. It being a Sunday afternoon, I don’t immediately have some good further-reading references to hand, but “reliability” and “distributed systems” are good Google search terms 