Very Fast Replies

david.wylie · February 16, 2020, 12:16pm

I’m just after some validation for my ideas -

I have an API endpoint that writes to a database and posts a message to a RabbitMQ queue, both of which have the potential to be relatively slow. My API needs to return as fast as humanly possible, because the calling end halts waiting for a result.

Would using background tasks be the right thing to do here? I was thinking fire up a background task to do both the db write and the RQ post. They can then take their time, and I can return a value to the calling party almost immediately.

Does that seem right?

robert · February 16, 2020, 2:08pm

When calling a background task you can get the status and update it as needed. So what you are doing sounds like it could work.

david.wylie · February 16, 2020, 2:25pm

My issue is - can I return a value, essentially exiting the API session, with the background process still running?

I haven’t tried any of this yet, I’m mapping out Monday’s work ahead of time

meredydd · February 16, 2020, 2:56pm

Yes, that approach will work – it’s one of the intended use cases for Background Tasks!

The crucial design question is: What happens when something fails?

Let’s start with the unpleasant truth: Everything fails, from time to time. The best you can do is to manage it to a level that’s acceptable for your application. Even modern airliners, whose safety engineering is a monument to human achievement and should leave all of us in slack-jawed awe, budget for 1 catastrophic failure every billion flight hours. They can’t achieve “perfectly safe”, but they don’t need to: all they need is “safer than crossing the street”.

Neither servers nor networks nor databases are airliners: they fail rather more often than that. Transient failures (eg “that request came in while the database was restarting”) are vastly more likely than permanent failures (“oh no Postgres lost my data”). Postgres is good at not losing your data, even if you are unable to read/write it for a moment.

If you’re using a well-designed system, it is very likely that transient failures will be reported to you – that is, you’ll experience them as exceptions, rather than silent data loss.

This is the standard Anvil holds itself to: transient failures should raise exceptions, and durable data should be highly reliable. Thus, transient failures (by default) propagate to the user, who (a) sees an error message and is aware that the operation they just attempted failed, and (b) can refresh their browser or try again later. Meanwhile, anything that’s committed to Data Tables should be durable and consistent – so any successful operation stays that way. (We use well-proven technology for this: Data Tables are backed off Postgres, with continuous backups.)

Let’s look at your case: You’re taking in an HTTP API call, and writing to a database and a message queue. This is subject to all sorts of transient failures: perhaps your database restarts, or the network connection between Anvil and your DB goes down for a minute, or one of these things happens to us and you get an Anvil-internal error.

If you do it all synchronously (ie while processing the API call), then a transient failure produces an exception which produces an HTTP 500 error to the caller.

If you do it asynchronously (ie starting a Background Task and returning immediately), then a transient failure will produce an exception in the Background Task which will produce an error message in your App Logs…and nothing else. That’s one of those “silent permanent data loss” things we usually try to avoid.

Off the top of my head, I can think of three ways of addressing this issue:

Keep the process synchronous, and report errors via HTTP errors. The API caller then gets to decide whether to retry the operation or accept the loss. The good news is, this is already their job! The API caller already has to account for what happens if their network connection goes down and they can’t reach your API, so fault tolerance (or acceptance) is their responsibility anyway.
Just swallow the error: decide it’s an acceptable loss. (Not every silent permanent data loss is bad. For example, if this data were website analytics, silently losing a tiny fraction of the clickstream data probably wouldn’t be a big deal, but response time would be very, very important.)
Write code to provide improved durability. For example: Your HTTP endpoint writes to a Data Table, recording what it’s about to do and the ID of the BG task it launched. The BG task deletes that row once done. A Scheduled Task sweeps the table every hour, looking for stale tasks that haven’t completed, and relaunches them, kill()ing the old Background Task if it’s still going and wedged somewhere. A transient failure can interrupt any of these processes, and the data still won’t be lost.

(You do have to worry about duplication from retries, of course – what happens if the DB write succeeds but the RabbitMQ operation fails, and then you retry both? Judicious use of DB transactions can reduce your window of vulnerability here, but keeping multiple components consistent remains a Hard Problem, and I have already rambled too long without going into the options here.)

Reliability engineering is fun, and there’s so much more to it than I’ve discussed here. It being a Sunday afternoon, I don’t immediately have some good further-reading references to hand, but “reliability” and “distributed systems” are good Google search terms

david.wylie · February 16, 2020, 4:02pm

Thanks for the detailed reply, especially for a Sunday!

I always assume everything will go wrong. My use case is to do with receiving real time SMS from a carrier, as a rate of several per second.

The SMPP server that I have bought can be configured to call a URL on every message. The URL called must return “result=1” else the software will not ack the message and the carrier will not count it in their stats (and therefore won’t pay us). Not only must it return that value it must do so quickly else it will time out and reject the message anyway.

The fast acking of the inbound message is the most important step in all of this. My stats logging (the DB) and my forwarding on the message to a client (the rabbit q) can fail with far fewer consequences, though knowing it has failed is useful when trying to tally up stats at the end of any billing period.

I am actually going to remove the DB step and have the rabbit q consumer process perform both the db insert and the forwarding of the messages. I may even use two queues so I can run two consumers, each dedicated to one task. The consumers essentially run at their leisure and the Qs can hold on to the messages for as long as required without holding anything up.

Duplication is broadly taken care of by unique keys and uuids on the database, the actual forwarding can and is an issue, especially as with SMS you can have partial messages, multiple parts, and any one of those can time out leaving you to piece together where the message actually is and how many of them there are.

Can I get to those programmatically? I’ll add a feature request if not, because again failures are kind of fine so long as I can account for them at some point in the future.

In the mean time, a quick ack back to the SMPP server and a silent failure of admin functions is the preferable trade off here. At least we get paid that way. Once it’s running I can start to analyse the pain points.

edit - I used to do this by writing out a text file for each record, then having a scheduled process every 2 minutes hoover up the files and write them to databases, etc. Problem is I’m handling nearly half a million messages a day, and that’s just the start of it. The files get somewhat unwieldy when the process crashes out and doesn’t pick up the files, leaving a directory so big you can’t even ls it. I like that simple approach, though. Do nothing complicated in real time and handle the hard stuff separately at your “leisure”.

edit 2 - speaking of retries, this looks interesting : retrying · PyPI
Not sure how useful it is but I like the idea.

stefano.menci · February 16, 2020, 10:11pm

I would get the http endpoint to add one row to an Anvil table and return immediately.

Then a background task scheduled to run every minute would search all the rows and process them one by one. Be careful not to (intentionally or not) start a transaction, otherwise this would slow down the http endpoint that tries to add a row.

The background task could write on its own to-do-list table that it’s starting to process a row before processing it and that it has done it successfully at the end.

The background task should, before starting to process the jobs, check if there are unfinished jobs and take any required action.

This would allow you to know if something went wrong without analyzing the app log (which would be way slower than managing a to-do-list table.)

david.wylie · February 18, 2020, 5:42pm

I’m just discovering how slow that is, once you get a few hundred thousand rows in it

The RabbitMQ method seems to work nicely (I use https://www.stackhero.io for that - one less thing to manage!) but I might give your idea a go, too. See which is quicker/more reliable. They both essentially achieve the same thing but it might be nice to keep it all on Anvil.

What will really be the decider is the throughput rate. Traffic peaks at 3-4 messages a second at the moment, but when I switch all of it we could peak around 10-20/second.