I’m working on a chatbot using Langchain, and I’ve encountered a latency issue. Initially, the client calls a server function, which in turn calls a background function to start streaming. However, I’m experiencing a significant delay before the streaming starts, impacting the user experience negatively.
In an attempt to optimize the load time, I’ve placed the imports directly inside the server functions, depending on the libraries each function requires. Yet, I’ve noticed some Langchain llm imports can take up to 4 seconds to load.
I’m contemplating upgrading to the Business plan to utilize a persistent server. However, I understand that background functions operate in different Python environments. My question is: If I move the imports outside the functions, would it help in reducing the load time? Or will background functions still face the same issue due to the different Python environments?
I appreciate any insights or suggestions in advance.
Without persistent server there is no way to get rid of the first delay, but you may be able to get rid of the following delays, during the stream.
You could try doing the job in a background task. Then the background task stays up, and does its job, updating a row in a datatable when there is more data available.
On the client you have a timer that calls a server function and reads the row from the datatable every second.
If the slow imports are done inside the background task, the server call that reads and returns the row should be fast.
All of this will work assuming that the library you are using does not use http endpoints to stream the result back to you. If it uses http endpoints, then those may be the bottleneck.
Also, any script you keep running on a machine you control that is outside of anvil that you then connect to using anvil-uplink will obviously, stay running without having to reload the langchain library, for as long as you wish it to keep the script running.
There are also no 30 second timeout on functions called by anvil uplink, but you would have to use a background task to call them if you want them to return a response that might take longer than 30 seconds.
Or call directly from the client where there is also no 30s timeout. (… still watch out for security and blocking concerns though.)
Running something like langchain outside of anvil would also give you the opportunity to modify other machine parameters, like the amount of memory a model has available to use if you wanted to run a local llm.
I have been grappling with this “slow import” issue for months on and off. As you found out, latency is caused by loading the server modules and then loading the server module again when the background task is launched. Persistent server will shave off a a second or two, but you can’t avoid the serverless nature of launching background tasks. You also can’t avoid using background tasks because you want to be able to “stream” the responses. Here are your options:
Use an uplink script with the slow imports, defining the background task in there. Run your uplink script wherever you want. I have shaved off a few seconds with this method. You can control the CPU power behind it without upgrading (most laptops will be faster than your Personal/Professional plan). Downside is that uplink scripts can disconnect and there are a few forum threads about managing those.
Persistent server which will shave off a few seconds. Try Option 1 first to get a sense of your lower bound latency.
Persistent server, call_async, and writing/reading to the database (the database streaming method). I haven’t had success with this, but someone else may have.
Create an external webservice that supports streaming (FastAPI?), and stream back to a barebones background task, effectively eliminating all your slow imports from your Anvil app. This requires knowledge of other web frameworks which is why I haven’t gone this route.
A possible fifth alternative is to find a way to launch the background task before the question is submitted (when the user starts typing for example), and have the background task wait for a question to be read from the database (input question gets sent to the database). I have toyed with this idea but it is a bit elaborate and might not actually improve latency.
Another big part of the latency is due to the multiple prompts used in a chain (especially the Conversation Retrieval Chain). You might want to explore using fewer steps in the prompting chain as another way to improve latency. Removing the conversation summary prompt, for example, might help.
I combine three of these with some of my programs/apps.
An uplink script that runs all of the time with wait_forever()**, connects to anvil uplink has a function that runs another copy of the script with another argument (or a different script all together if you prefer), this second version is a finite running “listening server” that connects to uplink and checks an anvil table for new “work” to be processed, every few seconds.
When any user launches the app, the function checks to see if that “listening server” is running, and if not it launches. This also happens if they click the button to process some “work”. The “listening server” only runs for 5 minutes, the typical amount of time that a user would have asked for “work” to be done.
If the listening server finds work, it processes and builds the result, sending it to the table, marking the job as “done” with a bool column.
You can either use a background task to check the row over and over until it is marked ‘done’, or you can pass the row of the job back to the client and have a timer component check.
This is not a one-to-one with langchain obviously, because you are streaming text back and forth and using an API and my system is more of a “press button, receive bacon” style for the purposes of this business app.
** You might get away with running your own while loop with a time.sleep() call, and a call to anvil.server._get_connection(), since that is all that .wait_forever() does. All you would be doing is injecting your own code to check a table for work to process.
I don’t know how the library works, but you may be able to split the app in two:
The app with the big slow library does its job with its own slow to start background task. The background task updates a row in a table when there are updates
The app with the UI does NOT use the slow library at all, it just shares a table with the slow app. The form polls a server function to check for updates in the table
This might work if you don’t need to use the library at every call.
The UI app cannot use the slow app as dependency, otherwise you are back at the starting point.
But the UI app can call a server function that stores some info in a row and calls an http endpoint of the slow app, passing the row id. The slow app gets the row, does its slow job and keeps the row updated while the UI app polls its own server callables to check for updates.
First and foremost, I’d like to extend my gratitude for all the suggestions and insights you’ve provided. After considering the recommendations, I managed to solve the problem using a hybrid approach inspired by your collective input.
Upgrade to Business Plan: I decided to move to the Business version to leverage the benefits of the Persistence Server.
Background Task Implementation: Right upon loading the form, I initiate a background task that constantly monitors the database, waiting for a user message.
Message Processing: When this task picks up a new message in the database, it processes it and then removes the message once done.
Task Restart: After the background task finishes processing the message, I terminate it and launch another to be ready for the next user input.
I share a template with the implementation I came up with.
If anyone has suggestions to further optimize the process, I’d be more than grateful for your feedback.
@yahiakalabs There is an old thread where we were discussing various different ways to interact with background tasks, over a year ago when there were fewer interaction options built-in to anvil.
(It also ended up spawning a few FR’s that helped with those options)
This might give you even more ideas because some of the workarounds to the (at the time) limitations worked quite well and were fully distilled concepts.