What is the best way to upload large amounts of data?
I would like to synchronize a few tables daily and using an http_endpoint times out after 20-30 seconds (seems to be inconsistent) and using an uplink is very slow.
I tried uploading 800 rows to a table by sending them as one long JSON document to an http_endpoint and adding them to a table takes 15.5 seconds without a transaction and 5.5 seconds with a transaction.
I tried to upload the same 800 rows with 800 add_row()
via uplink, and it takes 89 seconds with transaction and 99 seconds without.
So wrapping all in a transaction saves most of the time when using an http_endpoint, but doesn’t help much with an uplink.
I had a little hope that the uplink with transaction would work on a local cache, but apparently I was wrong.
http_endpoint with transaction 5.5 seconds
http_endpoint without transaction 15.5 seconds
uplink with transaction 89 seconds
uplink without transaction 99 seconds
It looks like the fastest way is to split the data in chunks smaller than 20 seconds and call an http_endpoint for each chunk. The problem with this is that it is impossible to do the whole upload in one transaction, which would leave the database in an inconsistent state between the time the first and the last chunks have been sent.
Is this a one-time operation, or do you need to do it regularly? Where is the data coming from?
It is a daily synchronization with our inventory.
Our inventory has tens of thousands of rows in our local database.
The Anvil app has a subset of our inventory. About 2-3 thousand rows generated after filtering, consolidating and massaging the data in the inventory.
For the first few weeks / months I would like to do a daily complete overwrite of the Anvil tables. When everything will work well I will compare the local with the Anvil tables and update / add / delete only the rows that that need to.
The best solution for bulk uploads would be the uplink with locally cached writes. I don’t see any reason why the writes should be sent one by one when working inside a transaction. (Perhaps there are many good reasons, I just don’t see them from here
).
Are you using Data Tables or an external database?
Anvil guys might jump on me here, but for any significant amount of data I use an external MySQL DB. I regularly upload 100k+ rows using LOAD DATA INFILE and it takes a few seconds, often as regularly as every 2 minutes.
@david.wylie: I am using Data Tables. They are good enough for this application and I am trying to keep it simple.
@navigate:
-
The concept you show is correct, but my logic is more complex: I need to check whether the row needs to be added, deleted or modified. Even if the row with the part ID is there, some fields could have changed. It is not a full inventory, it is a consolidated version where one row with one part ID has several fields with quantities and other attributes for each of our warehouses.
I can do it, but I would like to wait before investing time on it and, in the mean time, use a brute force full overwrite.
-
The old database is in Lotus, if I touch that I get green rashes. For now a script that exports, filters, consolidates, massages is all I can do. We are working on a new inventory management, but that’s another project, much larger than the app I’m working on now.
Regardless of the specific requirements for my current use case, I think a faster bulk upload would help.
5 seconds for 800 rows with an http_endpoint is not bad, but it times out with 8000 rows, so it’s unusable.
90 seconds for 800 rows with uplink is definitely too slow. Here the bottleneck seems to be too much talking to the server, and if there was a way to postpone the talking at the end, when the transaction is committed, perhaps we could be down to 5 seconds. At this point we could go to 8000 rows with no time out constraint.
I am going to use ELEPHANTSQL.com as a caching system.
Anvil APP -> Sqlalchemy connection to elephantsql.com (20mb) -> Golang connection dump to external local database (2 Terabytes) ( I don’t need to backfill any data so I can just keep old data in my tower. )