What I’m trying to do:
I’m uploading a CSV file about 10,000 lines long to a data table and it’s visibly/painfully slow. Client is wanting to upload CSVs with 100,000+ rows each time starting this week.
I’m broadly following code suggested posted elsewhere on this forum and using Pandas:
import pandas as pd
@anvil.server.callable
def get_next_uid(table):
"""UID isn't the anvil built-in row ID but a simple index (int) starting at zero"""
total_rows = len(table.search())
assert not table.get(UID=total_rows)
return total_rows
@anvil.server.background_task
def save_csv_to_database_old(file, table, overwrite):
with anvil.media.TempFile(file) as file_name:
df = pd.read_csv(file_name)
total_rows = df.shape[0]
df.columns = df.columns.str.lower()
table_cols = {x['name'] for x in table.list_columns()}
df_cols = set(df.columns)
try:
assert df_cols.issubset(table_cols)
except AssertionError:
anvil.server.task_state['progress'] = f"COLUMN MISMATCH"
anvil.server.task_state['current_row'] = f"{df_cols} {table_cols}"
return
for col in df.columns:
df[col] = df[col].apply(str)
for n, d in enumerate(df.to_dict(orient="records")):
anvil.server.task_state['progress'] = f"Index {n} in range 0-{total_rows-1} inclusive [{total_rows} total] ({round(((n+1) / total_rows) * 100)}%)"
try:
try:
row = table.get(**d) # Keys when checking for duplicates
except TableError:
row = None
if not row:
d['UID'] = get_next_uid(table)
row = table.add_row(**d)
elif not overwrite:
anvil.server.task_state['current_row'] = f"Duplicate: {', '.join([str(x) for x in d.values()])}"
continue
row.update(**d) # Overwrite meaningless for SEARCH_ITEMS and IMPORT_LIST?
row['JOB_HISTORY'] = f"{row['JOB_HISTORY'] or ''}{now()}, CSV Upload\n"
anvil.server.task_state['current_row'] = ', '.join([str(x) for x in d.values()])
except Exception as E:
anvil.server.task_state['current_row'] = repr(E)
print(E, d)
The code above works, but things process VERY slowly, like 10-20 rows per second.
What I’ve tried and what’s not working:
On a hunch I removed the following checks and noticed a significant improvement in speed…
Check for existing/duplicate row. Remove:
try:
row = table.get(**d) # Keys when checking for duplicates
except TableError:
row = None
Remove check for next available UID and just overwrite datatable completely:
if not row:
d['UID'] = get_next_uid(table)
row = table.add_row(**d)
elif not overwrite:
anvil.server.task_state['current_row'] = f"Duplicate: {', '.join([str(x) for x in d.values()])}"
continue
Even with the above changes, it’s taking 6 mins 21 seconds to create 7550 rows (Professional plan) which seems overly slow…
Any suggestions for optimising performance of what I think is a fairly straightforward database job would be greatly appreciated.
Are there further things I can do to improve performance?
Is there a more efficient way to check for duplicate table entries so I can reinstate that check without killing performance?
Is this only fixed by throwing money at the problem and upgrading account to Business/Dedicated Plan?
UPDATE: Assigning next UID is no longer a problem (see below)
I’m not great at DB optimizations so take this with a grain of salt, but to solve the slowness of the having to go back to the DB every time you check for dups, I think rather than wholly overhauling the code, you might more easily get away doing this:
Create a new DataTable for storing UIDs for each datatable: one row per data table, with a single JSON column that contains an array of UIDs
Create a dictionary in your server module that has the data table you are updating as the key, and the Anvil id for the row that has a list of all UIDs
At the beginning of your server call, make a call to for the pertinent row for the data table in question with Step 2 in mind
Instead of making a round trip for every row in the CSV, you can now just do a simple check for the UIDs membership in that row of UIDs
Add that UID to your server module’s variable containing the UIDs
At the end of your other operations, make a single update to the datatable row containing the UID JSON List and update it (if its different, if it stayed the same, then you can avoid the call )
I’m not sure how big a single JSON column can be, so maybe size contraints might be an issue, but I think the above might at least relieve some performance pressure.
Playing with Duncan’s suggestion I didn’t end up implementing a new datatable actually but something that made a big positive difference was moving this line outside of the loop, calling it and therefore table.search() just once, and then incrementing a variable.
d['UID'] = get_next_uid(table)
My learning on that point is that even though table.search() returns a “lazy” iterable, there is still quite an overhead when calling it, so calling it once every row really slows things down.
I’m still very keen to hear if there are more efficient options for checking for duplicates, or for speeding this whole operation up further generally - many thanks!
Not really. Each situation is different so all I can do is point you at the docs for transactions and suggest you read a few previous posts on here about using them.
After a quick look at your code, and assuming that this process is never running twice in parallel, here are my comments:
Touching the database is slow. Replace get_next_uid with a counter variable
anvil.server.task_state might be slow (I haven’t tested it). Unless you need it to see the exact point where it crashes, you could do it once every second or two. Something like this (even better if the function builds the strings, instead of building the strings only to discard them):
If the chances of finding duplicated rows is low, it may be better to add all the rows without checking for duplicates, then scan the table with an ordered search and delete the duplicated then. One search at the end with a few deletes may be much faster than checking every row
Another approach to checking for duplicates is to keep a set in memory with all the rows. If each row is small, keeping 100,000 items in a set should be manageable. If each row is large, then you could hash them and keep the hashes in the set (assuming that they are hashable, which should be the case because your data is coming from a csv)
The reason why you use a transaction is to manage atomicity.
The nice side effect is that they are faster.
Between the start and the end of the transaction, the database engine doesn’t need to notify anybody about what’s going on.
Between the start and the end of the same set of operations without a transaction, the database engine needs to make sure that any other access to the database knows about those operations. That adds an overhead that in some case has an impact on the performance.
Yes that is what I use them for (inserting chunks of 10 , 50 or 100k at a time) in place of something like .executemany(). I always witnessed it being faster so I never changed it, but for some reason there was an idea floating around that it wasn’t optimized that way.
Thanks for clarifying it.
Also I like your idea of using sets, I could imagine an implementation of a UNIQUE constraint across multiple columns being done by constructing a set of tuples for the current table state, and then another for the data to insert, then using .intersection() to find the collisions.
Obviously possible to do just using .search() etc. from anvils toolkit, but I bet it would be way faster for tens of thousands of rows, collecting comparing and then writing only on the relevant inserts or rows.
To clarify what I meant in that other post: Anvil is not going to ‘stash’ your changes and submit them to the database as a single SQL call - it’ll just tell postgres to open and close a transaction around your statements. So you still incur the overhead of touching the database each time you make a CRUD operation on the Anvil table/row object.
I didn’t know that a transaction might offer some performance benefits too - interesting. I’m still guessing that’s small potatoes vs the Python stuff… anyone want to do some testing?
Because of this constraint, I don’t know of a good answer to AWSOM’s question. As Stefano says one database access is faster than thousands of python statements. However, there’s no ‘bulk insert’ made available through the Anvil Tables API (feature request?). So, there is no way to avoid at best one python statement per row, and therefore no way to avoid tens/hundreds of thousands of of python statements in this case.
I did some tests long time ago, and it was saving anywhere between nothing and 30% of the time, depending on your luck.
While getting rid of that get_next_uid(table) used at the top of the thread and the get to check for duplicated before adding every row, would go from hitting the database 4 times to 1 time per row, so it will save 75-ish% of the time.