Faster way to find duplicates in data table?

AWSOM · April 11, 2022, 2:15am

What I’m trying to do:
I’m uploading a CSV file about 10,000 lines long to a data table and it’s visibly/painfully slow. Client is wanting to upload CSVs with 100,000+ rows each time starting this week.

I’m broadly following code suggested posted elsewhere on this forum and using Pandas:

import pandas as pd

@anvil.server.callable
def get_next_uid(table):
    """UID isn't the anvil built-in row ID but a simple index (int) starting at zero"""
    total_rows = len(table.search())
    assert not table.get(UID=total_rows)
    return total_rows

@anvil.server.background_task
def save_csv_to_database_old(file, table, overwrite):
    with anvil.media.TempFile(file) as file_name:
        df = pd.read_csv(file_name)
        total_rows = df.shape[0]
        df.columns = df.columns.str.lower()
        table_cols = {x['name'] for x in table.list_columns()}
        df_cols = set(df.columns)        
        try:
            assert df_cols.issubset(table_cols)
        except AssertionError:
            anvil.server.task_state['progress'] = f"COLUMN MISMATCH"
            anvil.server.task_state['current_row'] = f"{df_cols} {table_cols}"
            return
        for col in df.columns:
            df[col] = df[col].apply(str)
    for n, d in enumerate(df.to_dict(orient="records")):
        anvil.server.task_state['progress'] = f"Index {n} in range 0-{total_rows-1} inclusive [{total_rows} total] ({round(((n+1) / total_rows) * 100)}%)"
        try:
            try:
                row = table.get(**d)  #  Keys when checking for duplicates
            except TableError:
                row = None
            if not row:
                d['UID'] = get_next_uid(table)
                row = table.add_row(**d)
            elif not overwrite:
                anvil.server.task_state['current_row'] = f"Duplicate: {', '.join([str(x) for x in d.values()])}"
                    continue
            row.update(**d)  #  Overwrite meaningless for SEARCH_ITEMS and IMPORT_LIST?
            row['JOB_HISTORY'] = f"{row['JOB_HISTORY'] or ''}{now()}, CSV Upload\n"
            anvil.server.task_state['current_row'] = ', '.join([str(x) for x in d.values()])
        except Exception as E:
            anvil.server.task_state['current_row'] = repr(E)
            print(E, d)

The code above works, but things process VERY slowly, like 10-20 rows per second.

What I’ve tried and what’s not working:

On a hunch I removed the following checks and noticed a significant improvement in speed…

Check for existing/duplicate row. Remove:

            try:
                row = table.get(**d)  #  Keys when checking for duplicates
            except TableError:
                row = None

Remove check for next available UID and just overwrite datatable completely:

             if not row:
                d['UID'] = get_next_uid(table)
                row = table.add_row(**d)
            elif not overwrite:
                anvil.server.task_state['current_row'] = f"Duplicate: {', '.join([str(x) for x in d.values()])}"
                    continue

Even with the above changes, it’s taking 6 mins 21 seconds to create 7550 rows (Professional plan) which seems overly slow…

Any suggestions for optimising performance of what I think is a fairly straightforward database job would be greatly appreciated.

Are there further things I can do to improve performance?
Is there a more efficient way to check for duplicate table entries so I can reinstate that check without killing performance?
Is this only fixed by throwing money at the problem and upgrading account to Business/Dedicated Plan?

UPDATE: Assigning next UID is no longer a problem (see below)

duncan_richards12 · April 11, 2022, 2:40am

I’m not great at DB optimizations so take this with a grain of salt, but to solve the slowness of the having to go back to the DB every time you check for dups, I think rather than wholly overhauling the code, you might more easily get away doing this:

Create a new DataTable for storing UIDs for each datatable: one row per data table, with a single JSON column that contains an array of UIDs
Create a dictionary in your server module that has the data table you are updating as the key, and the Anvil id for the row that has a list of all UIDs
At the beginning of your server call, make a call to for the pertinent row for the data table in question with Step 2 in mind
Instead of making a round trip for every row in the CSV, you can now just do a simple check for the UIDs membership in that row of UIDs
Add that UID to your server module’s variable containing the UIDs
At the end of your other operations, make a single update to the datatable row containing the UID JSON List and update it (if its different, if it stayed the same, then you can avoid the call )

I’m not sure how big a single JSON column can be, so maybe size contraints might be an issue, but I think the above might at least relieve some performance pressure.

AWSOM · April 11, 2022, 3:00am

Many thanks for the speedy and constructive suggestion Duncan - much appreciated.

I’d still love to hear from others, particularly Anvil support team, regarding official best practise but will try your suggestion and report back.

AWSOM · April 11, 2022, 6:07am

Playing with Duncan’s suggestion I didn’t end up implementing a new datatable actually but something that made a big positive difference was moving this line outside of the loop, calling it and therefore table.search() just once, and then incrementing a variable.

d['UID'] = get_next_uid(table)

My learning on that point is that even though table.search() returns a “lazy” iterable, there is still quite an overhead when calling it, so calling it once every row really slows things down.

I’m still very keen to hear if there are more efficient options for checking for duplicates, or for speeding this whole operation up further generally - many thanks!

owen.campbell · April 11, 2022, 8:13am

You might want to try using a transaction.

In your present code, each db update is done individually and, to maintain data consistency, the dbms has to write that change to disk.

If you place a batch of updates within a transaction, those are all written to disk in one go.

AWSOM · April 11, 2022, 9:37am

Thanks Owen - do you have a link or example you could share please?

owen.campbell · April 11, 2022, 9:42am

Not really. Each situation is different so all I can do is point you at the docs for transactions and suggest you read a few previous posts on here about using them.

stefano.menci · April 11, 2022, 1:31pm

After a quick look at your code, and assuming that this process is never running twice in parallel, here are my comments:

Touching the database is slow. Replace get_next_uid with a counter variable
anvil.server.task_state might be slow (I haven’t tested it). Unless you need it to see the exact point where it crashes, you could do it once every second or two. Something like this (even better if the function builds the strings, instead of building the strings only to discard them):

   def update_task_state(self, progress, current_row):
       if time.time() < self.time_next_update:
           return
       self.time_next_update = time.time() + 2
       anvil.server.task_state['progress'] = progress
       anvil.server.task_state['current_row'] = current_row

If the chances of finding duplicated rows is low, it may be better to add all the rows without checking for duplicates, then scan the table with an ordered search and delete the duplicated then. One search at the end with a few deletes may be much faster than checking every row
Another approach to checking for duplicates is to keep a set in memory with all the rows. If each row is small, keeping 100,000 items in a set should be manageable. If each row is large, then you could hash them and keep the hashes in the set (assuming that they are hashable, which should be the case because your data is coming from a csv)

ianb · April 11, 2022, 3:51pm

@owen.campbell I was told by others that the transaction blocks do not work like that and only do validation / rollback for atomicity.

Do you know if the above quote is incorrect and it is actually the way you (and I who assumed the same previously) think it works?

I don’t think I could ever find out a concrete answer to this.

nathanguyette · April 11, 2022, 4:25pm

Boy, if only you could put a UNIQUE constraint on data tables… Wonder if this has been mentioned before in feature requests…

Programmatic constraints on Data Tables

stefano.menci · April 11, 2022, 4:29pm

The reason why you use a transaction is to manage atomicity.

The nice side effect is that they are faster.

Between the start and the end of the transaction, the database engine doesn’t need to notify anybody about what’s going on.

Between the start and the end of the same set of operations without a transaction, the database engine needs to make sure that any other access to the database knows about those operations. That adds an overhead that in some case has an impact on the performance.

ianb · April 11, 2022, 4:52pm

Yes that is what I use them for (inserting chunks of 10 , 50 or 100k at a time) in place of something like .executemany(). I always witnessed it being faster so I never changed it, but for some reason there was an idea floating around that it wasn’t optimized that way.

Thanks for clarifying it.

Also I like your idea of using sets, I could imagine an implementation of a UNIQUE constraint across multiple columns being done by constructing a set of tuples for the current table state, and then another for the data to insert, then using .intersection() to find the collisions.

Obviously possible to do just using .search() etc. from anvils toolkit, but I bet it would be way faster for tens of thousands of rows, collecting comparing and then writing only on the relevant inserts or rows.

…really got me thinking now.

stefano.menci · April 11, 2022, 5:00pm

These two rules of thumb are always at the base of any performance decision:

one python statement is faster than one database access - one operation with sets is faster than one operation on the database
one database access is faster than thousands of python statements - one database search is faster than a long python loop

After taking care of these two points, you can keep in mind that:

grouping all the database operations in one transaction is a little faster
decreasing the number of accesses to anvil.server.task_state can help
some operations are faster in sets than in lists
etc.

But usually most of the performance gains come from those two points.

danbolinson · April 11, 2022, 9:57pm

To clarify what I meant in that other post: Anvil is not going to ‘stash’ your changes and submit them to the database as a single SQL call - it’ll just tell postgres to open and close a transaction around your statements. So you still incur the overhead of touching the database each time you make a CRUD operation on the Anvil table/row object.

I didn’t know that a transaction might offer some performance benefits too - interesting. I’m still guessing that’s small potatoes vs the Python stuff… anyone want to do some testing?

Because of this constraint, I don’t know of a good answer to AWSOM’s question. As Stefano says one database access is faster than thousands of python statements. However, there’s no ‘bulk insert’ made available through the Anvil Tables API (feature request?). So, there is no way to avoid at best one python statement per row, and therefore no way to avoid tens/hundreds of thousands of of python statements in this case.

stefano.menci · April 11, 2022, 10:23pm

Yes.

I did some tests long time ago, and it was saving anywhere between nothing and 30% of the time, depending on your luck.

While getting rid of that get_next_uid(table) used at the top of the thread and the get to check for duplicated before adding every row, would go from hitting the database 4 times to 1 time per row, so it will save 75-ish% of the time.

AWSOM · April 13, 2022, 12:12am

Very helpful thank you!

AWSOM · April 13, 2022, 12:13am

Thanks for the link Owen. Very helpful.

AWSOM · April 13, 2022, 12:13am

Great suggestions thanks Stefano.