Upsert or avoid duplicate rows in data table

yahiakalabs · August 11, 2023, 9:21pm

What I’m trying to do:
I’m trying to add a row to a data table but not if a matching row already exists (avoiding duplicate entries). Essentially, ‘upsert’.

What I’ve tried and what’s not working:
I have an if-statement that checks if there’s a matching row and then, if there isn’t, add a row to the data table.

This if-statement slows down my server function quite a bit.

Code Sample:

@anvil.server.callable
def upsert_urls_to_bot(urls, bot_name):
    """Upsert URLs to a bot as a document."""
    user = anvil.users.get_user(allow_remembered=True)
    bot = app_tables.chatbots.get(User=user, bot_name=bot_name)
    if user and bot:
        print('Adding URLs to chatbot sources')
        for url in urls:
            if app_tables.documents.get(bot=bot, url=url):
                row = app_tables.documents.get(bot=bot, url=url)
                row['url'] = url
            else:
                app_tables.documents.add_row(bot=bot, url=url)
        return app_tables.documents.search(bot=bot)
    else:
        print('add_urls_to_bot: User or bot not defined')
        return None

If there’s no upsert, what else can I do to speed this up?

ianb · August 11, 2023, 9:25pm

Is user a linked column in bot? Or are you storing an entire copy of the “user” object/row, which could be larger than say, just the users email?

If so, a text search of just the email might be faster?

There are many more things that can make it faster.

Also, look at the app tables docs for app_tables.table_name.get(), it shows you how to short circuit the or operator to get and or create a row without the if statement (and the 2 instead of one calls to .get() ) .

You can then just go:

row['url'] = url

Also the same question with bot in the documents table, if you are just comparing the row, you might want to use accelerated tables and fetch_only to not load anything except the row id inside the row object for comparison.

yahiakalabs · August 11, 2023, 10:08pm

Thanks for that, really helpful.

I drastically increased the speed by short circuiting the “or” operator (Anvil Docs | Using Data Tables from Python) and also a bit more by using the fetch_only (Anvil Docs | Accelerated Tables Beta).

stefano.menci · August 12, 2023, 4:46am

Here is my take, (typed in a phone, I hope it’s not messy).

Your code seems to add rows, but doesn’t seem to update anything. This version has only one search rather than many get, it should be faster:

@anvil.server.callable
def upsert_urls_to_bot(urls, bot_name):
    """Upsert URLs to a bot as a document."""
    user = anvil.users.get_user(allow_remembered=True)
    if not user:
        print('add_urls_to_bot: User not defined')
        return None
    bot = app_tables.chatbots.get(User=user, bot_name=bot_name)
    if not bot: 
        print('add_urls_to_bot: bot not defined')
        return None

    print('Adding URLs to chatbot sources')
    rows = list(app_tables.documents.search(bot=bot, url=q.any_of(urls)))
    existing_urls = [row['url'] for row in rows] 
    for url in urls:
        if url not in existing_urls:
            rows.append(app_tables.documents.add_row(bot=bot, url=url))
    return rows

yahiakalabs · August 12, 2023, 8:07pm

Love, it, thanks! That did indeed shave off another 1-2 seconds for my dataset, which will translate into much more time with my bigger datasets.

I am curious about using fetch_only on searches since I’m only using the ‘url’ column here. I tried something like:

rows = list(app_tables.documents.search(q.fetch_only('url'), bot=bot, url=q.any_of(*urls)))

But it doesn’t seem to help with speed beyond what I have done so far.

stefano.menci · August 12, 2023, 9:20pm

q.fetch_only allows to decrease the number of precached values.

If you are returning row objects to the client and you have many columns that you are not going to use or some columns with large amounts of data, not transferring all that unused data from the server to the client, can save some time.

But when you are in un the server, it would only help if it skips the precaching of very large amounts of data. You are unlikely to see a difference if you prefetch less than a few megabytes.

Neeeco · August 13, 2023, 7:12am

Is there a reason you put the rows in a list before iterating over them? We can already iterate over the .search() result.

stefano.menci · August 13, 2023, 12:44pm

The function returns the rows matching the query, including the rows that were already existing and the newly created rows.

I made a list, so the newly created rows are added to it as they are created, then I return the list.

Otherwise I would need a second search at the end to return the rows.