What is the fastest way to update over 80000 rows

clinical · January 3, 2023, 3:44am

What I’m trying to do:
I have large database with over 800000, and I would like to update specific value in each row
What I’ve tried and what’s not working:
for loop, but it takes hours!
Code Sample:


def update_row_status(all_rows, status):
    for r in all_rows:
      r.update(status=status)

 update_row_status(app_tables.tasks.search(),"done")    
      
# this is a formatted code snippet.
# paste your code between ```

Clone link:
share a copy of your app

stefano.menci · January 3, 2023, 12:01pm

Is it 80,000 or 800,000?

Regardless, Anvil app_tables as far as I know doesn’t allow bulk updates. You can speed it up a little by wrapping it in a transaction. But first do a test with 100 rows, because I’m not sure it will speed it up. Also, that could lock your entire table for hours.

If you have a dedicated server, you can use sql and it will be very fast.

meredydd · January 3, 2023, 12:30pm

Are you using Accelerated Tables? Because if so, you might be interested in our (prerelease) support for batch updates. It works like this:

with anvil.tables.batch_update:
    for r in rows:
        r['status'] = foo

(You might want to work in batches smaller than 80000 - perhaps a thousand at a time?)

ianb · January 3, 2023, 4:03pm

Here is a generic batching update function, modified for use with this new anvil.tables.batch_update: (untested)

def update_batch_sized_rows(rows, update: dict, batch_size=1000, verbose=False):
    enumerated_rows = enumerate(rows, start=1)
    while True:
        with anvil.tables.batch_update:
            for i, row in enumerated_rows:
                row.update(**update)
                if i % batch_size == 0:
                    if verbose:
                        print(f"{i} rows have been updated.")
                    break

            i, row = next(enumerated_rows, (None, None))
            if i is None:
                #  This is how you get out of the while/True
                #  When the enumerator is empty
                if verbose:
                    print(f"A total of {len(rows)} rows have been updated.")
                break

            row.update(**update)  # Consumed a generator row with next() use it
            continue  # Exit the with statement

clinical · January 3, 2023, 4:55pm

Thank you @meredydd and @ianb . I will give it a try!

clinical · January 7, 2023, 6:24am

ianb:

def update_batch_sized_rows(rows, update: dict, batch_size=1000, verbose=False):
    enumerated_rows = enumerate(rows, start=1)
    while True:
        with anvil.tables.batch_update:
            for i, row in enumerated_rows:
                row.update(**update)
                if i % batch_size == 0:
                    if verbose:
                        print(f"{i} rows have been updated.")
                    break

            i, row = next(enumerated_rows, (None, None))
            if i is None:
                #  This is how you get out of the while/True
                #  When the enumerator is empty
                if verbose:
                    print(f"A total of {len(rows)} rows have been updated.")
                break

            row.update(**update)  # Consumed a generator row with next() use it
            continue  # Exit the with statement

So, When I tested today:

I tried running it on the server with the 30 seconds limits (for 1000 row), but I got the timeout error
then I tried the background tasks, it took 7 minutes to complete the update for 1000 row!

@anvil.server.callable
def update_rows():
  tasks = app_tables.tasks.search(q.fetch_only("Email","Contribution"))
  print(len(tasks))
  #update_batch_sized_rows(tasks,{'Contribution':'3'})
  anvil.server.launch_background_task('update_batch_sized_rows',tasks,{'Contribution':'3'})
  
@anvil.server.background_task
def update_batch_sized_rows(rows, update: dict, batch_size=1000, verbose=False):
    enumerated_rows = enumerate(rows, start=1)
    while True:
        with anvil.tables.batch_update:
            for i, row in enumerated_rows:
                row.update(**update)
                if i % batch_size == 0:
                    if verbose:
                        print(f"{i} rows have been updated.")
                    break

            i, row = next(enumerated_rows, (None, None))
            if i is None:
                #  This is how you get out of the while/True
                #  When the enumerator is empty
                if verbose:
                    print(f"A total of {len(rows)} rows have been updated.")
                break

            row.update(**update)  # Consumed a generator row with next() use it
            continue  # Exit the with statement

Any idea how to increase the speed?

p.colbert · January 7, 2023, 2:55pm

Just got a quick look, not a detailed examination, so take it with a kg of salt…

It looks like you are writing the exact same data(update) to thousands of rows.

Could those rows all simply refer to a single record that gets updated once?

clinical · January 7, 2023, 3:27pm

I am not sure what do you mean by

all simply refer to a single record that gets updated once?

p.colbert · January 7, 2023, 10:26pm

I have no idea whether this would actually work in your particular situation, but let me explain.

Imagine a new table, with a row that contains the contents of update. This row contains one copy of the data common to (and shared by) all the rows you wish to update.

Also imagine that every one of the rows passed to update_batch_sized_rows() has a link column, pointing to the above row.

Now, updating the shared row is a single, one-record step. The references to that row don’t need to be updated. They continue to point to the shared row. The only thing that changes is the content of that one shared row.

The update has thus been condensed from an 80,000-row job to a one-row job.

This seems to fit the workflow of this one batch-update step. Whether this fits your other uses of those 80,000 rows, I have no idea.

ianb · January 9, 2023, 2:28pm

I did not test it, since I am not using accelerated tables.
However I have seen plenty of the new server code since it is open source, and most of it falls back to the old style tables if you do not have it enabled.
Meaning you would not get an error even if the code is calling a method that is unavailable to use without accelerated tables (v2 tables code).

Are you using the beta version and have accelerated tables ticked and turned on, like the link @meredydd posted?

clinical · January 9, 2023, 3:51pm

Yes, I am using the beta editor and the accelerated tables

clinical · January 9, 2023, 3:53pm

Thank you for cool idea, but unfortunately it doesn’t work with my project

ianb · January 9, 2023, 5:06pm

I think you have the answer to “what is the fastest”, but I understand that is still very slow for the context of the question. I have regular updates that update 20-60k rows (admittedly very large amount of columns by design, but still…) regularly take 5-6 hours, and that is after I optimized it heavily so they would not take longer than a week.

I peeked into the batch update code, and it seems to just be doing a sequence write to the database after checking if you actually should be able to write to all of the rows you have given the batch update context manager. This is going to be faster than a round-trip to the database every single operation, but it is still handing off every operation separately to be completed by the database handler in the way it finds most efficient.
This is simply not fast enough to update 80,000 + rows in my opinion.

This is not bad, I am just saying you have exceeded the use-case for the data tables if this is what you need, and should look into using something like uplink attached to another machine that runs a different database.

To test out this concept with the least amount of cost, I suggest trying it on your local machine, using python to create a file-based sqlite database using sqlite3 so you can tinker with the concept.

Doing a single update on millions of rows takes less than a second with standard modern database access.
Over a million rows is a common thing: