Ways to speed up csv import to data table?

momoko · November 20, 2022, 12:45am

What I’m trying to do:

I’m building a little app that requires the user to upload a CSV file of at at least 500 (ideally more than 1000) rows. So far it works, but it’s very slow. Currently uploading a spreadsheet (csv) of 1000 rows (single column) takes ~3 min.

What I’ve tried and what’s not working:

I’m relying on the code from the “CSV Import” documentation — it totally works, it’s just impractically slow.

I also am doing this via the Anvil Uplink for my server code, as there are packages that my app uses that aren’t available/installed yet on Anvil. Not sure if this would be the reason for the lag time?

Any suggestions are much appreciated!

Code Sample:

def store_upload_data(file):
    with anvil.media.TempFile(file) as file_name:
        if file.content_type == 'text/csv':
            df = pd.read_csv(file_name).dropna()
            for d in df.to_dict(orient="records"): 
                app_tables.upload_data.add_row(**d) # We use Python's **kwargs syntax to pass the whole dict as keyword arguments

jshaffstall · November 20, 2022, 1:44am

Welcome to the forum!

So the code you pasted is in an Uplink? Be aware that to actually add those data table rows Anvil has to make a call to the actual server. So you’re making 500 (or a 1,000) server calls in short succession. That’s bound to introduce some delays.

You might try bundling new rows as a list of dictionaries, and then make a server call that loops over that list and adds rows on the server. You’d need to play with how big that list of dictionaries could be, and do timings to see if there’s a sweet spot where you get the best performance. You’d at least be making fewer server calls than you are now, which should speed up the throughput.

stefano.menci · November 20, 2022, 3:05am

I have a nightly upload that uploads 1000 rows with 8 columns and takes less than 10 seconds.

In my case it’s not a python script, so instead of using the uplink, I created an http endpoint to which I pass a list of dictionaries in json.

Whether it’s uplink or http endpoint, the concept is the same described by @jshaffstall: minimize the number of round trips by sending one list of rows, then doing the bulk of database operations in the server.

stucork · November 20, 2022, 5:18am

Just to add, if you’re using the Accelerate Tables beta then there’s a new api for doing this much faster.

app_tables.my_table.add_rows(list_of_dicts)

It can take any iterable of dictionary like objects.
And for a pandas dataframe your code might look like

def store_upload_data(file):
    with anvil.media.TempFile(file) as file_name:
        if file.content_type == 'text/csv':
            df = pd.read_csv(file_name).dropna()
            app_tables.upload_data.add_rows(df.to_dict(orient="records"))

So rather than doing 1000 server calls you do a single server call

momoko · November 21, 2022, 2:08am

Thanks very much for all the help! @stucork, I have a few newbie questions to add after trying to use the add_rows functionality for the accelerate tables beta:

What I’ve tried so far:

I moved my ‘store_upload_data’ function into a hosted Server Module in Anvil (instead of using the uplink) just to see if I’d see any difference, and also — after opting in to the Acc. Tables Beta — modified the code as shown:

@anvil.server.callable
def store_upload_data(file):
    with anvil.media.TempFile(file) as file_name:
        if file.content_type == 'text/csv':
            df = pd.read_csv(file_name).dropna()

            # checking start & end time and adding print statements to see elapsed time to execute .add_rows 

            s1 = datetime.datetime.now()
            print("Initiating .add_rows function ...")
            app_tables.upload_data.add_rows(df.to_dict(orient="records")) 
            s2 = datetime.datetime.now()
            func_time = s2-s1
          
            print("Total time to execute .add_rows: {}".format(func_time))

What I’ve noticed testing this so far …

My print statements don’t appear in either the ‘Running App Console’ or the ‘Server Console’ windows. Should I be looking elsewhere to see these?
Still takes a really long time to upload the data, and around the same amount as before.

(I don’t doubt I’m completely missing something v. basic about this as I’m pretty unfamiliar with app development)

Also - both @stefano.menci and @jshaffstall suggested minimizing calls by ‘sending one list of rows and doing the bulk of database operations in the server’ - I get this conceptually, but I’m having a hard time understanding how the original store_upload_data() function (shown below for reference) isn’t already doing this:

def store_upload_data(file):
    with anvil.media.TempFile(file) as file_name:
        if file.content_type == 'text/csv':
            df = pd.read_csv(file_name).dropna()
            for d in df.to_dict(orient="records"): 
                app_tables.upload_data.add_row(**d)

Doesn’t df.to_dict(orient=“records”) already convert the dataframe to a list of rows, which is then looped through? And isn’t this loop technically within a single server call, namely

anvil.server.call(‘store_upload_data’, file)

?

How might one refactor this code to avoid making a server call for every row of data?

(Thank you in advance for your patience with me on this - I’m learning as I go, and there’s lots to learn … )

jshaffstall · November 21, 2022, 2:29am

We all made the assumption that your store_upload_data function was in an uplink program. While that technically counts as being a server call, it’s a server call that lives on the machine the uplink program is running on (probably your home machine). Every call inside that function that needs to do something with the server has to send a call out over the network to the actual Anvil server.

Every call to the actual Anvil server spins up a new instance of the Python interpreter, which delays things above what the network transit time suggests. So making a lot of calls to add_row from an uplink program is making a lot of separate server calls.

If the assumption about store_upload_data being in an uplink program isn’t right, than the comments about sending a list of dictionaries over the network don’t apply.

momoko · November 21, 2022, 4:40am

Gotcha - I’ve since moved the function into an Anvil server module to see if it helps; the lag persists.

Also made sure to disconnect my uplink and also checked the app logs to see if the print statements in my server-side code show up there but it’s only showing browser sessions & output. Still working on it…

momoko · November 21, 2022, 5:39am

Quick update to say this has been resolved using @stucork 's recommended modifications and moving store_upload_data() function into an Anvil server module.