Performance Issues Working with Anvil Data Tables and Row Iterators

gweber.lauraandleigh · January 29, 2021, 9:33pm

My app has been fairly slow for awhile (7-15 second data grid load time for ~500 records), and I finally got around to investigating it deeper. I will probably write another post soon with some questions about best practices. However, I seem to have gotten to a point where I may need to abandon Anvil Data Tables altogether due to performance so wanted to see if I am missing anything before doing so.

I have gone through a ton of performance posts. Here are a few good ones:

However, I was still struggling to get down to a <3 second load time for my data grid. After investigating, it seems that a lot of my issues arose from constructing lists of dicts/AnvilRows from a table.search(). For many solutions on the forums and docs, people construct these dict lists from the table row iterator (ex. Table search question).

Anyways, I decided to time some of these constructions for a 21 column / 500 record Anvil data table. There is an example here:

https://anvil.works/build#clone:HJYFARTFSFW5S7X2=OEKV3BR4O6E4IQ6TSS2SOYVR

Basically, by clicking the “GET” button, it runs this code:

  def button_3_click(self, **event_args):
    
    print("Anvil table with 21 columns and 500 rows\n")
    then = datetime.datetime.now()
    
    columns_21_without_images = anvil.server.call('get_it')
    
    print("(1) Server load 21 columns w/o images:")
    print("      Time: " + str(datetime.datetime.now() - then) + '\n')
    then = datetime.datetime.now()
    
    columns_21_with_2_image_columns = anvil.server.call('get_it2')
    
    print("(2) Server load 21 columns with 2 image columns:")
    print("      Time: " + str(datetime.datetime.now() - then) + '\n')
    then = datetime.datetime.now()
    
    dict_columns_21_without_images = [r for r in columns_21_without_images]
    
    print("(3) Create list of dicts from search w/o images:")
    print("      Time: " + str(datetime.datetime.now() - then) + '\n')
    then = datetime.datetime.now()    

    [r for r in columns_21_with_2_image_columns]
    print("(4) Create list of dicts from search with images:")   
    print("      Time: " + str(datetime.datetime.now() - then) + '\n')
    then = datetime.datetime.now()
    
    [r for r in dict_columns_21_without_images]
    print("(5) Create list of dicts from list of dicts:")   
    print("      Time: " + str(datetime.datetime.now() - then) + '\n')

with an output of:

Screenshot 2021-01-29 155219

The point is that any time I have to run a loop over the 500 record search iterator, I burn 1-2 seconds of load time. The last case I have in there is doing the same thing with a list of 500 dictionaries (this performs at 1000-2000x speedup). This seems to suggest a lot of relative overhead for Anvil rows (also see stefano.menci’s comment here).

Why would I want to loop over the search? Well, there are many reasons. I will give one (simplified from my real case) example, and I can give more if needed. One case is that each of my Anvil rows has a category column. The category column is actually a link to another Anvil table, but let’s just pretend it is text. So each one of my 500 records is assigned to say, one of 10 categories. I would like to make a dropdown that displays these 10 categories so people can filter using them. One way to do this would be to construct a list of dictionaries using my 500 row search iterator and feed that into the dropdown.

Sorry for the long post. I actually really enjoy aspects of the Anvil tables and am hesitant to give them up. But my questions are…

Questions

When Anvil tables get past the size of 100-1000 rows, do they stop becoming useful?
Is the performance of the Anvil row iterator really as bad as I am getting (compared to list of dicts)? Am I doing something incorrectly? Any ideas on speeding it up?
If the answer to (1.) is yes, what is the best alternative? Should I store data in an external SQL database and call a server function to ask the database for a list of dicts? Then, send that list of dicts to my data grid? Could someone describe an example of a workflow they use for a database with >1000 records?
In particular, I am very interest in creating something like an Amazon-like display. You have 1000 inventory items and you want to give the user dropdowns and boxes to filter things. So should the workflow be to request a large, unfiltered amount of data from your database with one server call and then construct a pandas dataframe (or an alternative) on the client side for all filtering events? Or should I make a server call every time a new filter is requested? Example, I would ask the server to ask the database to give me all the items and then send them to the client. Then, the user wants to filter out expensive items. So I go back to the server with the query and ask the database to apply the filter then return to the client exactly what needs to be displayed in a list of dicts. Which one do you guys go for?

Again, sorry for the long post. These questions have built up over 6 months, and I have many details around each. Thanks for any help.

danbolinson · January 29, 2021, 11:03pm

I will try my best to give a bit of insight here.

First, an app_tables search call returns a lazy search iterator. I think they load on the order of 20 row objects at a time, and only load objects as needed. This is sometimes good, but when you iterate through an entire search iterator object, you load all the rows in batches and (I think) doing so with multiple back-end calls every 20ish rows. So, performance rapidly degrades, and unpacking a search iterator is expensive as you’re finding. It’s a wokr-around in some cases but if you have any number of rows, you end up where you are. The good news is that if you don’t need to unpack the search iterator, you can pass it back to a Datagrid on the client side and it will only load the rows as needed by the client as they paginate through the datagrid. Fancy.

It would be great if you could sort/filter by a linked column but right now I don’t think that’s possible (without unpacking the entire search iterator, which we want to avoid). Here’s a plug for everyone to +1 this feature request.

You could add a redundant field to your object and filter/sort on that, like your hypothetical category column, and keep a second data table that tracks the unique values in that column. So you can return that (small) list to the client for the list of categories and avoid doing something like set([r['categories'] for r in search_results]), which again, unpacks the entire search iterator.

There IS a way to hack around this by leveraging the ‘to_csv’ method of a search_iterator. This returns the entire results of the search iterator and does so shocklingly quickly. You can then send it to a dataframe as a StringIO object or else just deal with it in Python, at Python speed.

Here’s an example of returning a 10,000 row table as a list of dictionaries in 3 seconds, including conversion to and then from a dataframe:
https://anvil.works/build#clone:575YYURRGKKAGWHU=A7PKA5XWLYUID363T2YHCHLB

So to answer your questions:

Still useful, but you do run into some constraints (especially with filtering/sorting on linked tables)
Remember they are lazy, and unpacking them naively will always suck if they are even a little bit big.
External SQL would be fine but it’s probably not necessary.
Lots of ways to skin that cat! But, despite the datatable -> csv -> dataframe hack, I’d still try to keep things in search iterators to keep it lazy, and use helper tables to keep track of the known values you might be filtering on, and include those in the datatable for filtering and sorting purposes. There’s a lot of benefits in staying with search iterator/row objects (lazy loading for example) and I’d resist giving that up at least a little longer!

robert · January 30, 2021, 7:51am

I’ve been able to use some tricks to make all my app load times much better, but it does often take more work and thinking about data structures (eeek!) a lot more than the built in data tables.

As @danbolinson mentioned, if you are dealing with a good amount of rows then loading it into a dataframe will add about 1 second, but then future filtering or transforms are lightning fast (because Pandas leverages C on the back end). Pandas isn’t supported on the client side though. You can always export to a list of dicts with:

my_list_of_dicts = df.to_dict(orient='records')

To answer your questions from my perspective:

I don’t think so. It just depends on how the user will interact with the database. If huge amounts of unique data will be constantly needed and you want lightning-fast response times then using your own database might be a good idea. Then again it’s a lot of work to optimize a database and get it working fast with your specific type of queries.

For example, I have an app that uses an external database of about 2 million rows and the database has to search and filter through the data every load time (https://agaar.mongol.ai). It takes about 9-10 seconds. It’s not a commercial app or I would definitely be using the dedicated plan (so that data would be available in the server). I use a Postgres database that I’ve set up indexing for the specific columns I filter for. But honestly, I hate that. It’s annoying and takes time and you have to remember what the heck you did later and then figure out why performance is degrading because your host or your app or you did something stupid.

What you are doing is to call the iterator to the client, which makes the client have to go back and fetch batches to do the sorting. It’s like having a bad internet connection with lots of dropped packets, you have to call back to the source and get more packets. It’s quite inefficient. Instead do something like:

Client-Code:

product_data = anvil.server.call('get_product_data')

Server Code:

@anvil.server.callable
def get_product_data():
    results = app_tables.product_data.search()
    return [r for r in results]

If you want to filter using linked rows than you have to use Anvils tools. But for your use case that probably isn’t efficient. You should add a unique key like you would in a standard RDBMS and then you can filter on the list of dicts with that key a la:

filtered_results = [x for x in product_data if x['category_id'] == 'shoes']

For my large database I use the psycopg2 package to pull data and typically sqlalchemy to send data because the syntax is easier. I use the SQL connector for pandas with those libraries. But for your case I really don’t think you need an external database.
For your case, if you want a lightning fast experience, you are going to have to get creative. You could return only the first 20 rows as a list of dicts, then silently load the rest of the data (with anvil.server.call_s) into a client module. You can then do all the filtering in the client.

But you have a lot of edge cases to deal with there. For example what if someone filters before your background data is done loading? You would need some logic to deal with that.

As usual, the answer is mostly “it depends”. If you have a specific situation we can give some more help with that.

danbolinson · February 15, 2021, 3:47pm

I always get terrible performance when I do this, even on the server side:

filtered_results = [x for x in product_data]

I had assumed it was still making many (separate) calls to the data service and that explained the poor performance. Is that correct? Or is this expected to be performant on the Server?

Thanks!

Edit: Removed filtering from list comprehension example.

p.colbert · February 15, 2021, 4:04pm

You may be looking for this: Searching (querying) a Table

danbolinson · February 15, 2021, 4:10pm

Thanks Phil but I’m specifically interested in the performance implication of unpacking a search iterator (i.e. by list comprehension) server side vs client side. I’ll edit my message to remove the filtering.

p.colbert · February 15, 2021, 4:24pm

This situation has been noted and explained in dozens of forum messages, including this one (if you start at the top), and has been the topic of several Feature Requests. I really can’t add anything more to those excellent discussions.

gweber.lauraandleigh · February 15, 2021, 6:30pm

Irrespective of server vs. client for me as well. I timed it after you guys first posted because @robert seemed to hint that it might improve performance.

Thank you for your insights btw, @danbolinson and @robert. They have been very helpful to me. I was planning to respond with thanks and give some updates, but I have been very busy recently. Hopefully soon.

rickhurlbatt · October 15, 2021, 3:44am

I have just tested this method

and it seems linked to the type of plan you are on. I am presuming that if you are not on a dedicated server where the data tables and app exist on the same server, performance could be slow.

If you are on a dedicated plan with app and tables on the same hardware, creating a list of rows in this way adds almost no overhead (like less than 0.0001 seconds on top of a search that takes 4 seconds to search)

danbolinson · October 15, 2021, 3:18pm

I’m getting a different result. With 2000 rows I can return the search iterator in 2 seconds, but if I have to convert to a list (i.e. by applying filtering through list comprehension) then it takes almost 20 seconds for just 2000 rows.

I am on the ‘free’ account here but expect same results on other runtimes. Here’s my performance
testing if you want to try on a different runtime:

https://anvil.works/build#clone:575YYURRGKKAGWHU=A7PKA5XWLYUID363T2YHCHLB

ianb · October 15, 2021, 4:12pm

I did the following tests with your clone link, the first is the free plan basic python, the second is the full paid for python.

It is 1/2 the time, but imo that is not great. I also wanted to see what happens if you do the comprehension on the client side (using the clients browser). I assume this would mean each user using different hardware would get different results.

So for the middle part of the test my machine, running chrome, and the anvil (not beta) IDE was:

Here are the results:
basic python

full python

…so it made no sense to me that it would take longer for the iterator under full python o I ran it again and got this result:

I… also don’t know what conclusions to draw from this.

stefano.menci · October 15, 2021, 5:02pm

You should run each test many times and use the fastest. That’s how fast it can go.

It is not going to be faster because it can’t go faster, but it can go slower if other things are affecting the test result.
By running it many times you are weeding out the noise coming from other things.

ianb · October 15, 2021, 5:11pm

I wonder what the results of cashing are, either on the anvil cloud side, the client browser, or the PostgreSQL DB side, I don’t know really what the internal mechanisms are.

Either way, I already deleted the cloned app.

danbolinson · October 18, 2021, 2:10pm

I’ll refer you to my earlier message. You’re observing the laziness of the search iterator. I am guessing the full python runtime is beefier so you are faster but the fact remains that, due to the laziness of the iterator, you are making multiple subsequent calls to get the data.

Search iterators are generator functions that return SOME (not all) of the data with each yield. So if you use list comprehension (or any other method that unpacks the iterator), it is going back and forth to the server until its exhausted. Expensive! If you do this client side, it’ll be even worse (probably MUCH worse), because those subsequent calls will incur the network time as well.

It is worth noting, this only matters when you are dealing with a lot of rows (ballpark >1000 server side, a couple hundred client side).

Look for “Search is Lazy”:

ianb · October 18, 2021, 5:21pm

I understand the iterator is lazy (it’s a feature, not a bug etc.). In every circumstance where I have run into this problem I end up limiting the results by creating more specific search querys.

Database engines are much more adept at returning specific, ordered results while self optimizing for speed than almost anything I could ever come up with in a JIT compiled programming language.

I can use the programming language as an intermediate step to apply logic to choose what type of search, search query, etc. is best, but the bulk of the ‘chaff’ of the data should be handled at the db search level.
I can programmatically build a better search query etc, but the database engine is always going to be the go-to the place where speed happens.

This is just in my experience, feel free to disagree. I would love to learn a new method.

stefano.menci · October 18, 2021, 6:43pm

I use as few tables as possible and put all the stuff that is not used for querying in a simple object column. Linked tables are often victim of the n+1 problem, simple object columns are often faster.

In some cases, when the simple object columns contain large amounts of data, even working with simple objects can be slow and just getting 50 of them can take 1 second.

So my apps usually use the Anvil query operators when working with one record and, sometimes, use SQL (I think only available on dedicated plans) when working with or aggregating many rows.

For example the forms that show a list of documents may use SQL to get the list and dig inside the simple object columns and add some statistics. Once the user picks a document, the forms that work on that document only use Anvil queries. Most of the apps that use SQL use only one SQL query.

Tony.Nguyen · October 19, 2021, 12:47am

Have you tried this?