Serializing data in linked tables is slow

What I’m trying to do:
I have two linked tables- Trips and Shipments (a trip consists of one or more shipments), each table has about 20 columns.

I am trying to retrieve all trips as a serialized list. To do this, I turn each trip row in the Trips table into a dict; I then go to the linked rows in the Shipments table for that trip and do the same.

What I’ve tried and what’s not working:
It is slow and takes 15 seconds for just 100 trips/rows in the Trips table…which feels like a long wait for the user. I would greatly appreciate any pointers / suggestions to help reduce this time?

Code Sample:

@anvil.server.callable
def get_trips_serialized():
  trip_list=[]
  search_iter = app_tables.trips.search()
  for trip_row in search_iter:
    trip_dict=dict(trip_row)
    trip_dict['shipments']=[]
    for shipment in trip_row['shipments']:
      trip_dict['shipments'].append(dict(shipment))
    
	trip_list.append(trip_dict)
    
  return trip_list

First of all I would check where the 15 seconds are spent, by adding print('a', time.time()) after the first for, print('b', time.time()) after the second and print('c', time.time()) after the loops are completed.

If you find out that the serialization is fast, the problem could be in the slow connection with large amount of data and you should work on shrinking it.

If the problem is in the serialization, a way to speed it up could be to avoid the 1+n problem by not using linked tables.

Instead of 2 20-column tables, you could have one trips table with a handful of columns: the columns you need to sort and search your trips, plus one simple object columns with everything in it. Or you could have 2 simple object columns, one with the trip info that is not used for sorting and searching and one with a list of shipments.

Another solution is to again not use a linked table column, add a trip_id column to both tables, do one search on the trips table, get your dictionaries, iterate them and get a list of trip ids, do one search on the shipments with a .search(trip_id=q.any_of(trip_ids)), get your dictionaries, iterate them and add them to their trips. Here you are running more code, but I think that these loops running in memory without touching the database should be fast.

3 Likes

My understanding is that the ‘dict’ call requires another trip to the database. So you have a LOT of back and forth, hence the slow performance (which will get worse over time!).

So, the goal needs to be to reduce the number of times you hit the database.

First, is there a reason you are serializing the trips, vs sending them back as a row objects (or a search iterator of row objects)? ‘Row’ objects are dict-like, so I’m betting you can achieve a similar outcome without the coercion to python dict, thereby avoiding the database call. Because of this database call you really should avoid dict(row_object) pretty much always. Also, search iterators load lazily, which usually help with performance (don’t unpack these client side though, if you do need to unpack).

Second, if this is something you are frequently doing, can you make ‘shipments’ a “Multiple Rows” linked column in your Trips table? This means that you are storing a list of links to shipments in your datatable, and this column acts exactly like you have in your syntax here (but handling Anvil row/search iterator objects instead of python lists/dicts).

If you can do both of these things, your entire function boils down to:

def get_trips_serialized():
    return app_tables.trips.search()

You will still be able to reference columns with by their column name (i.e. trip_list[0]['my_col']), you can get the shipments form the trip column (i.e. trip_list[0]['shipments']), and your performance will be very fast even with a LOT of trips/shipments.

See:

My understanding (we go back and forth with subjective “my understandings” because these details are not documented) is that the laziness on the server is less aggressive than the laziness on the client.

So you have fewer calls to the database if you make it a dict on the server side rather than on the client side and you have only one round trip if you make sure that the dict has everything rather than returning the SearchIterator and having the client fetching stuff when needed.

The SearchIterator laziness:

  • helps a little (smaller data transfer) the client side code that does not need to access the lazy bits
  • hurts a lot (more round trips) the client side code that needs them
  • same on the server side code, but with a lesser extent because there is no roundtrip overhead and because the row objects seem to be less lazy on the server side

So I have learned to use dictionaries instead of SearchIterator or Row objects on the client.

Not only that, but server-side code should run much closer to the database, i.e., less lag per database request. So even if the db requests themselves were identical, you’d still benefit from bundling them all into a single (or fewer) server calls.

1 Like

Just to be clear, the behavior of calling dict on a row object is separate from the laziness of a search iterator. As I understand it (+1 that this is not well documented) is that this is a separate database call, server-side or client-side, and if client-side it’s also a separate network call (double-whammy).

Here is a clone where I compare four methods: returning a search_iterator, unpacking a search_iterator into row objects, unpacking a search_iterator into a dict WITHOUT calling dict, and unpacking a search_iterator into a dict WITH calling dict.

Time Test
Results on Python 3 Basic environment trying to retrieve 450 rows

Method return_search_iterator took 0:00:01.151000
Method return_list_of_rows took 0:00:01.866000
Method return_list_of_dicts_wo_dict took 0:00:14.390000
Method return_list_of_dicts_w_dict took 0:00:25.424000

So what can we learn: Time to return a search iterator is more or less independent from the amount of data you are returning. You pay a small price unpacking server-side (and this price depends on the size of the search iterator), though if you unpack that same search iterator client-side, you will pay big time (so if you have to do it, do it server-side).

Time to access columns by iterating over a row object is about twice as fast as calling ‘dict’ on that row object, for reasons that are not entirely clear, but probably have something to do with going back to fetch from the database.

FWIW I use search iterators pretty much exclusively for passing data back from the server to populate a datagrid, and its great for this. This is pretty much what this is for: It loads 20 (ish? not documented!) rows, populates the datagrid, and goes and gets more data as and when needed by the user clicking ‘next’. However, I do have a bias towards passing around row objects, which is contrary to your Performance Wiki (which I think is fantastic!!).

1 Like

First, Thank you so much. I have been able to bring the time down to 5 seconds.

The part that was slowing me down was the [dict(row) for row in .search()]. I think the linked table design is a bad idea for this. As you suggested…

Another solution is to again not use a linked table column, add a trip_id column to both tables,

is something I will remember for future design, but I am too far along here.

1 Like