Querying the data table with a list or dict

KR1 · May 15, 2023, 9:39am

I worte a code to create a list of all found rows from a database that match my unique numbers and save them as a list of rows found_rows. I need it for future processing done on it.

The code below do what I need but it’s really slow… Making so many query’s in a loop is not the best option. Is there a better way to do it with a dict or a list?

chunk is a List of dicts containing the unique number ['barcode'] that I’m using in my query:

          dict_search = {}#create dict for databank search
          found_rows = []#list ofr data table rows
          # Create a list of barcodes for search
          list_barcodes = [row['barcode'] for row in chunk]
          # Create the search dictionary
          for x in list_barcodes:
            dict_search['sample_id'] = x
            # Search the database for matching rows
            for records in app_tables.results_database.search(**dict_search):
              found_rows.append(records)#create a list of found projects

stefano.menci · May 15, 2023, 1:58pm

You can find them all in one query:

list_barcodes = [row['barcode'] for row in chunk]
found_rows = app_tables.results_database.search(sample_id=q.any_of(*list_barcodes))

A little note about variable naming that can make reading the code confusing: here the object returned by the iterator is one row or one record, so the variable should be called record rather than records. Obviously you can name it whatever you like, but that plural can be confusing.

KR1 · May 16, 2023, 1:35pm

Thanks for the solution but now I’m facing a new problem:
found_rows size is always about 2000 rows to avoid exceeding memory limitation.
It takes like 3 minutes to get one matching_rows in each loop. Print f1 to f1b.

Why is that taking so long to iterate through it? It was always really fast.
If I avoid to get all rows from results in one query, it takes about 20 Minutes to process 30.000.

Is there other way to match all rows to create one or multiple links?

          found_rows = app_tables.results_database.search(q.fetch_only("sample_id"), sample_id=q.any_of(*list_barcodes))
 
          # Iterate through the chunk using row['barcode'] to find matching rows in found_rows
          for row in chunk:#assign results to the sample
              barcode = row['barcode']
              print (f"1 {datetime.now()}")
              #Find matching rows based on the value of row['barcode']
              matching_rows = [found_row for found_row in found_rows if found_row['sample_id'] == barcode]
              print (f"1b {datetime.now()}")

ianb · May 16, 2023, 2:03pm

It’s very unoptimized for repetitive processing, but since we don’t know what you actually need for:

It is hard for someone else to optimize for you.

You are keeping a complete row iterator object in found_rows, that is connected to the database, every time you loop over it you are looping over the entire thing in a database connection and matching just a few items.

I would convert the row iterator into memory objects one time, and let python do the mapping to get your result, it will be much faster than repeatedly digesting a row iterator over and over and turning the results into a filtered list.
If you are running out of memory, your chunks need to be smaller, or you need to decide what you actually need “for later”.

Could you get by just creating a list of pairs of row_id’s and the data to change?

stefano.menci · May 16, 2023, 2:54pm

Is this code running on the server or client side?

Make sure you have advanced tables active:

You could try creating a dictionary of lists, so you iterate only once. That one iteration will be slower, but you should go from On2 (iterate n items n times) to On (do something n times). As far as I know, hashing and adding to a dictionary and to a list is O1, so, when n increases, the time required for that something becomes negligible compared to the square of the time required for the simpler iteration.

Something like this:

from collections import defaultdict

matching_rows = defaultdict(list)
for row in chunk:
    matching_rows[row['sample_id']].append(row)

Once you have this dictionary, you can access the matching rows with a simple matching_rows[id], which is a snappy O1.

KR1 · May 16, 2023, 4:45pm

@ianb I did the conversion into pairs. The search was much quicker but the conversion into pairs performance was comparable to direct query row by row of database.

Short description what i’m doing:

I’m doing it as Background task.
excel data conversion with pandas into list of rows as dict.
first part create some results in database A
2nd part that causing problems is responsible to create objects in database B that have multiple link column that need to be linked with a newly created results in database A.
One object in Database B can have none, one or multiple results that has to be linked.

p.colbert · May 16, 2023, 6:15pm

As I understand it, for a link (from source table X to target table Y) to be usable in database Z, both X and Y must be in Z. The source has to be present in order to see the link; the target has to be present in order to follow the link.

By mistake, I’ve had links where the source table was shared between database A and B, but the target table was in A only. Both databases could see the links, but only Database A could follow the links. Database B could not see the target, so it could not follow the links. Attempting to do so raised an exception.

So, linking rows, with multiple databases, should be done with care.

KR1 · May 17, 2023, 4:49am

Yes that’s why my main object can have multiple links, but those multiple links don’t have direct the link back.

In case It would be required, to search from that linked object after his parent. I’m using Base36 generatet unique IDs (in transaction). There is just a column that has his parent ID. But the chance for such thing to happend is really low. I just need to have quick access from my parent object to his results. I was just searching for some way to make the performance better.
I’m uploading an excel with the data, converting it with pandas into dict. Then uploading results first, then parent objects + creation of link to the result. For now what working is querying the results for each data row separately. It’s like microseconds but if you do it 30.000 times it take fair amount of time.

stefano.menci · May 17, 2023, 7:14am

Have you tried with the defaultdict?

KR1 · May 17, 2023, 8:25am

No, I had no time to try it yet. I will do it later today or tomorrow.
Once I did something similar thou. That first iteration is always really slow, but I will test it