A Single-Row Data Table as a search-index for another

nmvega · August 5, 2022, 4:32pm

Hello Friends:

I have a growing Data Table with one row per image and it’s companion sha3_512 hash (to prevent image duplicates). (sha3_512 may be overkill).

Before adding a new image (add_row(…)), I can certainly perform a search against this hash column to see if the image already exist. But I was thinking of creating a second Data Table with a single row that stores a dict(…) of those hashes in a Simple Object column, and perform the lookup like this instead:

lookup_row = app_tables.lookup_table.get_by_id(row_id)
value = lookup_row['image_hashes_dict'].get('aHashStr',None)
if not value:
   row = app_tables.my_table.add_row(...) # With 'aNewHashStr'
   lookup_row['aNewHashStr'] = row.get_row_id()
   # Note: ['image_hashes_dict'] is the Simple Object column name.

This one-row-with-a-dict would offer O(1) constant time, versus O(n) for row searches. Note: My personal plan doesn’t offer indexing.
(._.)

Does this sound reasonable? Any gotchas?

Sidenote: A second row in app_tables.lookup_table would provide the same purpose for a different Data Table that also stores unique hashes in a column.

Thank you!

jshaffstall · August 5, 2022, 5:22pm

Without actual timings of both approaches, it’s impossible to say which would be better. In general, I trust the database to do searches efficiently, so I’d avoid the index table until the point at which searches in the other table were starting to take too long. Then I’d look at optimizing. Optimizing without knowing you have a problem is just wasted time and unneeded complexity.

You can fill the table with a bunch of junk rows with unique hashes to do the timings on searches. It’d be pretty easy to pump 100,000 rows into it to see how long the search takes.

nmvega · August 5, 2022, 5:39pm

Thank you. Yes, you mean premature optimization. I agree. No performance problems yet. I’m seeking feedback from the pros. Your points are well received.

ianb · August 5, 2022, 5:43pm

On top of what @jshaffstall said I would say databases are way more optimized and efficient when used correctly than you might think. Most SQL db’s have been being optimized for specific operations like text search for more decades than python was even a language, and it is mostly optimized in C and C++.

Then, if you add indexing and there isn’t much to compare at all.

There will also be a time penalty to load the one object into memory, so as the object grows that will also take more time with every call.

A modern SQL db with a hand crafted query should be able to find a single result out of 100,000,000 records in less than 3 seconds. Maybe less than 1 depending on the data type.

Also I’m sure MD5 is good enough, the odds of a hash collision and the other data even being an image, well, you’d probably be waiting longer than multiple times the age of the universe if you wanted to see it happen once.

nmvega · August 5, 2022, 5:51pm

@ianb Thank you. Your points are well received, too.

@ianb @jshaffstall Gentlemen I’ll table the idea (no pun intended LoL), which I don’t mind because, although I posed the question to get engage your thoughts, I wasn’t really in the mood to create an indirection. Thank you for the advice guys. This was helpful!

nmvega · August 5, 2022, 6:05pm

@ianb PS: Hah, I remember the age of C and no
Python. My journey began writing C and Assembly for boards I designed based on Motorola 68030 and various Texas Instruments DSP chips. Different times. 🤷

owen.campbell · August 5, 2022, 7:52pm

Fortran 77 and 6800 for me!