Storing multiple images in the same Data Table row

nmvega · July 30, 2022, 8:58pm

Dear Friends:

I would try this, but I am currently on the road. Typing from my phone.

Within the same Row of a Data Table, can a “collection” of images be stored? (Say, 10 imagines per row).

If yes, would this be implemented via a Media column; or by stringifying the images in some way and embedding them into a list() or dict() that is written to a Simple Object column?

PS: I’m guessing that a
Media column can store just one item, and so that might not work, but you can correct me on that.

Thank you!

jshaffstall · July 30, 2022, 9:10pm

Please don’t! While this is technically possible, converting binary data to string increases the size of the data, and putting that much data into a simple object column is asking for performance issues.

You’d be better off having a table whose purpose is for holding media, one media item per row.

If each media item only belonged to one of the other data table rows, then you could do it like a traditional one-to-many relationship, where the media table has a link back to the original table.

If each media item could belong to multiple rows, you could have a multiple link column in the original table.

You could also do 10 separate media columns in your original table, but unless 10 is a magic never-to-be-changed number, I’d go with the flexibility of a separate media table.

nmvega · July 30, 2022, 9:37pm

Thank you. You always capture the nuance of my questions.

Indeed, 10 items (the content
) is a magic number per Row, but only because that’s how many of them are created per session. However, you make an excellent point relating to content reuse (ie, mixing & matching them, irrespective of the session they happened to be created in); and not “siloing” them in a “bundle”.

So I’ll model Data Tables with Links so that consumption is more generalized and flexible (basically your 2nd option).

And, of course, the other take away is discarding the stringification approach.

Thank you for helping me think through this.

jshaffstall · July 30, 2022, 9:55pm

If it’s likely that the same media file might be stored multiple times, you can store a file hash in the media table, and then compare hashes to avoid storing duplicate files.

Also, for what it’s worth, my experience is traditional relational DB, so my solutions tend to model that way. There might very well be better solutions that come from a more object DB approach.

nmvega · July 30, 2022, 10:25pm

Great point on storing a companion hash (will do)! Some people can be lazy and re-apply previous content (reducing the quality of the data store). How would you know? Yes, a hash is a good solution there.

Quick follow-up.

I already have and use the data model (with linked columns) we discussed above; and the difference now will be inserting 10-rows with one Media column instead of 1-row with a 10-item collection to a Simple Object column.

My question now turns to read back (ie, the consumption side). When a Data Table search() is performed and you wish to randomize the rows you pick out of search results, I don’t think (can’t remember) that results objects support indexing (get_item()); so I suspect you have to iterate through it and “pick out” the items you want (via a list of random position numbers) and execute continue otherwise.

If that’s the case, I’m hoping each iterated item (a Row object) is just a reference to the Row, and that embedded content isn’t actually fetched until / if you subscript it (ie, row['a_col_name"].

Does that sound about right?

Thank you!

stefano.menci · July 30, 2022, 10:54pm

Old behavior: simple object, media and link columns are lazy
New behavior: the fetch_only argument decides what’s lazy

The old behavior is to not read media or simple object columns when they are searched. They are only read when the specific key is accessed. This hurts especially when you access a media or simple object column on a row object on the client side, because it triggers a round trip.

The new behavior (enabled by the accelerated tables option) is to get at search time the columns specified by the fetch_only argument, then get other columns lazily, when the specific key is accessed.

jshaffstall · July 30, 2022, 11:31pm

Results objects do support indexing. In general, they behave list Python lists. You can index into them and slice them.

They are lazily loaded, though, so if your search results would generate one million rows, you don’t get all million rows back at once. The search results loads rows as needed, probably with some sort of paging mechanic.

If you’re doing random indexing, I have no idea how that will play with the efficiency of the search results, you’d need to experiment to see how it works with the amount of data you’re using.

I’ve never needed to randomly pick rows out of search results, so can’t really say how well it’d work.

nmvega · July 30, 2022, 11:43pm

Thank you @stefano.menci and @jshaffstall. I learn something new daily from you guys.

@jshaffstall That’s good news about being indexable. So I can do a random.choice() (something like that). Point well taken with respect to temporal caching efficiency. I’m definitely nowhere near millions of Rows (dozens perhaps), but your page-cache point still applies.

@stefano.menci I haven’t yet thought it through (on the road), but yes, the frontend will need just one column (the content column). So based on what you described, having the client send search-criteria that also includes a fetch_only optimization to the server; then letting the server code select the rows randomly (like I mentioned above); and finally, return those results to the frontend but with the fetch_only columns only materialized - I think this echoes your description of how this is optimized.

Combined, both answers will help significantly. Thank you.

PS: I have to check if the Personal plan allows Accelerated Tables. 🤷

nmvega · August 17, 2022, 7:38pm

@jshaffstall @stefano.menci

I might also try, instead of random.choice() on the Client, a random.shuffle() on the Server and see if (a) that doesn’t bomb-out and (b) confirm that it returns the same Anvil search iterator object type to the Client.

PS: Too bad q.fetch_only isn’t in my plan.

UPDATE:

random.shuffle() doesn’t work:

# return content_rows
return random.shuffle(content_rows) # Exception: AttributeError: __setitem__

Was worth a try.

ianb · August 17, 2022, 8:00pm

try random.sample() in the server module

like

sample_size = 25
rows = app_tables.table_name.search()

for sample_row in random.sample(rows, k=sample_size)
    ...

…based on how I interpret it works by peaking inside the random module, it should be able to get random samples from an indexable row_iterator of some arbitrary size pretty quickly. The row_iterator class seems to allow to to access the index (and slicing) of the row without turning the entire object into a list and losing the connection nature of each row / row iterator.

I have not tried this.

ianb · August 17, 2022, 8:03pm

If you want to know, I get many of these concepts by just poking around in the anvil github repo.
(I know that’s not for everyone, I like to tinker)

nmvega · August 17, 2022, 8:04pm

Thank you for the suggestion. I’ll try it immediately!

Under a separate cover, on the Client (and inside an event-handler), this was problematic, as well:

random.choice(content_rows) # SuspensionError: Cannot call a function that blocks or suspends here

nmvega · August 17, 2022, 8:05pm

Yes, thank you! Bookmarked and I’ll have regular looks at it.

jshaffstall · August 17, 2022, 8:10pm

That error doesn’t mean you can’t do what you tried to do, just that you can’t do it in that particular context. There are certain situations, based on how Skulpt interacts with Javascript, where you can’t make a server call. When the search iterator has to go back to the server, that counts.

So your random.choice would work in other contexts on the client, just not where you had the code.

Not that it’s a good idea to do on the client (perhaps incurring multiple server round-trips), just that it would work.

nmvega · August 17, 2022, 8:13pm

Thank you. I understood: In theory it could work in other scopes / settings, but it’s not ideal on the Client, in general. I agree… That’s what triggered my trying the randomization on the backend.

ianb · August 17, 2022, 8:17pm

You might even get away with just returning an object (from a server module) that is the result of

random.sample(rows, k=sample_size)

Because it returns an iterator, the client might just get back an iterator containing a random sample of row objects instead of the data contained within them.
…but again, no idea if this works

nmvega · August 17, 2022, 8:19pm

How funny. I was trying that route first. Give me a few minutes. (Zoom call in parallel).

nmvega · August 17, 2022, 8:41pm

Result:

sample_size = 10
return random.sample(content_rows, k=sample_size) # TypeError: Population must be a sequence or set. For dicts, use list(d).

I might roadmap randomization of results. Or on the client. maybe I’ll try generating a random number between: [0 .. len(content_rows)-1] and try subscripting with that. Perhaps not ideal, but worth a try.

ianb · August 17, 2022, 8:45pm

I was just thinking of another way to do it on the client side for quickly retrieving an arbitrarily large data table, but a small sample size.

Since calling len(rows) is optimized by anvil, using only one server round-trip…

rows = app_tables.table_name.search()
sample_size = 25
random_index = random.sample(range(len(rows)), k=sample_size)

for i in random_index:
  row = rows[i] 
  # this will only incur round trips of 1 + 1 * sample_size
  ...

Edit: it looks like you already thought of that

There is no penalty for calling the index of a row_iterator, and slicing it should only return as little as possible. (It returns the smallest page size (or pages) that covers what you want to slice)

nmvega · August 17, 2022, 8:48pm

Indeed, this strategy may work. Let me investigate and call back. LoL Thank you.