Yield large downloads with generator

aherncm · August 17, 2020, 4:21am

Hi,

I have very large files my users need to download. They get generated on the fly using a generator.

Using a generator has the advantage of the download starting immediately as the content is created, instead of (the poorer user experience of) making them wait for the entire file to be created before the download starts. It’s also a lot easier on memory requirements.

Here’s what I hoped would work:

def generate():
    df = pd....
    yield df.to_csv(index=True) 

@anvil.server.http_endpoint("/my-end-point")
def get_download(
  response = anvil.server.HttpResponse(200)
  csv_filename = 'test.csv'
  response.headers['Content-Disposition'] = f'attachment; filename="{csv_filename}"'
  response.body = generate()
  return response

I also looked into using a Media object but couldn’t get that to work either.

So … is there a way to pass a generator (yield) into either a download endpoint response or a media object?

Thanks.

p.colbert · August 17, 2020, 2:07pm

Welcome, @aherncm, to the Anvil Forum.

Thanks for sharing the context. That helps others suggest an alternate solution.

It looks like you want to set up a kind of a “streaming protocol” that lets both sender and receiver computers work separately, but simultaneously, to produce and consume a long stream of data. Do I have this right?

Unfortunately, for a Python generator to execute at the receiving end of the connection, both the receiver and the sender would have to be routines running inside the same computer, inside the same running copy of Python.

An http endpoint, however, supports the case of two separate machines, and two separate programs, joined by a request-reply style communications link. With http as the communication language, the receiver could be a program running on any hardware, using any language, not just Python. But since you hoped that a Python generator would work, let’s assume that the receiver is running Python.

In that case, the receiver might use more advanced Anvil-supplied features to approximate the desired effect. In particular, Anvil’s Background Tasks and Data Tables.

A Background Task could stay running, generator-like, in an Anvil server, to gather results incrementally.
It could call upon your generator to produce results.
It can make those results available to the caller, as a series of database records (table rows). See Storing Data in Data Tables.
The caller can launch such a Background Task, and read/write the relevant database records, by using Anvil’s Uplink feature.

If that seems like a very roundabout way to simulate a point-to-point streaming protocol, well, you’re right. If anyone here has experience on setting up such a protocol, with less overhead, please chime in.

And again, welcome to the Anvil Community!

aherncm · August 17, 2020, 3:45pm

Hi @p.colbert

Thanks for your in-depth reply.

The requirement for us is really just to download a large file (in a browser app) without incurring the memory and time overhead of having to completely pre-generate the file before it downloads (so no need for a python client.)

I presume there’s no way for a media object or http response to accept a generator as a parameter. Would there be any other way to do this? I have a prototype of this working in flask but it would be great to have it in Anvil and not have to spin up another web server.

btw, very impressed with Anvil! It’s a game changer!

Thanks.

p.colbert · August 17, 2020, 4:11pm

I believe you presume correctly. A Media Object is a container, not a generator. On creation, its full content, and length, are known. Remote function calls – HTTP or otherwise – return a set of data, of known size at return time.

In short, I’m seeing chunk-oriented protocols, here, both inside and outside of Anvil, not streaming protocols. You ask for a chunk, you get a chunk. If you ask for a Media Object via anvil.server.call, and it’s a big object, it will be streamed, but the Object must exist in its entirety before the transfer begins.

Above, I outlined a way to organize the data into a series of chunks, so that production and consumption could operate in parallel. Production writes them as database records, at its own pace. Consumption reads those records, in some stated order. To do that as outlined, with Anvil’s facilities, you do need Python at both ends.

If you want to bypass most or all of Anvil’s facilities, have a look at Anvil’s List of Packages. These are Python packages that can be used, server-side. One of them may have a communications protocol you can use instead of Anvil’s HTTP or Anvil’s server.calls.

I suspect that you’ll still need to use an Anvil Background Task to keep the generator running until it completes.

aherncm · August 17, 2020, 4:56pm

Thanks. I’ve scanned the (extensive) package list and nothing jumps out. I’ll give it some thought.

Great support!