Ad-hoc Analysis with Jupyter Notebooks and Uplink

owen.campbell · June 22, 2020, 10:35am

I’ve been working with a client who has now migrated the entire dataset for their research department to Anvil’s data tables. I thought I’d share a little of how they are now set up to perform ad-hoc analytics on their data securely, reliably and with relatively little development effort.

The dataset comprises several hundred thousand records over 3 or 4 main tables, plus some ancillary meta-data in a further 6 or 8 tables. There are also several tens of thousands of PDF files which are the original source of much of the data. Those are held as media objects in a data table and there are links between the main data records and the PDF which contains its original content.

There is an Anvil app to manage and maintain this set - it handles the create, update, delete functions you might expect and has some processes built in for verification and validation.

But, one of the main uses for this data set by the researchers themselves is ad-hoc analytical exercises from which the business can gain new insight and offer new products and services. Those have been done in the past using local Jupyter notebooks on researchers’ own machines and they wished to continue operating in the same manner.

At first, one or two of the researchers were given the privileged uplink key and shown how to query the data using the data tables interface. That was a necessary first step to allow work to progresss, but it has a number of drawbacks:

Since the key has privileged access to the data, it opens the possibility of accidental or malicious corruption
Not all the researchers had the inclination to learn the new query mechanism and, with such a large dataset, several tended to write very inefficient queries that would take excessive amounts of time to execute.
If members of staff leave, the key needs to be reset and re-issued and that can only be done via the IDE which means several people would still need access that allows for accidental or malicious corruption.

To solve 1. and 2. we realised that almost all the queries being used were variations on three or four main themes. They could be written as server side functions with parameters to handle the variations and the researchers simply needed to connect, call those functions to retrieve their dataset and carry on from there. Once all the queries were covered by a server function, we reset the privileged key and issued only the unprivileged key to those who needed it.

But what about 3.?

For that, we turned to Role Based Authorisation. Every server function that is callable from client code also requires that the calling user be logged in. Most also define levels of necessary permissions.

That way, we were able to build ‘admin’ forms within the app for managing users and their permissions. We never have to reset and re-issue the uplink key because we can disable access for individual users at will.

It’s all working perfectly and the effort to do so within Anvil itself was trivial. To quote one of the researchers: “If I never have to write another line of SQL just to get my hands on some data, it’s worth it for that alone.”