### Live Chat

If you don't want to do this, you can email us instead at contact@anvil.works.

Blog - Page 2

## Overthinking T-Shirts with SciPy 6th of May, 2019

##### Statistically Modelling Conference Swag at PyCon 2019

Sponsoring a conference has many challenges, and one of them is making sure you don't run out of T-shirts!

In his popular lightning talk at PyCon 2019, Meredydd described how we use SciPy to model the distributions, and minimise our chances of running out:

(Scroll down for a transcript)
##### Transcript:

Hi! My name is Meredydd, and I run a startup called Anvil. We make tools for building full-stack web apps with nothing but Python, and we are sponsoring PyCon again this year. It's great to be back!

Like any good sponsor, we give out T-shirts — specifically, to anyone who builds an app with Anvil and shows it to us at our stand.

There are two problems with this. For one, Cleveland is a long way from home, and all these T-shirts have to come with me in a very heavy suitcase.

Problem number two: Python developers, it turns out, come in many different shapes and sizes. Pictured here are two Python programmers: a shirt that fits the one on the left is going to look pretty undignified on the one on the right.

So, the question is, “how many shirts should we be bringing of each size?”

We've been here before, so we could just bring twice as many of each size as we gave out last year.

Last year, we gave out two women's-cut extra-small shirts, so perhaps we should bring four this year. That seems plausible.

But last year, we gave out 27 men's large shirts. If we brought 54 of them this year, that would definitely be overkill.

If you think about it, that 54th men's large shirt is much, much less likely to get used than that 4th women's extra-small.

It's the Law of Large Numbers: If you've got a larger sample size, it will average out more reliably.

We can model this with a binomial distribution. Imagine we're rolling 3,500 dice — one for each person at PyCon — and then counting up how many rolled "Men's Large".

``````from scipy.stats import binom
@anvil.server.callable
def get_dist(n_attendees, prob):
return [binom.pmf(k, n_attendees, prob)
for k in range(n_attendees)]
``````

Thankfully, SciPy has a function for calculating this distribution, and so I'm going to use it to write an interactive tool for exploring this distribution.

I'm going to write a function that gets the probability distribution for a given number of attendees, and a given probability of each attendee claiming a particular size of shirt.

At this point, the live-coding begins. Open the source code to follow along:

Now we have our distribution, we can make an interactive tool to explore it.

Our user interface will have a text box where we can enter how many of this size of shirt we used last time; and then underneath it will be a plot so we can explore the distribution.

``````def text_box_1_pressed_enter(self, **event_args):
"""This method is called when the user presses
Enter in this text box."""

dist = anvil.server.call('get_dist', 3200,
int(self.text_box_1.text)/3200.0)

self.plot_1.data = go.Bar(y=dist)
``````

When you hit Enter in this text box, we're going to call that `get_dist()` function we defined earlier. The number of attendees is 3,200, and we can estimate the probability from the number we gave out last time, because that was also out of a population of 3,200

Once we've got that distribution, we can plot it as a bar chart.

When we plot the women's extra-smalls, the distribution is actually quite wide. Of course, we're most likely to need two shirts, same as last time. But we could easily need twice that number, or even more.

Whereas if we check out the men's large shirts, the distribution is a lot tighter. Still, again, most likely to need 27, same as last time, but we're vanishingly unlikely to need twice that number.

``````return binom.ppf(0.95, n_attendees, prob)
``````

Now, we've constructed a statistical model that can actually answer our question. We want to know how many shirts to bring, to avoid running out.

What we want to do is to find a number of shirts such that there is a 95% probability of needing that number or less. This is the 95% point of the probability distribution, and SciPy provides a function for calculating this: `binom.ppf()`.

So we calculate the 95% point for the probability distribution of every size of shirt, and that's how many shirts we bring.

We wire this up in the UI, to display the number of shirts.

We see that for the women's extra-smalls, we need 5 shirts -- more than double the number we gave out last year -- to be 95% sure of not running out.

Whereas for the men's large shirts, we need 36 -- that's only 33% more than last time.

You can get the source code of the app I've just built here:

And if everyone in this hall comes to our stand, builds an app, claims a T-shirt, and completely cleans us out?

Well, at least that'll show the statisticians. Thanks very much!

## The Easiest Way to Build HTTP APIs in Python 30th April 2019

Anvil isn’t just the easiest way to build Web user interfaces in Python. It’s also a fully-featured app platform, great for building serverless apps and HTTP APIs! Today, we’ll take a tour:

### How to build a simple HTTP API

It’s pretty simple. Just create an Anvil app, add a Server Module, and write an `@anvil.server.http_endpoint()` function:

``````@anvil.server.http_endpoint('/hello-world')
def hello_world(**q):
``````

That was pretty fast, right? And you didn’t have to host it anywhere: Once you publish your app, your API is live!

### What does an API look like?

That depends on your app, of course! But for this post, we’ll be building an API for our multi-user To-Do List example.

Our To-Do list is a classic data-storage app. An API for this kind of app needs to do four things:

• Read records - in this case, we want to make a `GET` request to a URL and get all the tasks in our to-do list.
• Create new records - in this case, we want to `POST` to a URL to add a new task.
• Update records - in this case, we want to mark tasks as done, or change their title. By convention, each task has its own URL, which we can update by making a `PUT` request to it.
• Delete records - in this case, by making a `DELETE` request to a task’s URL.

(This is often abbreviated “CRUD”, for Create/Read/Update/Delete.)

### Let’s get going!

As we build up this API, step by step, we’ll see common patterns used by many HTTP or REST APIs:

• Authentication – Make sure only authorised users can access your API.

• Accepting Parameters – To create new records, we’ll have to take parameters with a POST request.

• URL parameters let us make custom URLs for each record.

• Updates and Deletes finish up our API by handing `PUT` and `DELETE` requests.

If you’re feeling impatient, you can get the source code for the finished, working API:

### Returning records from your database

We’ll start with the simplest endpoint: getting all the tasks in our to-do list, as an array of JSON objects. It’s pretty simple – if we return a list of dictionaries, Anvil will turn it into JSON for us:

``````@anvil.server.http_endpoint('/tasks')
``````

If we published our app at `https://my-todo-list.anvil.app`, we can run this function by fetching `https://my-todo-list.anvil.app/_/api/tasks`:

``````\$ curl https://my-todo-list.anvil.app/_/api/tasks
[{"done":true,"title":"Wash the car"},{"done":true,"title":"Do the dishes"}]
``````

Note: If you haven’t made your app public, your API URLs will be long and difficult to remember. To get a nicer URL, make your app public from the Publish dialog. See: How to publish your app

### Authentication

In an earlier tutorial, we expanded our simple to-do list into a multi-user application, using Anvil’s built-in Users service. We want to do this same with our API – we want to authenticate our users, and only allow them to access their own data.

We can automatically require authentication for our API endpoint, using Anvil’s Users Service, by passing `authenticate_users=True` to the `anvil.server.http_endpoint(...)` decorator. This prevents access unless the user provides a valid username and password via standard HTTP Basic authentication.

Within our endpoint function, we can use `anvil.users.get_user()` to find out what user they logged in as. Let’s modify our `/tasks` endpoint so it returns only the current user’s tasks:

``````@anvil.server.http_endpoint('/tasks', authenticate_users=True)
this_user = anvil.users.get_user()
``````

Let’s try that with `curl`:

``````\$ curl https://my-todo-list.anvil.app/_/api/tasks
Unauthorized

[{"done":true,"title":"Wash the car","id":"[23,567]"},{"done":true,"title":"Do the dishes","id":"[23,568]"}]
``````

You can also see that I’ve added an `id` field to the responses, containing the unique ID of each database row. Each database row has an ID, which is a a string like `"[23,568]"`. We can get it by calling `get_id()` on the row object, and we can use it later to retrieve that row from the database.

Want to do your own authentication, rather than using Anvil’s built-in Users Service? No problem! Just pass `require_auth=True` instead of `authenticate_users=True`, and then examine `anvil.server.request.username` and `anvil.server.request.password` for yourself. Learn more in our reference documentation.

### Accepting parameters

Now, we want an endpoint to add a new task. We use form parameters to specify the title of the new task. Form parameters are passed as keywords to the endpoint function, so to make a mandatory parameter we just make our function take an argument called `title`:

``````@anvil.server.http_endpoint('/tasks/new',
methods=["POST"], authenticate_users=True)
owner=anvil.users.get_user())
return {'id': new_row.get_id()}
``````

This endpoint should only accept HTTP POST requests, so we’ve specified `methods=["POST"]`. We also return the ID of the newly-created row.

Here’s how to create a new task from `curl`:

``````\$ curl -u me@example.com:my_password -d title="New task" https://my-todo-list.anvil.app/_/api/tasks/new
{"id":"[23,590]"}
``````

#### Form parameters vs JSON

The example above uses “form encoding” to pass parameters. These days, API users often prefer to send JSON as the request body.

We can accommodate this, by using `anvil.server.request.body_json` to decode the body of the request as a JSON object:

``````@anvil.server.http_endpoint('/new_task', methods=["POST"], authenticate_users=True)
title = anvil.server.request.body_json['title']
owner=anvil.users.get_user())
return {'id': new_row.get_id()}
``````

Here’s how to use this JSON-based endpoint from `curl` – it’s a little more verbose:

``````\$ curl -u me@example.com:my_password \
-X POST \
-H "Content-Type: application/json" \
{"id":"[23,591]"}
``````

### URL parameters

We want each task to have its own URL. We can use the unique ID of the database row for this. We’ll make each task available at `/tasks/[ID]`.

To do this, we make an endpoint with a URL parameter called `id`. URL parameters are passed into the endpoint function as keyword arguments, same as form parameters:

``````@anvil.server.http_endpoint('/tasks/:id', authenticate_users=True)
this_user = anvil.users.get_user()
return anvil.server.HttpResponse(status=404)
else:
``````

Of course, the user might request a bogus URL, with an invalid ID. Or, worse, with the ID of a row belonging to another user!

If the row with the specified ID is not available in the `tasks` table, or if it is not owned by this user, our function returns an HTTP 404 (“Not Found”) response.

Let’s test it by accessing one of the tasks we saw from the listing earlier. We’ll have to URL-encode the ID so that curl will swallow it, but that’s OK: Anvil will decode the `id` parameter before passing it to our function:

``````\$ curl -u me@example.com:my_password https://my-todo-list.anvil.app/_/api/tasks/%5B23%2C567%5D
[{"done":true,"title":"Wash the car","id":"[23,567]"}
``````

### Rounding out the example: Updates and deletes

We now know enough to complete the last two operations: updating a task and deleting it. These are done by making `PUT` and `DELETE` requests, respectively, to the task’s url.

Let’s expand our `task_url` endpoint to handle `GET`, `PUT` and `DELETE` methods:

``````@anvil.server.http_endpoint('/tasks/:id',
authenticate_users=True,
methods=["GET","PUT","DELETE"])
this_user = anvil.users.get_user()

request = anvil.server.request

return anvil.server.HttpResponse(status=404)
elif request.method == 'DELETE':
return None
elif request.method == 'PUT':
if 'title' in request.body_json:
if 'done' in request.body_json:

# PUT and GET both fall through to here and return the task
# as JSON

``````

### …and that’s a working API!

We’ve just built a fully functional, secure and authenticated API for our to-do list app – all in about 30 lines of Python.

Because Anvil is “serverless”, this API is already live on the internet, without us having to set up or maintain web servers.

You can open the source code of this app in Anvil, ready to run:

I’ll just finish up by mentioning a few more advanced things you might want to do with Anvil’s HTTP endpoints:

Returning different types of data

You can return plain text (strings), binary data with any content type (Media objects), or JSON (anything else).

If you return an `anvil.server.HttpResponse()` object, you can control the HTTP status, headers, and content of your response.

Anvil’s support for cookies comes with extra security. Anvil cookies are encrypted (so the user cannot see what you’ve set in a cookie) and authenticated (so the user cannot tamper with values you’ve set in a cookie).

In short, the contents of `anvil.server.cookies.local[...]` and `anvil.server.cookies.shared[...]` are trustworthy, unlike traditional HTTP environments.

If you want your API to be accessible from other web pages, you’ll need to set CORS (Cross Origin Resource Sharing) headers. You can do this by passing `cross_origin=True` to the `anvil.server.http_endpoint()` decorator.

XSRF protection

HTTP endpoints in Anvil are automatically protected from XSRF (Cross-Site Request Forgery). When a request comes to an HTTP endpoint from outside your app (based on `Origin` and/or `Referer` headers), it executes in an independent session from the rest of your app. This prevents malicious requests from external sites from performing actions with the credentials of a logged-in user.

You can opt out of this protection: To execute cross-site requests in the same session as the rest of your app, pass `cross_site_session=True` to the `anvil.server.http_endpoint()` decorator.

## Let's Build a Search Engine 29th March 2019

How does a search engine work? Let’s find out – by building one!

Search engines have become the gateway to the modern web. How often do you know exactly which page you want, but you search for it anyway, rather than typing the URL into your web browser?

Like many great machines, the simple search engine interface - a single input box - hides a world of technical magic tricks. When you think about it, there are a few major challenges to overcome. How do you collect all the valid URLs in existence? How do you guess what the user wants and return only the relevant pages, in a sensible order? And how do you do that for 130 Trillion pages faster than a human reaction time?

I’ll be some way closer to understanding these problems when I’ve built a search engine for myself. I’ll be using nothing but Python (even for the UI) and my code will be simple enough to include in this blog post.

You can copy the final version, try it out, and build on it yourself:

There will be three parts to this.

1. First, I’m going to build a basic search engine that downloads pages and matches your search query against their contents. (That’s this post)

2. Then I’m going to implement Google’s PageRank algorithm to improve the results. (See Part 2)

3. Finally, I’ll play with one of computer science’s powertools - indexing - to speed up the search and make the ranking even better. (See Part 3)

## Collecting URLs

I’m going to a build a web crawler that iteratively works its way through the web like this:

1. Start at a known URL
3. Make a note of any URLs it contains
4. GOTO 1 (for the new URLs I’ve found)

I need a known URL to start with. I’ll allow webmasters and other good citizens to submit URLs they know about. I’ll store them in a database (I’m using Anvil’s Data Tables) and if I know the URL already, I won’t store it twice.

``````@anvil.server.callable
def submit_url(url):
url = url.rstrip('/')  # URLs with and without trailing slashes are equivalent
if not app_tables.urls.get(url=url):
``````

I’ve also made it possible to submit sitemaps, which contain lists of many URLs (see our Background Tasks tutorial for more detail.) I’m using BeautifulSoup to parse the XML.

``````from bs4 import BeautifulSoup

@anvil.server.callable
def submit_sitemap(sitemap_url):
response = anvil.http.request(sitemap_url)

soup = BeautifulSoup(response.get_bytes())
for loc in soup.find_all('loc'):
submit_url(loc.string)
``````

If I submit the Anvil sitemap, my table is populated with URLs:

I’m in good company by allowing people to submit URLs and sitemaps for crawling - Google Search Console does this. It’s one way of avoiding my crawler getting stuck in a local part of the web that doesn’t link out to anywhere else.

Stealing shamelessly from Google Search Console, I’ve created a webmaster’s console with ‘submit’ buttons that call my `submit_url` and `submit_sitemap` functions:

``````  def button_sitemap_submit_click(self, **event_args):
"""This method is called when the button is clicked"""
self.label_sitemap_requested.visible = False
anvil.server.call('submit_sitemap', self.text_box_sitemap.text)
self.label_sitemap_requested.visible = True

def button_url_submit_click(self, **event_args):
"""This method is called when the button is clicked"""
self.label_url_requested.visible = False
anvil.server.call('submit_url', self.text_box_url.text)
self.label_url_requested.visible = True
``````

## Crawling

Now that I know some URLs, I can download the pages they point to. I’ll create a Background Task that churns through my URL list making requests:

``````@anvil.server.background_task
def crawl():
for url in app_tables.urls.search():
# Get the page
try:
response = anvil.http.request(url)
html = response.get_bytes().decode('utf-8')
except:
# If the fetch failed, just try the other URLs
continue

row['html'] = html
``````

Because it’s a Background Task, I can fire off a crawler and download all the pages I know about in the background without blocking the user’s interaction with my web app.

That’s all very well, but it doesn’t really crawl yet. The clever thing about a web crawler is how it follows links between pages. The web is a directed graph – in other words, it consists of pages with one-way links between them. That’s why it’s such a wonderful information store - if you’re interested in the subject of one page, you’re likely to be interested in the subjects of pages it links to. If you’ve ever been up ’til dawn in the grip of a Wikipedia safari, you’ll know what I’m talking about.

So I need to find the URLs in the pages I download, and add them to my list. BeautifulSoup, the brilliant HTML/XML parser, helps me out again here.

I also record which URLs I found on each page - this will come in handy when I implement PageRank.

``````  from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

# Parse out the URLs
for a in soup.find_all('a', href=True):
submit_url(a['href'])

``````

While I’m at it, I’ll grab the page title to make my search results a bit more human-readable:

``````  # Parse out the title from the page
title = str(soup.find('title').string) or 'No Title'
``````

The crawler has become rather like the classic donkey following a carrot: the further it gets down the URL list, the more URLs it finds, so the more work it has to do. I visualised this by plotting the length of the URL list alongside the number of URLs processed.

The list grows initially, but the crawler eventually finds all the URLs and the lines converge. It converges because I’ve restricted it to https://anvil.works (I don’t want to denial-of-service anybody’s site by accident.) If I had it crawling the open web, I imagine the lines would diverge forever - pages are probably being added faster than my crawler can crawl.

By the time it has finished, there’s a nice crop of page data waiting for me in the `pages` table.

Time to implement search. I’ve thrown together the classic “input box and button” UI using the drag-and-drop editor. There’s also a Data Grid for listing the results, which gives me pagination for free. Each result will contain the page title and a link.

The most basic search algorithm would just break the query into words and return pages that contain any of those words. That’s no good at all, and I can do better straight away.

I’ll remove words that are too common. Let’s say a user enters ‘how to build a web app’. If a page happens to contain exactly the text ‘how to build a web app’, it will be returned. But they would also get pages containing the text ‘how to suckle a lamb’.

So I’ll remove words like ‘how’ and ‘to’. In the lingo, these are called stop words.

I’ll include words that are closely related to those in the query. The search for ‘how to build a web app’ should probably return pages with ‘application builder’, even though neither of those words are exactly in the query.

In the lingo, this is called stemming.

Both of these requirements are met by Anvil’s `full_text_match` operator, so I can get a viable search running right away:

``````# On the server:
@anvil.server.callable
def basic_search(query):
return app_tables.pages.search(html=q.full_text_match(query))
``````
``````# On the client:
def button_search_click(self, **event_args):
"""This method is called when the button is clicked"""
self.repeating_panel_1.items = anvil.server.call('basic_search', self.text_box_query.text)
``````

Later on we’ll talk about indexing and tokenization, which gets to the nuts and bolts of how to optimise a search. But for now, I have a working search engine. Let’s try some queries.

## Test queries

For each stage in its development, I’m going to run three queries to see how the results improve as I improve my ranking system. Each query is chosen to reflect a different type of search problem.

I’ll only look at the first page of ten results. Nobody ever looks past the first page!

(If you’re wondering why all the results are from the same site, bear in mind that I’ve restricted my crawler to https://anvil.works to avoid getting my IP address quite legitimately blocked by anti-DoS software, and to keep my test dataset to a manageable size.)

#### ‘Plots’

‘Plots’ is a fairly generic word that you would expect to show up all over the place in technical writing. The challenge is to return the pages that are specifically about plotting, rather than those that just use the word in passing.

When I search for ‘plots’, I get this:

The first result is Using Matplotlib with Anvil, which is definitely relevant. Then there’s the reference docs, which has a section on the Plot component. And result number nine is the original announcement dating back to when we made Plotly available in the client-side Python code.

But there’s also a lot of fairly generic pages here. They probably mention the word ‘plot’ once or twice, but they’re not really what I’m looking for when I search for ‘plots’.

‘Uplink’ differs from ‘plots’ because it’s unlikely to be used by accident. It’s the name of a specific Anvil feature and it’s not a very common word in normal usage. If it’s on a page, that page is almost certainly talking about the Anvil Uplink.

If you’re not familiar with it, the Uplink allows you to `anvil.server.call` functions in any Python environment outside Anvil. So I’d expect to get the Using Code Outside Anvil tutorial high up the results list. It shows up at position four.

I also get Escape Hatches and Ejector Seats, which mentions the Uplink as one of Anvil’s ‘Escape Hatches’. And at number 10 we have Remote Control Panel, which uses the Uplink to run a test suite on a remote machine.

It’s good that all three of these show up, but it would be better if they were more highly ranked. The rest of the results probably talk about the Uplink in some way, but the Uplink is not their primary subject.

#### ‘Build a dashboard in Python’

This is included as an example of a multi-word query. I’d expect this to be harder for the search engine to cope with, since the words ‘build’ and ‘Python’ are going to be used a lot on the Anvil site, but a user typing this in is specifically interested in Python dashboarding.

There are two pages I would expect to see here: Building a Business Dashboard in Python and the Python Dashboard workshop. Neither of these appear in the results.

A few of the pages are tangentially related to dashboard building, but generally the signal appears to have been overwhelmed by the noise introduced by the words ‘build’ and ‘Python’.

#### So how did it do?

The basic search engine I’ve put together does manage to get some relevant results for one-word queries. The user has to look past the first few results to find what they’re looking for, but the main pages of interest are there somewhere.

It gets confused by multi-word queries. It can’t distinguish very well between the words that matter, and those that don’t. Anvil’s `full_text_match` does remove words like ‘a’ and ‘in’, but it obviously won’t guess that ‘build’ is less important than ‘dashboard’ in this particular situation.

## Next steps

I’m going to make two improvements in an attempt to address these problems. First, I’ll try to rank more interesting pages more highly. Google has an algorithm called PageRank that assesses how important each page is, and I’ve always wanted to learn how it works, so now is probably a good time! I explore it and implement it in the next post.

Second, I’ll take account of the number of times each word appears on the page. This will help with the ‘building a dashboard in Python’ query, because pages that just happen to mention ‘building’ will do so once or twice, whereas pages about building dashboards will use those words a lot. That gives me an excuse to explore two simple but powerful concepts from Computer Science - tokenization and indexing, which I implement in the final post.

## Implementing PageRank

So here goes, I’m digging into Google’s PageRank. It’s surprisingly simple; you might even say it’s ‘beautifully elegant’. Come read about it:

## Let's Build a Search Engine: How PageRank Works 29th March 2019

I’m building a search engine using Anvil. So far, I’ve built a machine that roams the internet digesting pages, and a simple query engine. I can make some search queries, but the most important pages don’t always show up at the top of the results.

I need a way to score pages based on how important they are. I need PageRank - the secret sauce that won Google the search engine wars of the late 90s.

Let’s look at how PageRank works, then I’ll build a Python web app that implements it.

## How it works

PageRank was developed in 1996 at Stanford by Sergey Brin and Larry Page, after whom it is named. It was the ‘key technical insight’ that made Google work so well, so they kept it a tightly-guarded secret.

Only kidding! Page and Brin wrote a brilliant and openly-available paper about it in 1998, which contains this priceless quote:

To test the utility of PageRank for search, we built a web search engine called Google.

The algorithm is based on the idea that important pages will get linked to more. Some terminology: if python.org links to my page, then my page has a ‘backlink’ from python.org. The number of ‘backlinks’ that a page has gives some indication of its importance.

PageRank takes this one step further - backlinks from highly-ranked pages are worth more. Lots of people link to python.org, so if they link to my page, that’s a bigger endorsement than the average webpage.

If my page is the only one linked to from python.org, that’s a sign of great importance, so it should be given a reasonably high weighting. But if it’s one of fifty pages python.org links to, perhaps it’s not so significant.

An equation paints a thousand words, so for the mathematically-inclined reader, this version of the algorithm can be written:

where R(u) is the rank of the page we’re ranking, R(v) is the ranking of its backlinks, Nv is the number of links page v contains, and c is a constant we can tweak at will. The sum is over all the backlinks of the page we’re ranking. (If equations aren’t your thing - this equation just re-states what I said above.)

It turns out that PageRank models exactly what a random web surfer would do. Our random surfer is equally likely to click each link on her current page, and her chance of being on this page is based on the number of pages that link to it.

We can’t calculate this all in one go, because pages can link to each other in a loop. If the links between pages form a loop, we don’t know what rank to give them, since they all depend on each other’s rank.

We solve this by splitting the calculation into iterations. A page in iteration i + 1 has its rank worked out using the PageRanks from iteration i. So the ranks of pages in loops don’t depend on themselves, they depend on the previous iteration of themselves.

In terms of the equation from earlier, we plug the ranks from iteration i into the right-hand side of the equation, and the ranks for iteration i + 1 come out on the left-hand side.

We have to make a guess about what ranks to start with. Luckily, this is a convergent algorithm, meaning wherever we start, if we do it enough times, the ranks will eventually stop changing and we know we have something that satisfies our equation.

So that solves the problem of self-referential pages. There’s another problem with loops - if none of the pages in the loop link to a page outside of the loop, our random surfer can never escape:

This is solved by adding an extra little bit of rank to every page, to balance out the rank that flows into dead-end loops. In our random surfer analogy, the surfer gets bored every so often and types a random URL into her browser. This changes our equation: there’s a bit of rank added, which we denote E(u):

And that’s the PageRank equation!

PageRank was revolutionary. The paper compares the results from a simple PageRank-based engine against the leading search engine at the time. The top result for “University” in the PageRank engine was the Stanford University homepage. The competing engine returned “Optical Physics at the University of Oregon”.

That sounds like exactly what I need to improve my search results. Luckily, it’s fairly simple to implement if you have a relatively small number of pages. Here goes…

## Implementing PageRank

First I need to calculate the backlinks for each page. My crawler has figured out the forward links from each page, so all the information is there. I just need to iterate through the pages adding ‘page X’ as a backlink on everything ‘page X’ links to.

``````def calculate_backlinks():
# Reset the backlinks for each page
for page in app_tables.pages.search():

# We have forward links for each page - calculate the backlinks
for page in app_tables.pages.search():
if page['urls'] is None:
continue

for url in page['urls']:
``````

That’s the backlinks figured out. Now to set up the initial condition. The PageRank calculation gradually refines the answer by a series of iterations, so I must set the initial values to start off with. For newly-discovered pages, I’ll just guess at 0.5. Pages I’ve seen before will have a PageRank from previous runs, so those pages can start with that old value.

``````@anvil.server.background_task
def ranking_agent():

for page in app_tables.pages.search():
# Initial condition
if page['rank'] is None:
page['rank'] = 0.5
``````

Now for the PageRank calculation itself. I expect the calculated values of the PageRanks to converge on the correct solution. So I want to invent a metric to tell if the calculation has stopped changing much. I’ll calculate how much the average page has changed in rank, and if that’s small, my calculation has converged.

``````  # ... still in the ranking_agent ...

# Iterate over all pages, and repeat until the average rank change is small
average_rank_change = 0.
while average_rank_change > 1.001 or average_rank_change < 0.999:
# Work out the next PageRank
new_ranks = calculate_pagerank(page)

# Step on the calculation
average_rank_change = step_calculation()
``````

Inside the loop, I calculate the PageRank for the next iteration, then step the calculation on by putting the values from step `i+1` into box `i`, and calculating the average rank change.

And that’s how to calculate PageRank! I’ve left out two important details. First, the actual PageRank calculation. This is just the equation from above, but written out in Python. The `CONSTANT` is c and the `RATING_SOURCE_FACTOR` is E(u) (I’ve assumed it’s the same value, 0.4, for each page).

``````CONSTANT = 0.7
RATING_SOURCE_FACTOR = 0.4

def calculate_pagerank():
"""Calculate the PageRank for the next iteration."""
new_ranks = {}
for page in app_tables.pages.search():
rank = RATING_SOURCE_FACTOR
rank = CONSTANT * rank
new_ranks[page['url']] = rank

return new_ranks
``````

And for completeness, here’s exactly how I step the calculation forward:

``````def step_calculation():
"""Put 'next rank' into the 'current rank' box and work out average change."""
num_pages = 0.
sum_change = 0.
for page in app_tables.pages.search():
sum_change += new_ranks[page['url']] / page['rank']
page['rank'] = new_ranks[page['url']]
num_pages += 1
return sum_change / num_pages
``````

To make the search engine order results by PageRank, I just need to use `tables.order_by` in my Data Tables query:

``````@anvil.server.callable
def ranked_search(query):
pages = app_tables.pages.search(
tables.order_by("rank", ascending=False),
html=q.full_text_match(query)
)
return [{"url": p['url'], "title": p['title']} for p in pages]
``````

If you want to see the code in its natural habitat, you can clone the final app, which contains everything I’ve talked about in this series of posts.

Let’s spin it up and get some ranks!

## Crunch those numbers

I built a UI that tracks the progress of the calculation. It polls the Background Task using a Timer component and plots the convergence metric with each iteration. This means you can watch it run and see the calculation gradually converge in real time.

I put some test data together to check that my results make sense intuitively. Consider the four pages (and links) shown in the diagram below:

The PageRank for each page is also shown in the diagram, as calculated by my ranking engine. The page with 3 backlinks has a PageRank of 1.5, the page with 2 backlinks has a PageRank of 0.97, and the pages with 1 backlink each from the same page have rankings of 0.63. This sounds about right. I can tune the spread of these numbers by changing `CONSTANT` in my code.

## Test queries

In Part 1, I tested the basic version of the search engine by running three queries and making judgements about the quality of the results.

Let’s run the same queries again to see how PageRank has improved things.

#### ‘Plots’

‘Plots’ is my example of a fairly generic word that appears a lot in technical writing. There are a few pages that are specifically about plots in Anvil, and I want to see whether they come up.

Overall, the PageRank search seems to do better. Five of the results are specifically about plotting:

The basic search only manages to get three of these into the top ten.

The basic search also included some pages from the middle of the blog. These have been ranked lower by PageRank, so the more-relevant pages have had a fighting chance.

The PageRank search does worse in one respect - the top spot has been taken by the reference docs, replacing the Using Matplotlib with Anvil guide. I’m ranking pages purely based on importance and not on relevance. The reference documentation is clearly ‘more important’ overall than the Matplotlib guide - but not more relevant to a search for ‘plots’.

I’m using ‘Uplink’ as an example of a word that’s not likely to be used accidentally - it’s not commonly used in normal speech, so any uses of it are probably about the Anvil Uplink. If you’re not familiar with it, the Uplink allows you to `anvil.server.call` functions in any Python environment outside Anvil.

There are three relevant pages in the basic search results, and they appear in the PageRank results too. They are Using Code Outside Anvil, Escape Hatches and Ejector Seats and Remote Control Panel. Sadly, these pages have lost position to more ‘major’ pages, the tutorials and reference docs.

The basic search just presents pages in the order it crawled them, so the ranking is ‘more random’ than the PageRank’s importance-based ordering. It looks like the PageRank search is doing worse than chance in this case, because it’s placing pages that have more backlinks at the top regardless of relevance.

We’ve learnt something - even something as powerful as PageRank can be counterproductive in some circumstances.

#### ‘Building a dashboard in Python’

A query with multiple words is harder because it’s difficult to work out which words are the subject of the query, and which words are incidental. I’m using ‘building a dashboard in Python’ to test this. This tripped up the basic search because of the noise introduced by the words ‘building’ and ‘Python’, which are very common on the Anvil site.

The PageRank search did slightly better. The basic search missed Building a Business Dashboard in Python and the Python Dashboard workshop, but they appear in the PageRank search. Again, some minor pages such as page 6 of the blog are now ranked lower, making room for these better-matching pages.

That said, PageRank has put Interactive Plots for Your Apps lower and Accepting Payment with Stripe higher. As with the ‘Uplink’ query, the ranking system does not take page content into account, so sometimes the ‘more random’ ranking of the basic search happens to do better.

#### So how did it do?

The PageRank search de-ranked pages 2 to 6 of the blog, because they don’t have many backlinks. This made room for other matching pages. Sometimes those pages were good matches, sometimes they were not.

It favoured the main pages such as ‘Tutorials’ and ‘Documentation’ above pages that deal with specific subjects. Older pages were also favoured because pages gradually acquire backlinks over time.

PageRank is much more powerful when used to choose between sites, rather than pages on the same site. A single site is a curated system where there are only a small number of relevant results. The web is an unmanaged jungle where there will be many sites on the same topic, and PageRank is a great way to decide which sites are better than others.

So it’s made things better, but I still need to do more.

## Next time: Tokenization and indexing

I need to relate the search rankings to the contents of the page. I’ll take advantage of two trusty Computer Science workhorses: tokenization and indexing.

Once I’ve done that, I think my search engine will be good enough to show to the public!

## Let's Build a Search Engine: How to Run Queries 29th March 2019

I’m building a search engine using Anvil. So far I’ve built a Python web app that crawls the web downloading pages. I’ve also implemented PageRank so my app surfaces the most important pages.

PageRank weeded out minor pages and made room for more important matches. But it doesn’t take page content into account, so the ranking was still a bit hit-and-miss.

I’m currently using Anvil’s `full_text_match` to perform the search for me. I’m treating it as a black box, which has saved me effort. But it’s time to take matters into my own hands. I’m going to explore more about how the actual search process works.

## Tokens

I want to rank pages higher if they have more matching words in them.

My pages are currently just stored as big unrefined strings of characters. It’s time to digest each page and get a precise definition of its meaning. I need to tokenize the page - split it into the elemental parts of language.

Python is great for string processing, so this is really easily accomplished. First I use BeautifulSoup to get rid of the `<script>` and `<style>` blocks that don’t tell me much about the page’s meaning. I also discard the HTML tags and punctuation:

``````from string import punctuation

def tokenize(html):
soup = BeautifulSoup(html)

# Remove script and style blocks
for script in soup.find_all('script'):
script.decompose()
for style in soup.find_all('style'):
style.decompose()

# Remove HTML
text = soup.get_text()

# Remove punctuation
text = text.translate(str.maketrans('', '', punctuation))

# Split the string into a list of words
tokens = text.split()
``````

The last step splits the string into a list of separate words. These are one kind of token, but I want more useful tokens, so I’m going to go a bit further.

I’m going to stem my tokens, and I’m going to remove stop words. I mentioned these techinques when I discussed `full_text_search` in my first search engine post. To recap:

• Stemming means each token will be the stem of the word, rather than the word itself. The words ‘build’, ‘building’ and ‘builds’ will all get stored as ‘build’, so they all have the same impact on the search results.
• Stop words are common words like ‘how’ and ‘to’ that don’t have much bearing on the meaning of a page. I’m going to remove any words from the stop words list used by PostgreSQL’s `tsquery`.

I’m going to do stemming the way you do anything in Python - import a library that does it all for you! In this case I’m using `nltk`, the Natural Language Toolkit that has a selection of stemming algorithms.

``````from nltk import PorterStemmer
stemmer = PorterStemmer()

STOP_WORDS = [
'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves',
'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their',
'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an',
'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about',
'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up',
'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when',
'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor',
'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'
]

def tokenize(html):
# The code I had before, which got me a list of words, then...

# Stem
tokens = [stemmer.stem(t) for t in tokens]

# Remove stop words
tokens = [t for t in tokens if t not in STOP_WORDS]

# Count occurrances of each token
return dict(Counter(tokens))
``````

The final bit of processing in the `return` statement counts the occurrence of each token in the list. So if my list was originally:

``````  [
'application'
'build',
'python',
'web',
'application',
'web'
]
``````

I end up with:

``````  {'application': 2, 'build': 1, 'web': 2, 'python': 1}
``````

This is just what I need to improve my search results - a count of each meaningful word in the page. Now I can order the pages based on how many times they mention each query word.

It’s more than that though – it’s also going to let me make my search a lot more scalable. If I ever hope to handle the entire web, I can’t rely on scanning every page for each query. Loading 130 Trillion HTML documents into memory every time somebody searches is a little sub-optimal.

Luckily, I already have everything I need to build a search index.

## Scaling search using an index

Think of an index just like an index in the back of a book. If you’re looking for a particular word in a book, you don’t read the entire book hoping to spot it. You look in the back of the book at the index - you find the word, and it tells you which pages it’s on. That’s exactly what I’m going to do with my web pages.

I’ll create a Data Table to list each token I know about alongside the pages it appears on:

To populate this table, I need to run my `tokenize` function on all the pages I have stored.

``````@anvil.server.background_task
def index_pages():
"""Build an index to find pages by looking up words."""
index = {}
for n, page in enumerate(app_tables.pages.search()):
# Tokenize the page
page['token_counts'] = tokenize(page['html'])

# Then organize the tokens into the index
index = collate_tokens(page, index)

# Now persist the index
for token, pages in index.items():
``````

That second `for` loop writes the index to the database. It’s much faster to build the index in-memory and persist it all at once than to build it directly into the database.

Here’s how I’m building the index. For each token in the page, I check if there’s an entry for it in the index. If there is, I add this page to it. If not, I create a new row in the index for this token.

``````def collate_tokens(page, index):
# For each token on the page, add a count to the index
for token, count in page['token_counts'].items():
# Add this count to the index
entry = [{'url': page['url'], 'count': count}]
if token in index:
index[token] += entry
else:
index[token] = entry

return index
``````

I’ve created a UI to launch the indexing process. There’s a button to trigger the Background Task, and I’m keeping track of its progress by plotting the size of the index vs. the number of pages indexed.

The graph is pretty interesting. You would probably expect it to smoothly approach a particular value, “the number of commonly-used English words”, but actually there are a couple of sudden jumps at the beginning. This must be where it processes large pages with many words, so it discovers a large proportion of the English language at once.

Here’s my index in my Data Table. Now I can look up any word stem and instantly get a list of pages it’s on!

This has been remarkably simple to build, so it’s easy to miss how powerful this is. If I wanted to search all the pages on the web without an index, I would have to scan 130 Trillion HTML documents, each on avarage a few kilobytes in size. Let’s say the avarage size is 10 kB, that’s

or 1.3 Exabytes of data. You’d need one million 1 Terabyte hard disks to store that much data, and you couldn’t possibly scan it in time to answer a query.

The upper limit on my index size is much, much smaller. There are about 170,000 words in English, and about 6,500 languages in the world. That gives us an estimate of the size of the index:

And each entry is just a word, average length around 10 bytes. So about 11 Gigabytes of data. And of course, we can store that in an easily searchable order – for example, alphabetically, so we can find records we want quickly. (Anvil’s Data Tables take care of that for us.)

Somewhere in the multiverse, there’s a universe where the laws of mathematics don’t permit indexing. In that universe, the web never took off.

## Ranking by match strength

I’m just moments away from my production-quality search engine. I just need to concoct an algorithm that combines the token counts I’ve created with the PageRank data from the previous post. This will rank pages based on both relevance and importance.

Just as I split my pages into tokens, I can split the search query into tokens too:

``````  query_tokens = tokenize(query)
``````

Then I can look up each query token in the index:

``````  index_entries = app_tables.index.search(token=q.any_of(*query_tokens))
``````

(If you’d like to understand exactly how I put this line together, check out Querying Data Tables.)

So now I have every index entry that relates to my query. Now we have to work out which of these pages is most relevant! This is where most of Google’s secret sauce is, but we’ll start simple. To compute how relevant each page is to my query, I’ll just sum up the token counts for each page, and use that as the relevance score for the page:

``````  # Sum up the word count scores
page_scores = defaultdict(int)
for index_entry in index_entries:
for index_page in index_entry['pages']:
page = app_tables.pages.get(url=index_page['url'])
page_scores[page] += index_page['count']
``````

(Remember: `defaultdict(int)` sets unknown keys to have a value of `0`, so it’s a great way to calculate sums.)

My `page_scores` dictionary is just what its name suggests: it maps each page to a total score. This is already a set of ranked search results! But they’re only ranked by match strength. I’ll multiply each value by the PageRank as well.

``````  # Calculate total scores
final_scores = {}
for page, score in page_scores.items():
final_scores[page] = score * page['rank']
``````

If PageRank is Google’s secret sauce, it’s important to know how much to add. In the real app, I’ve used a slightly more complicated formula that uses a constant to tweak the influence of PageRank vs. the relevance score:

``````  score * (1 + (page['rank']-1)*PAGERANK_MULTIPLIER)
``````

You can change the `PAGERANK_MULTIPLIER` to make PageRank more or less important - smaller numbers squeeze the PageRanks towards 1. What value should I use? You could tune it automatically by changing `PAGERANK_MULTIPLIER` and comparing the search results against training data. I did a manual version of this - I tweaked it and ran queries until I was happy with the results. I went with 1/3.

So now I have a list of pages and a final score for each. I’ll sort in descending order of score, and return only the data I’m interested in - the URL and title of each page, in order.

``````  # Sort by total score
pages = sorted(final_scores.keys(), key=lambda page: -final_scores[page])

# ... and return the page info we're interested in.
return [{"url": p['url'], "title": p['title']} for p in pages]
``````

And that’s my final search function! It takes in the same arguments as my other search functions - the query as a string - and returns the pages in the same format, so I can use it from the same UI as the other search functions. Time to compare results and see if I really have improved things.

## Final search tests

In each of these posts, I’ve been running three test queries to make a (pretty unscientific) judgement about the quality of my search results. Let’s see how the final algorithm stacks up.

#### ‘Plots’

I selected ‘plots’ as a word that would be common in general technical writing. There are a few pages on the Anvil site that are directly related to plotting, so I’m hoping to see them near the top.

And I do! In fact, the new search algorithm knocks it out of the park.

Four of the top five results directly cover how to make plots in Anvil:

Two other results use plots in a big way: SMS Surveys in Pure Python and Using SciPy to work out how many T-shirts I need for a conference.

A user who wants to learn about plots in Anvil is pretty well served by this algorithm!

I selected ‘Uplink’ because it has a specific meaning for Anvil, and it’s not a commonly-used word otherwise. (If you’re not familiar with the Anvil Uplink: it allows you to `anvil.server.call` to and from Python running anywhere.)

The new algorithm does brilliantly at this one too.

The two most relevant pages appear at the top: Using code outside Anvil, followed by Remote Control Panel. The first of these is the Uplink tutorial, so that’s probably what a user searching for ‘Uplink’ is looking for.

The Anvil On-Site Installation page is also in the top five. Among other things, it describes how to connect your app to a Python instance running on your private network, and the answer is ‘Use the Uplink’. That is indeed something somebody interested in the Uplink would like to know.

Overall, a great job.

#### ‘Build a dashboard in Python’

This is the most challenging query to answer because it has multiple words, and only one of them particularly signals what the user is looking for - ‘dashboard’.

Item three in the results is Build a Business Dashboard with Python, so that’s a plus point. It doesn’t appear in the Basic Search results at all.

The other page I’d expect to see, the Python Dashboard workshop, isn’t there on the first page. Most of the other pages seem to be included because they mention ‘build’ and ‘Python’ a lot, rather than because they talk about building dashboards in Python.

A couple of partially-related results from the Basic Search don’t appear in the new search: Interactive plots for your apps and the Remote Control Panel workshop.

So mixed results for the multi-word query. Neither a massive improvement, nor any worse.

#### So how did it do?

Nearly perfect results for ‘plots’ and ‘Uplink’, but not much improvement for ‘build a dashboard in Python’. Calculating relevance is hard!

I was hoping that pages with a few mentions of ‘build’ and ‘Python’ would get dropped because they only had low counts of those words. But, this being Anvil, there are pages that mention ‘build’ and ‘Python’ a lot.

## Next steps

I need to improve my relevance calculation, and to reduce the noise introduced by ‘build’ and ‘python’ in the multi-word query. Some ideas:

• I could try matching the entire phrase and/or portions of the phrase (N-grams).
• I could also start looking at the grammar of the query - ‘dashboard’ is the object noun, so I should weight it higher in the results.
• Something as simple as giving more weight to words that appear in the page title could also make a big difference.

The most promising step, though, is to ask “which words appear more than usual on this page?” instead of “which words appear often on this page?”. I could do that simply by dividing the word frequencies by the average frequency across all pages. That means words like ‘build’ and ‘python’ that show up on lots of pages would be de-emphasised in the results, but the word ‘dashboard’ will still score highly on dashboard-related pages.

The fun thing about playing with search engines is how many things you can come up with to improve the results. I’ve constructed a simple heuristic based on word counts and PageRank, but you can roll things together in so many different ways. You could start analysing media as well as the pure HTML; you could take account of page structure; or you could even track browsing behaviour (anonymised and with permission, of course!).

Why not try some things out for yourself? Feel free to clone the final app and tinker with it to see if you can do better than me. Then get in touch via the Forum to show me and the rest of the Anvil community what you’ve done!

## Make it live!

I’m quite pleased with what I’ve achieved. In a few hundred lines of code, I’ve built a search engine that does a really great job of serving single-word queries from the Anvil site. There’s still room for improvement with multi-word queries, but the results are good enough that the user can find the best match somewhere on the first page of results.

I’m ready to make it public.

As with any Anvil app, it’s auto-hosted at a private URL. To give it a public URL, I just enter my desired subdomain of `anvil.app` in this box:

And now my search engine is hosted at https://search.anvil.app!

(If I wanted to use my own domain, rather than `anvil.app`, of course I could. That’s the “Add Custom Domain” button in that screenshot.)

That’s the end of my search engine journey. I haven’t quite made a competitor to Google, but I’ve built a working search engine from first principles, and I’ve learnt about some important concepts in the process. Not only that, but I’ve implemented the PageRank algorithm in a few dozen lines of code!

You can see the full source code of my search engine app here:

I hope you’ve learnt something too. And I hope you’ve seen how easy Anvil has made this. Anvil allowed me to describe exactly how my search engine works, and show you the code.

If you’re not already an Anvil user, watch this 7-minute video to learn more. We build a database-backed multi-user web app in real-time and publish it on the internet.