Live Chat

We'll need to share your messages (and your email address if you're logged in) with our live chat provider, Drift. Here's their privacy policy.

If you don't want to do this, you can email us instead at contact@anvil.works.

Blog - Page 2

The Easiest Way to Build HTTP APIs in Python
30th April 2019


Anvil isn’t just the easiest way to build Web user interfaces in Python. It’s also a fully-featured app platform, great for building serverless apps and HTTP APIs! Today, we’ll take a tour:

How to build a simple HTTP API

It’s pretty simple. Just create an Anvil app, add a Server Module, and write an @anvil.server.http_endpoint() function:

@anvil.server.http_endpoint('/hello-world')
def hello_world(**q):
  return {'the answer is': 42}

Screencast of adding an HTTP endpoint

That was pretty fast, right? And you didn’t have to host it anywhere: Once you publish your app, your API is live!

What does an API look like?

That depends on your app, of course! But for this post, we’ll be building an API for our multi-user To-Do List example.

Our To-Do list is a classic data-storage app. An API for this kind of app needs to do four things:

  • Read records - in this case, we want to make a GET request to a URL and get all the tasks in our to-do list.
  • Create new records - in this case, we want to POST to a URL to add a new task.
  • Update records - in this case, we want to mark tasks as done, or change their title. By convention, each task has its own URL, which we can update by making a PUT request to it.
  • Delete records - in this case, by making a DELETE request to a task’s URL.

(This is often abbreviated “CRUD”, for Create/Read/Update/Delete.)

Let’s get going!

As we build up this API, step by step, we’ll see common patterns used by many HTTP or REST APIs:

If you’re feeling impatient, you can get the source code for the finished, working API:

Open source code

Returning records from your database

We’ll start with the simplest endpoint: getting all the tasks in our to-do list, as an array of JSON objects. It’s pretty simple – if we return a list of dictionaries, Anvil will turn it into JSON for us:

@anvil.server.http_endpoint('/tasks')
def get_tasks(**q):
    return [{'title': task['title'], 'done': task['done']}
            for task in app_tables.tasks.search()]

If we published our app at https://my-todo-list.anvil.app, we can run this function by fetching https://my-todo-list.anvil.app/_/api/tasks:

$ curl https://my-todo-list.anvil.app/_/api/tasks
[{"done":true,"title":"Wash the car"},{"done":true,"title":"Do the dishes"}]

Note: If you haven’t made your app public, your API URLs will be long and difficult to remember. To get a nicer URL, make your app public from the Publish dialog. See: How to publish your app

Authentication

In an earlier tutorial, we expanded our simple to-do list into a multi-user application, using Anvil’s built-in Users service. We want to do this same with our API – we want to authenticate our users, and only allow them to access their own data.

We can automatically require authentication for our API endpoint, using Anvil’s Users Service, by passing authenticate_users=True to the anvil.server.http_endpoint(...) decorator. This prevents access unless the user provides a valid username and password via standard HTTP Basic authentication.

Within our endpoint function, we can use anvil.users.get_user() to find out what user they logged in as. Let’s modify our /tasks endpoint so it returns only the current user’s tasks:

@anvil.server.http_endpoint('/tasks', authenticate_users=True)
def get_tasks(**q):
    this_user = anvil.users.get_user()
    return [{'title': task['title'], 'done': task['done'], 'id': task.get_id()}
            for task in app_tables.tasks.search(owner=this_user)]

Let’s try that with curl:

$ curl https://my-todo-list.anvil.app/_/api/tasks
Unauthorized

$ curl -u me@example.com:my_password https://my-todo-list.anvil.app/_/api/tasks
[{"done":true,"title":"Wash the car","id":"[23,567]"},{"done":true,"title":"Do the dishes","id":"[23,568]"}]

You can also see that I’ve added an id field to the responses, containing the unique ID of each database row. Each database row has an ID, which is a a string like "[23,568]". We can get it by calling get_id() on the row object, and we can use it later to retrieve that row from the database.

Want to do your own authentication, rather than using Anvil’s built-in Users Service? No problem! Just pass require_auth=True instead of authenticate_users=True, and then examine anvil.server.request.username and anvil.server.request.password for yourself. Learn more in our reference documentation.

Accepting parameters

Now, we want an endpoint to add a new task. We use form parameters to specify the title of the new task. Form parameters are passed as keywords to the endpoint function, so to make a mandatory parameter we just make our function take an argument called title:

@anvil.server.http_endpoint('/tasks/new',
                            methods=["POST"], authenticate_users=True)
def new_task(title, **q):
  new_row = app_tables.tasks.add_row(title=title, done=False,
                                     owner=anvil.users.get_user())
  return {'id': new_row.get_id()}

This endpoint should only accept HTTP POST requests, so we’ve specified methods=["POST"]. We also return the ID of the newly-created row.

Here’s how to create a new task from curl:

$ curl -u me@example.com:my_password -d title="New task" https://my-todo-list.anvil.app/_/api/tasks/new
{"id":"[23,590]"}

Form parameters vs JSON

The example above uses “form encoding” to pass parameters. These days, API users often prefer to send JSON as the request body.

We can accommodate this, by using anvil.server.request.body_json to decode the body of the request as a JSON object:

@anvil.server.http_endpoint('/new_task', methods=["POST"], authenticate_users=True)
def new_task(**q):
  title = anvil.server.request.body_json['title']
  new_row = app_tables.tasks.add_row(title=title, done=False,
                                     owner=anvil.users.get_user())
  return {'id': new_row.get_id()}

Here’s how to use this JSON-based endpoint from curl – it’s a little more verbose:

$ curl -u me@example.com:my_password \
       -X POST \
       -H "Content-Type: application/json" \
       -d '{"title":"New task"}' \
       https://my-todo-list.anvil.app/_/api/new_task
{"id":"[23,591]"}

URL parameters

We want each task to have its own URL. We can use the unique ID of the database row for this. We’ll make each task available at /tasks/[ID].

To do this, we make an endpoint with a URL parameter called id. URL parameters are passed into the endpoint function as keyword arguments, same as form parameters:

@anvil.server.http_endpoint('/tasks/:id', authenticate_users=True)
def task_url(id, **q):
  this_user = anvil.users.get_user()
  task = my_tasks.get_by_id(id)
  if task is None or task['owner'] != this_user:
    return anvil.server.HttpResponse(status=404)
  else:
    return {'title': task['title'], 'done': task['done'], 'id': id}

Of course, the user might request a bogus URL, with an invalid ID. Or, worse, with the ID of a row belonging to another user!

If the row with the specified ID is not available in the tasks table, or if it is not owned by this user, our function returns an HTTP 404 (“Not Found”) response.

Let’s test it by accessing one of the tasks we saw from the listing earlier. We’ll have to URL-encode the ID so that curl will swallow it, but that’s OK: Anvil will decode the id parameter before passing it to our function:

$ curl -u me@example.com:my_password https://my-todo-list.anvil.app/_/api/tasks/%5B23%2C567%5D
[{"done":true,"title":"Wash the car","id":"[23,567]"}

Rounding out the example: Updates and deletes

We now know enough to complete the last two operations: updating a task and deleting it. These are done by making PUT and DELETE requests, respectively, to the task’s url.

Let’s expand our task_url endpoint to handle GET, PUT and DELETE methods:

@anvil.server.http_endpoint('/tasks/:id',
                            authenticate_users=True,
                            methods=["GET","PUT","DELETE"])
def task_by_url(id, **p):
  this_user = anvil.users.get_user()
  task = my_tasks.get_by_id(id)

  request = anvil.server.request

  if task is None or task['owner'] != this_user:
    return anvil.server.HttpResponse(status=404)
  elif request.method == 'DELETE':
    task.delete()
    return None
  elif request.method == 'PUT':
    if 'title' in request.body_json:
      task['title'] = request.body_json['title']
    if 'done' in request.body_json:
      task['done'] = request.body_json['done']

  # PUT and GET both fall through to here and return the task
  # as JSON

  return {'title': task['title'], 'done': task['done'], 'id': id}

…and that’s a working API!

We’ve just built a fully functional, secure and authenticated API for our to-do list app – all in about 30 lines of Python.

Because Anvil is “serverless”, this API is already live on the internet, without us having to set up or maintain web servers.

You can open the source code of this app in Anvil, ready to run:

Open this app

Further Reading

I’ll just finish up by mentioning a few more advanced things you might want to do with Anvil’s HTTP endpoints:

Returning different types of data

You can return plain text (strings), binary data with any content type (Media objects), or JSON (anything else).

If you return an anvil.server.HttpResponse() object, you can control the HTTP status, headers, and content of your response.

Cookies

Anvil’s support for cookies comes with extra security. Anvil cookies are encrypted (so the user cannot see what you’ve set in a cookie) and authenticated (so the user cannot tamper with values you’ve set in a cookie).

In short, the contents of anvil.server.cookies.local[...] and anvil.server.cookies.shared[...] are trustworthy, unlike traditional HTTP environments.

CORS made simple

If you want your API to be accessible from other web pages, you’ll need to set CORS (Cross Origin Resource Sharing) headers. You can do this by passing cross_origin=True to the anvil.server.http_endpoint() decorator.

XSRF protection

HTTP endpoints in Anvil are automatically protected from XSRF (Cross-Site Request Forgery). When a request comes to an HTTP endpoint from outside your app (based on Origin and/or Referer headers), it executes in an independent session from the rest of your app. This prevents malicious requests from external sites from performing actions with the credentials of a logged-in user.

You can opt out of this protection: To execute cross-site requests in the same session as the rest of your app, pass cross_site_session=True to the anvil.server.http_endpoint() decorator.


To learn more about all these features, read more in the Anvil reference docs

Let's Build a Search Engine
29th March 2019

How does a search engine work? Let’s find out – by building one!

Search engines have become the gateway to the modern web. How often do you know exactly which page you want, but you search for it anyway, rather than typing the URL into your web browser?

Like many great machines, the simple search engine interface - a single input box - hides a world of technical magic tricks. When you think about it, there are a few major challenges to overcome. How do you collect all the valid URLs in existence? How do you guess what the user wants and return only the relevant pages, in a sensible order? And how do you do that for 130 Trillion pages faster than a human reaction time?

I’ll be some way closer to understanding these problems when I’ve built a search engine for myself. I’ll be using nothing but Python (even for the UI) and my code will be simple enough to include in this blog post.

You can copy the final version, try it out, and build on it yourself:

Edit this app

There will be three parts to this.

  1. First, I’m going to build a basic search engine that downloads pages and matches your search query against their contents. (That’s this post)

  2. Then I’m going to implement Google’s PageRank algorithm to improve the results. (See Part 2)

  3. Finally, I’ll play with one of computer science’s powertools - indexing - to speed up the search and make the ranking even better. (See Part 3)

Collecting URLs

Let’s start building a machine that can download the entire web.

I’m going to a build a web crawler that iteratively works its way through the web like this:

  1. Start at a known URL
  2. Download the page
  3. Make a note of any URLs it contains
  4. GOTO 1 (for the new URLs I’ve found)

I need a known URL to start with. I’ll allow webmasters and other good citizens to submit URLs they know about. I’ll store them in a database (I’m using Anvil’s Data Tables) and if I know the URL already, I won’t store it twice.

@anvil.server.callable
def submit_url(url):
  url = url.rstrip('/')  # URLs with and without trailing slashes are equivalent
  if not app_tables.urls.get(url=url):
    app_tables.urls.add_row(url=url)

I’ve also made it possible to submit sitemaps, which contain lists of many URLs (see our Background Tasks tutorial for more detail.) I’m using BeautifulSoup to parse the XML.

from bs4 import BeautifulSoup

@anvil.server.callable
def submit_sitemap(sitemap_url):
  response = anvil.http.request(sitemap_url)
  
  soup = BeautifulSoup(response.get_bytes())
  for loc in soup.find_all('loc'):
    submit_url(loc.string)

If I submit the Anvil sitemap, my table is populated with URLs:

I’m in good company by allowing people to submit URLs and sitemaps for crawling - Google Search Console does this. It’s one way of avoiding my crawler getting stuck in a local part of the web that doesn’t link out to anywhere else.

Stealing shamelessly from Google Search Console, I’ve created a webmaster’s console with ‘submit’ buttons that call my submit_url and submit_sitemap functions:

  def button_sitemap_submit_click(self, **event_args):
    """This method is called when the button is clicked"""
    self.label_sitemap_requested.visible = False
    anvil.server.call('submit_sitemap', self.text_box_sitemap.text)
    self.label_sitemap_requested.visible = True

  def button_url_submit_click(self, **event_args):
    """This method is called when the button is clicked"""
    self.label_url_requested.visible = False
    anvil.server.call('submit_url', self.text_box_url.text)
    self.label_url_requested.visible = True

Crawling

Now that I know some URLs, I can download the pages they point to. I’ll create a Background Task that churns through my URL list making requests:

@anvil.server.background_task
def crawl():
  for url in app_tables.urls.search():
    # Get the page
    try:
      response = anvil.http.request(url)
      html = response.get_bytes().decode('utf-8')
    except:
      # If the fetch failed, just try the other URLs
      continue

    row = app_tables.pages.get(url=url) or app_tables.pages.add_row(url=url)
    row['html'] = html

Because it’s a Background Task, I can fire off a crawler and download all the pages I know about in the background without blocking the user’s interaction with my web app.

That’s all very well, but it doesn’t really crawl yet. The clever thing about a web crawler is how it follows links between pages. The web is a directed graph – in other words, it consists of pages with one-way links between them. That’s why it’s such a wonderful information store - if you’re interested in the subject of one page, you’re likely to be interested in the subjects of pages it links to. If you’ve ever been up ’til dawn in the grip of a Wikipedia safari, you’ll know what I’m talking about.

So I need to find the URLs in the pages I download, and add them to my list. BeautifulSoup, the brilliant HTML/XML parser, helps me out again here.

I also record which URLs I found on each page - this will come in handy when I implement PageRank.

  from bs4 import BeautifulSoup

  soup = BeautifulSoup(html)

  # Parse out the URLs
  for a in soup.find_all('a', href=True):
    submit_url(a['href'])
    
    # Record the URLs for this page
    page['forward_links'] += a['href']

While I’m at it, I’ll grab the page title to make my search results a bit more human-readable:

  # Parse out the title from the page
  title = str(soup.find('title').string) or 'No Title'

The crawler has become rather like the classic donkey following a carrot: the further it gets down the URL list, the more URLs it finds, so the more work it has to do. I visualised this by plotting the length of the URL list alongside the number of URLs processed.

The list grows initially, but the crawler eventually finds all the URLs and the lines converge. It converges because I’ve restricted it to https://anvil.works (I don’t want to denial-of-service anybody’s site by accident.) If I had it crawling the open web, I imagine the lines would diverge forever - pages are probably being added faster than my crawler can crawl.

By the time it has finished, there’s a nice crop of page data waiting for me in the pages table.

Time to implement search. I’ve thrown together the classic “input box and button” UI using the drag-and-drop editor. There’s also a Data Grid for listing the results, which gives me pagination for free. Each result will contain the page title and a link.

The most basic search algorithm would just break the query into words and return pages that contain any of those words. That’s no good at all, and I can do better straight away.

I’ll remove words that are too common. Let’s say a user enters ‘how to build a web app’. If a page happens to contain exactly the text ‘how to build a web app’, it will be returned. But they would also get pages containing the text ‘how to suckle a lamb’.

So I’ll remove words like ‘how’ and ‘to’. In the lingo, these are called stop words.

I’ll include words that are closely related to those in the query. The search for ‘how to build a web app’ should probably return pages with ‘application builder’, even though neither of those words are exactly in the query.

In the lingo, this is called stemming.

Both of these requirements are met by Anvil’s full_text_match operator, so I can get a viable search running right away:

# On the server:
@anvil.server.callable
def basic_search(query):
  return app_tables.pages.search(html=q.full_text_match(query))
# On the client:
def button_search_click(self, **event_args):
  """This method is called when the button is clicked"""
  self.repeating_panel_1.items = anvil.server.call('basic_search', self.text_box_query.text)

Later on we’ll talk about indexing and tokenization, which gets to the nuts and bolts of how to optimise a search. But for now, I have a working search engine. Let’s try some queries.

Test queries

For each stage in its development, I’m going to run three queries to see how the results improve as I improve my ranking system. Each query is chosen to reflect a different type of search problem.

I’ll only look at the first page of ten results. Nobody ever looks past the first page!

(If you’re wondering why all the results are from the same site, bear in mind that I’ve restricted my crawler to https://anvil.works to avoid getting my IP address quite legitimately blocked by anti-DoS software, and to keep my test dataset to a manageable size.)

‘Plots’

‘Plots’ is a fairly generic word that you would expect to show up all over the place in technical writing. The challenge is to return the pages that are specifically about plotting, rather than those that just use the word in passing.

When I search for ‘plots’, I get this:

The first result is Using Matplotlib with Anvil, which is definitely relevant. Then there’s the reference docs, which has a section on the Plot component. And result number nine is the original announcement dating back to when we made Plotly available in the client-side Python code.

But there’s also a lot of fairly generic pages here. They probably mention the word ‘plot’ once or twice, but they’re not really what I’m looking for when I search for ‘plots’.

‘Uplink’ differs from ‘plots’ because it’s unlikely to be used by accident. It’s the name of a specific Anvil feature and it’s not a very common word in normal usage. If it’s on a page, that page is almost certainly talking about the Anvil Uplink.

If you’re not familiar with it, the Uplink allows you to anvil.server.call functions in any Python environment outside Anvil. So I’d expect to get the Using Code Outside Anvil tutorial high up the results list. It shows up at position four.

I also get Escape Hatches and Ejector Seats, which mentions the Uplink as one of Anvil’s ‘Escape Hatches’. And at number 10 we have Remote Control Panel, which uses the Uplink to run a test suite on a remote machine.

It’s good that all three of these show up, but it would be better if they were more highly ranked. The rest of the results probably talk about the Uplink in some way, but the Uplink is not their primary subject.

‘Build a dashboard in Python’

This is included as an example of a multi-word query. I’d expect this to be harder for the search engine to cope with, since the words ‘build’ and ‘Python’ are going to be used a lot on the Anvil site, but a user typing this in is specifically interested in Python dashboarding.

There are two pages I would expect to see here: Building a Business Dashboard in Python and the Python Dashboard workshop. Neither of these appear in the results.

A few of the pages are tangentially related to dashboard building, but generally the signal appears to have been overwhelmed by the noise introduced by the words ‘build’ and ‘Python’.

So how did it do?

The basic search engine I’ve put together does manage to get some relevant results for one-word queries. The user has to look past the first few results to find what they’re looking for, but the main pages of interest are there somewhere.

It gets confused by multi-word queries. It can’t distinguish very well between the words that matter, and those that don’t. Anvil’s full_text_match does remove words like ‘a’ and ‘in’, but it obviously won’t guess that ‘build’ is less important than ‘dashboard’ in this particular situation.

Next steps

I’m going to make two improvements in an attempt to address these problems. First, I’ll try to rank more interesting pages more highly. Google has an algorithm called PageRank that assesses how important each page is, and I’ve always wanted to learn how it works, so now is probably a good time! I explore it and implement it in the next post.

Second, I’ll take account of the number of times each word appears on the page. This will help with the ‘building a dashboard in Python’ query, because pages that just happen to mention ‘building’ will do so once or twice, whereas pages about building dashboards will use those words a lot. That gives me an excuse to explore two simple but powerful concepts from Computer Science - tokenization and indexing, which I implement in the final post.

Implementing PageRank

So here goes, I’m digging into Google’s PageRank. It’s surprisingly simple; you might even say it’s ‘beautifully elegant’. Come read about it:

Implementing Google's PageRank


Or sign up for free, and open our search engine app in the Anvil editor:

Get building

Let's Build a Search Engine: How PageRank Works
29th March 2019


Back to: How to build a search engine

I’m building a search engine using Anvil. So far, I’ve built a machine that roams the internet digesting pages, and a simple query engine. I can make some search queries, but the most important pages don’t always show up at the top of the results.

I need a way to score pages based on how important they are. I need PageRank - the secret sauce that won Google the search engine wars of the late 90s.

Let’s look at how PageRank works, then I’ll build a Python web app that implements it.

How it works

PageRank was developed in 1996 at Stanford by Sergey Brin and Larry Page, after whom it is named. It was the ‘key technical insight’ that made Google work so well, so they kept it a tightly-guarded secret.

Only kidding! Page and Brin wrote a brilliant and openly-available paper about it in 1998, which contains this priceless quote:

To test the utility of PageRank for search, we built a web search engine called Google.

The algorithm is based on the idea that important pages will get linked to more. Some terminology: if python.org links to my page, then my page has a ‘backlink’ from python.org. The number of ‘backlinks’ that a page has gives some indication of its importance.

PageRank takes this one step further - backlinks from highly-ranked pages are worth more. Lots of people link to python.org, so if they link to my page, that’s a bigger endorsement than the average webpage.

If my page is the only one linked to from python.org, that’s a sign of great importance, so it should be given a reasonably high weighting. But if it’s one of fifty pages python.org links to, perhaps it’s not so significant.

An equation paints a thousand words, so for the mathematically-inclined reader, this version of the algorithm can be written:

where R(u) is the rank of the page we’re ranking, R(v) is the ranking of its backlinks, Nv is the number of links page v contains, and c is a constant we can tweak at will. The sum is over all the backlinks of the page we’re ranking. (If equations aren’t your thing - this equation just re-states what I said above.)

It turns out that PageRank models exactly what a random web surfer would do. Our random surfer is equally likely to click each link on her current page, and her chance of being on this page is based on the number of pages that link to it.

We can’t calculate this all in one go, because pages can link to each other in a loop. If the links between pages form a loop, we don’t know what rank to give them, since they all depend on each other’s rank.

We solve this by splitting the calculation into iterations. A page in iteration i + 1 has its rank worked out using the PageRanks from iteration i. So the ranks of pages in loops don’t depend on themselves, they depend on the previous iteration of themselves.

In terms of the equation from earlier, we plug the ranks from iteration i into the right-hand side of the equation, and the ranks for iteration i + 1 come out on the left-hand side.

We have to make a guess about what ranks to start with. Luckily, this is a convergent algorithm, meaning wherever we start, if we do it enough times, the ranks will eventually stop changing and we know we have something that satisfies our equation.

So that solves the problem of self-referential pages. There’s another problem with loops - if none of the pages in the loop link to a page outside of the loop, our random surfer can never escape:

This is solved by adding an extra little bit of rank to every page, to balance out the rank that flows into dead-end loops. In our random surfer analogy, the surfer gets bored every so often and types a random URL into her browser. This changes our equation: there’s a bit of rank added, which we denote E(u):

And that’s the PageRank equation!

PageRank was revolutionary. The paper compares the results from a simple PageRank-based engine against the leading search engine at the time. The top result for “University” in the PageRank engine was the Stanford University homepage. The competing engine returned “Optical Physics at the University of Oregon”.

That sounds like exactly what I need to improve my search results. Luckily, it’s fairly simple to implement if you have a relatively small number of pages. Here goes…

Implementing PageRank

First I need to calculate the backlinks for each page. My crawler has figured out the forward links from each page, so all the information is there. I just need to iterate through the pages adding ‘page X’ as a backlink on everything ‘page X’ links to.

def calculate_backlinks():
  # Reset the backlinks for each page
  for page in app_tables.pages.search():
    page['backlinks'] = []
  
  # We have forward links for each page - calculate the backlinks
  for page in app_tables.pages.search():
    if page['urls'] is None:
      continue
    
    # Add this page as a backlink to everything it links to
    for url in page['urls']:
      forward_linked = app_tables.pages.get(url=url.strip('/'))
      if forward_linked:
        forward_linked['backlinks'] += [page['url']]

That’s the backlinks figured out. Now to set up the initial condition. The PageRank calculation gradually refines the answer by a series of iterations, so I must set the initial values to start off with. For newly-discovered pages, I’ll just guess at 0.5. Pages I’ve seen before will have a PageRank from previous runs, so those pages can start with that old value.

@anvil.server.background_task
def ranking_agent():
  calculate_backlinks()
  
  for page in app_tables.pages.search():
    # Initial condition
    if page['rank'] is None:
      page['rank'] = 0.5

Now for the PageRank calculation itself. I expect the calculated values of the PageRanks to converge on the correct solution. So I want to invent a metric to tell if the calculation has stopped changing much. I’ll calculate how much the average page has changed in rank, and if that’s small, my calculation has converged.

  # ... still in the ranking_agent ...

  # Iterate over all pages, and repeat until the average rank change is small
  average_rank_change = 0.
  while average_rank_change > 1.001 or average_rank_change < 0.999:
    # Work out the next PageRank
    new_ranks = calculate_pagerank(page)

    # Step on the calculation
    average_rank_change = step_calculation()

Inside the loop, I calculate the PageRank for the next iteration, then step the calculation on by putting the values from step i+1 into box i, and calculating the average rank change.

And that’s how to calculate PageRank! I’ve left out two important details. First, the actual PageRank calculation. This is just the equation from above, but written out in Python. The CONSTANT is c and the RATING_SOURCE_FACTOR is E(u) (I’ve assumed it’s the same value, 0.4, for each page).

CONSTANT = 0.7
RATING_SOURCE_FACTOR = 0.4

def calculate_pagerank():
  """Calculate the PageRank for the next iteration."""
  new_ranks = {}
  for page in app_tables.pages.search():
    rank = RATING_SOURCE_FACTOR
    for backlink in page['backlinks']:
      backlinked_page = app_tables.pages.get(url=backlink)
      rank += backlinked_page['rank'] / len(backlinked_page['forward_links'])
    rank = CONSTANT * rank
    new_ranks[page['url']] = rank
  
  return new_ranks

And for completeness, here’s exactly how I step the calculation forward:

def step_calculation():
  """Put 'next rank' into the 'current rank' box and work out average change."""
  num_pages = 0.
  sum_change = 0.
  for page in app_tables.pages.search():
    sum_change += new_ranks[page['url']] / page['rank']
    page['rank'] = new_ranks[page['url']]
    num_pages += 1
  return sum_change / num_pages

To make the search engine order results by PageRank, I just need to use tables.order_by in my Data Tables query:

@anvil.server.callable
def ranked_search(query):
  pages = app_tables.pages.search(
    tables.order_by("rank", ascending=False),
    html=q.full_text_match(query)
  )
  return [{"url": p['url'], "title": p['title']} for p in pages]

If you want to see the code in its natural habitat, you can clone the final app, which contains everything I’ve talked about in this series of posts.

Edit this app

Let’s spin it up and get some ranks!

Crunch those numbers

I built a UI that tracks the progress of the calculation. It polls the Background Task using a Timer component and plots the convergence metric with each iteration. This means you can watch it run and see the calculation gradually converge in real time.

I put some test data together to check that my results make sense intuitively. Consider the four pages (and links) shown in the diagram below:

The PageRank for each page is also shown in the diagram, as calculated by my ranking engine. The page with 3 backlinks has a PageRank of 1.5, the page with 2 backlinks has a PageRank of 0.97, and the pages with 1 backlink each from the same page have rankings of 0.63. This sounds about right. I can tune the spread of these numbers by changing CONSTANT in my code.

Test queries

In Part 1, I tested the basic version of the search engine by running three queries and making judgements about the quality of the results.

Let’s run the same queries again to see how PageRank has improved things.

‘Plots’

‘Plots’ is my example of a fairly generic word that appears a lot in technical writing. There are a few pages that are specifically about plots in Anvil, and I want to see whether they come up.

Overall, the PageRank search seems to do better. Five of the results are specifically about plotting:

The basic search only manages to get three of these into the top ten.

The basic search also included some pages from the middle of the blog. These have been ranked lower by PageRank, so the more-relevant pages have had a fighting chance.

The PageRank search does worse in one respect - the top spot has been taken by the reference docs, replacing the Using Matplotlib with Anvil guide. I’m ranking pages purely based on importance and not on relevance. The reference documentation is clearly ‘more important’ overall than the Matplotlib guide - but not more relevant to a search for ‘plots’.

I’m using ‘Uplink’ as an example of a word that’s not likely to be used accidentally - it’s not commonly used in normal speech, so any uses of it are probably about the Anvil Uplink. If you’re not familiar with it, the Uplink allows you to anvil.server.call functions in any Python environment outside Anvil.

There are three relevant pages in the basic search results, and they appear in the PageRank results too. They are Using Code Outside Anvil, Escape Hatches and Ejector Seats and Remote Control Panel. Sadly, these pages have lost position to more ‘major’ pages, the tutorials and reference docs.

The basic search just presents pages in the order it crawled them, so the ranking is ‘more random’ than the PageRank’s importance-based ordering. It looks like the PageRank search is doing worse than chance in this case, because it’s placing pages that have more backlinks at the top regardless of relevance.

We’ve learnt something - even something as powerful as PageRank can be counterproductive in some circumstances.

‘Building a dashboard in Python’

A query with multiple words is harder because it’s difficult to work out which words are the subject of the query, and which words are incidental. I’m using ‘building a dashboard in Python’ to test this. This tripped up the basic search because of the noise introduced by the words ‘building’ and ‘Python’, which are very common on the Anvil site.

The PageRank search did slightly better. The basic search missed Building a Business Dashboard in Python and the Python Dashboard workshop, but they appear in the PageRank search. Again, some minor pages such as page 6 of the blog are now ranked lower, making room for these better-matching pages.

That said, PageRank has put Interactive Plots for Your Apps lower and Accepting Payment with Stripe higher. As with the ‘Uplink’ query, the ranking system does not take page content into account, so sometimes the ‘more random’ ranking of the basic search happens to do better.

So how did it do?

The PageRank search de-ranked pages 2 to 6 of the blog, because they don’t have many backlinks. This made room for other matching pages. Sometimes those pages were good matches, sometimes they were not.

It favoured the main pages such as ‘Tutorials’ and ‘Documentation’ above pages that deal with specific subjects. Older pages were also favoured because pages gradually acquire backlinks over time.

PageRank is much more powerful when used to choose between sites, rather than pages on the same site. A single site is a curated system where there are only a small number of relevant results. The web is an unmanaged jungle where there will be many sites on the same topic, and PageRank is a great way to decide which sites are better than others.

So it’s made things better, but I still need to do more.

Next time: Tokenization and indexing

I need to relate the search rankings to the contents of the page. I’ll take advantage of two trusty Computer Science workhorses: tokenization and indexing.

Once I’ve done that, I think my search engine will be good enough to show to the public!

Read about it here:

Tokenization and indexing


Or sign up for free, and open our search engine app in the Anvil editor:

Get building

Let's Build a Search Engine: How to Run Queries
29th March 2019


Back to: How PageRank Works

I’m building a search engine using Anvil. So far I’ve built a Python web app that crawls the web downloading pages. I’ve also implemented PageRank so my app surfaces the most important pages.

PageRank weeded out minor pages and made room for more important matches. But it doesn’t take page content into account, so the ranking was still a bit hit-and-miss.

I’m currently using Anvil’s full_text_match to perform the search for me. I’m treating it as a black box, which has saved me effort. But it’s time to take matters into my own hands. I’m going to explore more about how the actual search process works.

Tokens

I want to rank pages higher if they have more matching words in them.

My pages are currently just stored as big unrefined strings of characters. It’s time to digest each page and get a precise definition of its meaning. I need to tokenize the page - split it into the elemental parts of language.

Python is great for string processing, so this is really easily accomplished. First I use BeautifulSoup to get rid of the <script> and <style> blocks that don’t tell me much about the page’s meaning. I also discard the HTML tags and punctuation:

from string import punctuation

def tokenize(html):
  soup = BeautifulSoup(html)

  # Remove script and style blocks
  for script in soup.find_all('script'):
    script.decompose()
  for style in soup.find_all('style'):
    style.decompose()

  # Remove HTML
  text = soup.get_text()

  # Remove punctuation
  text = text.translate(str.maketrans('', '', punctuation))

  # Split the string into a list of words
  tokens = text.split()

The last step splits the string into a list of separate words. These are one kind of token, but I want more useful tokens, so I’m going to go a bit further.

I’m going to stem my tokens, and I’m going to remove stop words. I mentioned these techinques when I discussed full_text_search in my first search engine post. To recap:

  • Stemming means each token will be the stem of the word, rather than the word itself. The words ‘build’, ‘building’ and ‘builds’ will all get stored as ‘build’, so they all have the same impact on the search results.
  • Stop words are common words like ‘how’ and ‘to’ that don’t have much bearing on the meaning of a page. I’m going to remove any words from the stop words list used by PostgreSQL’s tsquery.

I’m going to do stemming the way you do anything in Python - import a library that does it all for you! In this case I’m using nltk, the Natural Language Toolkit that has a selection of stemming algorithms.

from nltk import PorterStemmer
stemmer = PorterStemmer()

STOP_WORDS = [
  'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 
  'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 
  'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 
  'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 
  'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 
  'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 
  'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 
  'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 
  'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'
]

def tokenize(html):
  # The code I had before, which got me a list of words, then...
  
  # Stem
  tokens = [stemmer.stem(t) for t in tokens]

  # Remove stop words
  tokens = [t for t in tokens if t not in STOP_WORDS]  
  
  # Count occurrances of each token
  return dict(Counter(tokens))

The final bit of processing in the return statement counts the occurrence of each token in the list. So if my list was originally:

  [
    'application'
    'build',
    'python',
    'web',
    'application',
    'web'
  ]

I end up with:

  {'application': 2, 'build': 1, 'web': 2, 'python': 1}

This is just what I need to improve my search results - a count of each meaningful word in the page. Now I can order the pages based on how many times they mention each query word.

It’s more than that though – it’s also going to let me make my search a lot more scalable. If I ever hope to handle the entire web, I can’t rely on scanning every page for each query. Loading 130 Trillion HTML documents into memory every time somebody searches is a little sub-optimal.

Luckily, I already have everything I need to build a search index.

Scaling search using an index

Think of an index just like an index in the back of a book. If you’re looking for a particular word in a book, you don’t read the entire book hoping to spot it. You look in the back of the book at the index - you find the word, and it tells you which pages it’s on. That’s exactly what I’m going to do with my web pages.

I’ll create a Data Table to list each token I know about alongside the pages it appears on:

To populate this table, I need to run my tokenize function on all the pages I have stored.

@anvil.server.background_task
def index_pages():
  """Build an index to find pages by looking up words."""
  index = {}
  for n, page in enumerate(app_tables.pages.search()):
    # Tokenize the page
    page['token_counts'] = tokenize(page['html'])
    
    # Then organize the tokens into the index
    index = collate_tokens(page, index)
    
  # Now persist the index
  for token, pages in index.items():
    app_tables.index.add_row(token=token, pages=pages)

That second for loop writes the index to the database. It’s much faster to build the index in-memory and persist it all at once than to build it directly into the database.

Here’s how I’m building the index. For each token in the page, I check if there’s an entry for it in the index. If there is, I add this page to it. If not, I create a new row in the index for this token.

def collate_tokens(page, index):
  # For each token on the page, add a count to the index
  for token, count in page['token_counts'].items():
    # Add this count to the index
    entry = [{'url': page['url'], 'count': count}]
    if token in index:
      index[token] += entry
    else:
      index[token] = entry

  return index

I’ve created a UI to launch the indexing process. There’s a button to trigger the Background Task, and I’m keeping track of its progress by plotting the size of the index vs. the number of pages indexed.

The graph is pretty interesting. You would probably expect it to smoothly approach a particular value, “the number of commonly-used English words”, but actually there are a couple of sudden jumps at the beginning. This must be where it processes large pages with many words, so it discovers a large proportion of the English language at once.

Here’s my index in my Data Table. Now I can look up any word stem and instantly get a list of pages it’s on!

This has been remarkably simple to build, so it’s easy to miss how powerful this is. If I wanted to search all the pages on the web without an index, I would have to scan 130 Trillion HTML documents, each on avarage a few kilobytes in size. Let’s say the avarage size is 10 kB, that’s

or 1.3 Exabytes of data. You’d need one million 1 Terabyte hard disks to store that much data, and you couldn’t possibly scan it in time to answer a query.

The upper limit on my index size is much, much smaller. There are about 170,000 words in English, and about 6,500 languages in the world. That gives us an estimate of the size of the index:

And each entry is just a word, average length around 10 bytes. So about 11 Gigabytes of data. And of course, we can store that in an easily searchable order – for example, alphabetically, so we can find records we want quickly. (Anvil’s Data Tables take care of that for us.)

Somewhere in the multiverse, there’s a universe where the laws of mathematics don’t permit indexing. In that universe, the web never took off.

Ranking by match strength

I’m just moments away from my production-quality search engine. I just need to concoct an algorithm that combines the token counts I’ve created with the PageRank data from the previous post. This will rank pages based on both relevance and importance.

Just as I split my pages into tokens, I can split the search query into tokens too:

  query_tokens = tokenize(query)

Then I can look up each query token in the index:

  index_entries = app_tables.index.search(token=q.any_of(*query_tokens))

(If you’d like to understand exactly how I put this line together, check out Querying Data Tables.)

So now I have every index entry that relates to my query. Now we have to work out which of these pages is most relevant! This is where most of Google’s secret sauce is, but we’ll start simple. To compute how relevant each page is to my query, I’ll just sum up the token counts for each page, and use that as the relevance score for the page:

  # Sum up the word count scores
  page_scores = defaultdict(int)
  for index_entry in index_entries:
    for index_page in index_entry['pages']:
      page = app_tables.pages.get(url=index_page['url'])
      page_scores[page] += index_page['count']

(Remember: defaultdict(int) sets unknown keys to have a value of 0, so it’s a great way to calculate sums.)

My page_scores dictionary is just what its name suggests: it maps each page to a total score. This is already a set of ranked search results! But they’re only ranked by match strength. I’ll multiply each value by the PageRank as well.

  # Calculate total scores
  final_scores = {}
  for page, score in page_scores.items():
    final_scores[page] = score * page['rank']

If PageRank is Google’s secret sauce, it’s important to know how much to add. In the real app, I’ve used a slightly more complicated formula that uses a constant to tweak the influence of PageRank vs. the relevance score:

  score * (1 + (page['rank']-1)*PAGERANK_MULTIPLIER)

You can change the PAGERANK_MULTIPLIER to make PageRank more or less important - smaller numbers squeeze the PageRanks towards 1. What value should I use? You could tune it automatically by changing PAGERANK_MULTIPLIER and comparing the search results against training data. I did a manual version of this - I tweaked it and ran queries until I was happy with the results. I went with 1/3.

So now I have a list of pages and a final score for each. I’ll sort in descending order of score, and return only the data I’m interested in - the URL and title of each page, in order.

  # Sort by total score
  pages = sorted(final_scores.keys(), key=lambda page: -final_scores[page])
  
  # ... and return the page info we're interested in.
  return [{"url": p['url'], "title": p['title']} for p in pages]

And that’s my final search function! It takes in the same arguments as my other search functions - the query as a string - and returns the pages in the same format, so I can use it from the same UI as the other search functions. Time to compare results and see if I really have improved things.

Final search tests

In each of these posts, I’ve been running three test queries to make a (pretty unscientific) judgement about the quality of my search results. Let’s see how the final algorithm stacks up.

‘Plots’

I selected ‘plots’ as a word that would be common in general technical writing. There are a few pages on the Anvil site that are directly related to plotting, so I’m hoping to see them near the top.

And I do! In fact, the new search algorithm knocks it out of the park.

Four of the top five results directly cover how to make plots in Anvil:

Two other results use plots in a big way: SMS Surveys in Pure Python and Using SciPy to work out how many T-shirts I need for a conference.

A user who wants to learn about plots in Anvil is pretty well served by this algorithm!

I selected ‘Uplink’ because it has a specific meaning for Anvil, and it’s not a commonly-used word otherwise. (If you’re not familiar with the Anvil Uplink: it allows you to anvil.server.call to and from Python running anywhere.)

The new algorithm does brilliantly at this one too.

The two most relevant pages appear at the top: Using code outside Anvil, followed by Remote Control Panel. The first of these is the Uplink tutorial, so that’s probably what a user searching for ‘Uplink’ is looking for.

The Anvil On-Site Installation page is also in the top five. Among other things, it describes how to connect your app to a Python instance running on your private network, and the answer is ‘Use the Uplink’. That is indeed something somebody interested in the Uplink would like to know.

Overall, a great job.

‘Build a dashboard in Python’

This is the most challenging query to answer because it has multiple words, and only one of them particularly signals what the user is looking for - ‘dashboard’.

Item three in the results is Build a Business Dashboard with Python, so that’s a plus point. It doesn’t appear in the Basic Search results at all.

The other page I’d expect to see, the Python Dashboard workshop, isn’t there on the first page. Most of the other pages seem to be included because they mention ‘build’ and ‘Python’ a lot, rather than because they talk about building dashboards in Python.

A couple of partially-related results from the Basic Search don’t appear in the new search: Interactive plots for your apps and the Remote Control Panel workshop.

So mixed results for the multi-word query. Neither a massive improvement, nor any worse.

So how did it do?

Nearly perfect results for ‘plots’ and ‘Uplink’, but not much improvement for ‘build a dashboard in Python’. Calculating relevance is hard!

I was hoping that pages with a few mentions of ‘build’ and ‘Python’ would get dropped because they only had low counts of those words. But, this being Anvil, there are pages that mention ‘build’ and ‘Python’ a lot.

Next steps

I need to improve my relevance calculation, and to reduce the noise introduced by ‘build’ and ‘python’ in the multi-word query. Some ideas:

  • I could try matching the entire phrase and/or portions of the phrase (N-grams).
  • I could also start looking at the grammar of the query - ‘dashboard’ is the object noun, so I should weight it higher in the results.
  • Something as simple as giving more weight to words that appear in the page title could also make a big difference.

The most promising step, though, is to ask “which words appear more than usual on this page?” instead of “which words appear often on this page?”. I could do that simply by dividing the word frequencies by the average frequency across all pages. That means words like ‘build’ and ‘python’ that show up on lots of pages would be de-emphasised in the results, but the word ‘dashboard’ will still score highly on dashboard-related pages.

The fun thing about playing with search engines is how many things you can come up with to improve the results. I’ve constructed a simple heuristic based on word counts and PageRank, but you can roll things together in so many different ways. You could start analysing media as well as the pure HTML; you could take account of page structure; or you could even track browsing behaviour (anonymised and with permission, of course!).

Why not try some things out for yourself? Feel free to clone the final app and tinker with it to see if you can do better than me. Then get in touch via the Forum to show me and the rest of the Anvil community what you’ve done!

Edit this app

Make it live!

I’m quite pleased with what I’ve achieved. In a few hundred lines of code, I’ve built a search engine that does a really great job of serving single-word queries from the Anvil site. There’s still room for improvement with multi-word queries, but the results are good enough that the user can find the best match somewhere on the first page of results.

I’m ready to make it public.

As with any Anvil app, it’s auto-hosted at a private URL. To give it a public URL, I just enter my desired subdomain of anvil.app in this box:

And now my search engine is hosted at https://search.anvil.app!

(If I wanted to use my own domain, rather than anvil.app, of course I could. That’s the “Add Custom Domain” button in that screenshot.)

Now it’s your turn

That’s the end of my search engine journey. I haven’t quite made a competitor to Google, but I’ve built a working search engine from first principles, and I’ve learnt about some important concepts in the process. Not only that, but I’ve implemented the PageRank algorithm in a few dozen lines of code!

You can see the full source code of my search engine app here:

Edit this app

I hope you’ve learnt something too. And I hope you’ve seen how easy Anvil has made this. Anvil allowed me to describe exactly how my search engine works, and show you the code.

If you’re not already an Anvil user, watch this 7-minute video to learn more. We build a database-backed multi-user web app in real-time and publish it on the internet.

Or sign up for free, and try it yourself:

Get building

Hand-drawn web apps
3rd March 2019

Xkcd-style sketched Web apps

This morning, I was reminded of Jake VanderPlas’s delightful xkcd theme for Matplotlib by Hacker News. As well as being fun to look at, sketch graphs are a great way of signalling “this is a prototype”, or “only the trend here is important”. My immediate thought was, “wouldn’t it be great if we could sketch out web apps as easily?”. Clearly, I wasn’t the only one:

wodenokoto says: "Is there a bootstrap or css theme to go along, so my PoC web-app can match the style of my proof of concept graphs?"

So, on my train ride into London today, I built an XKCD-style sketch theme for building apps in Anvil. Here’s what it looks like:

Animation of building an app with the XKCD sketch theme

This theme is ready to use yourself:

Copy app: XKCD-style theme

Or read on to see how it works…

Step 1: Write some CSS

Normally, you don’t need anything except Python to build an app in Anvil. But if we want, we can drop down and customise the look and feel of our page with HTML and CSS.

Start with a new “Custom HTML” app, with the “Standard Page” layout, so the theme.css in our Assets starts out blank. Let’s give ourselves a more hand-drawn feel:

@import url('https://fonts.googleapis.com/css?family=Patrick+Hand|Patrick+Hand+SC');

body {
  font-family: "Patrick Hand", sans-serif;
  font-size: 18px;
}


input.form-control, textarea.form-control, select.form-control, .btn {
  box-shadow: none;
  border: 2px solid black;
  border-radius: 15px 255px 15px 225px / 225px 15px 255px 15px;
}

/* Make the buttons look a bit more flat */
.btn {
  background-image: none;
  padding: 5px 12px;
}

.btn:active, .btn:hover, .btn:focus {
  border-color: black;
  background-image: none;
  background-color: #e3e3e3;
  outline: none;
}

Step 2: Cards and Headings

As well as primitives like “text”, “input box”, or “drop-down”, applications have higher-level UI elements that should look consistent. For example, we might want to group our components into “cards”. Likewise, we want all our headings to match each other.

We can make these UI elements available in the Toolbox, so we can use them with the visual designer:

Creating a card with a heading

To do this, we don’t create new components; we create new “roles” for existing components. A “Card” is a panel component (ColumnPanel) with a border and drop-shadow; and a “Heading” is a Label component with a different font and bigger text.

We use the Roles editor to create roles called card and heading (applicable to ColumnPanels and Labels respectively), and make them available in the Toolbox. Then we just need to write the CSS:

.anvil-role-card {
  border: 2px solid black;
  padding: 5px;
  border-radius: 125px 10px 20px 185px / 25px 205px 205px 25px;
  box-shadow: 2px 2px 5px 0 rgba(0,0,0,.2);
}

.anvil-role-heading .label-text {
  font-family: "Patrick Hand SC", sans-serif;
  font-size: 26px;
  padding: 13px 0 0;
}

Step 3: HTML scaffolding

Often pages have common structure - a title bar, some navigation links, etc. Rather than building this every time with components, our theme can provide HTML scaffolding. This scaffolding can include drag-and-drop points for creating special elements like titles or navigation links. It then typically has space for a standard ColumnPanel to lay out the main page content.

We edit standard-page.html (also in our Assets) to define a “title” slot and a “nav-links” slot, with placeholders for each. We use anvil-drop-slot= attributes to make drop zones: if you drop components onto these elements, they will be placed in the designated slot:

<div class="title-bar">
  <div class="title">
    <div class="placeholder" anvil-if-slot-empty="title" anvil-drop-slot="title">
      Drop a title here
    </div>
    <div anvil-slot="title"></div>
  </div>

  <div class="nav-links">
    <div anvil-slot="nav-links"></div>
    <div class="placeholder" anvil-if-slot-empty="nav-links"
         anvil-drop-slot="nav-links">
      Drop a FlowPanel here
    </div>
  </div>
  <div style="clear:both"></div>
</div>

And finally, we set up the default drop behaviour. If the mouse isn’t directly over a drop zone, we drop into the nearest container:

<div anvil-drop-default anvil-drop-container=".anvil-container"></div>

And there we have it!

We now have an Anvil theme for putting together xkcd-style web apps, super quickly and with nothing but Python. Combine it with the xkcd matplotlib style (it’s now built into matplotlib, and you can use that with Anvil) to create beautifully sketchy-looking applications:

You can see this app’s source code too:

Copy app: Plotting Example

To find out more about Anvil, check out our tutorials. For the gory details on how to create your own custom themes that work with Anvil’s visual designer, check out the reference docs.

Learn More

Get the best out of Anvil with our free email course.