ValueError: signal only works in main thread of the main interpreter

taleno.digital · December 10, 2022, 3:02am

Hello Team,

As you know, I’m trying to make a simple web scraping app using Scrapy.

But I got this error and I don’t know what this means:

ValueError: signal only works in main thread of the main interpreter
at /usr/local/lib/python3.10/signal.py:56
called from /home/anvil/.env/lib/python3.10/site-packages/twisted/internet/base.py:1281
called from /home/anvil/.env/lib/python3.10/site-packages/twisted/internet/posixbase.py:142
called from /home/anvil/.env/lib/python3.10/site-packages/scrapy/utils/ossignal.py:19
called from /home/anvil/.env/lib/python3.10/site-packages/scrapy/crawler.py:356
called from worldometers_spider, line 34
called from Form1, line 17

Here’s a screenshot for better understanding:

Appreciate any help from this community.

Thanks,
Joey

stefano.menci · December 10, 2022, 4:31am

I’m wildly guessing here, very wildly because I don’t know how Anvil works, how Scrapy works or what you are doing.

The Anvil server runs on the main thread of a Python process. When it receives a new request, it processes it on another thread (either already existing in a queue and waiting to be used or created on demand). At any point in time there are many threads running: the main one, managing the whole shebang, one thread per request being processed, one per background task, etc.

Scrapy does something similar: the main thread manages a bunch of worker threads and delegates scraping jobs to each of them.

Now imagine if the main thread for Scrapy were running in one of the Anvil request threads, and the main thread of Anvil decided to kill that thread, for example because it exceeds the 30 seconds timeout. Scrapy would not appreciate!

I’m guessing that Scrapy wants to make sure that its main thread, the one that manages the other scraper threads, is the main thread of the Python process, because it wants to make sure everything is under control.

If that’s the case, I’m afraid you are out of luck. I don’t think you can run your second Python process on the Anvil server, other than the Anvil server process (unless you are running the open source server on your own machine).

I’m sure some of the things I mentioned are wrong, but chances are I got the big picture right. You can use it as a starting point for some further research.
Good luck!

taleno.digital · December 12, 2022, 1:44am

Thanks for taking a shot @stefano.menci ! Really appreciate it.

I think it’s a good idea to here from @meredydd if there’s a way to use Scrapy in Anvil.

p.colbert · December 12, 2022, 1:58pm

Server-side environments will always be constrained, to some degree, because they run on shared hardware, and are managed alongside numerous similar environments.

But that’s not the only option.

Uplink programs are not so constrained. Developers often turn to them for access to things that the developers directly control.

In my case, I’m running a custom compute engine, on my own PCs.

In your case, you could be running your own Python installation containing and using Scrapy.

hugetim · December 12, 2022, 4:00pm

And, just to be clear, Uplink is not limited to running something locally on your own PC. You can package the Scrapy code as a cloud-hosted web service such as in this tutorial: Deploy a Google Colab Notebook with Docker

ianb · December 12, 2022, 8:27pm

If I were trying to do what I think you are trying to do, I would start with a working system first and work on moving it from a ‘development’ machine to a production environment as a step 2.

Here is how I would structure it:

Create an anvil uplink script on your machine. This script would have a callable server function that takes information from the Anvil Server on what to crawl. You could either pass a data structure, or some kind of unique table ID from a column in some kind of request table you built.
The function in the uplink script will, instead of doing the crawling itself, spawn a new python process to do the crawling using scrapy installed on your machine.
You would pass it the information it needed, if it was a table ID, the new process could log into uplink and gather what was to be scraped from that request table in anvil data tables.
Once it had the desired results from scrapy, it would then write back the information directly to an anvil data table. The new process would then close.
The anvil uplink with the callable function would remain running.

On the anvil side something would check for the results until it found them in the data table.

After you get all of that to work, then you can look into containerizing, moving to the cloud etc. for production.
If you are doing lots of scrapes, you may want this to be able to use different VPN connections for each spin-up of a container anyway, so you don’t get blocked IP’s, or you could just try not do anything that would get on the bad side of a webservers ruleset. (aka being nice)

Edit:
I just wanted to add, many of the ‘keeping track of, and spinning off’ tasks can be handled using the background tasks and scheduled tasks features in anvil to avoid the 30 second limit while you wait for your scraping information to return. I did not include this above but I was trying to keep in the bounds of a free-tier user. If you are a paid user you will have access to no-time-limit background tasks and scheduled tasks.

taleno.digital · December 13, 2022, 2:22am

Thank you @p.colbert @hugetim @ianb