Extracting PDF Tables using Tabula-Py

jim · April 1, 2020, 6:25am

I am trying to extract some tables from PDF files using the library tabula-py (thanks for installing!).

I can save PDF files as media objects in a data table and I was following this post to figure out how to manipulate the files, however the simplest approach looks like extracting tables directly from the URL using the following syntax:

import tabula

pdf_url = 'https://www.mscbs.gob.es/profesionales/saludPublica/ccayes/alertasActual/nCov-China/documentos/Actualizacion_61_COVID-19.pdf'
test_df = tabula.read_pdf(test_pdf_url)[0]
print(test_df)

This code works perfectly in a Colab Notebook, however, when I try this in Anvil I get a java error (tabula-py is actually a wrapper around tabula-java):

CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', '/usr/local/lib/python3.7/site-packages/tabula/tabula-1.0.3-jar-with-dependencies.jar', '--guess', '--format', 'JSON', '/tmp/c46cadd9-8a0f-4050-a284-29444a2129c6.pdf']' returned non-zero exit status 1.
at /usr/local/lib/python3.7/subprocess.py, line 512
  called from /usr/local/lib/python3.7/site-packages/tabula/io.py, line 85
  called from /usr/local/lib/python3.7/site-packages/tabula/io.py, line 322

I have tried a number of different options and approaches but get the same error - any help would be greatly appreciated, thanks! If it helps, you can see the options by using:

print(help(tabula.io.build_options))

bridget · April 3, 2020, 6:05pm

Hi @jim,

The code you’ve provided above is incomplete. Could you let me know where test_pdf_url is coming from?

When I edit the code you’ve provided to use pdf_url rather than test_pdf_url, the code is working for me.

Making some guesses here, it looks like there might be an issue with a temporary file, but without knowing how you’re generating that file, I can’t help you any more than that I’m afraid.

jim · April 6, 2020, 3:33pm

Thanks @bridget, I tried it just now and for some reason it now works. No idea why but marvellous, thanks.