Trying to read table-stored .xlsx files using openpyxl

dconnell · April 14, 2022, 12:38am

What I’m trying to do:
I create .xlsx files using openpyxl, they are stored as .xlsx files, they download and open in Excel with no issue, we use these all day. These Workbooks have multiple Worksheets, I’d like to later refer to/open these and ‘read’ specific cell values.

What I’ve tried and what’s not working:
I tried xlrd, no dice… it does not handle .xlsx format. I’m calling the server function below and I’m getting a messy error, I think I’m close but not quite packaging or decoding the stored Workbook properly.

Code Sample:

@anvil.server.callable
def read_xlsx(rfa_id):
  jobs = app_tables.jobs.search(rfa_id = rfa_id)
  if len(jobs) > 1:
    for job in jobs:
      jobs = sorted(jobs,key = lambda i: i['post_datetime'], reverse=True)
      job = jobs[0]
  else:
    job = jobs[0]

    
  wb = job['download'].get_bytes()
  
  workbook = load_workbook(wb)
  print(f"Worksheet names: {workbook.sheetnames}")
  sheet = workbook.active
  print(sheet)
  print(f"The title of the Worksheet is: {sheet.title}")

I’m getting this error:

InvalidFileException: openpyxl does not support b’.xmlpk\x05\x06\x00\x00\x00\x00\x10\x00\x10\x00(\x04\x00\x00\xf2s\x00\x00\x00\x00’ file format, please check you can open it with Excel first. Supported formats are: .xlsx,.xlsm,.xltx,.xltm at /usr/local/lib/python3.7/site-packages/openpyxl/reader/excel.py, line 94 called from /usr/local/lib/python3.7/site-packages/openpyxl/reader/excel.py, line 124 called from /usr/local/lib/python3.7/site-packages/openpyxl/reader/excel.py, line 316 called from [PDFServerModule, line 471](javascript:void(0)) called from [Dev_Starter, line 45](javascript:void(0))

Clone link:
Unfortunately I don’t have/cannot see to create a separate app so as not to post sensitive data

danbolinson · April 14, 2022, 2:49pm

Are you trying to read tabularized data or data from specific cells?

If reading tabular data pandas.read_excel has the best (fastest and most robust) implementation I’ve come across. I’m sure it’s something else under the hood…
https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

If you need to read specific cells, then I’ve used openpyxl with .xlsx files in the past. I’m not sure why you’d get that error you are seeing there. I wonder if this is actually encoded in .xls or something… I would be curious to see if the issue was something to do with how you’re saving the file. Does this work if you try a .xlsx file you have saved FROM Excel?

EDIT: Looking at your code again, I think that openpyxl load_workbook expects a file-type object, not a bytes string. Can you load it into an io.BytesIO first?

import io
bytestream = io.BytesIO(job['download'].get_bytes())
bytestream.seek(0)
workbook = load_workbook(bytestream)

Edit2: Corrected code above per feedback from @dconnell

dconnell · April 14, 2022, 3:47pm

Thanks so much for the thoughtful response. I used your code above but added a bit that seems to be working:

  file = job['download'].get_bytes()
  bytestream = io.BytesIO(file)
  bytestream.seek(0)
  workbook = load_workbook(bytestream)
  print(workbook.sheetnames)

danbolinson · April 14, 2022, 4:02pm

Thanks, I will correct my post!

p.colbert · April 14, 2022, 9:51pm

According to
https://pandas.pydata.org/docs/getting_started/install.html

Pandas uses the following libraries for Excel access:

Excel files¶

Dependency	Minimum Version	Notes
xlrd	2.0.1	Reading Excel
xlwt	1.3.0	Writing Excel
xlsxwriter	1.2.2	Writing Excel
openpyxl	3.0.3	Reading / writing for xlsx files
pyxlsb	1.0.6	Reading for xlsb files

Notice that Pandas is using openpyxl to write and read .xlsx files. So you should be able to do the same.