What I’m trying to do:
I’m parsing a PDF into HTML and saving different stages of that process into tables in anvil.
I’m getting this error:
anvil.tables.TableError: Internal database error: ERROR: unsupported Unicode escape sequence Detail: \u0000 cannot be converted to text. Where: JSON data, line 1: …histologische Wachstumsmuster</li></ul></ul>",…
The text is part of the PDF.
What is happening and how can I ensure it will never happen? The user gets to upload whatever PDF they want and I need it to never break on me. I extract the text from the PDFs using pdfminer.six and then make a bunch of manipulations and changes.
I found information on the error elsewhere and fixed it as suggested there:
loading data report error: “unsupported Unicode escape sequence, \u0000 cannot be converted to text.”
Title
loading data report error: “unsupported Unicode escape sequence, \u0000 cannot be converted to text.”
Issue
When loading certain JSON data to GPDB v6 (postgresql v9.4), it would report the following error:
ERROR: unsupported Unicode escape sequence
DETAIL: \u0000 cannot be converted to text.
CONTEXT: JSON data, line 1: {“data”:…
Resolution
This is due to design change in postgresql v9.4 (GPDB v6). There is a postgresql community post which talked about this change, a quote was selected from the community post, pasted below:
Source:
Matthew Byrne writes: Are there any plans to support \u0000 in JSONB and, relatedly, UTF code > point 0 in TEXT?
No. It’s basically never going to happen because of the widespread use of C strings (nul-terminated strings) inside the backend. Making \0 a legal member of strings would break all those internal APIs, requiring touching far more code than anyone would want to do. It’d likely break a great deal of client-side code as well.
The GPDB R&D team also confirmed, even in postgresql v12, this usage will report the same error. The R&D team has no plans to change this design.
However, we do have a workaround, listed below:
SELECT (regexp_replace(the_string::text, ’
u0000’, ‘’, ‘g’))::json;
https://community.pivotal.io/s/article/loading-data-report-error-unsupported-Unicode-escape-sequence?language=en_US