2

I am working on invoice parsing using invoice2data library. This library has predefined templates in YAML for parsing invoices. But when I am running the samples it is giving me YAML parsing error for all the templates

Runnung it as :

invoice2data --input-reader tesseract FlipkartInvoice.pdf

Exception:

Traceback (most recent call last):
File "/home/webwerks/.local/bin/invoice2data", line 10, in <module>
sys.exit(main())
File "/home/webwerks/.local/lib/python3.5/site-packages/invoice2data/main.py", line 191, in main
templates += read_templates()
File "/home/webwerks/.local/lib/python3.5/site-packages/invoice2data/extract/loader.py", line 88, in read_templates
tpl = ordered_load(template_file.read())
File "/home/webwerks/.local/lib/python3.5/site-packages/invoice2data/extract/loader.py", line 36, in ordered_load
return yaml.load(stream, OrderedLoader)
File "/usr/local/lib/python3.5/dist-packages/yaml/__init__.py", line 112, in load
loader = Loader(stream)
File "/usr/local/lib/python3.5/dist-packages/yaml/loader.py", line 44, in __init__
Reader.__init__(self, stream)
File "/usr/local/lib/python3.5/dist-packages/yaml/reader.py", line 74, in __init__
self.check_printable(stream)
File "/usr/local/lib/python3.5/dist-packages/yaml/reader.py", line 144, in check_printable
'unicode', "special characters are not allowed")
yaml.reader.ReaderError: unacceptable character #x0082: special characters are not allowed
in "<unicode string>", position 312

The last line says:

File "/usr/local/lib/python3.5/dist-packages/yaml/reader.py", line 144, in check_printable
'unicode', "special characters are not allowed")
yaml.reader.ReaderError: unacceptable character #x0082: special characters are not allowed
in "<unicode string>", position 312

I have checked the templates. All are valid in the UTF-8 format.
The issue seems to be with the python-yaml package.Anyone has encountered this issue?

Anthon
  • 69,918
  • 32
  • 186
  • 246
Rajesh Gosemath
  • 1,812
  • 1
  • 17
  • 31

2 Answers2

4

That your input is valid UTF-8 is irrelevant, as YAML source should only accept a subset of Unicode code points (independent of UTF-8 or some other encoding).

In particular it does only support the printable subset of Unicode and the old YAML 1.1 specification, the one that PyYAML supports, elaborates that with:

The allowed character range explicitly excludes the surrogate block #xD800-#xDFFF, DEL #x7F, the C0 control block #x0-#x1F (except for #x9, #xA, and #xD), the C1 control block #x80-#x9F, #xFFFE, and #xFFFF. Any such characters must be presented using escape sequences.

So the non-printable "BREAK PERMITTED HERE" codepoint, 0x0082 is clearly disallowed (and is not one of those things PyYAML should allow, but doesn't).

Anthon
  • 69,918
  • 32
  • 186
  • 246
  • @Androidjack-RajeshGosemath Open an issue with whatever generates the YAML (I think tesseract in this case) saying it generates invalid YAML. – flyx Apr 29 '19 at 14:14
  • @Androidjack-RajeshGosemath Remove the offending Unicode code point from your input before loading. You should probably use `safe_load()` anyway instead of `yaml.load()`, so you can do `yaml.safe_load(template_file.read().replace(u'\x82', '')` – Anthon Apr 29 '19 at 14:29
  • If the character appears within a double quoted scalar, you might consider replacing it with `\u0082`, or you can check your templates that generate the YAML and make sure the characters is removed at the source. – Anthon Apr 29 '19 at 14:34
  • I would be a LOT better if it can specifically report a "unprintable UTF-8" character, rather than generate a python hissy-fit. At least then it can report EXACTLY where and in what file the character appeared in... Still room for improvement! – anthony Jun 17 '22 at 04:51
0

It could be that your LANG environment variable is incorrectly set. The default LANG environment variable is en_US.ASCII for some OS, so a valid character 0x82 in UTF-8 is not recognized as a valid character in ASCII, which would cause this issue.

The solution is simply setting LANG to en_US.UTF-8

He David
  • 13
  • 2