0

So I read each page of a pdf and appended every xml extract to a string variable. Using Page.get_text(“xml”). The text output consisted of many units of

<page id="page0" width="595.276" height="841.89">\n<block bbox="84.95639 235.90979 382.4564 316.3398">\n<line bbox="84.96 235.90979 382.4564 278.3298" wmode="0" dir="1 0">\n<font name="AkkuratPro-Bold" size="35">

I understand that these are bounding boxes around texts and in the documentation it's specified that these are best parsed using lxml. And so I tried the below way of implementation.

from lxml import etree

root = etree.fromstring(texts)

and got the following error:

Traceback (most recent call last):

  File "C:\Users\z34534534\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3418, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-11-209409d1172d>", line 3, in <module>
    root = etree.fromstring(texts)

  File "src/lxml/etree.pyx", line 3237, in lxml.etree.fromstring

  File "src/lxml/parser.pxi", line 1896, in lxml.etree._parseMemoryDocument

  File "src/lxml/parser.pxi", line 1777, in lxml.etree._parseDoc

  File "src/lxml/parser.pxi", line 1082, in lxml.etree._BaseParser._parseUnicodeDoc

  File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc

  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult

  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError

  File "<string>", line 196
XMLSyntaxError: Extra content at the end of the document, line 196, column 2

I really would like to know the current way of implementing lxml and using the bounding box to get the text out of the pdf document.

0 Answers0