XML parsing error(invalid token) caused by PDF

Question

A colleague of mine filled-in dynamic PDF form, saved and sent it to me. However due to probably some weird symbol used it did not open, neither on colleague's or my PC. It was giving XML parsing error: not well-formed (invalid token) (error code 4). There was a lot of important info in that doc so I really need a way to recover it.

I tried many recommended things, such as:

Upgrading official Adobe Acrobat Reader to the latest version. Afterwards repairing it.
Opening with other software such as FOXIT reader, software for working with docs (Libre Office, notepad, Sublime, etc.).
Opening with Adobe Acrobat Livecycle Design - software with which this application form (I suppose) was created.
Using different PDF2text libraries (written in Python). As the form was dynamic this method was inefficient
Made a post on official Adobe Support Website (yeah, that's the only way to get help from Adobe using free versions of software)

However I came up with zero result.

error pic

The only thing that succeed a bit was opening PDF with default Windows notepad. It showed XML-formatted code, however most of the code was encoded (on gist small part of encoded code is seen in the end, but there is much more) Was something like that:

%PDF-1.6
%âãÏÓ
1 0 obj
<</AcroForm 59 0 R/MarkInfo<</Marked true>>/Metadata 2 0 R/Names 60 0 R/Pages 235 0 R/Type/Catalog/Perms 233 0 R/StructTreeRoot 243 0 R/NeedsRendering true>>
endobj
2 0 obj
<</Length 4114/Subtype/XML/Type/Metadata>>stream
<?xpacket begin="ï»¿" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c005 78.150055, 2013/08/07-22:58:47        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/"
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
            xmlns:xmp="http://ns.adobe.com/xap/1.0/"
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
            xmlns:desc="http://ns.adobe.com/xfa/promoted-desc/">
         <dc:format>application/pdf</dc:format>
         <dc:creator>
            <rdf:Seq>
               <rdf:li>DAAD</rdf:li>
            </rdf:Seq>
         </dc:creator>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">PBF: Gutachtenformular</rdf:li>
            </rdf:Alt>
         </dc:title>
         <pdf:Producer>Adobe XML Form Module Library</pdf:Producer>
         <xmp:CreateDate>2008-08-14T09:56:29+02:00</xmp:CreateDate>
         <xmp:CreatorTool>Adobe LiveCycle Designer ES 10.4</xmp:CreatorTool>
         <xmp:MetadataDate>2017-03-17T09:14:06+01:00</xmp:MetadataDate>
         <xmp:ModifyDate>2017-03-17T09:14:06+01:00</xmp:ModifyDate>
         <xmpMM:DocumentID>uuid:d62a53c0-8974-4b14-888e-569579f416d8</xmpMM:DocumentID>
         <xmpMM:InstanceID>uuid:c097e78e-1dd1-11b2-0a00-9e91daf58acd</xmpMM:InstanceID>
         <desc:embeddedHref rdf:parseType="Resource">
            <rdf:value>G:\Z2\00- Verbindliche Formulare, Vorlagen\___Logo_fuer_Formulare_06_2015\DAAD_Globe_Logo-Supplement_eng_tl_rgb_300dpi.jpg</rdf:value>
            <desc:ref>/template/subform[1]/pageSet[1]/pageArea[1]/draw[2]</desc:ref>
         </desc:embeddedHref>
         <desc:Schema-Anmerkung rdf:parseType="Resource">
            <rdf:value>16 byte UUID in 32 chars (hexadecimal encoded)</rdf:value>
            <desc:ref>/template/subform[1]/subform[1]/field[1]</desc:ref>
         </desc:Schema-Anmerkung>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
                                                                                                    
                                                                                                    
 
<?xpacket end="w"?>
endstream
endobj
214 0 obj
<</Filter[/FlateDecode]/Length 419>>stream
H‰¼“[kÂ0Çßýg}²LŠ¦àæC7'†nÞØžB°§.,¶¥IÕáüîKÓ8[´2˜¬”^’ÿ¹äwÎa>Tåg„¡_]û”°@HÊ9z6t:`%‡>Ð³àërº%Æ‚…Á1UnnáÊiØ•M

I tried many different decoding tools - no success.

Can you share the pdf in question for inspection? Furthermore, how exactly does the [tag:python] tag relate to your question? Does that only refer to the use of *"different PDF2text libraries (written in Python)"* or does it indicate you hope for a solution in python? — mkl, Aug 19 '17 at 12:58
@mkl, I was writing question-answer post, so as my solution is based on Python code - this tag is included. I will upload PDF — techkuz, Aug 19 '17 at 15:52

score 0 · Accepted Answer · edited Jun 20 '23 at 21:25

You should have used specific FlateDecoding method. There is a working solution written by Stephen Haywood . I checked its correctness in Python 2. Just change the PDF title to yours and run in terminal with python command. Here is the gist.

#!/bin/bash
import re
import zlib

pdf = open("some_doc.pdf", "rb").read()
stream = re.compile(r'.*?FlateDecode.*?stream(.*?)endstream', re.S)

for s in stream.findall(pdf):
    s = s.strip('\r\n')
    try:
        print(zlib.decompress(s))
        print("")
    except:
        pass

XML parsing error(invalid token) caused by PDF

1 Answers1