1

I am using Adobe's EchoSign API to retrieve a string representation of a PDF file. The problem I am running into is that writing the file to disk is working properly. The file length is a much different length than the string and won't open as a PDF.

As a test, I used an existing PDF file - one that I know is a true PDF, and tried to pull the contents of the file as a string like their API provides and then write it back to another file. The result is the same. I can open the "real" PDF using Adobe, but the new file will not open. This should be simple, but I am obviously missing something.

Here is what I have done to test this out: Scenario 1: Using string received from the API

File.WriteAllText(fileName, PDFstring, new UTF8Encoding(false));

Scenario 2: Using string received from the API. Yeah, it seemed dumb, but nothing has been working.

            using (var sw = File.CreateText(fileName))
        {
            for (int p = 0; p < PDFstring.Length; p++)
            {
                var c = PDFstring.Substring(p, 1);
                sw.Write(c);
            }
        }

Scenario 3: Use a known good PDF file and try to copy it by creating a string and writing it to a new file.

        var filename = @"C:\Adobe\GoodDocument.pdf";
        var newFile = @"C:\Adobe\Rewrite.pdf";
        var fs = new FileStream(filename, FileMode.Open, FileAccess.Read);
        var file = new StreamReader(fs);
        var allAdobe = file.ReadToEnd();
        fs.Close();
        File.WriteAllText(newFile, allAdobe, new UTF8Encoding(false));

All three scenarios gave the same results. I cannot use the new file. The file lengths are all longer than they should be. Attempting to open the new file asks for a password where the original does not.

Obeservation: I just ran scenario 3 again. Accept this time using the copied (incorrect) file as the original. The result was an exact duplicate! What gives? Is Adobe playing tricks with me?

Mark Bonafe
  • 1,461
  • 13
  • 23
  • 1
    Is the `string` you're getting from the API maybe a [Base64](https://en.wikipedia.org/wiki/Base64) representation of the actual byte-content? – Corak Jul 12 '18 at 13:38
  • 1
    If so, does `File.WriteAllBytes(newFile, Convert.FromBase64String(stringFromAPI));` work? – Corak Jul 12 '18 at 13:54
  • I looked a bit at the EchoSign API. Can you elaborate on how you get the document from Adobe? Are you calling the REST service at `/agreements/{agreementId}/combinedDocument`? – Hans Kilian Jul 12 '18 at 13:55
  • 1
    A "valid" PDF should start with `%PDF`. does the string from the API start like that? If not, could you post the first ten or so characters from the string? – Corak Jul 12 '18 at 14:02
  • I don't know EchoSign but if you get a _String Representation_ of the PDF-File, I suppose it's text-only. If you write that into any file, you can open it with a text-editor, but not with a PDF reader. – derpirscher Jul 12 '18 at 15:30
  • And for your 3rd example: PDF can contain binary data. You can't read that into a string. If you do that, some bytes will be lost. And when you write that string back to a file, it does not have the correct strucutre any more, thus cannot be opened by a PDF reader – derpirscher Jul 12 '18 at 15:31
  • Here is the start of the string, it's obviously a pdf. %PDF-1.7 %���� 1 0 obj <>/ProcSet[/PDF/Text]>>/Rotate 0/StructParents 25/Type/Page>> endobj 2 0 obj < – Mark Bonafe Jul 12 '18 at 16:15
  • 2
    **Strings are not byte arrays**, so do not treat them as byte arrays. Never try to read a binary file format into a string; read it into a byte array! – Eric Lippert Jul 12 '18 at 17:22

3 Answers3

0

PDF is a binary format. So you need to read and write them as bytes like this:

var document = File.ReadAllBytes("document.pdf");
File.WriteAllBytes("new document.pdf", document);
Hans Kilian
  • 18,948
  • 1
  • 26
  • 35
  • 1
    I think the common practice is to call files that contain a mix of text and binary data 'binary'. At least I would be confused if you gave me a PDF file and said that it was a text file. – Hans Kilian Jul 12 '18 at 13:43
  • Ok, I've tried this approach and the results are different, but the same. The file size is now smaller than the original and still won't open as a PDF. Looks like progress though. – Mark Bonafe Jul 12 '18 at 13:49
  • Yeah, well Adobe isn't about to change their format. – Mark Bonafe Jul 12 '18 at 13:57
  • @AdrianoRepetti *"PDF is a text format (ASCII 7 bit) which MAY contain some binary content"* - There are many questions here on SO by developers whose code because of this opinion damages PDFs beyond repair. – mkl Jul 12 '18 at 14:14
  • @MarkBonafe *"Adobe isn't about to change their format."* - PDF isn't *their format* anymore. It became an ISO standard 10 years ago. – mkl Jul 12 '18 at 14:18
  • @mkl - thanks for all the info. Question, if it became an ISO standard so long ago, then why is it so difficult to receive a string from their API and write it to disk as a usable PDF file? That's "all" I'm trying to do. You'd think this would be easy or at least some examples would exist. Sadly, it is not easy and I have not seen any examples. – Mark Bonafe Jul 12 '18 at 14:50
  • @AdrianoRepetti *"PDF file format is officially defined as text"* - it is not **defined as text**. According to the specification *"A PDF file is represented as a sequence of 8-bit bytes, some of which are interpreted as character codes in the ASCII character set and some of which are treated as arbitrary binary data depending upon the context."* So it is defined as a binary format parts of which can be *interpreted* as text. Unfortunately indeed the specification often uses terms that come from the world of text, but to an attentive reader it makes clear that one cannot *handle it like text*. – mkl Jul 12 '18 at 15:08
  • @MarkBonafe *"if it became an ISO standard so long ago, then why is it so difficult to receive a string from their API and write it to disk as a usable PDF file?"* - not that Adobe API is an ISO norm, merely the PDF format. Any API, in particular proprietary ones, can be made arbitrarily difficult to use. *"That's "all" I'm trying to do."* - if you answered Corak's requests for clarification in comments to your question, we might be able to help. – mkl Jul 12 '18 at 15:13
  • @mkl - Thanks again, but I have tried converting to a byte array and writing the array to a file with the exact same - unsatisfactory - results. Somewhere there must be an example or a "rule book" on how to accomplish this. Do you have any information at all on how to actually code a working solution? – Mark Bonafe Jul 12 '18 at 15:40
  • @mkl you're right, just found [official documentation](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf) says _"...PDF is based on a structured binary file format..."_ – Adriano Repetti Jul 12 '18 at 17:04
0

The @hans-kilian answer is enought if you won't edit something before rewrite the document, but i think you can read it a string changing Reading format and Writing format to ASCII:

var filename = @"C:\Adobe\GoodDocument.pdf";
        var newFile = @"C:\Adobe\Rewrite.pdf";
        var fs = new FileStream(filename, FileMode.Open, FileAccess.Read);
        var file = new StreamReader(fs, System.Text.Encoding.Default);
        var allAdobe = file.ReadToEnd();
        fs.Close();
        File.WriteAllText(newFile, allAdobe, System.Text.Encoding.Default);

EDIT: I realize only now that your string come from an API, so that's the only viable solution :)

EDIT2: Ok, i read your link and i understand that you need to decode in base 64 some chunks of your PDF strings, and i think is what i was saying you in my yesterday comment:

  • I open a "test.pdf" with notepad++ and i've that piece of code:

%PDF-1.7

4 0 obj
(Identity)
endobj
5 0 obj
(Adobe)
endobj
8 0 obj
<<
/Filter /FlateDecode
/Length 146861
/Type /Stream
>>
stream


[.......] LOTS OF ANSI CHARACTERS [.......]


endstream
endobj
13 0 obj
<<
/Font <<
/F1 11 0 R
>>
>>
endobj
3 0 obj
<<
/Contents [ 12 0 R ]
/CropBox [ 0.0 0.0 595.32001 841.92004 ]
/MediaBox [ 0.0 0.0 595.32001 841.92004 ]
/Parent 2 0 R
/Resources 13 0 R
/Rotate 0
/Type /Page
>>
endobj
10 0 obj
<<
/Length 535
>>
stream
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS def /CMapType 2 def 1 begincodespacerange <0000> <FFFF> endcodespacerange 15 beginbfchar <0003> <0020> <0018> <0044> <0026> <0046> <002C> <0048> <0057> <0050> <0102> <0061> <011E> <0065> <015D> <0069> <0175> <006D> <0190> <0073> <019A> <0074> <01C7> <0079> <0355> <002C> <0357> <003A> <035B> <2019> endbfchar endcmap CMapName currentdict /CMap defineresource pop end end 
endstream
endobj
9 0 obj
[ 3 3 226 24 24 615 38 38 459 44 44 623 87 87 516 258 258 479 286 286 497 349 349 229 373 373 798 400 400 391 410 410 334 455 455 452 853 853 249 855 855 267 859 859 249 ]
endobj
6 0 obj
[ -798 -268 798 952 ]
endobj
7 0 obj
798
endobj
2 0 obj
<<
/Count 1
/Kids [ 3 0 R ]
/Type /Pages
>>
endobj
1 0 obj
<<
/Pages 2 0 R
/Type /Catalog
>>
endobj
14 0 obj
<<
/Author (user)
/CreationDate (D:20180713094854+02'00')
/ModDate (D:20180713094854+02'00')
/Producer (Microsoft: Print To PDF)
/Title (Microsoft Word - Documento1)
>>
endobj
xref
0 15
0000000000 65535 f
0000148893 00000 n
0000148834 00000 n
0000147825 00000 n
0000000009 00000 n
0000000035 00000 n
0000148778 00000 n
0000148815 00000 n
0000000058 00000 n
0000148591 00000 n
0000148004 00000 n
0000147008 00000 n
0000147480 00000 n
0000147780 00000 n
0000148942 00000 n
trailer
<<
/Info 14 0 R
/Root 1 0 R
/Size 15
>>
startxref
149133
%%EOF
(i use code snippet just to have the code correctly formatted ;) )
  • What i've inside [.......] LOTS OF ANSI CHARACTERS [.......] is ANSI but in your situation is a base64string that need to be "replaced" with his base64 decoded to ANSI string, if i'm right you can do that like below:

    byte[] data = Convert.FromBase64String(your_base_64_string); string decodedString = Encoding.Default.GetString(data);

Let me know if you can hit the goal :)

Legion
  • 760
  • 6
  • 23
  • Have you tried it? It doesn't work. It strips the high bit off of the binary data. The files have the same length but they're not the same. – Hans Kilian Jul 12 '18 at 13:48
  • Tested right now.... and I agree with you, ASCII 7 isn't enought, but i think is only an encoding error, let me check two things ;) – Legion Jul 12 '18 at 13:54
  • Yes, I've tried it. The file is reported to be smaller using Windows Explorer, I haven't actually checked exact file length. But, you are right, the files are not the same. The encoding seems to make a big difference, but that isn't my wheelhouse. Thanks for the help! – Mark Bonafe Jul 12 '18 at 14:00
  • Edit my answer, the encoding needed is ANSI, so you should use System.Text.Encoding.Default – Legion Jul 12 '18 at 14:07
  • That works for an existing file, but it didn't work for the string returned from the API. ...sigh... I suppose now I will have to walk the string and look for binary data and encode each section. sheez! If it works I'll post the code here because Adobe tech support is dumb as a box of rocks. – Mark Bonafe Jul 12 '18 at 14:30
  • if you can post here the string format, maybe they send you a string formatted data and you need a buffer... is hard to say without seeing it. – Legion Jul 12 '18 at 14:59
  • When you say "string format", do you mean the actual string? Sadly, no because it would violate HIPAA regulations. If that is what you mean, I can setup a generic PDF and pull that. – Mark Bonafe Jul 12 '18 at 15:34
  • The guy in this link said he had a fix. https://stackoverflow.com/questions/35470113/echosign-combineddocument-api Anyone know how to convert this to c#? – Mark Bonafe Jul 12 '18 at 15:42
  • I've not enought time to read it today, but i see that it use a base 64 conversion, idk if your entire string is encoded or only the stream (inside steam & endstream words) but you can use "System.Convert.FromBase64String(encodedData)" to decode it. – Legion Jul 12 '18 at 17:10
0

While Legion technically answered the posed question, I feel it's necessary for anyone following in my footsteps to get the full answer.

What lead to this question was me trying to write the content of a response to an Adobe Sign API call to a file.

I am using C# and the RestSharp library. This is important. The RestSharp IRestResponse object that provides the content apparently creates this property from the data received from the call. Because the content is so complex, creating the string representation immediately made writing it to a PDF file impossible. Digging deeper into the response object, I noticed a property call RawBytes. This is a byte array of the response. If I write the byte array directly to disk, everything.just.works.

Sorry to bother everyone with this. I was one layer above the actual problem

Mark Bonafe
  • 1,461
  • 13
  • 23