0

PDF content are saved as several ways, "(abc) Tj", "(<0035><0035>) Tj" or "\u065".

I want to know if there is a way to convert the PDF code to one type, no matter direct text "(abc) Tj", or hexadecimal "(<0035><0035>) Tj", or Octal "\u065".

I think if convert and encode the PDF to one type, will be easier to analyse the content.

Is it possible to use Ghostscript or something to do that? Thanks

SuperBerry
  • 1,193
  • 1
  • 12
  • 28
  • Your second example of 'several ways' is wrong, it should be `<00350035> Tj`. The rules for converting the input format to the exact bytes they represent are outlined in the formal specifications and are not that hard to implement. – Jongware Aug 22 '15 at 10:28

1 Answers1

2

Essentially, no, there is no way to do so. There are two kinds of string, regular strings '(' and ')' delimited, and hex strings '<' and '>' delimited. Hex strings need not be escaped whereas regular text strings do need to be for 'special' characters, like carriage return and linefeed. Octal is also permitted in regular strings.

PDF producers are free to mix and match these all they like, but in general a given PDF producer will usually use one technique throughout.

Because Ghostscript's pdfwrite device is a PDF producer, it will (I believe) generally produce all its output the same way.

What it won't do is 'convert' your original PDF file. It produces a brand new PDF file which should look visually identical but whose internals bear no resemblance to your original PDF. In addition some metadata or fidelity may be lost.

KenS
  • 30,202
  • 3
  • 34
  • 51
  • So I have no way to change the technique in the PDF? – SuperBerry Aug 22 '15 at 09:00
  • 1
    Fundamentally, no. You could write code to do so of course. Given that the length of <3333> is not the same as (!!) or (\041\041) changing the string representation will alter the length of the content stream which will mean altering the xref table. Of course the content stream will usually be compressed as well so you 'll need to decompress it, alter the string representation, recompress it, write it back to the original file (shifting the following bytes) and finally update the xref table. Seems like a lot of trouble for no gain. – KenS Aug 22 '15 at 09:13
  • ... there seems to be no practical *reason* to do this. The format of the strings are for storage only. Any PDF parser should be totally oblivious of how a text "!?" is stored: as `(!?)`, as `<213F>`, or as `(\41\77)`. The storage format is not saved in memory "as is", it'll be parsed into an internal format. – Jongware Aug 22 '15 at 10:25
  • I have tried to extract text from PDF by reading the code directly from the uncompressed PDF "[]TJ" and "()Tj", however I found it is very hard to do because there are too many character techniques like <> or (\041\041) or (!!). Besides, there are "\ToUnicode" and "\Differences" parameters... So hard. So I think if I convert them to a single technique, the extraction would be easier. – SuperBerry Aug 22 '15 at 10:30
  • I know there are some free command line tools like PDFtoText.exe etc., but I want to do it in my program and render the text on my application area. – SuperBerry Aug 22 '15 at 10:36
  • @SuperBerry: no, the `/ToUnicode` and `/Differences` arrays have nothing to do with how a string is written in a file. E.g., a sequence `\n` will *always* represent the internal number `10`, no matter how the backslash or the character `n` gets remapped *after* reading raw binary data. Separate those two steps. – Jongware Aug 22 '15 at 10:59