0

I'm processing a PDF document in a program.

The only part of the document I have access to is a list of PDF operations (with their arguments), and a list of horizontal displacements for the glyphs and fonts that appear in the document.

Is it possible from this to calculate the coordinates of each string on any given page? By "a string" I mean either an argument of a Tj, ' or " operators or a string element of the argument of a TJ operator. I don't care in what coordinate space these coordinates are defined, or their units, only that space and units are the same for every point, since I'm mostly trying to calculate relative distances, not actually display them properly.

If that's relevant, the PDF document in question doesn't have any images or vertical text, but it can have multiple Text Objects on a single page, and the strings are not drawn in reading order (moreover the order they're drawn in changes from page to page).

I've tried to figure this out myself from the PDF reference document, but I always have problems with linear algebra, so I'm having a really hard time trying to understand how transformation and different spaces actually work. I've been trying to use the Tm[3,0] and Tm[3,1] elements of Text Matrix Tm as coordinates, and that mostly worked (in that when I order strings using those elements they are usually in a correct reading order), but still there are issues (e.g. on some pages the order gets completely wrong, and in some cases symbols that appear really close to each other on the page actually have a larger distance between them then symbols that appear far away from each other, etc.)

For example let's say I have this sequence of operators:

BT
0 7.3001 -7.3001 0 124.64 301.79 Tm
A Tj
/T1_2 1 Tf
E Tj
0.0157 Tc
0 7.3001 -7.3001 0 124.64 518.99 Tm
SOME Tj
ET
BT
WOW Tj
1.359 0.041 Td
T*
SECOND Tj
0 7.6068 -7.3001 0 269.54 245.01 Tm
LAST Tj
ET

How would one calculate the coordinates of the strings in the resulting file?

JohnDiGriz
  • 171
  • 13
  • @KJ well, that's not a big problem, since I know the glyph's bounding box, and I also know which font is used (since I have access to Tf operations). The problem is about transformation and coordinate spaces – JohnDiGriz Mar 10 '23 at 02:16
  • @KJ I know that PDF is not human readable, as I said I'm writing a program that need to do this processing. – JohnDiGriz Mar 10 '23 at 02:25
  • @KJ the snippet is just for the general idea of what kind of info I have, not from an actual file. To clarify, I'm not trying to create a PDF file, I'm trying to read one automatically, and for that I need to be able to a)find all of the strings there, b)find the distances between them – JohnDiGriz Mar 10 '23 at 06:12
  • Remember you also need to take account of the CTM ("current transformation matrix" set by the "cm" operator). The text matrices also need to be combined with that to get to "device space" which is what you want as your output. To interpret all this properly, you will need to deal with BT / ET / Q / q operators too. So as you go through the operator list, you maintain the current graphics and text matrix on a stack or stacks, and then pop off the stack on, for example, seeing a q operator. – johnwhitington Mar 10 '23 at 12:59

1 Answers1

1

How would one calculate the coordinates of the strings in the resulting file?

Assuming this is the full page content stream, the positions in the default user space coordinate system would be:

BT
0 7.3001 -7.3001 0 124.64 301.79 Tm
A Tj

This is invalid, according to the PDF specification: There is no initial value for either font or size; they shall be specified explicitly by using Tf before any text is shown.

Thus, whether string A is shown (in some arbitrary font at some arbitrary size) or not, depends on the PDF viewer implementation.

Assuming, though, that you have a viewer that assumes some font and font size, A should be shown at (124.64, 301.79).

/T1_2 1 Tf
E Tj

If the string A is drawn somehow, the string E is likely to be shown as a continuation of A. As this is implementation-dependent (see above), though, there is no telling where.

0.0157 Tc
0 7.3001 -7.3001 0 124.64 518.99 Tm
SOME Tj

Unless the invalid statement before has left the PDF viewer in a completely befuddled state, the string SOME is drawn at (124.64, 518.99).

ET
BT
WOW Tj

At the start of a text object (BT) the text matrix and the text line matrix are reset to the identity matrix. Thus, the string WOW is drawn at (0, 0) using font /T1_2 at size 1 with character spacing 0.0157.

1.359 0.041 Td
T*
SECOND Tj

The text line matrix still is the identity matrix, see above. The Td instruction, therefore, sets text matrix and text line matrix to

1     0     0     1 0 0     1     0     0
0     1     0  ×  0 1 0  =  0     1     0
1.359 0.041 1     0 0 1     1.359 0.041 1

As the text leading has not being set, its value is still the default 0. Thus, the T* instruction doesn't change anything.

The string SECOND, therefore, is drawn at (1.359, 0.041).

0 7.6068 -7.3001 0 269.54 245.01 Tm
LAST Tj
ET

LAST is drawn at (269.54, 245.01).

mkl
  • 90,588
  • 15
  • 125
  • 265