1

I'm using PoDoFo to extract character displacement to update a text matrix correctly. This is a code fragment of mine:

PdfString str, ucode_str;
std::stack<PdfVariant> *stack;
const PdfFontMetrics *f_metrics;
...

/* Convert string to UTF8 */
str = stack->top().GetString();
ucode_str = ts->font->GetEncoding()->ConvertToUnicode(str, ts->font);
stack->pop();
c_str = (char *) ucode_str.GetStringUtf8().c_str();

/* Font metrics to obtain a character displacement */
f_metrics = ts->font->GetFontMetrics();

for (j = 0; j < strlen(c_str); j++) {
    str_w = f_metrics->CharWidth(c_str[j]);

    /* Adjust text matrix using str_w */
    ...
}

It works well for some PDF files (str_w contains a useful width), but doesn't work for others. In these cases str_w contains 0.0. I took a look at the PoDoFo 0.9.5 sources and found CharWidth() implemented for all sub-classes of PdfFontMetrics.

Am I missing something important during this string conversion?

Update from 04.08.2017

@mkl did a really good job reviewing PoDoFo's code. However, I realized that I had to obtain a bit different parameter. To be precise, I needed a glyph width expressed in text space units (see PDF Reference 1.7, 5.1.3 Glyph Positioning and Metrics), but CharWidth() is implemented in PdfFontMetricsObject.cpp like:

double PdfFontMetricsObject::CharWidth(unsigned char c) const
{
    if (c >= m_nFirst && c <= m_nLast &&
        c - m_nFirst < static_cast<int>(m_width.GetSize())) {
        double dWidth = m_width[c - m_nFirst].GetReal();

        return (dWidth * m_matrix.front().GetReal() * this->GetFontSize() + this->GetFontCharSpace()) * this->GetFontScale() / 100.0;
    }

    if (m_missingWidth != NULL)
        return m_missingWidth->GetReal();
    else
        return m_dDefWidth;
}

Width is calculated using additional multipliers (like font size, character space, etc.). What I really needed was dWidth * m_matrix.front().GetReal() only. Thus, I decided to implement GetGlyphWidth(int c) from the same file like:

double PdfFontMetricsObject::GetGlyphWidth(int c) const
{
    if (c >= m_nFirst && c <= m_nLast &&
        c - m_nFirst < static_cast<int>(m_width.GetSize())) {
        double dWidth = m_width[c - m_nFirst].GetReal();
        return dWidth * m_matrix.front().GetReal();
    }
    return 0.0;
}

and call this one instead of CharWidth() from the first listing.

  • Please share a sample PDF for analysis. There are some very weird PDFs in which all glyph widths indeed are 0 and the text matrix is moved along by separate instructions. Probably you have such a file. – mkl Aug 03 '17 at 13:15
  • [This PDF](https://drive.google.com/file/d/0B5bUjbiZdo9nVFlQVFNNWHpBaVE/view?usp=sharing) is processed with error (displacement is `0.0`) – Dmitry Salychev Aug 03 '17 at 13:43
  • [This one](https://drive.google.com/file/d/0B1fFjmlHIxF2T2twUTNLTmRmVWM/view?usp=sharing) is correctly processed (I'm interested in first page only) – Dmitry Salychev Aug 03 '17 at 13:45
  • @mkl, do you mean something like `[ (A) 120 (W) 120 (A) 95 (Y again) ] TJ`? I thought that these characters should have the correct displacements, also. – Dmitry Salychev Aug 03 '17 at 13:55
  • *" do you mean something like ... I thought that these characters should have the correct displacements, also."* - Yes, they *should* have the correct displacement and the values in **TJ** are only to be used for kerning; but there indeed are some PDFs out there in which the numbers in **TJ** do the whole displacement. But this is not the issue in case of your sample PDF, cf. my answer. – mkl Aug 03 '17 at 14:50

1 Answers1

0

If I understand the Podofo code correctly (I'm not really a Podofo expert...), the PdfFontMetricsObject class is used to represent the metrics of fonts contained in an already existing PDF:

/** Create a font metrics object based on an existing PdfObject
 *
 *  \param pObject an existing font descriptor object
 *  \param pEncoding a PdfEncoding which will NOT be owned by PdfFontMetricsObject
 */
PdfFontMetricsObject( PdfObject* pFont, PdfObject* pDescriptor, const PdfEncoding* const pEncoding );

The method CharWidth here is implemented like this:

double PdfFontMetricsObject::CharWidth( unsigned char c ) const
{
    if( c >= m_nFirst && c <= m_nLast
        && c - m_nFirst < static_cast<int>(m_width.GetSize()) )
    {
        double dWidth = m_width[c - m_nFirst].GetReal();

        return (dWidth * m_matrix.front().GetReal() * this->GetFontSize() + this->GetFontCharSpace()) * this->GetFontScale() / 100.0;
    }

    if( m_missingWidth != NULL )
        return m_missingWidth->GetReal ();
    else
        return m_dDefWidth;
}

One in particular sees that the parameter c is not encoded according to the font encoding but left as is for the lookup in the widths array. Thus, the expected input of this method does not appear to be a ASCII or ANSI character code but the original glyph ID.

Your code, on the other hand, has already transformed the glyph IDs to Unicode in UTF-8 and, therefore, essentially tries to lookup by ANSI character codes.


This would match the example documents, a typical font encoding in the PDF processed with error looks like this

28 0 obj
<<
  /Differences[0/B/G/W/a/d/e/f/g  9/i/l/n/o/p/r/space/t/w]
  /BaseEncoding/MacRomanEncoding
  /Type/Encoding
>>
endobj

with glyph codes from 0 (FirstChar) to 17 (LastChar), or

12 0 obj
<<
  /Differences[1/A/B/C/D/F/I/L/M/N/O/P/R/T/U/a/c/d
                /degree/e/eight/f/five/four/g/h
               27/i/l/m/n/o/one/p/parenleft/parenright
                /period/r/registered/s/space
                /t/three/two/u/w/zero]
  /BaseEncoding/MacRomanEncoding
  /Type/Encoding
>>
endobj 

with glyph codes from 1 (FirstChar) to 46 (LastChar).

So these encoding deal glyph codes starting from 0 for all required glyphs and don't really cover that many glyphs

Thus, CharWidth will return 0 for all char values above 17 or above 46 which means all (in the former case) or most (in the latter case) ANSI non control characters.

On the other hand a typical font encoding in the PDF processed correctly looks like this:

1511 0 obj
<<
  /Type/Encoding
  /BaseEncoding/WinAnsiEncoding
  /Differences[
    1/Delta/Theta
    8/Phi
    11/ff/fi/fl/ffi
    39/quoteright
  ]
>>
endobj 

with glyph codes from 1 (FirstChar) to 122 (LastChar).

These encodings basically are WinAnsiEncoding with minor additions in the lower values, in particular the control character values.


What you can do, therefore, is to iterate over glyph codes in str (allowing you to call CharWidth for them) and converting them individually to Unicode when needed instead of first converting str to Unicode ucode_str and then iterating over ANSI characters in ucode_str.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • @DmitrySalychev Great! Admittedly I wrote the answer without testing my proposal, merely based on Podofo code review, so I'm happy, too, that my analysis was right. – mkl Aug 03 '17 at 16:03