5

I'm using Apache Tika 1.14 which uses pdfbox 2.0.3. I use it to extract text content of files. In production mode when processing many files I get in log many statements like these:

WARN  o.a.p.pdmodel.font.PDTrueTypeFont - Using fallback font 'LiberationSans' for 'TimesNewRomanPSMT'
WARN  o.a.p.pdmodel.font.PDTrueTypeFont - Using fallback font 'LiberationSans' for 'TimesNewRomanPS-BoldMT'

I compared a few files and compared extracted text with their original document and nothing is missing.

My question is: if I use pdfbox only for text extraction then can I safely ignore statements of this type without any risk I miss something? Or maybe I should install missing fonts?

Thanks for any advices

user3151361
  • 109
  • 1
  • 1
  • 6
  • 2
    I'm not 100% sure of this (didn't test) but recommend that you install the so-called standard 14 fonts (Times, Courier, Arial, Symbol, Dingbats). My fear is that sizes might be wrong so characters would overlap due to different sizes of the fallback font and the font in the PDF. – Tilman Hausherr May 22 '17 at 17:52
  • My understanding is that when parsing pdf document pdfbox first visualizes it and after that parses it therefore it is important to have fonts installed. Am I right? – user3151361 May 23 '17 at 07:51
  • No, PDFBox doesn't visualize for text extraction. Btw another argument for installing the fonts: you'll get rid of the warnings. These clutter your log file and will may prevent you from seeing more important things. – Tilman Hausherr May 23 '17 at 14:50

1 Answers1

5

According to what I found here https://pdfbox.apache.org/1.8/cookbook/workingwithfonts.html they recommend installing so called Standard 14 Fonts.

Due to licensing requirements we need to provide substitute fonts.

Based on code in class org.apache.pdfbox.pdmodel.font.FontMapperImpl, these are Standard 14 Fonts and their substitutes:

Courier:CourierNew,CourierNewPSMT,LiberationMono,NimbusMonL-Regu
Courier-Bold:CourierNewPS-BoldMT,CourierNew-Bold,LiberationMono-Bold,NimbusMonL-Bold
Courier-Oblique:CourierNewPS-ItalicMT,CourierNew-Italic,LiberationMono-Italic,NimbusMonL-ReguObli
Courier-BoldOblique:CourierNewPS-BoldItalicMT,CourierNew-BoldItalic,LiberationMono-BoldItalic,NimbusMonL-BoldObli
Helvetica:ArialMT,Arial,LiberationSans,NimbusSanL-Regu
Helvetica-Bold:Arial-BoldMT,Arial-Bold,LiberationSans-Bold,NimbusSanL-Bold
Helvetica-Oblique:Arial-ItalicMT,Arial-Italic,Helvetica-Italic,LiberationSans-Italic,NimbusSanL-ReguItal
Helvetica-BoldOblique:Arial-BoldItalicMT,Helvetica-BoldItalic,LiberationSans-BoldItalic,NimbusSanL-BoldItal
Times-Roman:TimesNewRomanPSMT,TimesNewRoman,TimesNewRomanPS,LiberationSerif,NimbusRomNo9L-Regu
Times-Bold:TimesNewRomanPS-BoldMT,TimesNewRomanPS-Bold,TimesNewRoman-Bold,LiberationSerif-Bold,NimbusRomNo9L-Medi
Times-Italic:TimesNewRomanPS-ItalicMT,TimesNewRomanPS-Italic,TimesNewRoman-Italic,LiberationSerif-Italic,NimbusRomNo9L-ReguItal
Times-BoldItalic:TimesNewRomanPS-BoldItalicMT,TimesNewRomanPS-BoldItalic,TimesNewRoman-BoldItalic,LiberationSerif-BoldItalic,NimbusRomNo9L-MediItal
Symbol:Symbol,SymbolMT,StandardSymL
ZapfDingbats:ZapfDingbatsITC,Dingbats,MS-Gothic

I understand that when for instance processing file that uses font Helvetica and I don't have that font installed then one of substitute fonts will be used: ArialMT, Arial, LiberationSans, NimbusSanL-Regu. That's clear.

What in case when I don't have font Arial (which is not one of Standard 14 Fonts) installed and I'd like LiberationSans to be used when processing file with Arial. Is there a way to configure such mapping?

One more thing: in version 1.8.13 I saw in class: org.apache.pdfbox.pdmodel.font.FontManager resource file is loaded: org/apache/pdfbox/resources/FontMapping.properties which could be used to provide such mappings. In version 2.x I don't see any posibility to do this. I wonder why it was removed...

Nicomedes E.
  • 1,326
  • 5
  • 18
  • 27
user3151361
  • 109
  • 1
  • 1
  • 6