0

I'm using Apache PDFBox to write Arabic text on a page without embedding the font. It would appear that ArialMT is generally available so that both PDFBox will work and a PDF viewer will not have trouble with the final document; however, I have not managed to uncover a code strategy by which the font can be used but will not be embedded.

Note: This is perfectly possible by the PDF standard and I've seen such generated documents.

ADDENDUM (further explaining the case)

The specific case for non-embedded font is the case where I'm generating a document with images and placing invisible text (e.g. produced via OCR) on top of the images. When conforming to the PDF/A standard, embedding of the font in such cases is not necessary, as the image is the only source for rasterization of the document. The "standard 14" fonts do not include Arabic codepoints, so that another font would need to be referenced for PDFBox to work, but loading a font makes it embedded.

Doc
  • 343
  • 3
  • 13
  • 3
    Just because you can do something doesn't mean you should. There are computers that don't have much fonts and the result may be weird. PDFBox will always embed the full or a subset of the font. – Tilman Hausherr Mar 19 '18 at 14:43
  • @TilmanHausherr I do have legitimate reasons for doing this and they are according to PDF standards. Other PDF-producing tools already do exactly what I'm trying to. I'm sorry I can't give you the whole picture, but I assure you it's legitimate. – Doc Mar 19 '18 at 15:59
  • Tesseract for that use case uses a glyph-less font (cf. [this answer](https://stackoverflow.com/a/49306240/1729265)) which has a fairly small footprint. Using such fonts might also be an option for you. – mkl Mar 20 '18 at 11:34
  • @mkl Interesting, but not practical. I suspect that the font only supports Latin codepoints. [This](https://github.com/overview/pdfocr/tree/master/the-all-font) looks a lot more interesting, but I need to see how big the font actually is. – Doc Mar 20 '18 at 12:14
  • For those insist embedded all fonts, please note that a normal Chinese font is about 10 M, and think of use several font in a little document. Without embedded font, it may be 25 k in size, and with embedded font, you got 30 M. It should provide an option to not embedded font in a PDF library. – bob dawson Mar 23 '19 at 03:33
  • Even though only said explicitly in Mike's answer, the talk all the time was about embedding the *subset of **actually used glyphs** only*, and that subset is small for Chinese texts, too. – mkl Mar 23 '19 at 07:23

1 Answers1

1

To elaborate on Tilman's comment,

Just because you can do something doesn't mean you should. There are computers that don't have much fonts and the result may be weird

They're entirely correct: don't do this, use subset embedding because different setups can have different versions of Arial all of which will resolve against the ArialMT identifier, but with completely different internal glyphIDs.

As PDFs point to glyphids, not 'letters', what looks like cake with your copy of Arial could —when encoded as glyphid array— end up being B^r( in a different version of Arial. And that even includes newer versions of Arial that you yourself might end up using a year from now: suddenly your PDF files are completely unusable even for you.

PDF should be stand-alone documents. If you want people to read your PDFs, use subset embedding for the fonts you used, even if they're supposedly "generally available". The only way to not embed a font is to make the document use only fonts from the predefined standard set of 14 fonts, which any PDF-spec compliant reader must come with in order to render content without font embeds. And notice that Arial is not in that list.

Mike 'Pomax' Kamermans
  • 49,297
  • 16
  • 112
  • 153
  • Your ardent support of the standards is admirable. I admit that my corner case requires some imagination to accept because of the way it was (partially) stated, so I've updated the question to make it clearer. – Doc Mar 20 '18 at 11:28
  • That should really be a proper post rewrite including code that shows how you're generating things, as your edit describes a completely different thing from what you started with. Note that ghost text is still text, and unless you've issued very specific instructions to store literal blocks (which we won't know without seeing code) that's still going to be stored as glyphids and you're still going to *have* to do subset embedding if you want that text to stay the same from version to version rather than just being random, unusable, numerical data. – Mike 'Pomax' Kamermans Mar 20 '18 at 14:45