3

UPDATE: Please see https://softwarerecs.stackexchange.com/questions/71464/java-library-to-insert-invisible-text-into-a-pdf instead.

I want to insert invisible text into an existing PDF file, to make it searchable.

What library should I use?
I would appreciate links to specific API methods to use.

Free, ideally open source.
Thanks a lot!

(For the curious: I want to automatically OCR incoming scanned papers and make them searcheable, in an Alfresco repository)

Nicolas Raoul
  • 58,567
  • 58
  • 222
  • 373
  • 1
    @AndrewMorton *"Does this answer your question?"* - that is very unlikely. The question here after all is about *regular text* which merely shall be invisible, not *metadata*. Furthermore, the question is nearly 9 years old and closed with an accepted answer. Chances are the op meanwhile is not dealing with that issue anymore... – mkl Jan 02 '20 at 15:12
  • @mkl The OP may have been unaware that metadata could be added to a PDF document at the time, and that it is will be [indexed by Alfresco](https://docs.alfresco.com/6.0/references/dev-extension-points-custom-metadata-extractor.html). The question would be regarded as off-topic nowadays as it's asking for a library, but I thought that the duplicate would be more useful. – Andrew Morton Jan 02 '20 at 15:25
  • Still a useful question, but now recommendations have their own site so I just posted the same question there: https://softwarerecs.stackexchange.com/questions/71464/java-library-to-insert-invisible-text-into-a-pdf – Nicolas Raoul Jan 02 '20 at 15:27
  • @AndrewMorton no. This closing as duplicate is incorrect. The amount of OCR'ed text makes pdf metadata the completely wrong place to put it. – mkl Jan 02 '20 at 16:16

3 Answers3

4

3 options. My answers are itext-specific, but you should be able to translate the underlying methods to any sufficiently advance PDF library.

  1. Text render mode 3: "No stroke, no fill". With iText: myPdfContentByte.setTextRenderMode(PdfContentByte.TEXT_RENDER_MODE_INVISIBLE);
  2. Draw the text behind something. You're presumably using scanned page images. iText myPdfStamper.getUnderContent(pageNum) makes this easy, and will let you draw the text under the scan. Other libraries that let you access a page's contents might require you to add your text 'in the raw' at the beginning of an existing content stream. You'll want to check out the "PDF Spec" (google that, you'll be fine) for details. Chapter 9 is all about text rendering.
  3. Draw the text outside the page's media or crop box. If you just want some random PDF-savvy search engine to turn up your page this will work, but if you want people looking at the PDF to see the appropriate text selection box... not so much.
Mark Storer
  • 15,672
  • 3
  • 42
  • 80
1

This shows how to create a PDF document containing text and this shows how to add an image. Add the text first and then add the image on top of it - the text will become 'invisible' to the end user but will remain searchable by search engines. This may also be useful.

nikhil500
  • 3,458
  • 19
  • 23
  • I don't want to add an image. As I said, I am modifying an *existing* PDF file. – Nicolas Raoul Mar 02 '11 at 03:19
  • Ok, I somehow assumed that the scanned pages are images. In that case, [this](http://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/Overlay.java?view=markup) may help - you can create a new PDF with the text and overlay the original PDF on top of it. – nikhil500 Mar 02 '11 at 05:09
0

You do not have to render the text invisible. Just render them in the appropriate place but overlay the scanned image on the text. Or, you could render the text over the image and set alpha value of the color of the the stroke and brush to zero.

BZ1
  • 1,306
  • 2
  • 8
  • 10
  • Sure, as long as the text is not visible to the end user, anything is fine (that's what I meant by "invisible"). What API methods of what library would you use for this? – Nicolas Raoul Feb 28 '11 at 05:59
  • If you already have the OCR'd text and the scanned image using some other component, then most PDF libraries will be able to render the scanned image on a page and then the individual textouts over that. You should render the text on the page, not on the image, just overlay the text elements on the image element in the PDF page. I work for a company (www.gnostice.com) that makes commercial PDF components, but my guess is that you should be able using PDFBox or iText. – BZ1 Feb 28 '11 at 11:14