0

I am using iText to parse text from PDF files. I found that some text are returned but are not visible. For example, I got

  • financial derivative” , which does exist on the page
  • FINANCIAL DERIVATIVE” which is visible, but still returned by iText. In addition, they are not selectable by Adobe acrobat or Foxit.

Does iText has a way to differentiate visible/invisible text? Or does the PDF format has any specification relating to this?

Feng

Feng Qing
  • 13
  • 1
  • 6
  • 2
    How is that particular text made "invisible"? Is there something on top of it? Very small? Having no color, or the same as its background? Off the page? Rendered in a font containing no outlines? Rendered in a font that maps everything to a space character? – Jongware Oct 30 '14 at 18:43

1 Answers1

2

As you give no code it's hard to suggest actual code... However, the customary way to make text invisible is to use the text render mode. All text in PDF has such a text render mode and it determines whether the text is rendered as filled text (normal), stroked text, filled and stroked... And one of the possibilities is "invisible" which makes sure the text isn't shown.

When parsing text on a page iText amongst other things allows you to filter the text that is returned - see the FilteredRenderListener for example. During filtering you can then determine whether you're interested in the text or not. There is a lot of information about the text you can inspect using the TextRenderInfo object. This object has a method called "getTextRenderMode" that will return the above text render mode. If that call returns "3", you know the text is rendered invisibly.

Now, if you want to know for sure whether this text is indeed rendered invisibly (and not using one of the other nasty tricks @jongware suggests in his comment, you'll have to inspect the PDF or share an example with us so that we can take a look.

David van Driessche
  • 6,602
  • 2
  • 28
  • 41
  • Interesting--does this particular rendering mode translate to [this "no stroke, no fill"](http://stackoverflow.com/a/5184903/2564301) or is it actually part of the PDF specification? (In which case I'll be happy to add it to my bag o' tricks!) – Jongware Oct 30 '14 at 21:45
  • 1
    (*one of the possibilities is "invisible" which makes sure the text isn't shown*) - *does this particular rendering mode translate to this "no stroke, no fill" or is it actually part of the PDF specification* - Table 106 of ISO 32000-1 describes rendering mode 3 as **Neither fill nor stroke text (invisible)**, so **no stroke, no fill** and **invisible** should indeed coincide. – mkl Oct 30 '14 at 23:17