1

I have a series of PDF files I need to search for keywords, but many of them contain a huge amount of hidden text. What I mean is when you try to CTRL+F to see how many key words are named "CJP" there are about 35 results, but in reality there are only about 9 that are actually visible, the rest just seem to be randomly hidden all over the page. I have tried out several APIs with them all reading 35 and not 9, so I wanted to try out this class named TextRenderInfo in ITextSharp because the method GetTextRenderMode is suppose to return 3 if the text is hidden, meaning I can use that to ignore strings that are invisable.

Here is my current code:

static void Main(string[] args)
{
      Gerdau.ITextSharpCount(@"Source.pdf", "CJP");
}

  public static int ITextSharpCount(string filePath, string searchString)
  {

      StringBuilder sb = new StringBuilder();
      string file = filePath;
      using (PdfReader reader = new PdfReader(file))
      {
          for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++)
          {
              textRenderInfo.GetTextRenderMode();
              ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
              string text = PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy);
              text = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
              sb.Append(text);
          }
      }
      int numberOfMatches = Regex.Matches(sb.ToString(), searchString).Count;
      return numberOfMatches;
  }

The issue is I don't know how to set up the TextRenderInfo class to check for the hidden text. If anyone knows how to do it, it would be a huge help and more code the merrier :).

Lukeluke
  • 11
  • 2
  • *"because the method GetTextRenderMode is suppose to return 3 if the text is hidden"* - beware, a `TextRenderMode` value 3 is but one of many possible causes for invisible text. The text can also be normally drawn (with a different `TextRenderMode` value) and then be covered by something, or be drawn outside the page area or the current clip path, or it may be drawn white on white, or it may be drawn in a glyphless font, or it may be drawn all transparent or in a blend mode that hides it... thus, checking the `TextRenderMode`alone quite likely won't help you. – mkl Jun 16 '21 at 04:48

0 Answers0