I have a series of PDF files I need to search for keywords, but many of them contain a huge amount of hidden text. What I mean is when you try to CTRL+F to see how many key words are named "CJP" there are about 35 results, but in reality there are only about 9 that are actually visible, the rest just seem to be randomly hidden all over the page. I have tried out several APIs with them all reading 35 and not 9, so I wanted to try out this class named TextRenderInfo
in ITextSharp
because the method GetTextRenderMode
is suppose to return 3 if the text is hidden, meaning I can use that to ignore strings that are invisable.
Here is my current code:
static void Main(string[] args)
{
Gerdau.ITextSharpCount(@"Source.pdf", "CJP");
}
public static int ITextSharpCount(string filePath, string searchString)
{
StringBuilder sb = new StringBuilder();
string file = filePath;
using (PdfReader reader = new PdfReader(file))
{
for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++)
{
textRenderInfo.GetTextRenderMode();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string text = PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy);
text = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
sb.Append(text);
}
}
int numberOfMatches = Regex.Matches(sb.ToString(), searchString).Count;
return numberOfMatches;
}
The issue is I don't know how to set up the TextRenderInfo
class to check for the hidden text. If anyone knows how to do it, it would be a huge help and more code the merrier :).