I am developing a pdf reader. i want to find any string in pdf and to know the corresponding page number. I am using iTextSharp.
Asked
Active
Viewed 3,611 times
4
-
You'll need to extract text from every page, check out PdfTextExtractor, http://stackoverflow.com/a/4893285/231316 – Chris Haas Apr 22 '12 at 15:13
2 Answers
1
Something like this should work:
// add any string you want to match on
Regex regex = new Regex("the",
RegexOptions.IgnoreCase | RegexOptions.Compiled
);
PdfReader reader = new PdfReader(pdfPath);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
for (int i = 1; i <= reader.NumberOfPages; i++) {
ITextExtractionStrategy strategy = parser.ProcessContent(
i, new SimpleTextExtractionStrategy()
);
if ( regex.IsMatch(strategy.GetResultantText()) ) {
// do whatever with corresponding page number i...
}
}

kuujinbo
- 9,272
- 3
- 44
- 57
1
In order to use Itextsharp
you can use Acrobat.dll
to find the current page number. First of all open the pdf file and search the string usingL
Acroavdoc.open("Filepath","Temperory title")
and
Acroavdoc.FindText("String").
If the string found in this pdf file then the cursor moved into the particular page and the searched string will be highlighted. Now we use Acroavpageview.GetPageNum()
to get the current page number.
Dim AcroXAVDoc As CAcroAVDoc
Dim Acroavpage As AcroAVPageView
Dim AcroXApp As CAcroApp
AcroXAVDoc = CType(CreateObject("AcroExch.AVDoc"), Acrobat.CAcroAVDoc)
AcroXApp = CType(CreateObject("AcroExch.App"), Acrobat.CAcroApp)
AcroXAVDoc.Open(TextBox1.Text, "Original document")
AcroXAVDoc.FindText("String is to searched", True, True, False)
Acroavpage = AcroXAVDoc.GetAVPageView()
Dim x As Integer = Acroavpage.GetPageNum
MsgBox("the string found in page number" & x)

venkatesh
- 11
- 1