2

I am using iTextSharp to fetch data from pdf within a particular rectangle

The data fetched in case of height is working fine but in case of width, it is returning whole line instead of the words in the rectangle.

Code I am using is as below:

  PdfReader reader = new PdfReader(Home.currentInstance.Get_PDF_URL());
            iTextSharp.text.Rectangle pageRectangle = reader.GetPageSize(currentPage);
            float selection_x = ((float)(selectionRectangle.RenderTransform.Value.OffsetX) / (float)canvas.Width) * pageRectangle.Width;
            float selection_y = pageRectangle.Height - (((float)(selectionRectangle.RenderTransform.Value.OffsetY) / (float)canvas.Height) * pageRectangle.Height);
            float selection_height = ((float)(selectionRectangle.Height) / (float)canvas.Height) * pageRectangle.Height;
            float selection_width = ((float)(selectionRectangle.Width) / (float)canvas.Width) * pageRectangle.Width;
            selection_y -= selection_height;
            RectangleJ rect = new RectangleJ(selection_x,selection_y,selection_width,selection_height);
            RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
            ITextExtractionStrategy strategy;
            strategy = new FilteredTextRenderListener(
           new LocationTextExtractionStrategy(), filter
         );
String pageText = PdfTextExtractor.GetTextFromPage(reader, currentPage, strategy);

Any help will be highly appreciated.

Thanks in advance

Aman Chhabra
  • 3,824
  • 1
  • 23
  • 39

1 Answers1

6

Finally, I am able to resolve the issue

I created the following class

public class LimitedTextStrategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy
    {

        public readonly ITextExtractionStrategy textextractionstrategy;

        public LimitedTextStrategy(ITextExtractionStrategy strategy)
        {
            this.textextractionstrategy = strategy;
        }
        public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
        {
          foreach (TextRenderInfo info in renderInfo.GetCharacterRenderInfos())
        {
            this.textextractionstrategy.RenderText(info);
        } 
        }
        public string GetResultantText()
        {
            return this.textextractionstrategy.GetResultantText();
        }

        public void BeginTextBlock() {
            this.textextractionstrategy.BeginTextBlock();

        }
        public void EndTextBlock() {
            this.textextractionstrategy.EndTextBlock();

        }
        public void RenderImage(ImageRenderInfo renderInfo) {
            this.textextractionstrategy.RenderImage(renderInfo);
        }
    }

and then changed the extraction line to

String pageText = PdfTextExtractor.GetTextFromPage(reader, currentPage, new LimitedTextStrategy(strategy));

And now it is working fine. I hope it help someone else as well

Aman Chhabra
  • 3,824
  • 1
  • 23
  • 39
  • 4
    Some explanations can be found in [this answer](http://stackoverflow.com/questions/21000256/pdf-reading-highlighed-text-highlight-annotations-using-c-sharp/21023311#21023311) which focused on the equivalent solution for iText / Java. – mkl Mar 01 '14 at 08:23
  • public void RenderImage(ImageRenderInfo renderInfo) { this.textextractionstrategy.RenderImage(renderInfo); } this is useless. The text extractor will ignore this method. If you do need to render image, you can implement a custom RenderListener instead of a custom text extraction strategy. – Silent Sojourner Jul 06 '17 at 19:17
  • @Silent Sojourner I am not aware of any issue with above implementation and is working fine in one of my live applications. If you are facing any issue, please share and I will try to sort it. – Aman Chhabra Jul 08 '17 at 00:33