0

We have used the above algorithm suggested in the link to read the highlighted text from PDF files. It works fine with single line highlights.

However, text read from multiple line highlights is incorrect. Final text read from highlight is completely jumbled with neighboring lines text.

Help us in resolving the above issue. - Thanks

user5342176
  • 101
  • 1
  • 9
  • Thanks Tilman for your quick response. I did not find any option to share the document in Stackoverflow. Shall I share it by uploading at Google drive ? – user5342176 Jul 23 '18 at 08:06
  • Uploaded the document at [link](https://drive.google.com/file/d/1nXtTMwjLeCHL4nlZU1yfIAeQhBEkuAtY/view). Three highlights are present in page 1. First one is single line comment and second & third are multiline highlights. – user5342176 Jul 23 '18 at 12:01
  • Three highlights are present in page 1. First one is single line comment and second & third are multi-line highlights. Text read from 2nd and 3rd highlights are jumbled. For eg:- 3rd text read from highlight is : "iad circumstances, including cashing a check, At one time, the federal government assigned social closing on a loan, gaining employment, and securing access to a commercial airplane. At one security numbers for certain valid nonwork purposes, including for the purpose of obtaining". – user5342176 Jul 23 '18 at 12:11
  • 1
    That is hardly "jumbled". The highlight touches the text above and so the extraction took that line too. – Tilman Hausherr Jul 23 '18 at 12:45
  • Indeed, if you want to extract only text *completely contained* in the marked area (in contrast to *at least partially contained* therein), you either have to either fudge the `PDFTextStripperByArea` a bit to test differently or you have to shrink the rectangles you retrieved from the **QuadPoints** a bit vertically to not include text from other lines. – mkl Jul 23 '18 at 15:16
  • @user5342176 Could you resolve your issue given the tipp above? Or do you still need help? – mkl Aug 07 '18 at 10:28

0 Answers0