Extra spaces, extra new line characters and unable to identify the headers, which are bold, while reading the pdf from python

Asked Aug 31 '23 at 09:45

Active Aug 31 '23 at 10:04

Viewed 25 times

-4

/*Hi Everyone,

I have a PDF file which has some bold side heading(visually bold. Not capital letters). The paragraphs in between the headings are considered as the sections.

I am searching for a particular word in the PDF. If any section has that particular word, I want to display the entire section along with heading. I think that can be achieved by using the newline line character logic. But, the text extracted from the PDF is unnecessarily retrieving the extra spaces, newline characters. So the logic is missing.

Can anyone help with this situation? */

asked Aug 31 '23 at 09:45

Uma Mahesh

4

Specify with an example, what exactly do you need and what you've tried – adityanithariya Aug 31 '23 at 09:50
1

Check this to know how to make a [reproducible exemple](https://stackoverflow.com/help/minimal-reproducible-example) – Ant0ine64 Aug 31 '23 at 09:58
1

technically the OP has provided code. The question is clearly commented code (surrounded by `/*` `*/` ). The only problem is that is not formatted as code and /* */ are not valid comment delimiters in python – Sembei Norimaki Aug 31 '23 at 10:11
Try the PyMuPDF package. It lets you extract text in multiple ways: by word, by paragraph, with associated information (font, sont characteristics, position and color information, writing direction, ...) – Jorj McKie Aug 31 '23 at 10:47

Extra spaces, extra new line characters and unable to identify the headers, which are bold, while reading the pdf from python

0 Answers0