0

I am trying to extract text out of a PDF document. I am wondering how does PDF handle bulleted paragraphs. Consider this example:

enter image description here

Does PDF retain any logical meta-information that the 2 chunks of text shown above are members of a bulleted list system OR is it just left to the human mind to interpret the bullet symbols? This information would be very helpful to me in developing a text mining tool that I am currently engaged with.

Thanks, S

Sau001
  • 1,451
  • 1
  • 18
  • 25
  • I don't think the PDF knows about bulleted lists. It's likely just outlines and text blocks, but it depends on which tool you use. Acrobat Pro can convert PDFs to Word files. – Dan Wilson Apr 18 '18 at 20:34
  • @Sau Is your PDF tagged or not? If it is not tagged, chances are the bullets merely are some symbols drawn somewhere on the page. They may be drawn using some symbol font but they also may be drawn using PDF vector graphics instructions. – mkl Apr 18 '18 at 20:54

0 Answers0