What information does a PDF document store with regards to bulleted lists?

Asked Apr 18 '18 at 10:40

Active Apr 18 '18 at 20:11

Viewed 18 times

I am trying to extract text out of a PDF document. I am wondering how does PDF handle bulleted paragraphs. Consider this example:

Does PDF retain any logical meta-information that the 2 chunks of text shown above are members of a bulleted list system OR is it just left to the human mind to interpret the bullet symbols? This information would be very helpful to me in developing a text mining tool that I am currently engaged with.

Thanks, S

edited Apr 18 '18 at 20:11

asked Apr 18 '18 at 10:40

Sau001

1,451
1
18
25

I don't think the PDF knows about bulleted lists. It's likely just outlines and text blocks, but it depends on which tool you use. Acrobat Pro can convert PDFs to Word files. – Dan Wilson Apr 18 '18 at 20:34
@Sau Is your PDF tagged or not? If it is not tagged, chances are the bullets merely are some symbols drawn somewhere on the page. They may be drawn using some symbol font but they also may be drawn using PDF vector graphics instructions. – mkl Apr 18 '18 at 20:54

What information does a PDF document store with regards to bulleted lists?

0 Answers0