I am working on a project where I need to break up 10-Ks into their constituent paragraphs. For some 10-Ks I am able to do something simple like soup.find_all('p')
, but I am also seeing other 10-Ks that use <div>
for everything instead of <p>
tags. Here are three different ways I am seeing companies declare paragraph breaks:
Case where empty div tags are used to create create space between paragraphs:
<div></div><div>Text of a paragraph</div><div></div>
Case where margins/padding are used on either the top or bottom to create space:
<div style="padding-top: 10pt">Text of a paragraph</div>`, `<div style="margin-bottom: 10pt"></div>
Case where the company uses <br>
tags:
<div><br><div><div>Text of paragraph</div><div><br></div>
I have had to write new code for each of these three cases, and I am worried that there could be other ways of marking paragraphs that I haven't encountered yet.
QUESTION: Is there a package or method I can use to standardize all these different ways of declaring paragraph breaks, or should I continue to write code for each new case I encounter?