4

I am trying to extract 'Italic' Content from a pdf in python. I have converted the pdf to html so that I can use the italic tag to extract the text. Here is how the html looks like

<br></span></div><div style="position:absolute; border: textbox 1px
solid; writing-mode:lr-tb; left:71px; top:225px; width:422px;
height:15px;"><span style="font-family: TTPGFA+Symbol; font-
size:12px">•</span><span style="font-family: YUWTQX+ArialMT; font-
size:14px">  Kornai, Janos. 1992. </span><span style="font-family:
PUCJZV+Arial-ItalicMT; font-size:14px">The Socialist System: The
Political Economy of Communism</span><span style="font-family:
YUWTQX+ArialMT; font-size:14px">.

This is how the code looks:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/../..myfile.html"))
bTags = []
for i in soup.findAll('span'):
    bTags.append(i.text)

I am not sure how can I get only the italic text.

M.D
  • 97
  • 1
  • 9
  • Is the text italic via , , inline styles, or CSS? – Julien Sep 12 '16 at 19:48
  • I don't know `BeautifulSoup`, but you probably need to get the `style` attribute of the tag and check if it contains `Italic`. Print `i` to see if it has a `attributes` key, for example. –  Sep 12 '16 at 19:48
  • @Julien from the sample it looks like you need to check the `style` attribute to see if `font-family` contains `Italic`. –  Sep 12 '16 at 19:51
  • .@Camil, my point was that there are many ways that text can be italic on an HTML page. Checking the `style` attribute would only cover if it is set via inline styles, and none of the other ways. Also, since this HTML is being auto-generated, it's safe to assume that it will be consistent, so it doesn't make sense to cover all the different possibilities. EDIT: Sorry, for some reason I did not see the example HTML... – Julien Sep 12 '16 at 19:55
  • I could only find converting into html as the only solution to extract italic text from a pdf – M.D Sep 12 '16 at 19:59

1 Answers1

4

Try this:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
bTags = []
for i in soup.find_all('span', style=lambda x: x and 'Italic' in x):
    bTags.append(i.text)

print bTags

Passing a function to the style argument will filter results by the result of that function, with its input as the value of the style attribute. We check to see if the string Italic is inside the attribute, and if so, return True.

You may need a more sophisticated algorithm depending on the rest of what your HTML looks like.

Julien
  • 5,243
  • 4
  • 34
  • 35