2

I want to extract title of a link after getting its HTML via BeautifulSoup library in python. Basically, the whole title tag is

 <title>Imaan Z Hazir on Twitter: &quot;Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)&quot;</title>

I want to extract data that is in &quot tags that is only this Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3) I tried as

import urllib
import urllib.request

from bs4 import BeautifulSoup

link = "https://twitter.com/ImaanZHazir/status/778560899061780481"
try:
    List=list()
    r = urllib.request.Request(link, headers={'User-Agent': 'Chrome/51.0.2704.103'})
    h = urllib.request.urlopen(r).read()
    data = BeautifulSoup(h,"html.parser")
    for i in data.find_all("title"):
        List.append(i.text)
        print(List[0])
except urllib.error.HTTPError as err:
    pass

I also tried as

for i in data.find_all("title.&quot"):

for i in data.find_all("title>&quot"):

for i in data.find_all("&quot"):

and

for i in data.find_all("quot"):

But no one is working.

Amar
  • 855
  • 5
  • 17
  • 36

3 Answers3

0

Once you have parsed the html:

data = BeautifulSoup(h,"html.parser")

Find the title this way:

title = data.find("title").string  # this is without <title> tag

Now find two quotes (") in the string. There are many ways to do that. I would use regex:

import re
match = re.search(r'".*"', title)
if match:
    print match.group(0)

You never search for &quot; or any other &NAME; sequences because BeautifulSoup converts them to the actual characters they represent.

EDIT:

Regex which does not capture the quotes would be:

re.search(r'(?<=").*(?=")', title)
zvone
  • 18,045
  • 3
  • 49
  • 77
0

Here is a simple complete example using regex to extract the text within quotes:

import urllib
import re
from bs4 import BeautifulSoup

link = "https://twitter.com/ImaanZHazir/status/778560899061780481"

r = urllib.request.urlopen(link)
soup = BeautifulSoup(r, "html.parser")
title = soup.title.string
quote = re.match(r'^.*\"(.*)\"', title)
print(quote.group(1))

What happen's here is that after getting pages's source and finding the title we use a regular expression against the title to extract the text within the quotes.

We tell the regular expression to look for an arbitrary number of symbols at the beginning of the string (^.*) before the opening quote (\"), then capture the text between it and the closing quote (second \").

Then we print the captured text by telling Python to print the first captured group (the part between parenthesis in regex).

Here's more on matching with regex in python - https://docs.python.org/3/library/re.html#match-objects

4140tm
  • 2,070
  • 14
  • 17
0

Just split the text on the colon:

In [1]:  h = """<title>Imaan Z Hazir on Twitter: &quot;Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)&quot;</title>"""

In [2]: from bs4 import BeautifulSoup

In [3]: soup  = BeautifulSoup(h, "lxml")

In [4]: print(soup.title.text.split(": ", 1)[1])
 "Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"

Actually looking at the page you don't need to split at all, the text is in the p tag inside the div.js-tweet-text-container, th:

In [8]: import requests

In [9]: from bs4 import BeautifulSoup


In [10]: soup  = BeautifulSoup(requests.get("https://twitter.com/ImaanZHazir/status/778560899061780481").content, "lxml")


In [11]: print(soup.select_one("div.js-tweet-text-container p").text)
Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)

In [12]: print(soup.title.text.split(": ", 1)[1])
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"

So you can do it either way for the same result.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Caunnungham This worked! Thanks for informing. `print(soup.select_one("div.js-tweet-text-container p").text)` – Amar Sep 22 '16 at 04:20