Extract data from "e under title tag using BeautifulSoup?

Question

I want to extract title of a link after getting its HTML via BeautifulSoup library in python. Basically, the whole title tag is

 <title>Imaan Z Hazir on Twitter: &quot;Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)&quot;</title>

I want to extract data that is in &quot tags that is only this Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3) I tried as

import urllib
import urllib.request

from bs4 import BeautifulSoup

link = "https://twitter.com/ImaanZHazir/status/778560899061780481"
try:
    List=list()
    r = urllib.request.Request(link, headers={'User-Agent': 'Chrome/51.0.2704.103'})
    h = urllib.request.urlopen(r).read()
    data = BeautifulSoup(h,"html.parser")
    for i in data.find_all("title"):
        List.append(i.text)
        print(List[0])
except urllib.error.HTTPError as err:
    pass

I also tried as

for i in data.find_all("title.&quot"):

for i in data.find_all("title>&quot"):

for i in data.find_all("&quot"):

and

for i in data.find_all("quot"):

But no one is working.

I would expect that BeautifulSoup converts `"` to `"`, so you just have to look for `"`... — zvone, Sep 21 '16 at 18:49

score 0 · Answer 1 · answered Sep 21 '16 at 19:05

Once you have parsed the html:

data = BeautifulSoup(h,"html.parser")

Find the title this way:

title = data.find("title").string  # this is without <title> tag

Now find two quotes (") in the string. There are many ways to do that. I would use regex:

import re
match = re.search(r'".*"', title)
if match:
    print match.group(0)

You never search for " or any other &NAME; sequences because BeautifulSoup converts them to the actual characters they represent.

EDIT:

Regex which does not capture the quotes would be:

re.search(r'(?<=").*(?=")', title)

score 0 · Answer 2 · answered Sep 21 '16 at 20:52

Here is a simple complete example using regex to extract the text within quotes:

import urllib
import re
from bs4 import BeautifulSoup

link = "https://twitter.com/ImaanZHazir/status/778560899061780481"

r = urllib.request.urlopen(link)
soup = BeautifulSoup(r, "html.parser")
title = soup.title.string
quote = re.match(r'^.*\"(.*)\"', title)
print(quote.group(1))

What happen's here is that after getting pages's source and finding the title we use a regular expression against the title to extract the text within the quotes.

We tell the regular expression to look for an arbitrary number of symbols at the beginning of the string (^.*) before the opening quote (\"), then capture the text between it and the closing quote (second \").

Then we print the captured text by telling Python to print the first captured group (the part between parenthesis in regex).

Here's more on matching with regex in python - https://docs.python.org/3/library/re.html#match-objects

Padraic Cunningham · Accepted Answer · 2016-09-21T22:42:24.870

Just split the text on the colon:

In [1]:  h = """<title>Imaan Z Hazir on Twitter: &quot;Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)&quot;</title>"""

In [2]: from bs4 import BeautifulSoup

In [3]: soup  = BeautifulSoup(h, "lxml")

In [4]: print(soup.title.text.split(": ", 1)[1])
 "Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"

Actually looking at the page you don't need to split at all, the text is in the p tag inside the div.js-tweet-text-container, th:

In [8]: import requests

In [9]: from bs4 import BeautifulSoup


In [10]: soup  = BeautifulSoup(requests.get("https://twitter.com/ImaanZHazir/status/778560899061780481").content, "lxml")


In [11]: print(soup.select_one("div.js-tweet-text-container p").text)
Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)

In [12]: print(soup.title.text.split(": ", 1)[1])
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"

So you can do it either way for the same result.

Caunnungham This worked! Thanks for informing. `print(soup.select_one("div.js-tweet-text-container p").text)` — Amar, Sep 22 '16 at 04:20

Extract data from "e under title tag using BeautifulSoup?

3 Answers3