Crawl a news website and getting the news content

Question

I'm trying to download the text from a news website. The HTML is:

<div class="pane-content">
<div class="field field-type-text field-field-noticia-bajada">
<div class="field-items">
        <div class="field-item odd">
                 <p>"My Text" target="_blank">www.injuv.cl</a></strong></p>         </div>

The output should be: My Text I'm using the following python code:

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = "My URL"
parsed_html = BeautifulSoup(html)
p = parsed_html.find("div", attrs={'class':'pane-content'})
print(p)

But the output of the code is: "None". Do you know what is wrong with my code??

Even when you would parse the HTML and not the URL, the HTML isn't valid. You can't parse that with BeautifulSoup. — tobltobs, Jun 09 '16 at 20:14
@tobltobs `BeautifulSoup` attempts to fix broken HTML; It can parse that HTML just fine. — That1Guy, Jun 09 '16 at 20:19

score 2 · Accepted Answer · answered Jun 09 '16 at 20:23

The problem is that you are not parsing the HTML, you are parsing the URL string:

html = "My URL"
parsed_html = BeautifulSoup(html)

Instead, you need to get/retrieve/download the source first, example in Python 2:

from urllib2 import urlopen

html = urlopen("My URL")
parsed_html = BeautifulSoup(html)

In Python 3, it would be:

from urllib.request import urlopen

html = urlopen("My URL")
parsed_html = BeautifulSoup(html)

Or, you can use the third-party "for humans"-style requests library:

import requests

html = requests.get("My URL").content
parsed_html = BeautifulSoup(html)

Also note that you should not be using BeautifulSoup version 3 at all - it is not maintained anymore. Replace:

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup

with just:

from bs4 import BeautifulSoup

That1Guy · Answer 2 · 2016-10-04T16:35:54.797

BeautifulSoup accepts a string of HTML. You need to retrieve the HTML from the page using the URL.

Check out urllib for making HTTP requests. (Or requests for an even simpler way.) Retrieve the HTML and pass that to BeautifulSoup like so:

import urllib
from bs4 import BeautifulSoup

# Get the HTML
conn = urllib.urlopen("http://www.example.com")
html = conn.read()

# Give BeautifulSoup the HTML:
soup = BeautifulSoup(html)

From here, just parse as you attempted previously.

p = soup.find("div", attrs={'class':'pane-content'})
print(p)

Crawl a news website and getting the news content

2 Answers2