How can i extract images and articles from an html using readability-lxml?

Question

url = 'http://edition.cnn.com/'
    req = urllib.request.Request(url, data=None,
            headers={
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
    )
    f = urllib.request.urlopen(req)
    html = f.read()

    readable_article = Document(html).summary()
    readable_title = Document(html).short_title()

    print(readable_title)
    print(html)

    print(Document(html))

How can fetch all images and articles ?Is there any built in function for that ?If no then how?

score 0 · Answer 1 · edited Nov 27 '18 at 13:29

I'd suggest the newspaper module that can what you are looking for.

Here are some of the examples I took from their site. You can download and install the module from here https://pypi.org/project/newspaper3k/

        >>> from newspaper import Article

        >>> url = u'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
        >>> article = Article(url)
        >>> article.download()

        >>> article.html
        u'<!DOCTYPE HTML><html itemscope itemtype="http://...'
        >>> article.parse()

        >>> article.authors
        [u'Leigh Ann Caldwell', u'John Honway']

        >>> article.publish_date
        datetime.datetime(2013, 12, 30, 0, 0)

        >>> article.text
        u'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'

        >>> article.top_image
        u'http://someCDN.com/blah/blah/blah/file.png'

        >>> article.movies
        [u'http://youtube.com/path/to/link.com' ...]
        >>> article.nlp()

        >>> article.keywords
        [u'New Years', u'resolution', ...]

        >>> article.summary
        u'The study shows that 93% of people ...'
        >>> import newspaper

        >>> cnn_paper = newspaper.build(u'http://cnn.com')

        >>> for article in cnn_paper.articles:
        >>>     print(article.url)
        http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/
        http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html
        ...

        >>> for category in cnn_paper.category_urls():
        >>>     print(category)

        http://lifestyle.cnn.com
        http://cnn.com/world
        http://tech.cnn.com
        ...

        >>> cnn_article = cnn_paper.articles[0]
        >>> cnn_article.download()
        >>> cnn_article.parse()
        >>> cnn_article.nlp()
        ...
        >>> from newspaper import fulltext

        >>> html = requests.get(...).text
        >>> text = fulltext(html)

It looks like newspaper package have issues regarding installing ..Don't know wether is the python3.4 compactability available or not..Cannot Install..Any comments — Anand VL, May 18 '16 at 09:00
For Python 3 you should use newspaper3k (https://pypi.python.org/pypi/newspaper3k) — Lukas Klein, Sep 19 '16 at 22:23

How can i extract images and articles from an html using readability-lxml?

1 Answers1