newsletter3k_does its funtions work on stored data,I already downloaded contents of the URL

Question

The newspaper3k in GitHub here is a quite useful Library. Currently, it works with python3. I wonder if it can handle downloaded/stored text. The point is we already downloaded the contents of the URL and do not want to repeat this every time when we use certain functions (keyword, summary, date,...). We would like to query stored data for date and authors for example. Obvious code execution flow 1.download, 2.parse, extract various info: text, title, images,... it looks like a chain reaction to me that always starts with a download:

>>> url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
>>> article = Article(url)
>>> article.download()
>>> article.html
'<!DOCTYPE HTML><html itemscope itemtype="http://...'
>>> article.parse()
>>> article.authors
['Leigh Ann Caldwell', 'John Honway']
>>> article.publish_date
datetime.datetime(2013, 12, 30, 0, 0)
>>> article.text
'Washington (CNN) -- Not everyone subscribes to a New Year's    resolution...'
>>> article.top_image
'http://someCDN.com/blah/blah/blah/file.png'

Are you wanting to store the information that you downloaded offline? — Life is complex, Feb 09 '21 at 05:37
Yes, 1.store it, 2.extract information like author, images, keywords. — tursunWali, Feb 09 '21 at 05:56
So, you want to extract the content from a news source, store the data in someway and query this data at a later time. — Life is complex, Feb 09 '21 at 13:36
I will post an answer, based on how I would do this, but only after you accept my other answer for your previous question on newspaper. — Life is complex, Feb 09 '21 at 15:17
Yes, your that answer is helpful. I will accept it, it works. — tursunWali, Feb 09 '21 at 15:32
Ok. I will put together an answer and post something by tomorrow. — Life is complex, Feb 09 '21 at 15:37

Life is complex · Accepted Answer · 2021-02-11T14:25:30.937

0

After your comment about using "ctrl+s" and save on news sources, I removed my first answer and added this one.

I download the content from this article -- https://www.latimes.com/business/story/2021-02-08/tesla-invests-in-bitcoin -- to my file system.

The example below shows how I can query this article from my local file system.

from newspaper import Article

with open("Elon Musk's Bitcoin embrace is a bit of a head-scratcher - Los Angeles Times.htm", 'r') as f:
    # note the empty URL string
    article = Article('', language='en')
    article.download(input_html=f.read())
    article.parse()
    article_meta_data = article.meta_data

    article_published_date = ''.join({value for (key, value) in article_meta_data['article'].items()
                                      if key == 'published_time'})

    print(article_published_date)
    # output 
    2021-02-08T15:52:56.252

    print(article.title)
    # output
    Elon Musk’s Bitcoin embrace is a bit of a head-scratcher

    article_author = {value for (key, value) in article_meta_data['article'].items() if key == 'author'}
    print(''.join(article_author).rsplit('/', 1)[-1])
    # output
    russ-mitchell

    article_summary = ''.join({value for (key, value) in article_meta_data['og'].items() if key == 'description'})
    print(article_summary)
    # output 
    The Tesla CEO says climate change is a threat to humanity, but his endorsement is driving demand for a cryptocurrency with a massive carbon footprint.

edited Feb 11 '21 at 14:25

answered Feb 09 '21 at 20:55

Life is complex

15,374
5
29
58

thank you for the above code. immagine the download is already done. so we do not do these steps: //// url = 'https://www.latimes.com/business/story/2021-02-08/tesla-invests-in-bitcoin' article = Article(url, config=config) article.download() ///// we should directly use methods to get publish date, title, description, author(s),text and top image. – tursunWali Feb 09 '21 at 21:36
"immagine the download is already done" if so, how are you storing the information? – Life is complex Feb 10 '21 at 04:47
Store in the normal way: "ctrl+s", and save. Then one .html file and one folder created in the Download folder. – tursunWali Feb 11 '21 at 05:08
You're manually saving all the webpage content offline in multiple folders? – Life is complex Feb 11 '21 at 13:33
yes it is the way webpages stored in a storage. – tursunWali Feb 11 '21 at 17:12
@tursunWali my current answer shows how to query offline articles. Please accept this answer, because it has all the details needed to handle your use case. – Life is complex Feb 11 '21 at 17:54

newsletter3k_does its funtions work on stored data,I already downloaded contents of the URL

1 Answers1