Newspaper3k: how to retrieve cashed articles?

Question

This document says that that by default, newspaper caches all previously extracted articles and eliminates any article which it has already extracted.

>>> cbs_paper = newspaper.build('http://cbs.com')
>>> cbs_paper.size()
1030

>>> cbs_paper = newspaper.build('http://cbs.com')
>>> cbs_paper.size()
2

Okay, but it says nothing if I build once a website how I can retrieve the cashed articles?

score 2 · Answer 1 · edited Mar 14 '23 at 22:31

newspaper3k uses memoize to cache the articles for a source

setting memoize to false would stop the caching mechanism

cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False)

however, if you still want the caching and want to access the cached articles you can find .newspaper_scraper dir inside the temp folder (Path in windows machine)

C:\Users\your_user\AppData\Local\Temp\.newspaper_scraper\memoized

For Linux-based OSes, try looking in

/tmp/.newspaper_scraper/memoized/

For macOS, look in the directory specified by $TMPDIR. This may be any or none of the following:

/tmp/.newspaper_scraper/memoized/
/private/tmp/.newspaper_scraper/memoized/
~/Library/Caches/TemporaryItems/.newspaper_scraper/memoized/

Newspaper3k: how to retrieve cashed articles?

1 Answers1