I'm trying to create a dataset to do sentiment analysis on news articles. I'm using Newspaper3k to scrape articles from the website. I scraped a few websites but didn't store the articles properly and hence I can't use them. When I try scraping the same websites again it only scrapes the new articles and not the ones it already scraped. Is there a way for me to scrape the articles I already scraped again??
Asked
Active
Viewed 410 times
1 Answers
1
By default, newspaper caches all previously extracted articles and eliminates any article which it has already extracted.
This feature exists to prevent duplicate articles and to increase extraction speed.
You may opt out of this feature with the memoize_articles
parameter.
For example, in your case set it to False:
newspaper.build('http://cbs.com', memoize_articles=False)

Ami Hollander
- 2,435
- 3
- 29
- 47