Questions tagged [newspaper3k]

49 questions
0
votes
1 answer

Author extraction in newspaper example is not working

I'm trying to use newspaper3k to extract speaker names from webpages containing speeches with no luck. Following the documentation of the package, article.authors seems to always return an empty list. Using the example in the docs here. In: from…
0
votes
0 answers

Cannot append article contents to list

Using the python newspaper3k package, and I am trying to loop through all of the articles on a website and build a dataframe with the contents of the articles. meta_data of the article comes as a nested dictionary and I am able to pull it out of a…
0
votes
0 answers

Error when using using reticulate with shiny

I am trying to use a python package inside shiny app to extract the maintext from a webpage: https://newspaper.readthedocs.io/en/latest/ what I mean by main text is the body of the article, without any adds, links, etc... (very similar to the…
Bahi8482
  • 489
  • 5
  • 15
0
votes
1 answer

Newspaper3k: Any way to download multiple web articles to one variable?

I am trying to download a number of web articles for parsing. They are similar articles (annual reports), and I'd like all three to be downloaded in one singular output/variable for simplicity. When I separate multiple urls, the code works,…
0
votes
1 answer

newsletter3k, find author name in visible text after first "by" word

Newsletter3K is a good python Library for News content extraction. It works mostly well .I want to extract names after first "by" word in visible text. This is my code, it did not work well, somebody out there please help: import re from newspaper…
0
votes
1 answer

newsletter3k, am I did something wrong, author function did not pick up author in news article

This is about the author function of newspaper3k Library. I have this list of URL for news. the ">>> article.authors" did not pick up authors sometimes. An example is here:authors missing
tursunWali
  • 71
  • 8
0
votes
1 answer

newsletter3k_does its funtions work on stored data,I already downloaded contents of the URL

The newspaper3k in GitHub here is a quite useful Library. Currently, it works with python3. I wonder if it can handle downloaded/stored text. The point is we already downloaded the contents of the URL and do not want to repeat this every time when…
tursunWali
  • 71
  • 8
0
votes
2 answers

How to get the right url after redirection (the one given by the browser) using python

I'm working on a project whose aim is to retrieve all the information from a news article (media website), for this I'm using the library newspaper3K which works quite well. however I have a problem concerning some urls (redirected link), according…
0
votes
0 answers

how to use Sharingan for newspaper text extraction?

I want to test Sharingan for newspaper text extraction https://github.com/vipul-sharma20/sharingan, but I didn't understand how to use it. I cloned the project, installed requirements. What else, is there any example to start with?
Ryad_B
  • 17
  • 3
0
votes
1 answer

Web scraping news articles and keyword search

I have a code which fetches me titles of news articles in webpages. I have used a for loop in which I get the titles of 4 news websites. I have also implemented a word search which tells the number of articles in which the word " coronavirus" is…
0
votes
1 answer

Get more article URLs from a news source with newspaper3k?

When I do import newspaper paper = newspaper.build('http://cnn.com', memoize_articles=False) print(len(paper.articles)) I see that newspaper found 902 articles from http://cnn.com, which seems quite little too me, considering that they publish many…
HelloGoodbye
  • 3,624
  • 8
  • 42
  • 57
0
votes
1 answer

Why does newspaper3k differentiate between http://cnn.com and http://www.cnn.com?

When I run the Python code import newspaper print(len(newspaper.build('http://cnn.com', memoize_articles=False).articles)) exit() in Python 3 I get the output 897 (i.e. newspaper3k found 897 pages considered articles on the domain http://cnn.com),…
HelloGoodbye
  • 3,624
  • 8
  • 42
  • 57
0
votes
1 answer

Newspaper3k: how to retrieve cashed articles?

This document says that that by default, newspaper caches all previously extracted articles and eliminates any article which it has already extracted. >>> cbs_paper = newspaper.build('http://cbs.com') >>> cbs_paper.size() 1030 >>> cbs_paper =…
Ahmad
  • 8,811
  • 11
  • 76
  • 141
0
votes
1 answer

Python Newspapers3k Newspapers library mutithreading hangs indefinitely

I'm working on a project to extract articles from gaming media sites, and I'm doing a basic test run, which according to VSCode's debugger consistently hangs at the point after which I've set up a multi-threaded extraction (changing the number of…
0
votes
1 answer

Newspaper api for scraping articles

I have used newspaper3k api from python for scraping articles. I am not able to scrape Times of India articles , getting publish date null from response rest articles are giving proper articles. article =…