0

This is about the author function of newspaper3k Library. I have this list of URL for news. the ">>> article.authors" did not pick up authors sometimes. An example is here:authors missing

tursunWali
  • 71
  • 8

1 Answers1

0

Newspaper3k uses the Python package Beautiful Soup to extract items, such as author names from a news website. The tags that Newspaper3k queries are pre-defined within Newspaper3k source code. Newspaper3k makes a best effort to extract content from these standard tags on a news site.

BUT not all news sources are structured the same, so Newspaper3k will miss certain content, because a tag (e.g., author) will be a different place in the HTML structure.

For instance Newspaper3k looks for the author name in these tags:

VALS = ['author', 'byline', 'dc.creator', 'byl']

The tag dc.creator is always located in the META tag section of a news source. If your news source has a different author tag, such as article.author, which the LA Times uses then you must query that tag like this:

article_meta_data = article.meta_data
article_author = {value for (key, value) in article_meta_data['article'].items() if key == 'author'}

I cover many of these harvesting issues in my newspaper3K overview document, which I have shared on my Github page.

Life is complex
  • 15,374
  • 5
  • 29
  • 58
  • Yes, I was there: VALS = ['author', 'byline', 'dc.creator', 'byl'] Thank you for the overview document, it is up to date, I go to read it. One spontaneous question, how about adding a tag to VALS? There is where author located: – tursunWali Feb 11 '21 at 03:41
  • my obervation, in some webpages, the author name appears after the first "by" word (in the actual text, may not be after first "by" word in page source). I wonder if this may be the way: Finding a word after a specific word in Python using regex from text file , in stackoverflow.com – tursunWali Feb 11 '21 at 04:35
  • I don't control *Newspaper3k,* so I cannot modify the source code. You have to modify your code to use *Newspaper3k* to harvest the elements correctly. Open an issue with the *Newspaper3k* owners to add a new tag in the source code. – Life is complex Feb 11 '21 at 05:19
  • I also do not control the newspaper3k, I just asked such possibilites by forking in Github. I checked your wonderful examples at " newspaper3K overview document,", please tell me where to find " bbc_dictionary.items()" and "cnn_dictionary.items() " , sorry it shows error message. – tursunWali Feb 11 '21 at 05:38
  • There are lready discussions about forking the package, because it hasn't been updated since 2016. Concerning "bbc_dictionary.items()" and "cnn_dictionary.items()." These items are only linked to my code example as part of my naming convention for variables being used. I just tested my example code from my overview document that contain those variables and the examples worked with no errors. – Life is complex Feb 11 '21 at 13:52
  • Did my answer help you? If so, please accept the answer. If not, please follow-up specifically so any outstanding concerns based on the original question can be addressed. – Life is complex Feb 11 '21 at 13:55
  • "Concerning "bbc_dictionary.items()" and "cnn_dictionary.items()." These items are only linked to my code example as part of my naming convention for variables being used. ", Rule based system has to be very specific. This is a pain right now. the webpages I want extract authors from are not in BBC or CNN. One clue I know aobut them: author names appears after the "by" word and usually tailed by publication date. – tursunWali Feb 11 '21 at 17:17
  • All my code example in the [overview document](https://github.com/johnbumgarner/newspaper3_usage_overview) can be easily modified to fit your needs. – Life is complex Feb 11 '21 at 17:52
  • Please accept this answer, because it has all the details needed to handle your original use case. – Life is complex Feb 11 '21 at 19:17
  • The code in above Answer1 works by giving me the website address of the author in LA times: 'https://www.latimes.com/people/john-myers' when I tried on this weblink: https://www.latimes.com/california/story/2021-02-09/california-deal-over-teacher-vaccines-could-reopen-elementary-schools, I think I should use re package to extract the last part in "https://www.latimes.com/people/john-myers" – tursunWali Feb 12 '21 at 02:45
  • My other answer showed how to extract that part of the author URL. – Life is complex Feb 12 '21 at 03:10
  • Oww, I didn't notice that sorry. But I did the following and it worked: import re url='https://www.latimes.com/california/story/2021-02-09/california-deal-over-teacher-vaccines-could-reopen-elementary-schools ' article = Article(url.strip(), config=config) article.download() article.parse() article_meta_data = article.meta_data authorURL = {value for (key, value) in article_meta_data['article'].items() if key == 'author'} string=repr(authorURL) author= re.sub('{}', '', string) article_author=re.sub(r'^.+/([^/]+)}$', r'\1', author) print(article_author) – tursunWali Feb 12 '21 at 03:43
  • @tursunWali no worries. I will be adding another section to my overview document related to your questions. Please accept this answer. Thanks. – Life is complex Feb 12 '21 at 03:46
  • Already accepted, appreciate your help. Love Stackoverlfow spirit – tursunWali Feb 12 '21 at 03:48
  • Unfortunately not everyone has the same spirit to help others and share their knowledge. Please let me know If you have any more questions about Newspaper3k. Happy coding. – Life is complex Feb 12 '21 at 04:06
  • if you say so, I would say, let's improve newsletter3k, please! – tursunWali Feb 12 '21 at 04:08
  • here is the discussion on maintenance and forking the project: https://github.com/codelucas/newspaper/issues/813 – Life is complex Feb 12 '21 at 04:11
  • Yes, just now visited there and expressed my interest, also urge those cool guys to start early. – tursunWali Feb 12 '21 at 04:22
  • Life is complex, could you have a look to my new post: find author name in the visible text after the first “by” word, please! – tursunWali Feb 12 '21 at 06:16