newsletter3k, am I did something wrong, author function did not pick up author in news article

Question

This is about the author function of newspaper3k Library. I have this list of URL for news. the ">>> article.authors" did not pick up authors sometimes. An example is here:authors missing

Life is complex · Accepted Answer · 2021-02-10T14:57:31.257

0

Newspaper3k uses the Python package Beautiful Soup to extract items, such as author names from a news website. The tags that Newspaper3k queries are pre-defined within Newspaper3k source code. Newspaper3k makes a best effort to extract content from these standard tags on a news site.

BUT not all news sources are structured the same, so Newspaper3k will miss certain content, because a tag (e.g., author) will be a different place in the HTML structure.

For instance Newspaper3k looks for the author name in these tags:

VALS = ['author', 'byline', 'dc.creator', 'byl']

The tag dc.creator is always located in the META tag section of a news source. If your news source has a different author tag, such as article.author, which the LA Times uses then you must query that tag like this:

article_meta_data = article.meta_data
article_author = {value for (key, value) in article_meta_data['article'].items() if key == 'author'}

I cover many of these harvesting issues in my newspaper3K overview document, which I have shared on my Github page.

edited Feb 10 '21 at 14:57

answered Feb 10 '21 at 14:52

Life is complex

15,374
5
29
58

Yes, I was there: VALS = ['author', 'byline', 'dc.creator', 'byl'] Thank you for the overview document, it is up to date, I go to read it. One spontaneous question, how about adding a tag to VALS? There is where author located:
by Nelson Daily Staff on Thursday November 19 2020
– tursunWali Feb 11 '21 at 03:41
my obervation, in some webpages, the author name appears after the first "by" word (in the actual text, may not be after first "by" word in page source). I wonder if this may be the way: Finding a word after a specific word in Python using regex from text file , in stackoverflow.com – tursunWali Feb 11 '21 at 04:35
I don't control *Newspaper3k,* so I cannot modify the source code. You have to modify your code to use *Newspaper3k* to harvest the elements correctly. Open an issue with the *Newspaper3k* owners to add a new tag in the source code. – Life is complex Feb 11 '21 at 05:19
I also do not control the newspaper3k, I just asked such possibilites by forking in Github. I checked your wonderful examples at " newspaper3K overview document,", please tell me where to find " bbc_dictionary.items()" and "cnn_dictionary.items() " , sorry it shows error message. – tursunWali Feb 11 '21 at 05:38
There are lready discussions about forking the package, because it hasn't been updated since 2016. Concerning "bbc_dictionary.items()" and "cnn_dictionary.items()." These items are only linked to my code example as part of my naming convention for variables being used. I just tested my example code from my overview document that contain those variables and the examples worked with no errors. – Life is complex Feb 11 '21 at 13:52
Did my answer help you? If so, please accept the answer. If not, please follow-up specifically so any outstanding concerns based on the original question can be addressed. – Life is complex Feb 11 '21 at 13:55
"Concerning "bbc_dictionary.items()" and "cnn_dictionary.items()." These items are only linked to my code example as part of my naming convention for variables being used. ", Rule based system has to be very specific. This is a pain right now. the webpages I want extract authors from are not in BBC or CNN. One clue I know aobut them: author names appears after the "by" word and usually tailed by publication date. – tursunWali Feb 11 '21 at 17:17
All my code example in the [overview document](https://github.com/johnbumgarner/newspaper3_usage_overview) can be easily modified to fit your needs. – Life is complex Feb 11 '21 at 17:52
Please accept this answer, because it has all the details needed to handle your original use case. – Life is complex Feb 11 '21 at 19:17
The code in above Answer1 works by giving me the website address of the author in LA times: 'https://www.latimes.com/people/john-myers' when I tried on this weblink: https://www.latimes.com/california/story/2021-02-09/california-deal-over-teacher-vaccines-could-reopen-elementary-schools, I think I should use re package to extract the last part in "https://www.latimes.com/people/john-myers" – tursunWali Feb 12 '21 at 02:45
My other answer showed how to extract that part of the author URL. – Life is complex Feb 12 '21 at 03:10
Oww, I didn't notice that sorry. But I did the following and it worked: import re url='https://www.latimes.com/california/story/2021-02-09/california-deal-over-teacher-vaccines-could-reopen-elementary-schools ' article = Article(url.strip(), config=config) article.download() article.parse() article_meta_data = article.meta_data authorURL = {value for (key, value) in article_meta_data['article'].items() if key == 'author'} string=repr(authorURL) author= re.sub('{}', '', string) article_author=re.sub(r'^.+/([^/]+)}$', r'\1', author) print(article_author) – tursunWali Feb 12 '21 at 03:43
@tursunWali no worries. I will be adding another section to my overview document related to your questions. Please accept this answer. Thanks. – Life is complex Feb 12 '21 at 03:46
Already accepted, appreciate your help. Love Stackoverlfow spirit – tursunWali Feb 12 '21 at 03:48
Unfortunately not everyone has the same spirit to help others and share their knowledge. Please let me know If you have any more questions about Newspaper3k. Happy coding. – Life is complex Feb 12 '21 at 04:06
if you say so, I would say, let's improve newsletter3k, please! – tursunWali Feb 12 '21 at 04:08
here is the discussion on maintenance and forking the project: https://github.com/codelucas/newspaper/issues/813 – Life is complex Feb 12 '21 at 04:11
Yes, just now visited there and expressed my interest, also urge those cool guys to start early. – tursunWali Feb 12 '21 at 04:22
Life is complex, could you have a look to my new post: find author name in the visible text after the first “by” word, please! – tursunWali Feb 12 '21 at 06:16

newsletter3k, am I did something wrong, author function did not pick up author in news article

1 Answers1