0

Newsletter3K is a good python Library for News content extraction. It works mostly well .I want to extract names after first "by" word in visible text. This is my code, it did not work well, somebody out there please help:

import re
from newspaper import Config
from newspaper import Article
from newspaper import ArticleException
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
USER_AGENT = 'Mozilla/5.0 (Macintosh;Intel Mac OS X 10.15; rv:78.0)Gecko/20100101   Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10 
html1='https://saugeentimes.com/new-perspectives-a-senior-moment-food-glorious-food-part-2/'
article = Article(html1.strip(), config=config)
article.download()
article.parse()
soup = BeautifulSoup(article)
## I want to take only visible text
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()
for line in visible_text:
    # Capture one-or-more words after first (By or by) the initial match
    match = re.search(r'By (\S+)', line)

    # Did we find a match?
    if match:
        # Yes, process it to print 
        By = match.group(1)
        print('By {}'.format(By))`
tursunWali
  • 71
  • 8
  • Are you trying to get this code to work only for *saugeentimes.com* or do you plan on querying multiple sources with the same code? – Life is complex Feb 12 '21 at 15:08
  • Yes, I want to query multiple sources similar to saugeentimes.com. – tursunWali Feb 12 '21 at 15:11
  • Thanks for the info. Please provide additional sources in your question. – Life is complex Feb 12 '21 at 16:09
  • In following webpages, author names appear after the first visible "by" word: 1. http://thenelsondaily.com/regionalnews?amp%3Bquicktabs_1=2&quicktabs_1=1%22%27&qt-qt_nelson_regional_international=1&page=5 2. http://thenelsondaily.com/regionalnews?amp%3Bquicktabs_1=2&quicktabs_1=1%22%27&qt-qt_nelson_regional_international=1&page=7 3. https://www.macleans.ca/education/what-college-students-in-canada-can-expect-during-covid/ 4. https://www.cnn.com/2020/10/09/business/edinburgh-woollen-mill-job-cuts/index.html – tursunWali Feb 12 '21 at 22:18
  • The Author name (in some web articles), comes between title and date – tursunWali Feb 12 '21 at 22:42
  • I would recommend reviewing my [newspaper overview document](https://github.com/johnbumgarner/newspaper3_usage_overview), because some of the sites that you provided can be easily queried using code from that document. Some of the other sites will likely require a different approach. Most likely requests, BS4 and regex. Please rework your question for the sites that cannot be extracted using my overview examples. – Life is complex Feb 13 '21 at 04:50
  • I will explore your newspaper overview document again. It would be great if you teach other approaches " Some of the other sites will likely require a different approach" @Life is complex – tursunWali Feb 13 '21 at 17:16
  • I spent the time developing an answer that can handle several "author name" use cases from the URLs that you provided. The answer can be expanded to fit additional use cases as needed. Please accept this answer, because it has all the details that you requested in your question. – Life is complex Feb 14 '21 at 22:24

1 Answers1

0

This is not a comprehensive answer, but it is one that you can build from. You will need to expand this code as you add additional sources. Like I stated before my Newspaper3k overview document has lots of extraction examples, so please review it thoroughly.

Regular expressions should be a last ditch effort after trying these extraction methods with newspaper3k:

  • article.authors
  • meta tags
  • json
  • soup
from newspaper import Config
from newspaper import Article
from newspaper.utils import BeautifulSoup

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

urls = ['https://saugeentimes.com/new-perspectives-a-senior-moment-food-glorious-food-part-2',
        'https://www.macleans.ca/education/what-college-students-in-canada-can-expect-during-covid',
        'https://www.cnn.com/2021/02/12/asia/india-glacier-raini-village-chipko-intl-hnk/index.html',
        'https://www.latimes.com/california/story/2021-02-13/wildfire-santa-cruz-boulder-creek-residents-fear-water'
        '-quality',
        'https://foxbaltimore.com/news/local/maryland-lawmakers-move-ahead-with-first-tax-on-internet-ads-02-13-2021']

for url in urls:
    try:
        article = Article(url, config=config)
        article.download()
        article.parse()
        author = article.authors
        if author:
            print(author)
        elif not author:
            soup = BeautifulSoup(article.html, 'html.parser')
            author_tag = soup.find(True, {'class': ['td-post-author-name', 'byline']}).find(['a', 'span'])
            if author_tag:
                print(author_tag.get_text().replace('By', '').strip())
            else:
                print('no author found')
    except AttributeError as e:
        pass
Life is complex
  • 15,374
  • 5
  • 29
  • 58
  • Now, I have doubts about using newsletter3k as a complete solution for web article scraping: title, URL, authors, images, text, publication date, and summary. Is there any other tool to try, which is better than newspaper3k, dear experts? – tursunWali Feb 16 '21 at 03:13
  • newsletter3k isn't perfect, but it's a decent python module. I have learned the module's parameters and know that I need to look at the source of a target website to determine what harvest methods will work. You have to do the same thing when using beautiful soup, selenium or scrapy. My newsletter3k overview document was written to inform others on the usages of the module against specific sites. That document provides anyone with the base knowledge on how to wield the library with efficiency. – Life is complex Feb 16 '21 at 12:41
  • Thank you, I took it as NO answer. So still Newspaper is a choice. – tursunWali Feb 16 '21 at 14:29
  • Not a no, but a maybe with limitations. – Life is complex Feb 16 '21 at 18:20
  • I mean, Newsletter is still a thing to go. – tursunWali Feb 16 '21 at 18:24
  • Yes, newspaper is still a thing to use when harvesting specific content. – Life is complex Feb 16 '21 at 18:26
  • obsessed to find by pls help:s = requests_html.HTMLSession() page = s.get('http://oshawaexpress.ca/oshawa-student-receives-70000-scholarship/') soup=bs(page.text,'lxml') teks=soup.get_text() teks = "\n".join([ll.rstrip() for ll in teks.splitlines() if ll.strip()]) ## 4, # for line in teks: # m = re.search(r'(by\=.+?(?= )|by\=.+?$)', line) # if m: # text = m.group() # Matched text here # print (text) ## 3, below code to find 'by' word ad print for line in teks: for part in line.split(): if "by" in part: print (part ) – tursunWali Feb 16 '21 at 18:32
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/228808/discussion-between-life-is-complex-and-tursunwali). – Life is complex Feb 16 '21 at 19:16