Python Newspapers3k Newspapers library mutithreading hangs indefinitely

Question

I'm working on a project to extract articles from gaming media sites, and I'm doing a basic test run, which according to VSCode's debugger consistently hangs at the point after which I've set up a multi-threaded extraction (changing the number of threads does not help) on two sites. I'm honestly not sure what I'm doing wrong here; I followed the examples that have been laid out. One of the sites, Gamespot, is even used in someone's tutorial, and I tried removing the other (Polygon) and it doesn't seem to help. I've created a virtual environment and tried this with both Python 3.8 and 3.7. All dependencies appear to be satisfied; I also tested in in repl dot it and got the same hang.

I would love to hear I'm just doing something wrong so I can fix it; I really want to do some data science on these specific websites and their articles! But it seems as if, at least for an OS X user, there's some sort of bug with multithreading. Here's my code:

#import system functions
import sys
import requests
sys.path.append('/usr/local/lib/python3.8/site-packages/')
#import basic HTTP handling processes
#import urllib
#from urllib.request import urlopen
#import scraping libraries

#import newspaper and BS dependencies

from bs4 import BeautifulSoup
import newspaper
from newspaper import Article 
from newspaper import Source 
from newspaper import news_pool

#import broad data libraries
import pandas as pd

#import gaming related news sources as newspapers
gamespot = newspaper.build('https://www.gamespot.com/news', memoize_articles=False)
polygon = newspaper.build('https://www.polygon.com/gaming', memoize_articles=False)

#organize the gaming related news sources using a list
gamingPress = [gamespot, polygon]
print("About to set the pool.")
#parallel process these articles using multithreading (store in mem)
news_pool.set(gamingPress, threads_per_source=4)
print("Setting the pool")
news_pool.join()
print("Pool set")
#create the interim pandas dataframe based on these sources
final_df = pd.DataFrame()

#a limit on sources could be placed here; intentionally I have placed none
limit = 10

for source in gamingPress:
    #these are temporary placeholder lists for elements to be extracted
    list_title = []
    list_text = []
    list_source = []

    count = 0

    for article_extract in source.articles:
        article_extract.parse()
        
        #further limit functionality could be placed here; not placed
        if count > limit:
            break

        list_title.append(article_extract.title)
        list_text.append(article_extract.text)
        list_source.apprend(article_extract.source_url)

        print(count)
        count +=1 #progress the loop *via* count

    temp_df = pd.DataFrame({'Title': list_title, 'Text': list_text, 'Source': list_source})
    #Append this to the final DataFrame
    final_df = final_df.append(temp_df, ignore_index=True)

#export to CSV, placeholder for deeper analysis/more limited scope, may remain
final.df.to_csv('gaming_press.csv')

and here's what I get back when I finally give up and hit the interrupt at console:


About to set the pool.
Setting the pool
^X^X^CTraceback (most recent call last):
  File "scraper1.py", line 31, in <module>
    news_pool.join()
  File "/usr/local/lib/python3.8/site-packages/newspaper3k-0.3.0-py3.8.egg/newspaper/mthreading.py", line 103, in join
    self.pool.wait_completion()
  File "/usr/local/lib/python3.8/site-packages/newspaper3k-0.3.0-py3.8.egg/newspaper/mthreading.py", line 63, in wait_completion
    self.tasks.join()
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/queue.py", line 89, in join
    self.all_tasks_done.wait()
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 302, in wait
    waiter.acquire()
KeyboardInterrupt

Update on this: I used two example scripts by other people that did the same thing, but with different news sources, and they also hang at news_pool.join(). I'm trying to rebuild my entire local Python environment in case this is an issue of newspaper3k needing an older version of Python, but I'm not hopeful since cloud compilers failed too. The traceback was essentially the same with other peoples' scripts. — Ellie Lockhart, Aug 29 '20 at 17:16
I have not, no - and it seems to me as if attempting to use the multithreaded extraction features of newspaper3k in *any* context causes a lock (and I've even tried virtualizing very simple code in the cloud, to make sure it wasn't my computer). This is unfortunate since multithreading is considered to be best practices for any sort of web harvesting. — Ellie Lockhart, Oct 12 '20 at 11:59
Ellie, Just checking in to see if you tried the code in my answer. — Life is complex, Nov 04 '20 at 22:53

Life is complex · Accepted Answer · 2020-10-18T15:37:58.277

I decided to look into Newspaper mutithreading issue. I looked at the source code for Newspaper on github and devised this answer. In my testing I was able to obtain the article titles.

It seems that this processing is time intensive, because it takes on average 6 minutes. After doing some more research it looks like the time lag is directly related to the articles being downloaded in the background. I'm unsure how to speed this process up using Newspaper.

import newspaper
from newspaper import Config
from newspaper import news_pool

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

gamespot = newspaper.build('https://www.gamespot.com/news', config=config, memoize_articles=False)
polygon = newspaper.build('https://www.polygon.com/gaming', config=config, memoize_articles=False)

gamingPress = [gamespot, polygon]

# this setting is adjustable 
news_pool.config.number_threads = 2

# this setting is adjustable 
news_pool.config.thread_timeout_seconds = 2

news_pool.set(gamingPress)
news_pool.join()

for source in gamingPress:
  for article_extract in source.articles:
    article_extract.parse()
    print(article_extract.title)

To be honest, I'm trying to determine the benefit of using news_pool. It seems from the comments in the source code of Newspaper that news_pool primary purpose is related to connection rate-limiting. I also noted that several attempts have been made to improve the threading model, but those code updates haven't been pushed into the production code.

Nevertheless... The answer below starts processing in under 1 minute and it doesn't use news_pool. More testing needs to be done to see if a source rate limits the connections or other issues arise.

import newspaper
from newspaper import Config
from newspaper import news_pool

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

gamespot = newspaper.build('https://www.gamespot.com/news', config=config, memoize_articles=False)
polygon = newspaper.build('https://www.polygon.com/gaming', config=config, memoize_articles=False)
gamingPress = [gamespot, polygon]
for source in gamingPress:
   source.download_articles()
   for article_extract in source.articles:
      article_extract.parse()
      print(article_extract.title)

Concerning the news_pool code section. For some reason I noted redundant article titles in my limited testing of your target sources.

Python Newspapers3k Newspapers library mutithreading hangs indefinitely

1 Answers1