python newspaper - cannot extract article if URL is not in english language

Question

I am trying to get the content of a news article with python newspaper module. I can find the body of a news item with the following code. The code parses the feed URL in feed_url variable with feedparser and then tries to find the news body and publishing date with newspaper module.

import newspaper
from newspaper import Article
import feedparser
import urllib.parse

count = 0
feed_url="https://www.extremetech.com/feed"
#feed_url="http://www.prothomalo.com/feed/"
d = feedparser.parse(feed_url)
for post in d.entries:
    count+=1
    if count == 2:
        break

    #post_link = post.link
    post_link =urllib.parse.unquote(post.link) #Added later to decode the
    # encoded URL into the  original Bengali langauge            
    print("count= ",count," url = ",post_link,end="\n ")

    try:

        content = Article(post_link)
        content.download()
        content.parse()
        print(" content = ", end=" ")
        print(content.text[0:50])
        print(" content.publish_date = {}".format(content.publish_date))


    except Exception as e:
        print(e)

I mentioned 2 different values for the variable feed_url in the code - one is from extremetch site and another one is from prothomalo website .

Let us say for example, extremetech has a news item (which I get through feedparser.parse )with URL as https://www.extremetech.com/computing/263951-mit-announces-new-neural-network-processor-cuts-power-consumption-95. And I can easily get the news body text and publishing date for this URL.

But prothomalo for example has a news item with the URL ( obtained from feedparser.parse) as http://www.prothomalo.com/sports/article/1432086/%E0%A6%B8%E0%A6%B0%E0%A7%8D%E0%A6%AC%E0%A7%8B%E0%A6%9A%E0%A7%8D%E0%A6%9A-%E0%A6%B8%E0%A7%8D%E0%A6%95%E0%A7%8B%E0%A6%B0-%E0%A6%97%E0%A7%9C%E0%A7%87%E0%A6%93-%E0%A6%B9%E0%A6%BE%E0%A6%B0 .

But the actual URL does not look so in prothomalo website. You can visit the URL and will find that the URL has changed into Bengali language. I think the reason behind such encrypted (?) URL is that the URL has some parts that are in Bengali language. The content here is also in Bengali language.

Python newspaper module can extract the content and publishing date from extretemetech site and not from prothomalo. Is the failure due to the non-English characters in the prothomalo URL ?

How can I get the news content , publishing date etc from prothomalo site ( i.e. perhaps sites containing non-English URLs ) as well ?

EDIT 1: I could decode the encoded URL of prothomalo into the original Bengali language with the line : post_link =urllib.parse.unquote(post.link). Still I cannot get the content and publishing date.

I removed my answer. From my test, both URIs (with and without the last part), got the same result with `newspaper.Article`. It looks like the URI doesn't have any impact, but the page itself. Could you confirm? — Arount, Feb 16 '18 at 10:44
@Arount, ' It looks like the URI doesn't have any impact, but the page itself '- what do you mean ? — Istiaque Ahmed, Feb 16 '18 at 10:45
I tried to parse page with `newspaper.Article` got empty text with the long URL and the shorten one (the one without the last part). So I suspect `newspaper.Article` to not be able to parse the page's **content** whatever the URL is. — Arount, Feb 16 '18 at 10:46
Sorry for last comment, I miss readed. Well, maybe it does not works for non-English content, I don't know this library to be honest — Arount, Feb 16 '18 at 10:55
@Arount, The doc lists several languages for the module to work with, Bengali is not listed there, 'If you are certain that an entire news source is in one language, go ahead and use the same api :)' - https://pypi.python.org/pypi/newspaper3k/0.1.5 — Istiaque Ahmed, Feb 16 '18 at 10:58

python newspaper - cannot extract article if URL is not in english language

0 Answers0