I am trying to get the content of a news article with python newspaper module. I can find the body of a news item with the following code. The code parses the feed URL in feed_url
variable with feedparser and then tries to find the news body and publishing date with newspaper module.
import newspaper
from newspaper import Article
import feedparser
import urllib.parse
count = 0
feed_url="https://www.extremetech.com/feed"
#feed_url="http://www.prothomalo.com/feed/"
d = feedparser.parse(feed_url)
for post in d.entries:
count+=1
if count == 2:
break
#post_link = post.link
post_link =urllib.parse.unquote(post.link) #Added later to decode the
# encoded URL into the original Bengali langauge
print("count= ",count," url = ",post_link,end="\n ")
try:
content = Article(post_link)
content.download()
content.parse()
print(" content = ", end=" ")
print(content.text[0:50])
print(" content.publish_date = {}".format(content.publish_date))
except Exception as e:
print(e)
I mentioned 2 different values for the variable feed_url
in the code - one is from extremetch site and another one is from prothomalo website .
Let us say for example, extremetech has a news item (which I get through feedparser.parse
)with URL as
https://www.extremetech.com/computing/263951-mit-announces-new-neural-network-processor-cuts-power-consumption-95. And I can easily get the news body text and publishing date for this URL.
But prothomalo for example has a news item with the URL ( obtained from feedparser.parse
) as http://www.prothomalo.com/sports/article/1432086/%E0%A6%B8%E0%A6%B0%E0%A7%8D%E0%A6%AC%E0%A7%8B%E0%A6%9A%E0%A7%8D%E0%A6%9A-%E0%A6%B8%E0%A7%8D%E0%A6%95%E0%A7%8B%E0%A6%B0-%E0%A6%97%E0%A7%9C%E0%A7%87%E0%A6%93-%E0%A6%B9%E0%A6%BE%E0%A6%B0 .
But the actual URL does not look so in prothomalo website. You can visit the URL and will find that the URL has changed into Bengali language. I think the reason behind such encrypted (?) URL is that the URL has some parts that are in Bengali language. The content here is also in Bengali language.
Python newspaper module can extract the content and publishing date from extretemetech site and not from prothomalo. Is the failure due to the non-English characters in the prothomalo URL ?
How can I get the news content , publishing date etc from prothomalo site ( i.e. perhaps sites containing non-English URLs ) as well ?
EDIT 1:
I could decode the encoded URL of prothomalo into the original Bengali language with the line : post_link =urllib.parse.unquote(post.link)
. Still I cannot get the content and publishing date.