Why does newspaper3k differentiate between http://cnn.com and http://www.cnn.com?

Question

When I run the Python code

import newspaper
print(len(newspaper.build('http://cnn.com', memoize_articles=False).articles))
exit()

in Python 3 I get the output 897 (i.e. newspaper3k found 897 pages considered articles on the domain http://cnn.com), but when I run

import newspaper
print(len(newspaper.build('http://www.cnn.com', memoize_articles=False).articles))
exit()

(i.e., with an additional www.; nothing else has changed) I only get 895. These numbers are consistent when I switch forth and back between these two URLs. Is the www. actually significant in a URL? If so, why does the article count become so similar with these two URLs when using the newspaper3k library? Otherwise, why isn't the article count exactly the same?

"Is the `www.` actually significant in a URL"—yes, certainly. Many sites default to their `www` subdomain, but there's nothing magical about it. It could be entirely different from the root domain, simply an alias, or anything in between. — ChrisGPT was on strike, Sep 13 '20 at 20:20
I mean, they redirect traffic from `example.com` to `www.example.com`, but this is not mandatory. The two domains could be entirely separate. — ChrisGPT was on strike, Sep 13 '20 at 20:39
CNN also redirects traffic depending upon whether it's international — Peter Wood, Sep 13 '20 at 20:46

score 1 · Answer 1 · answered Sep 13 '20 at 21:45

As you can see below, several url's represented in www'less resource in two variants:

with www
without www

import newspaper

artcls = newspaper.build('https://cnn.com', memoize_articles=False).articles
urls = [a.url.replace('www.', '') for a in artcls]

duplicated = set()

for u in urls:
    if urls.count(u) > 1:
        duplicated.add(u)
        
for d in duplicated:
    print(d)

result:

https://cnn.com/business/media
https://cnn.com/travel/news
https://cnn.com/travel/article/hong-kong-cbd-cafe-found-wellness-intl-hnk/index.html
https://cnn.com/travel/article/rent-fire-lookout-towers-covid-19/index.html

Why does newspaper3k differentiate between http://cnn.com and http://www.cnn.com?

1 Answers1