5

I need to count words that are inside a webpage using python3. Which module should I use? urllib?

Here is my Code:

def web():
    f =("urllib.request.urlopen("https://americancivilwar.com/north/lincoln.html")
    lu = f.read()
    print(lu)
Fruchtzwerg
  • 10,999
  • 12
  • 40
  • 49
birajad
  • 482
  • 1
  • 8
  • 16

1 Answers1

18

With below self explained code you can get a good starting point for counting words within a web page:

import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation

# We get the url
r = requests.get("https://en.wikiquote.org/wiki/Khalil_Gibran")
soup = BeautifulSoup(r.content)

# We get the words within paragrphs
text_p = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
c_p = Counter((x.rstrip(punctuation).lower() for y in text_p for x in y.split()))

# We get the words within divs
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
c_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))

# We sum the two countesr and get a list with words count from most to less common
total = c_div + c_p
list_most_common_words = total.most_common() 

If you want for example the first 10 most common words you just do:

total.most_common(10)

Which in this case outputs:

In [100]: total.most_common(10)
Out[100]: 
[('the', 2097),
 ('and', 1651),
 ('of', 998),
 ('in', 625),
 ('i', 592),
 ('a', 529),
 ('to', 529),
 ('that', 426),
 ('is', 369),
 ('my', 365)]
Cedric Zoppolo
  • 4,271
  • 6
  • 29
  • 59
  • 1
    I don't know who gave me downvote for this question. down vote for no reason. – birajad Nov 08 '17 at 02:10
  • 1
    Not me. By the way if you found my answer useful please consider to upvote/accept it. – Cedric Zoppolo Nov 08 '17 at 02:13
  • 1
    My vote is not counted because of inadequate reputation. However, you have my up vote. I was just wandering if I can check python codes for plagiarism but I am not getting any response from anyone. – birajad Nov 08 '17 at 02:28
  • You can accept answers with the tick under the up and down vote – Cedric Zoppolo Nov 08 '17 at 02:29
  • Okay I just did that and thank you. Do you know of any online websites to check code for plagiarism for free? If i may ask? – birajad Nov 08 '17 at 03:26
  • No. Sorry. I don't know. You may ask a new question for that. Good luck. – Cedric Zoppolo Nov 08 '17 at 12:29
  • 2
    I found the above method might output not exact numbers as a paragraph can be within a div and viceversa. Nos sure how it works, but I found an interesting tool online for checking word counts within a website: https://wordcounter.net/website-word-count – Cedric Zoppolo May 09 '19 at 19:50
  • Using this with the example URL given (https://en.wikiquote.org/wiki/Kahlil_Gibran) shows it vastly overstates the words. E.g. CTRL+F and searching "the" on the actual page only returns 687 results at the time of writing, while this says "the" appears 2139 times. – ZhouW Dec 30 '20 at 02:32
  • @ZhouW you are right. The problem resides that many web pages have 'div' and 'p' one within the other or viceversa, as I mentioned in mi previous comment. – Cedric Zoppolo Jan 06 '21 at 22:55
  • Interesting, another tool I found that does the exact same thing pretty accurate webpage word counter https://randomtools.io/webpage-word-counter/ – user2475624 Mar 06 '22 at 09:07