Removing All Spacing from HTML String

Question

I'm trying to implement code that removes all white space and space characters then counts the top 3 alphanumeric characters occurring in the page. My question is twofold.

1) The method I'm using for split doesn't seem to be working and I'm not sure why it's not working. To the best of my knowledge joining then splitting should remove all white space and spaces from html source code, but it's not (see first returned value from amazon example below).

2) I'm not terribly familiar with the most_common operation and when I tested my code on "http://amazon.com" I get the following output:

The top 3 occuring alphanumeric characters in the html of http://amazon.com 
:  [(u' ', 258), (u'a', 126), (u'e', 126)]

What does the u mean in the returned most_common(3) values?

My Current Code:

import requests
import collections


url = raw_input("please eneter the url of the desired website (include http://): ")

response = requests.get(url)
responseString = response.text

print responseString

topThreeAlphaString = " ".join(filter(None, responseString.split()))

lineNumber = 0

for line in topThreeAlphaString:
    line = line.strip()
    lineNumber += 1

topThreeAlpha = collections.Counter(topThreeAlphaString).most_common(3)

print "The top 3 occuring alphanumeric characters in the html of", url,": ", topThreeAlpha

It means it is a unicode string. You are `join()`ing back with a space `" ".join(...)`, just join with an empty string `"".join(...)` — AChampion, Mar 02 '17 at 02:17

score 0 · Answer 1 · answered Mar 02 '17 at 02:23

To take care of whitespace, you'll want to use an instance of HTMLParser.HTMLParser and its unescape method to get rid of any raw HTML characters lying around. To count the characters, you should check out collections.Counter.

import requests
from collections import Counter
from HTMLParser import HTMLParser

response = requests.get('http://www.example.com')
responseString = response.text

parser = HTMLParser()
c = Counter(''.join(parser.unescape(responseString).split())

print(c.most_common()[:3])

Removing All Spacing from HTML String

1 Answers1