I'm trying to implement code that removes all white space and space characters then counts the top 3 alphanumeric characters occurring in the page. My question is twofold.
1) The method I'm using for split doesn't seem to be working and I'm not sure why it's not working. To the best of my knowledge joining then splitting should remove all white space and spaces from html source code, but it's not (see first returned value from amazon example below).
2) I'm not terribly familiar with the most_common operation and when I tested my code on "http://amazon.com" I get the following output:
The top 3 occuring alphanumeric characters in the html of http://amazon.com
: [(u' ', 258), (u'a', 126), (u'e', 126)]
What does the u mean in the returned most_common(3) values?
My Current Code:
import requests
import collections
url = raw_input("please eneter the url of the desired website (include http://): ")
response = requests.get(url)
responseString = response.text
print responseString
topThreeAlphaString = " ".join(filter(None, responseString.split()))
lineNumber = 0
for line in topThreeAlphaString:
line = line.strip()
lineNumber += 1
topThreeAlpha = collections.Counter(topThreeAlphaString).most_common(3)
print "The top 3 occuring alphanumeric characters in the html of", url,": ", topThreeAlpha