How to get the same result in book "Web Scraping with Python: Collecting Data from the Modern Web" Chapter 7 Data Normalization section

Question

Python version: 2.7.10

My code:

# -*- coding: utf-8 -*-

from urllib2 import urlopen
from bs4 import BeautifulSoup
from collections import OrderedDict
import re
import string

def cleanInput(input):
    input = re.sub('\n+', " ", input)
    input = re.sub('\[[0-9]*\]', "", input)
    input = re.sub(' +', " ", input)
    # input = bytes(input, "UTF-8")
    input = bytearray(input, "UTF-8")
    input = input.decode("ascii", "ignore")

    cleanInput = []
    input = input.split(' ')

    for item in input:
        item = item.strip(string.punctuation)
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
            cleanInput.append(item)
    return cleanInput

def ngrams(input, n):
    input = cleanInput(input)
    output = []

    for i in range(len(input)-n+1):
        output.append(input[i:i+n])
    return output

url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
html = urlopen(url)
bsObj = BeautifulSoup(html, 'lxml')
content = bsObj.find("div", {"id": "mw-content-text"}).get_text()
ngrams = ngrams(content, 2)
keys = range(len(ngrams))
ngramsDic = {}
for i in range(len(keys)):
    ngramsDic[keys[i]] = ngrams[i]
# ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
ngrams = OrderedDict(sorted(ngramsDic.items(), key=lambda t: t[1], reverse=True))


print ngrams
print "2-grams count is: " + str(len(ngrams))

I recently learning how to do web scraping by following the book Web Scraping with Python: Collecting Data from the Modern Web, while in Chapter 7 Data Normalization section I first write the code as same as the book shows and got an error from the terminal:

Traceback (most recent call last):
  File "2grams.py", line 40, in <module>
    ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
AttributeError: 'list' object has no attribute 'items'

Therefore I've changed the code by creating a new dictionary where the entities are the lists of ngrams. But I've got a quite different result:

Question:

If I wanna have the result as the book shows (where sorted by values and the frequency), should I write my own lines to count the occurrence of each 2-grams, or the code in the book already had that function (codes in the book were python 3 code) ? book sample code on github
The frequency in my output was quite different with the author's, for example [u'Software', u'Foundation'] were occurred 37 times but not 40. What kinds of reasons causing that difference (could it be my code errors)?

Book screenshot:

score 1 · Answer 1 · answered Jan 10 '16 at 20:44

1

Got an error in this chapter too because ngrams was a list. I converted it to dict and it worked

ngrams1 = OrderedDict(sorted(dict(ngrams1).items(), key=lambda t: t[1], reverse=True))

answered Jan 10 '16 at 20:44

Sandor Caetano

145
5

score 1 · Answer 2 · answered Mar 28 '16 at 06:31

I got the same problem when I read this book.ngrams should be dict. python version 3.4

here is my code:

from urllib.request import urlopen
from bs4 import BeautifulSoup
from collections import OrderedDict
import re
import string

def cleanInput(input):
    input = re.sub('\n+',' ', input)
    input = re.sub('\[0-9]*\]', '', input)
    input = re.sub('\+', ' ', input)
    input = bytes(input, 'utf-8')
    input = input.decode('ascii', 'ignore')
    cleanInput = []
    input = input.split(' ')
    for item in input:
        item = item.strip(string.punctuation)
        if len(item) >1 or (item.lower() == 'a' or item.lower() == 'i'):
            cleanInput.append(item)
    return cleanInput

def ngrams(input, n):
    input = cleanInput(input)
    output = []
    for i in range(len(input)-n+1):
        output.append(input[i:i+n])
    return output

html = urlopen("http://en.wikipedia.org/wiki/Python_(programming_language)")
bsObj = BeautifulSoup(html, "lxml")
content = bsObj.find("div", {"id": "mw-content-text"}).get_text()
ngrams1 = ngrams(content, 2)
#ngrams1  is something like this [['This', 'article'], ['article', 'is'], ['is', 'about'], ['about', 'the'], ['the', 'programming'], ['programming', 'language'],
ngrams = {}
for i in ngrams1:
    j = str(i)   #the key of ngrams should not be a list
    ngrams[j] = ngrams.get(j, 0) + 1
    # ngrams.get(j, 0) means return a value for the given key j. If key j is not available, then returns default value 0.
    # when key j appear again, ngrams[j] = ngrams[j]+1

ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
print(ngrams)
print("2-grams count is:"+str(len(ngrams)))

This is a part of my result:

OrderedDict([("['Python', 'Software']", 37), ("['Software', 'Foundation']", 37), ("['of', 'the']", 37), ("['of', 'Python']", 35), ("['Foundation', 'Retrieved']", 32),

bkalcho · Answer 3 · 2017-07-16T11:06:46.593

More elegant solution to this would be using collections.defaultdict.

Here is my code (using Python 2.7+):

import requests
import re
import string
from bs4 import BeautifulSoup
from collections import OrderedDict, defaultdict


def clean_input(input):
    input = re.sub('\n+', " ", input)
    input = re.sub('\[[0-9]*\]', "", input)
    input = re.sub(' +', " ", input)
    input = bytes(input).decode(encoding='utf-8')
    input = input.encode(encoding='ascii', errors='ignore')
    clean_input = []
    input = input.split(' ')
    for item in input:
        item = item.strip(string.punctuation)
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
            clean_input.append(item)
    return clean_input


def ngrams(input, n):
    input = clean_input(input)
    output = []
    for i in xrange(len(input)-n+1):
        output.append(input[i:i+n])
    return output


response = requests.get("http://en.wikipedia.org/wiki/Python_(programming_language")
bsObj = BeautifulSoup(response.content, "html.parser")
content = bsObj.find("div", {"id":"mw-content-text"}).get_text()
ngrams1 = ngrams(content, 2)
ngrams = defaultdict(int)
for k in ngrams1:
    ngrams[str(k)] += 1 
ngrams = OrderedDict(sorted(ngrams.items(), key=(lambda t: t[1]), reverse=True))
print ngrams
print "2-grams count is: %d" % len(ngrams)

This is part of my result:

OrderedDict([("['Python', 'programming']", 5), ("['programming', 'language']", 4), ("['for', 'Python']", 3), ("['the', 'page']", 2), ("['language', 'in']", 2), ("['sister', 'projects']", 1), ("['language', 'article']", 1), ("['page', 'I']", 1), ("['see', 'Why']", 1),

score 0 · Answer 4 · answered Dec 09 '17 at 05:04

List doesn't have a item. I just changed list to dict. Here is my code I changed

def ngrams(input, n):
    input = cleanInput(input)
    output = dict()
    for i in range(len(input)-n+1):
        new_ng = " ".join(input[i:i+n])
        if new_ng in output:
            output[new_ng] += 1
        else:
            output[new_ng] = 1
    return output

Kernel · Answer 5 · 2018-01-24T09:32:54.283

Actually, most of our programming books already told you where to find the material or code about the book you are reading.

For this book, you can find all example code at:

http://pythonscraping.com/code/ and will redirect you to

https://github.com/REMitchell/python-scraping.

Then you can find your code in chapter7 folder. You can see following screenshot in your book, and the url of example code which I marked with blue box:

Example code in 2-clean2grams.py:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string
from collections import OrderedDict

def cleanInput(input):
    input = re.sub('\n+', " ", input)
    input = re.sub('\[[0-9]*\]', "", input)
    input = re.sub(' +', " ", input)
    input = bytes(input, "UTF-8")
    input = input.decode("ascii", "ignore")
    cleanInput = []
    input = input.split(' ')
    for item in input:
        item = item.strip(string.punctuation)
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
            cleanInput.append(item)
    return cleanInput

def getNgrams(input, n):
    input = cleanInput(input)
    output = dict()
    for i in range(len(input)-n+1):
        newNGram = " ".join(input[i:i+n])
        if newNGram in output:
            output[newNGram] += 1
        else:
            output[newNGram] = 1
    return output

html = urlopen("http://en.wikipedia.org/wiki/Python_(programming_language)")
bsObj = BeautifulSoup(html, "html.parser")
content = bsObj.find("div", {"id":"mw-content-text"}).get_text()
#ngrams = getNgrams(content, 2)
#print(ngrams)
#print("2-grams count is: "+str(len(ngrams)))

ngrams = getNgrams(content, 2)
ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
print(ngrams)

In this example code you might get result like:

[('Python Software', 37), ('Software Foundation', 37), ...

If you want your result like:

[("['Python', 'Software']", 37), ("['Software', 'Foundation']", 37), ...

You just need make a little modification as follow:

How to get the same result in book "Web Scraping with Python: Collecting Data from the Modern Web" Chapter 7 Data Normalization section

5 Answers5