Counting phrase frequencies in an html file

Question

I'm currently trying to get used to Python and have recently hit block in my coding. I couldn't run a code that would count the number of times a phrase appears in an html file. I've recently received some help constructing the code for counting the frequency in a text file but am wondering there is a way to do this directly from the html file (to bypass the copy and paste alternative). Any advice will be sincerely appreciated. The previous coding I have used is the following:

#!/bin/env python 3.3.2
import collections
import re

# Defining a function named "findWords".
def findWords(filepath):
  with open(filepath) as infile:
    for line in infile:
      words = re.findall('\w+', line.lower())
      yield from words

phcnt = collections.Counter()

from itertools import tee
phrases = {'central bank', 'high inflation'}
fw1, fw2 = tee(findWords('02.2003.BenBernanke.txt'))   
next(fw2)
for w1,w2 in zip(fw1, fw2):
  phrase = ' '.join([w1, w2])
  if phrase in phrases:
    phcnt[phrase] += 1

print(phcnt)

@Ashish Nitin Patil: Unfortunately, that only gives me a way to count for words, not phrases — Raul, Nov 18 '13 at 17:36

score 1 · Answer 1 · answered Nov 18 '13 at 08:13

1

You can use some_str.count(some_phrase) function

In [19]: txt = 'Text mining, also referred to as text data mining, Text mining,\
         also referred to as text data mining,'
In [20]: txt.lower().count('data mining')
Out[20]: 2

answered Nov 18 '13 at 08:13

mclafee

1,406
3
18
25

Hey man, the original code I posted works on text files but what I'm wondering is how to use it directly on an html file. – Raul Nov 18 '13 at 17:40

Adrian Genaid · Answer 2 · 2013-11-19T00:16:00.747

What about just stripping the html tags before doing the analysis? html2text does this job quite well.

import html2text
content = html2text.html2text(infile.read())

would give you the text content (somehow formatted, but this is no problem in your approach I think). There are options to ignore images and links additionally, which you would use like

h = html2text.HTML2Text()
h.ignore_images = True
h.ignore_links = True
content = h.handle(infile.read())

Counting phrase frequencies in an html file

2 Answers2