0

I'm a newbee on text mining and working on a toy project to scrapy the text from website and split it into tokens. However, after downloading content using Beautifulsoup, I failed to split it with the .split method with the following code

# -*- coding: utf-8 -*-
import nltk
import operator
import urllib3
from bs4 import BeautifulSoup

http = urllib3.PoolManager()
url= 'http://python.org/'
response = http.request('GET',url)
# nltk.clean_html is dropped by NTLK
clean = BeautifulSoup(response.data,"html5lib")
# clean will have entire string removing all the html noise
tokens = [tok for tok in clean.split()]
print tokens[:100]

Python told me that

TypeError: 'NoneType' object is not callable

According previous stackoverflow question, it's due to the fact that

clean is not a string, it's a bs4.element.Tag. When you try to look up split in it,it does its magic and tries to find a subelement named split, but there is none. You are calling that None

In this case, How should I adjust my code to achieve my goal to get the tokens? Thank you.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
zlqs1985
  • 509
  • 2
  • 8
  • 25
  • 3
    It almost appears to me that you haven't read the BeautifulSoup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/. There is no single way of getting tokens from a page in a useful way. It's necessary to make a study of each page. – Bill Bell Sep 05 '17 at 15:03
  • Possible duplicate of [BeautifulSoup Grab Visible Webpage Text](https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text) – Kos Sep 05 '17 at 15:41

1 Answers1

1

You could use get_text() to return just the text from the HTML and pass that to the nltk word_tokenize() as follows:

from bs4 import BeautifulSoup
import requests
import nltk

response = requests.get('http://python.org/').content
soup = BeautifulSoup(response, "html.parser")
text_tokens = nltk.tokenize.word_tokenize(soup.get_text())

print text_tokens

(You can also use urllib3 to get your data)

This would give you something starting:

[u'Welcome', u'to', u'Python.org', u'{', u'``', u'@', u'context', u"''", u':'...

If you are only interested in the words, you could then further filter the returned list to remove entries with only puncutation, for example:

text_tokens = [t for t in text_tokens if not re.match('[' + string.punctuation + ']+', t)]
Martin Evans
  • 45,791
  • 17
  • 81
  • 97