I'm a newbee on text mining and working on a toy project to scrapy the text from website and split it into tokens. However, after downloading content using Beautifulsoup, I failed to split it with the .split
method with the following code
# -*- coding: utf-8 -*-
import nltk
import operator
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
url= 'http://python.org/'
response = http.request('GET',url)
# nltk.clean_html is dropped by NTLK
clean = BeautifulSoup(response.data,"html5lib")
# clean will have entire string removing all the html noise
tokens = [tok for tok in clean.split()]
print tokens[:100]
Python told me that
TypeError: 'NoneType' object is not callable
According previous stackoverflow question, it's due to the fact that
clean is not a string, it's a bs4.element.Tag. When you try to look up split in it,it does its magic and tries to find a subelement named split, but there is none. You are calling that None
In this case, How should I adjust my code to achieve my goal to get the tokens? Thank you.