0

I am trying to implement LDA upon a set of tweets treated as a document. While preprocessing, in the stemming part it shows error as : UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

My code is as shown below:

from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
import csv
import itertools

tokenizer = RegexpTokenizer(r'\w+')

en_stop = get_stop_words('en')

p_stemmer = PorterStemmer()

reader = csv.reader(open('/home/balki/Documents/Bangalore-13062016.csv', 'rU'), dialect=csv.excel_tab)

your_list = list(reader)
chain=itertools.chain(*your_list)
your_list2=list(chain)

texts = []

for i in your_list2:

    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    stopped_tokens = [i for i in tokens if not i in en_stop]


    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]

    print(stemmed_tokens)

Please suggest what should be done.

dantiston
  • 5,161
  • 2
  • 26
  • 30

0 Answers0