I'm following a turtorial on LDA and encountering a problem since the turtorial is made in python 3 and I'm working in 2.7 (the turtorial claims to work in both). As far as I understand I need to turn strings into unicode in python 2.x before I can apply token.isnumeric()
. Due to my lack of experience and knowledge I'm not sure how to do this nicely in the following script. Does anyone have a solution?
data_dir = 'nipstxt/'
yrs = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
dirs = ['nips' + yr for yr in yrs]
docs = []
for yr_dir in dirs:
files = os.listdir(data_dir + yr_dir)
for filen in files:
# Note: ignoring characters that cause encoding errors.
with open(data_dir + yr_dir + '/' + filen) as fid:
txt = fid.read()
docs.append(txt)
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
docs[idx] = docs[idx].lower() # Convert to lowercase.
docs[idx] = tokenizer.tokenize(docs[idx]) # Split into words.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]
docs = [[token for token in doc if len(token) > 1] for doc in docs]