Effectively turning strings into unicode for python 2.7

Question

I'm following a turtorial on LDA and encountering a problem since the turtorial is made in python 3 and I'm working in 2.7 (the turtorial claims to work in both). As far as I understand I need to turn strings into unicode in python 2.x before I can apply token.isnumeric(). Due to my lack of experience and knowledge I'm not sure how to do this nicely in the following script. Does anyone have a solution?

data_dir = 'nipstxt/'
yrs = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
dirs = ['nips' + yr for yr in yrs]
docs = []
for yr_dir in dirs:
files = os.listdir(data_dir + yr_dir)
    for filen in files:
        # Note: ignoring characters that cause encoding errors.
        with open(data_dir + yr_dir + '/' + filen) as fid:
            txt = fid.read()
        docs.append(txt)

tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

docs = [[token for token in doc if len(token) > 1] for doc in docs]

Mark Ransom · Accepted Answer · 2016-12-07T16:32:10.423

0

The generic way to convert a byte string to a Unicode string is with decode. If you know the string will only contain ASCII characters (as a number will), you don't have to specify a parameter, it will default to ascii.

docs = [[token for token in doc if not token.decode().isnumeric()] for doc in docs]

If there's any chance that the string will contain non-ASCII characters, you can get those replaced with a special character that won't count as numeric.

docs = [[token for token in doc if not token.decode(errors='replace').isnumeric()] for doc in docs]

edited Dec 07 '16 at 16:32

answered Dec 07 '16 at 16:21

Mark Ransom

299,747
42
398
622

Thanks, it seems to be on right track since it gave me a new error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xf8 in position 0: ordinal not in range(128). I guess it means that some non `ascii` characters. Can I set a parameter to deal with this? – WiggyStardust Dec 07 '16 at 16:35
@WiggyStardust I already anticipated that problem, see my edit. – Mark Ransom Dec 07 '16 at 16:40

Effectively turning strings into unicode for python 2.7

1 Answers1