2

I am a linguist trying to figure out how to use NLTK and how to tag parts of speech in corpora.

I am trying to use the function pos_tag and get the same error message as another poster: ascii codec can't decode byte...

See this link: NLTK 3 POS_TAG throws UnicodeDecodeError

I tried all of the suggested solutions, including the one given by the original poster, but without success on any of them.

Are there any more possible solutions to this problem?

Community
  • 1
  • 1
  • Welcome to stack overflow. "I have the same problem as this guy, I tried the solution but it didn't work" doesn't leave us much to go on. Try to read a short text (a couple of sentences), and come back here with the text, encoding and error message. – alexis Jun 02 '15 at 20:53
  • PS. If you're getting unicode errors you'll be much better off if you just forget about python 2. Python 3 is much better for handling multiple encodings. – alexis Jun 02 '15 at 20:54
  • PPS. The question you link to is obsolete: The current NLTK version (3.0.2) *is* compatible with python 3, and you should use it that way. – alexis Jun 02 '15 at 20:56

1 Answers1

0

It sounds like you are getting a unicode error. Where is your corpus from? You probably have some characters that look like '0xd1' or something similar. This is a pretty standard issue to run into and can often be painful to deal with. In my experience, you have to use regular expression substitutions to remove these characters.

What is the exact error? If you provide that I can help you with a regex to remove the bad tokens.

brandomr
  • 192
  • 5