1
I just want to do tagging of POS tags but got some error.

Text=open('news/article.txt') t=Text.read() print t text=nltk.word_tokenize(t); posTagged=nltk.pos_tag(text) print posTagged

and got this:

 Maybe that is why whenever we go to watch any live sport in India they lock us within cages. Thanks to the Cricket Lovers in the Barabati Stadium in Cuttack, this is probably only going to get worse.
    But right now you, the Cricket Lovers at the Barabati Stadium, have a bigger problem to deal with. I hope you realize what you have done. You didn’t just disrupt a game last evening, you may have just ensured you won’t get international cricket in your city. So much for your love!
    Thanks to a bunch of hooligans, every Indian fan has been blackened. We are all hanging our heads in shame. This feeling is far worse than losing just a cricket match.

    Traceback (most recent call last):
      File "C:\Python27\TestProj1.py", line 12, in <module>
        posTagged=nltk.pos_tag(text)
      File "C:\Python27\lib\site-packages\nltk\tag\__init__.py", line 106, in pos_tag
        return tagger.tag(tokens)
      File "C:\Python27\lib\site-packages\nltk\tag\sequential.py", line 61, in tag
        tags.append(self.tag_one(tokens, i, tags))
      File "C:\Python27\lib\site-packages\nltk\tag\sequential.py", line 81, in tag_one
        tag = tagger.choose_tag(tokens, index, history)
      File "C:\Python27\lib\site-packages\nltk\tag\sequential.py", line 634, in choose_tag
        featureset = self.feature_detector(tokens, index, history)
      File "C:\Python27\lib\site-packages\nltk\tag\sequential.py", line 736, in feature_detector
        'prevtag+word': '%s+%s' % (prevtag, word.lower()),
    UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 4: ordinal not in range(128)

But for some other text files it is working perfectly. How to solve this?

alexis
  • 48,685
  • 16
  • 101
  • 161
Salah
  • 177
  • 1
  • 11
  • So you are getting a `UnicodeDecodeError`? Because your question had a different error a few minutes ago (sorry for the rollback confusion). – alexis Oct 06 '15 at 16:39
  • Your NLTK code is working fine, you just need to learn how to print unicode on your terminal (that's why regular ascii files work fine). It's a well-known topic, but a _different_ one from your previous question. – alexis Oct 06 '15 at 16:41
  • See [here](http://stackoverflow.com/questions/10569438/how-to-print-unicode-character-in-python). Better yet, switch to python 3 _today._ – alexis Oct 06 '15 at 16:42
  • PS. I just saw the dates and figured out what you're doing: You should have just asked a new question with your new problem, not edited this long-closed one. – alexis Oct 06 '15 at 16:44
  • what is unicode? I created the text file by copy and paste from a article. But there are some other file I created in same manner which are working perfect. But only in this case, these errors are coming. – Salah Oct 06 '15 at 16:50
  • They barred me from asking question for 7 days. @alexis – Salah Oct 06 '15 at 16:51
  • 'ascii' codec can't decode byte 0x92 in position 4: ordinal not in range(128). what is this? – Salah Oct 06 '15 at 16:56
  • t.decode("eng"). this will work I this. what will be the valid notation for english. eg. t.decode("latin-1") – Salah Oct 06 '15 at 17:08
  • They barred you from asking questions? This question was closed back in July. I suggest you try to address the reason, not look for backdoors. – alexis Oct 06 '15 at 17:33
  • Who knows what encoding your file has? Probably utf-8. READ UP on the link I suggested, and google "python unicode". And switch to python 3. Trust me on this: If you have *any* interaction with non-English texts, there's no reason to mess with python 2's handling of them. – alexis Oct 06 '15 at 17:34
  • Thank you @alexis. I got that. Actually encoding is not a big issue in my project. What I need, is just a text file. It was happening because I copied and pasted from different sites. There were some encoding. Now I understood. Thanks a lot. – Salah Oct 06 '15 at 19:08

0 Answers0