4

Hi text mining champions,

I'm using Anaconda with NLTK v3.2 on Windows 10.(client's environment)

When I try to POS tag, I keep getting a URLLIB2 error:

URLError: <urlopen error unknown url type: c>

It seems urllib2 is unable to recognize windows paths? How can I work around this?

The command is simple as:

nltk.pos_tag(nltk.word_tokenize("Hello World"))

edit: There is a duplicate question, however I think the answers obtained here by manan and alvas are a better fix.

alvas
  • 115,346
  • 109
  • 446
  • 738
Max
  • 982
  • 10
  • 21
  • Possible duplicate of [Python NLTK pos\_tag throws URLError](http://stackoverflow.com/questions/35827859/python-nltk-pos-tag-throws-urlerror) – alvas Mar 13 '16 at 15:26
  • looks like yeah. I read that post prior. – Max Mar 14 '16 at 03:11

3 Answers3

10

EDITED

This issue has been resolved from NLTK v3.2.1. Upgrading your NLTK version would resolve the issue, e.g. pip install -U nltk.


I faced the same issue and the error encountered was as follows;

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\nltk-3.2-py2.7.egg\nltk\tag\__init__.py", line 110, in pos_tag
tagger = PerceptronTagger()
  File "C:\Python27\lib\site-packages\nltk-3.2-py2.7.egg\nltk\tag\perceptron.py", line 141, in __init__
self.load(AP_MODEL_LOC)
  File "C:\Python27\lib\site-packages\nltk-3.2-py2.7.egg\nltk\tag\perceptron.py", line 209, in load
self.model.weights, self.tagdict, self.classes = load(loc)
  File "C:\Python27\lib\site-packages\nltk-3.2-py2.7.egg\nltk\data.py", line 801, in load
opened_resource = _open(resource_url)
  File "C:\Python27\lib\site-packages\nltk-3.2-py2.7.egg\nltk\data.py", line 924, in _open
return urlopen(resource_url)
  File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 391, in open
response = self._open(req, data)
  File "C:\Python27\lib\urllib2.py", line 414, in _open
'unknown_open', req)
  File "C:\Python27\lib\urllib2.py", line 369, in _call_chain
result = func(*args)
  File "C:\Python27\lib\urllib2.py", line 1206, in unknown_open
raise URLError('unknown url type: %s' % type)
urllib2.URLError: <urlopen error unknown url type: c>

The URLError that you mentioned was due to a bug in the perceptron.py file within the NLTK library for Windows. In my machine, the file is at this location

C:\Python27\Lib\site-packages\nltk-3.2-py2.7.egg\nltk\tag\perceptron.py

(Basically look at an equivalent location within yours wherever you have the Python27 folder)

The bug was basically in the code to find the corresponding location for the averaged_perceptron_tagger within your machine. One can have a look at the line 801 and 924 mentioned in the data.py file regarding this.

I think the NLTK developer community recently fixed this bug in the code. Have a look at this commit made to their code a few days back.

https://github.com/nltk/nltk/commit/d3de14e58215beebdccc7b76c044109f6197d1d9#diff-26b258372e0d13c2543de8dbb1841252

The snippet where the change was made is as follows;

self.tagdict = {}
self.classes = set()
    if load:
        AP_MODEL_LOC = 'file:'+str(find('taggers/averaged_perceptron_tagger/'+PICKLE))
          self.load(AP_MODEL_LOC)
        # Initially it was:AP_MODEL_LOC = str(find('taggers/averaged_perceptron_tagger/'+PICKLE)) 

def tag(self, tokens):

Updating the file to the most recent commit worked for me and was able to use the nltk.pos_tag command. I believe this would resolve your problem as well (assuming you have everything else set up).

alvas
  • 115,346
  • 109
  • 446
  • 738
MananVyas
  • 236
  • 2
  • 4
  • Works like a dream. Thanks @MananVyas – Max Mar 11 '16 at 03:17
  • FWIW I had the same error on Win10 python 3.4 (64Bit) with nltk installed via pip and up to date as of April 2nd. Finding the percepthon.py file and making the change in the snippet above worked after a restart for good measure. Wish I had seen this post 4 hours ago though because I thought it was my tokens that were the problem – mobcdi Apr 02 '16 at 22:38
  • Sorry for adding the edit to your answer, this is to avoid cross-platform communication and NLTK users starting new issues on the github repo on this resolved issue. – alvas Apr 20 '16 at 04:02
6

EDITED

This issue has been resolved from NLTK v3.2.1. Please upgrade your NLTK!


First read @MananVyas answer for the why:

https://stackoverflow.com/a/35902494/610569


Here's the how, without downgrading to NLTK v3.1, using NLTK 3.2, you can use this "hack":

>>> from nltk.tag import PerceptronTagger
>>> from nltk.data import find
>>> PICKLE = "averaged_perceptron_tagger.pickle"
>>> AP_MODEL_LOC = 'file:'+str(find('taggers/averaged_perceptron_tagger/'+PICKLE))
>>> tagger = PerceptronTagger(load=False)
>>> tagger.load(AP_MODEL_LOC)
>>> pos_tag = tagger.tag
>>> pos_tag('The quick brown fox jumps over the lazy dog'.split())
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
  • I ran the above code and it worked fine but when I try to run my nltk routine it still gives <> the code I am using is in http://stackoverflow.com/questions/36255291/extract-city-names-from-text-using-python/36255377?noredirect=1#comment60196241_36255377 I also ran Sarim Hussain's suggestion successfully but no luck. – GeorgeC Mar 30 '16 at 00:11
  • try upgrading your nltk, `pip install -U nltk` – alvas Mar 30 '16 at 07:13
  • just tried that. Still same error. On pip command I get << writing dependency_links to nltk.egg-info\dependency_links.txt warning: manifest_maker: standard file '-c' not found reading manifest template 'MANIFEST.in' warning: no files found matching 'Makefile' under directory '*.txt' warning: no previously-included files matching '*~' found anywhere in distribution writing manifest file 'nltk.egg-info\SOURCES.txt' Successfully installed nltk-3.2>> – GeorgeC Mar 31 '16 at 13:48
  • 1
    Which OS are you using? What is your Python version? How did you install python? How did you install NLTK? Did you install through `pip` or `conda`? Where are you running Python? From the command prompt, terminal or in some IDE? Are you running it through a server or a cloud? Are you running it on your laptop/computer? Or in some school's lab where there might be a firewall? Where are you running the python script? Did you have any other file name call `nltk.py` in your directory? – alvas Mar 31 '16 at 14:23
  • 1
    After upgrading to NLTK 3.2 did you use the `AP_MODEL_LOC = 'file:'+str(find('taggers/averaged_perceptron_tagger/'+PICKLE))` hack? – alvas Mar 31 '16 at 14:23
  • 1
    Sorry for the multiple questions, your short comment isn't enough to help us debug the problems, please answer each of the questions in the previous 2 comments and we'll try to find a solution afterwards. Actually, it'll also be easier if yo ask another question and state all the answers to those questions in the comments, it looks like it's another problem. – alvas Mar 31 '16 at 14:24
  • Thanks -I have created a new question for this at http://stackoverflow.com/questions/36349755/nltk-routine-gives-raise-urlerrorunknown-url-type-s-type-in-python – GeorgeC Apr 01 '16 at 06:48
1

I faced the same issue a while back. Solution:

nltk.download('averaged_perceptron_tagger')
Undo
  • 25,519
  • 37
  • 106
  • 129