31

I just started my first NLTK project and am confused about the proper setup. I need several resources like the Punkt Tokenizer and the maxent pos tagger. I myself downloaded them using the GUI nltk.download(). For my collaborators I of course want that this things get downloaded automatically. I haven't found any idiomatic code for that in the docu.

Am I supposed to just put nltk.data.load('tokenizers/punkt/english.pickle') and their like into the code? Is this going to download the resources every time the script is run? Am I to provide feedback to the user (i.e. my co-developers) of what is being downloaded and why this is taking so long? There MUST be gear out there that does the job, right? :)

//Edit To explify my question:
How do I test whether an nltk resource (like the Punkt Tokenizer) is already installed on the machine running my code, and install it if it is not?

alvas
  • 115,346
  • 109
  • 446
  • 738
Zakum
  • 2,157
  • 2
  • 22
  • 30
  • I'm having trouble determining what you're asking. A concise, testable code example demonstrating your current approach would be very helpful. –  May 16 '14 at 21:43
  • 1
    Let me reframe the question: How do I test whether an nltk resource (like the Punkt Tokenizer) is already installed on the machine running my code, and install it if it is not? – Zakum May 16 '14 at 22:54
  • Edit your question to match your comment. Putting the short question in the comments may let it get overlooked – Spaceghost May 17 '14 at 12:30

2 Answers2

45

You can use the nltk.data.find() function, see https://github.com/nltk/nltk/blob/develop/nltk/data.py:

>>> import nltk
>>> nltk.data.find('tokenizers/punkt.zip')
ZipFilePathPointer(u'/home/alvas/nltk_data/tokenizers/punkt.zip', u'')

When the resource is not available you'll find the error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk-3.0a3-py2.7.egg/nltk/data.py", line 615, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'punkt.zip' not found.  Please use the NLTK Downloader
  to obtain the resource:  >>> nltk.download()
  Searched in:
    - '/home/alvas/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

Most probably, you would like to do something like this to ensure that your collaborators have the package:

>>> try:
...     nltk.data.find('tokenizers/punkt')
... except LookupError:
...     nltk.download('punkt')
... 
[nltk_data] Downloading package punkt to /home/alvas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
True
ChrisG
  • 221
  • 4
  • 12
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 5
    There is a trap to this approach, which is that you can't reliably use it to install the data in a non-interactive application. Python will import nltk _without_ the downloaded resource. If you discover this fact with a LookupError and then try to run `nltk.download` and then re-import the relevant nltk module, Python will believe nltk was already imported and not re-import anything. So even though you'll have downloaded the new data artifact, the imported version of NLTK will still be the one that was booted up without access to it. – ely Jul 25 '19 at 15:11
  • 3
    For example, you often need `from nltk import wordnet` but this submodule of nltk only exists if wordnet was downloaded prior to when nltk was imported. If you `try` .. `except` this import and check for `LookupError` and then dynamically run `nltk.download('wordnet')`, it will indeed install the data for wordnet, but re-running `from nltk import wordnet` will still fail (the `nltk` module being referenced will still be the one that booted up with no `wordnet` submodule in it). – ely Jul 25 '19 at 15:13
  • 1
    @ely what's the remedy then? – vpap Dec 14 '21 at 19:06
  • PEP8 recommends to set all imports at the top of the file. In this case, anyway, it seems we can avoid the trap only running first "import nltk", then the try-except clause, and finally the specific import like "from nltk import ...". This seems a bit of workaround anyway. – Fab Feb 17 '22 at 09:46
  • Or maybe "from nltk import data, download" first, and after the try-except surround "import nltk"? – Fab Feb 17 '22 at 09:53
  • @Fab Can you share a code with the workaround? I don't seem to understand it very clearly. – Somnath Rakshit May 31 '22 at 00:35
  • @SomnathRakshit, I'm posting an example below – Fab Jul 29 '22 at 12:13
2

After Somnath comment, I am posting an example of the try-except workaround. Here we search for the comtrans module that is not in the nltk data by default.

from nltk.corpus import comtrans
from nltk import download

try:
    words = comtrans.words('alignment-en-fr.txt')
except LookupError:
    print('resource not found. Downloading now...')
    download('comtrans')
    words = comtrans.words('alignment-en-fr.txt')
Fab
  • 142
  • 8