3

I need to perform text pre-processing tasks such as sentence splitting, tokenization and tagging using NLTK. I want to use GENIA tagger for tagging. I am using Anaconda version 3.10 and installed geniatagger by the following command.

python setup.py install

In the IPython console, the following I entered the following code.

import geniatagger
tagger =geniatagger.GeniaTagger('C:\Users\dell\Anaconda\geniatagger\geniatagger')
print tagger.parse('Welcome to natural language processing!')

The following error message appears when pressed Enter.

---------------------------------------------------------------------------
WindowsError                              Traceback (most recent call last)
<ipython-input-2-52e4d65c2d02> in <module>()
----> 1 tagger = geniatagger.GeniaTagger('C:\Users\dell\Anaconda\geniatagger\geniatagger')
  2 print tagger.parse('Welcome to natural language processing!')
  3 

 C:\Users\dell\Anaconda\lib\site-packages\geniatagger_python-0.1-py2.7.egg\geniatagger.pyc in __init__(self, path_to_tagger)
 19         self._tagger = subprocess.Popen('./'+os.path.basename(path_to_tagger),
 20                                         cwd=self._dir_to_tagger,
 ---> 21                                         stdin=subprocess.PIPE, stdout=subprocess.PIPE)
 22 
 23     def parse(self, text):

 C:\Users\dell\Anaconda\lib\subprocess.pyc in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags)
708                                 p2cread, p2cwrite,
709                                 c2pread, c2pwrite,
--> 710                                 errread, errwrite)
711         except Exception:
712             # Preserve original exception in case os.close raises.

C:\Users\dell\Anaconda\lib\subprocess.pyc in _execute_child(self, args, executable, preexec_fn, close_fds, cwd, env, universal_newlines, startupinfo, creationflags, shell, to_close, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite)
956                                          env,
957                                          cwd,
--> 958                                          startupinfo)
959             except pywintypes.error, e:
960                 # Translate pywintypes.error to WindowsError, which is

WindowsError: [Error 2] The system cannot find the file specified

Why do I get this error message? How can I fix this?

If I use this tagging straight away, will it perform the tokenization part as well?

Note: geniatagger python file is inside the 'geniatagger' folder.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Dakshila Kamalsooriya
  • 1,391
  • 4
  • 17
  • 36

1 Answers1

3

TL;DR:

# Install Genia Tagger (C code).
$ git clone https://github.com/saffsd/geniatagger && cd geniatagger && make && cd ..
# Install Genia Tagger (python wrapper)
$ git clone https://github.com/informationsea/geniatagger-python.git && cd geniatagger-python && sudo python setup.py install && cd ..
$ python
>>> from geniatagger import GeniaTagger
>>> tagger = GeniaTagger('./geniatagger/geniatagger')
>>> loading morphdic...done.
loading pos_models................done.
loading chunk_models....done.
loading named_entity_models..done.

>>> print tagger.parse('This is a pen.')
[('This', 'This', 'DT', 'B-NP', 'O'), ('is', 'be', 'VBZ', 'B-VP', 'O'), ('a', 'a', 'DT', 'B-NP', 'O'), ('pen', 'pen', 'NN', 'I-NP', 'O'), ('.', '.', '.', 'O', 'O')]

I'm not sure whether the packages for Genia tagger works out of the box from conda, so i think a native python/pip fix is simpler.

Firstly, there's no support for Genia Tagger in NLTK (At least not yet =) ), so it isn't a problem with the NLTK installation/modules.

The problem might lie in some outdated imports that the original GeniaTagger C code uses (http://www.nactem.ac.uk/tsujii/GENIA/tagger/).

So to resolve the problem, you have to add #include <cstdlib> to the original code but thankfully @saffsd has already done so and put it nicely in his github repo (https://github.com/saffsd/geniatagger/blob/master/morph.cpp)

Then comes installing the python wrapper, you can either:

  • install from the official pypi with: pip install https://pypi.python.org/packages/source/g/geniatagger-python/geniatagger-python-0.1.tar.gz

  • or use some other github repo to install, e.g. https://github.com/informationsea/geniatagger-python that appears first from google search

Lastly, the GeniaTagger initialization in python is rather weird because it doesn't really take the path to the directory of the tagger but the tagger itself and assumes that the model files are in the same directory as the tagger, see https://github.com/informationsea/geniatagger-python/blob/master/geniatagger.py#L19 .

And possibly it expects some use of './' in the first level of directory path, so you would have to initialize the tagger as such GeniaTagger('./geniatagger/geniatagger').


Beyond the installation issues. If you use the python wrapper for the GeniaTagger, there's only one function in the GeniaTagger object, i.e. parse(), when you use parse(), it will output a list of tuples for each sentence and the input is one sentence string. The items in each tuple are:

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
  • Hi alvas, @saffsd [Nactem webpage](http://www.nactem.ac.uk/GENIA/tagger/) mentions 3.0.2 version of genia tagger (uploaded on Feburary 9 2016). Whereas based on the commit comments in [saffsd's github repository](https://github.com/saffsd/geniatagger) it seems of version 3.0.1 Do you have any idea what has been updated? – Kaushik Acharya Jun 05 '18 at 13:40
  • Upon executing this x = tagger.parse('This is a pen.'), i get an error: TypeError: a bytes-like object is required, not 'str' – PinkBanter Sep 02 '19 at 12:49