12

Is there any way to use the Standford Tagger in a more performant fashion?

Each call to NLTK's wrapper starts a new java instance per analyzed string which is very very slow especially when a larger foreign language model is used...

http://www.nltk.org/api/nltk.tag.html#module-nltk.tag.stanford

Jabb
  • 3,414
  • 8
  • 35
  • 58

2 Answers2

14

Found the solution. It is possible to run the POS Tagger in servlet mode and then connect to it via HTTP. Perfect.

http://nlp.stanford.edu/software/pos-tagger-faq.shtml#d

example

start server in background

nohup java -mx1000m -cp /var/stanford-postagger-full-2014-01-04/stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTaggerServer -model /var/stanford-postagger-full-2014-01-04/models/german-dewac.tagger -port 2020 >& /dev/null &

adjust firewall to limit access to port 2020 from localhost only

iptables -A INPUT -p tcp -s localhost --dport 2020 -j ACCEPT
iptables -A INPUT -p tcp --dport 2020 -j DROP

test it with wget

wget http://localhost:2020/?die welt ist schön

shutdown server

pkill -f stanford

restore iptable settings

iptables -D INPUT -p tcp -s localhost --dport 2020 -j ACCEPT
iptables -D INPUT -p tcp --dport 2020 -j DROP
Jabb
  • 3,414
  • 8
  • 35
  • 58
  • 2
    can you add the python code you used to connect this/use this from NLTK? I am interested in this question but for now I am using the previous solution as it solves the problems if you queue sentences for processing. – Tommy Jul 19 '14 at 21:22
  • 1
    @Jabb hi, can you please help on the wget command, when I tried it, this is error that I got, please see the continuation comment `--2020-11-24 17:28:00-- http://localhost:2020/?die Resolving localhost (localhost)... 127.0.0.1 Connecting to localhost (localhost)|127.0.0.1|:2020... connected. HTTP request sent, awaiting response... 200 No headers, assuming HTTP/0.9 Length: unspecified Saving to: ‘index.html?die’ index.html?die [ <=>` – William Nov 24 '20 at 10:38
  • 1
    `‘index.html?die’ saved [47] http://welt/ Resolving welt (welt)... failed: Name or service not known. wget: unable to resolve host address ‘welt’ - http://ist/ Resolving ist (ist)... failed: No address associated with hostname. wget: unable to resolve host address ‘ist’ --2020-11-24 17:28:04-- http://xn--schn-7qa/ Resolving xn--schn-7qa (xn--schn-7qa)... failed: Name or service not known. wget: unable to resolve host address ‘xn--schn-7qa’ FINISHED --2020-11-24 17:28:06-- Total wall clock time: 5.6s Downloaded: 1 files, 47 in 0s (5.47 MB/s)` – William Nov 24 '20 at 10:39
7

Using nltk.tag.stanford.POSTagger.tag_sents() for tagging multiple sentences.

The tag_sents has replaced the old batch_tag function, see https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L61


DEPRECATED:

Tag the sentences using batch_tag instead of tag, see http://www.nltk.org/_modules/nltk/tag/stanford.html#StanfordTagger.batch_tag

alvas
  • 115,346
  • 109
  • 446
  • 738
  • THis works really well, simply by queueing sentences for processing and then using `taggedlist = batch_tag[sent for sent in queue].` – Tommy Jul 19 '14 at 21:23
  • it's because `batch_tag` loads the tagger and the model only once. using `tag` will reload the tagger and model each time you tag a sentence ;) – alvas Jul 20 '14 at 06:26
  • Yeah its awesome, i would upvote three times if I could! – Tommy Jul 20 '14 at 15:12
  • 1
    Did this go away very recently? I followed the link, and could not find the batch_tag method there. – scharfmn Feb 10 '15 at 10:23
  • 1
    @bahmait, i've updated the function name for the latest version. The `batch_tag` has been refactored to `tag_sents` – alvas Feb 10 '15 at 13:07
  • I am not sure why tagging multiple sentences using batch_tag is about 2 times slower that tagging the sentences one by one for me. – Brana Nov 23 '15 at 02:49
  • update your nltk `pip install -U nltk` and then using `tag_sents` instead of `batch_tag` – alvas Nov 23 '15 at 08:14
  • Are you sure batch_tag works slower than tag_sents [nltk 2.0 vs 3.0]? Because it was pretty difficult to make the Stanford NLP tagger workd on nltk 2, and it could be even more difficult to make it work on nltk 3. I even had to edit some lines in the source code to make the NLTK 2.0 work, so I would not like to install 3.0 unless I have to. – Brana Nov 23 '15 at 23:50
  • 1
    upgrade to NLTK v3.1 , the devs have solved quite a lot of bugs since 3.0 and of course a lot more since 2.0. If not go to corenlp code from stanford. – alvas Nov 24 '15 at 00:34
  • I am going to try to do this. I couldnt install corenlp on windows I spend a few days trying . – Brana Nov 24 '15 at 00:52
  • 1
    Calm down. First update NLTK to v3.1, then take a look at http://moin.delph-in.net/ZhongPreprocessing and see whether it works. If that fails, write down all the steps you've took to do the NLTK update and add more details to your question. Then we can try and see whether we can help you fix the issue. – alvas Nov 24 '15 at 01:02
  • StanfordPos tagger 3.1 works for me - they did fix the bug that was present in 2.0.3. I was talking about stanford nlpcore which I tried to install few years ago. – Brana Nov 24 '15 at 01:37
  • I measured the speed - it takes around 45 sec to tag 1600 sentences using this method while using just tag it would take around 3000 sec to do the same job. Therefore I just got 60-70x of improvement in speed. So, thanks very much for the suggestion. – Brana Nov 24 '15 at 01:54