0

I want to perform part of speech tagging and entity recognition in python similar to Maxent_POS_Tag_Annotator and Maxent_Entity_Annotator functions of openNLP in R. I would prefer a code in python which takes input as textual sentence and gives output as different features- like number of "CC", number of "CD", number of "DT" etc.. CC, CD, DT are POS tags as used in Penn Treebank. So there should be 36 columns/features for POS tagging corresponding to 36 POS tags as in Penn Treebank POS. I want to implement this on Azure ML "Execute Python Script" module and Azure ML supports python 2.7.7. I heard nltk in python may does the job, but I am a beginner on python. Any help would be appreciated.

Ming Xu - MSFT
  • 2,116
  • 1
  • 11
  • 13
ankur
  • 19
  • 6

1 Answers1

3

Take a look at NTLK book, Categorizing and Tagging Words section.

Simple example, it uses the Penn Treebank tagset:

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
pos_tag(word_tokenize("John's big idea isn't all that bad.")) 

[('John', 'NNP'),
("'s", 'POS'),
 ('big', 'JJ'),
 ('idea', 'NN'),
 ('is', 'VBZ'),
 ("n't", 'RB'),
 ('all', 'DT'),
 ('that', 'DT'),
 ('bad', 'JJ'),
 ('.', '.')]

Then you can use

from collections import defaultdict
counts = defaultdict(int)
for (word, tag) in pos_tag(word_tokenize("John's big idea isn't all that bad.")):
    counts[tag] += 1

to get frequencies:

defaultdict(<type 'int'>, {'JJ': 2, 'NN': 1, 'POS': 1, '.': 1, 'RB': 1, 'VBZ': 1, 'DT': 2, 'NNP': 1})
hellpanderr
  • 5,581
  • 3
  • 33
  • 43
  • Thanks @hellpanderr. Can you pls also guide the steps how to import nltk in python? I am new on python. Windows 7 - 64 bit. – ankur Sep 07 '15 at 06:17
  • @ankur The steps for import nltk in python: 1. open a cmd window; 2. command 'cd' into the path of installed Python; 3. command 'Scripts/pip.exe install ntlk' – Peter Pan Sep 07 '15 at 09:11
  • @PeterPan-MSFT I am using python 2.7.7. pip is not installed. It is showing error of Scripts is not recognized as internal or external command. – ankur Sep 07 '15 at 09:22
  • @PeterPan-MSFT one more related question. If I want to use just pos_tag, word_tokenize what option i should put under nltk.download(info_or_if=' ') under id? – ankur Sep 07 '15 at 09:24
  • @ankur Download pip-7.1.2.tar.gz from the page https://pypi.python.org/pypi/pip, decompress it and command 'python setup.py' to install pip. – Peter Pan Sep 07 '15 at 09:28
  • @PeterPan-MSFTImport Error: No module named setuptools – ankur Sep 07 '15 at 09:38
  • @ankur Please create a new thread for your more related question about NTLK, the comment is not used for answer the details of programming. Thanks. – Peter Pan Sep 07 '15 at 09:38