0

I am trying to tag a HTML page full of space-separated numbers like "5320412185 5320412184 5320412189..." to observe how the tagger behaves with numbers. I'm using english-left3words-distsim.tagger in the constructor. I'm observing on the console that most of the numbers are tagged as CD but at times there are also numbers getting tagged as NN. I searched on the FAQ page of nlp.stanford.edu but I couldn't find this there. Can anyone help me in understanding this?

I don't know if I should need to mention this: I'm feeding each number separately to the tagger by splitting the huge input(1045000 numbers!) based on space-delimiter.

A.R.K.S
  • 1,692
  • 5
  • 18
  • 40
  • Hi, my first answer was not correct, I realized that I didn't understand totally your problem. Now I have revised my answer, it should be the solution of your problem. Have you check it yet? – Ferit Oct 17 '15 at 11:27

1 Answers1

1

From Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision)

Sometimes, it is unclear whether one is cardinal number or a noun. In general, it should be tagged as a cardinal number (CD) even when its sense is not clearly that of a numeral.

EXAMPLE: one/CD of the best reasons

But if it could be pluralized or modified by an adjective in a particular context, it is a common noun (NN).

EXAMPLE: the only (good) one/NN of its kind
         (cf. the only (good) ones/NNS of their kind)

In the collocation another one, one should also be tagged as a common noun (NN).

Hyphenated fractions one-half, three-fourths, seven-eighths, one-and-a-half, seven-and-three-eighths should be tagged as adjectives (JJ) when they are prenominal modifiers, but as adverbs (RB) if they could be replaced by double or twice.

For further reading: http://repository.upenn.edu/cgi/viewcontent.cgi?article=1603&context=cis_reports

Ferit
  • 8,692
  • 8
  • 34
  • 59
  • 1
    Thanks for your response @Saibot. What you explained seems to be for numbers expressed in english words. But my input page has actual numerals of which some are being tagged as "NNS" and "NN". I shall go through this PDF. Thank you very much for this file! – A.R.K.S Oct 18 '15 at 08:04