NLP: How to correctly normalise a feature for gender classification?

Question

NOTE Before I begin, this F-measure is not related to precision and recall, and its title and definition is taken from this paper.

I have a feature known as the F-measure, which is used to measure formality in a given text. It is mostly used in gender classification of text which is what I'm working on as a project.

The F-measure is defined as:

F = 0.5 * (noun freq. + adjective freq. + preposition freq. + article freq. – pronoun freq. – verb freq. – adverb freq. – interjection freq. + 100)

where the frequencies are taken from a given text (for example, a blog post).

I would like to normalize this feature for use in a classification task. Initially, my first thought was that since the value F is bound by the number of words in the given text (text_length), I thought of first taking F and dividing by text_length. Secondly, and finally, since this measure can take on both positive and negative values (as can be inferred from the equation) I then thought of squaring (F/text_length) to only get a positive value.

Trying this I found that the normalised values did not seem to be too correct as I started getting really small values in (below 0.10) for all the cases I tested the feature with and I am thinking that the reason might be because I am squaring the value which would essentially make it smaller since its the square of a fraction. However this is required if I want to guarantee positive values only. I am not sure what else to consider to improve the normalisation such that a nice distribution within [0,1] is produced, and would like to know if there is some kind of strategy involved to correctly normalise NLP features.

How should I approach the normalisation of my feature, and what might I be doing wrong?

Side note, but how did you find the preposition/interjection/article frequencies for the text you were working with? It seems like in NLTK and Stanford CoreNLP, there are no POS tags for those (there are only tags that subsume them). — VMS, Jul 13 '17 at 06:13
If I understood you correctly, such data can be elicited by using NLP tools that automatically count such data for you. — mesllo, Dec 01 '17 at 11:29

score 1 · Accepted Answer · answered Feb 17 '15 at 06:51

1

If you carefully read the article, you'll find that the measure is already normalized:

F will then vary between 0 and 100%

The reason for this is that "frequencies" in the formula are calculated as follows:

The frequencies are here expressed as percentages of the number of words belonging to a particular category with respect to the total number of words in the excerpt.

I.e. you should normalize them by the total number of words (just as you suggested). But afterwards don't forget to multiply each one by 100.

answered Feb 17 '15 at 06:51

Vsevolod Dyomkin

9,343
2
31
36

I did not understand why I should multiply each frequency by 100, since it is already a value within the range of 0 - 100 I simply divided by 100 to transform into 0 - 1. Am I correct? – mesllo Feb 18 '15 at 12:35
1

@jablesauce yes, if you already have each frequency as a value between 0 and 100 and the sum of all frequencies is also below 100, you have the correct formula. So, just dividing by 100 will scale your feature into the 0-1 range – Vsevolod Dyomkin Feb 18 '15 at 15:38

NLP: How to correctly normalise a feature for gender classification?

1 Answers1