Normalizing by max value or by total value?

Question

I'm doing some work that involves document comparison. To do this, I'm analizing each document, and basically counting the number of times some key words appear on each of these documents. For instance:

Document 1:                          Document 2:
    Book   -> 3                          Book   -> 9
    Work   -> 0                          Work   -> 2
    Dollar -> 5                          Dollar -> 1
    City   -> 18                         City   -> 6

So after the counting process, I store all these sequence of numbers in a vector. This sequence of numbers will represent the feature vector for each document.

Document 1: [ 3,  0,  5, 18]
Document 2: [ 9,  2,  1,  6]

The final step would be to normalize the data in a range from [0 1]. But here is where I realized this could be done following two different approachs:

Dividing each sequence of numbers by the total number of repetitions
Dividing each sequence of numbers by the maximum number of repetitions

Following the first approach, the result of the normalization would be:

Document 1: [ 0.11538,  0.00000,  0.19231, 0.69231]   (divided by 26)
Document 2: [ 0.50000,  0.11111,  0.05556, 0.33333]   (divided by 18)

While following the second approach, the result would be:

Document 1: [ 0.16667,  0.00000,  0.27778, 1.00000]   (divided by 18)
Document 2: [ 1.00000,  0.22222,  0.11111, 0.66667]   (divided by  9)

For this specific case:

Which of these two approaches will enhance the representation and comparisson of feature vector?
Are the results going to be the same?
Will any of these approaches will work better with a specific measure of similarity (euclidian, cosine)?

Which did you end up using and why? Also, what measure of similarity did you choose? — Sam Weisenthal, Jun 12 '15 at 20:33
I no longer remember what I ended up using @user99889. I worked on this project more than 2 years ago :( — juliomalegria, Jun 16 '15 at 23:49

greeness · Accepted Answer · 2012-12-05T03:14:38.660

Notation

Suppose you have two vectors A and B, you use x as the normalization constant for A and y as the normalization constant for B. Since you are counting word occurrences, we can assume x > 0 and y > 0.

Cosine distance

For cosine distance showing below, normalization constant will be canceled out. It's easy to see, you will finally get a constant 1/(xy) at the enumerator, and an identical constant 1/(xy) at the denominator . So you can cancel out 1/(xy).

enter image description here

Euclidean distance

For Euclidean distance, it's not the case above. I list an example below assuming A and B are 2-d vectors. n-dimensional vector is just a simple extension on that. A' and B' are the normalized vector of A and B respectively.

enter image description here

Comparing the unnormalized version of dist(A,B) with the normalized version of dist(A',B'), you can see that: the normalization constant you choose (max or sum) determines the weight on x1^2+x2^2, y1^2+y2^2 and the interacting term. As a result, different normalization constants give you different distances.

Feature Vector

If this is for some information retrieval purpose or topic extracting, did you try TF-IDF? That might be a better measure than purely counting the occurrences of terms.

Normalizing by max value or by total value?

1 Answers1

Notation

Cosine distance

Euclidean distance

Feature Vector

Linked