I'm doing some work that involves document comparison. To do this, I'm analizing each document, and basically counting the number of times some key words appear on each of these documents. For instance:
Document 1: Document 2:
Book -> 3 Book -> 9
Work -> 0 Work -> 2
Dollar -> 5 Dollar -> 1
City -> 18 City -> 6
So after the counting process, I store all these sequence of numbers in a vector. This sequence of numbers will represent the feature vector for each document.
Document 1: [ 3, 0, 5, 18]
Document 2: [ 9, 2, 1, 6]
The final step would be to normalize the data in a range from [0 1]
. But here is where I realized this could be done following two different approachs:
- Dividing each sequence of numbers by the total number of repetitions
- Dividing each sequence of numbers by the maximum number of repetitions
Following the first approach, the result of the normalization would be:
Document 1: [ 0.11538, 0.00000, 0.19231, 0.69231] (divided by 26)
Document 2: [ 0.50000, 0.11111, 0.05556, 0.33333] (divided by 18)
While following the second approach, the result would be:
Document 1: [ 0.16667, 0.00000, 0.27778, 1.00000] (divided by 18)
Document 2: [ 1.00000, 0.22222, 0.11111, 0.66667] (divided by 9)
For this specific case:
- Which of these two approaches will enhance the representation and comparisson of feature vector?
- Are the results going to be the same?
- Will any of these approaches will work better with a specific measure of similarity (euclidian, cosine)?