In count vectorizer which axis to use?

Question

I want to create a document term matrix. In my case it is not like documents x words but it is sentences x words so the sentences will act as the documents. I am using 'l2' normalization post doc-term matrix creation.

The term count is important for me to create summarization using SVD in further steps.

My query is which axis will be appropriate to apply 'l2' normalization. With sufficient research I understood:

Axis=1 : Will give me the importance of the word in a sentence (column wise normalization)
Axis=0 : Importance of the word in a document (row wise normalization).

Even after knowing the theory I am not able to decide which alternative to choose because the choice will greatly affect my summarization results. So kindly guide me a solution along with a reason for the same.

gtancev · Accepted Answer · 2020-03-22T19:18:19.110

1

By L2 normalization, do you mean division by the total count? If you normalize along axis=0, then the value of x_{i,j} is the probability of the word j over all sentences i (division by the global word count), which is dependent on the length of the sentence, as longer ones can repeat some words over and over again and will have a much higher probability for this word, as they contribute a lot to the global word count. If you normalize along axis=1, then you're asking whether sentences have the same composition of words, as you normalize along the lenght of the sentence.

edited Mar 22 '20 at 19:18

answered Mar 22 '20 at 10:18

gtancev

243
1
10

1

Perfectly put by you for self questioning. I would go with axis=0, because I am interested in the impact of the word in the whole document and this can give be higher probability of key terms. Rather axis = 1 will give me low probabilty becasue I am sure those key terms would hardly reappear in the same sentence. Thank you so much making me think over it. – shrikanth singh Mar 22 '20 at 11:20

In count vectorizer which axis to use?

1 Answers1

Linked