1

I was trying to create ngrams using hash_vectorizer function in text2vec, when I noticed that it doesn't change the dimensions of my dtm wit changing values.

h_vectorizer = hash_vectorizer(hash_size = 2 ^ 14, ngram = c(2L, 10L))
dtm_train = create_dtm(it_train, h_vectorizer)
dim(dtm_train)

In the above code, the dimensions dont change whether its 2-10 or 9-10.

vocab = create_vocabulary(it_train, ngram = c(1L, 4L))
ngram_vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, ngram_vectorizer)

In the above code, the dimensions change, but i want to use the hash_vectorizor also since it saves on space. How do I go about using that?

Elin
  • 6,507
  • 3
  • 25
  • 47
Akhil
  • 165
  • 1
  • 1
  • 8

1 Answers1

2

When using hashing you set the size of your output matrix in advance. You did so by setting hash_size = 2 ^ 14. This stays the same indpendently of the ngram window specified in the model. However, the counts within the output matrix change.

(In response to below comments:) Below you find a minimum example with two very simple strings to demonstrate the different outputs for two different ngram windows used in a hash_vectorizer. For the bigrams case I have added the output matrix of a vocab_vectorizer for comparison. You realize that you have to set a hash size sufficiently large to account for all terms. If it is too small the hash values of individual terms may collide.

Your comment concerning that you always have to compare the outputs of a vocab_vectorizer approach and a hash_vectorizer approach leads into the wrong direction, because you would then loose the efficiency/memory advantage that might be generated by a hashing approach, which avoids generating a vocabulary. Depending on your data and desired output hashing may treat accuracy (and interpretability of terms in the dtm) against efficiency. Hence, it depends on your use case if hashing is reasonable or not (which it is especially for classification tasks at the document level for large collections).

I hope this gives you a rough idea about hashing and what you can or cannot expect from it. You might also check some posts on hashing at quora, Wikipedia (or also here). Or also refer to the detailed original sources listed on text2vec.org.

library(text2vec)
txt <- c("a string string", "and another string")

it = itoken(txt, progressbar = F)


#the following four example demonstrate the effect of the size of the hash
#and the use of signed hashes (i.e. the use of a secondary hash function to reduce risk of collisions)
vectorizer_small = hash_vectorizer(2 ^ 2, c(1L, 1L)) #unigrams only
hash_dtm_small = create_dtm(it, vectorizer_small)
as.matrix(hash_dtm_small)
#    [,1] [,2] [,3] [,4]
# 1    2    0    0    1
# 2    1    2    0    0  #collision of the hash values of and / another

vectorizer_small_signed = hash_vectorizer(2 ^ 2, c(1L, 1L), signed_hash = TRUE) #unigrams only
hash_dtm_small = create_dtm(it, vectorizer_small_signed)
as.matrix(hash_dtm_small)
#     [,1] [,2] [,3] [,4]
# 1    2    0    0    1
# 2    1    0    0    0 #no collision but some terms (and / another) not represented as hash value

vectorizer_medium = hash_vectorizer(2 ^ 3, c(1L, 1L)) #unigrams only
hash_dtm_medium = create_dtm(it, vectorizer_medium)
as.matrix(hash_dtm_medium)
#    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# 1    0    0    0    1    2    0    0    0
# 2    0    1    0    0    1    1    0    0 #no collision, all terms represented by hash values


vectorizer_medium = hash_vectorizer(2 ^ 3, c(1L, 1L), signed_hash = TRUE) #unigrams only
hash_dtm_medium = create_dtm(it, vectorizer_medium)
as.matrix(hash_dtm_medium)
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# 1    0    0    0    1    2    0    0    0
# 2    0   -1    0    0    1    1    0    0 #no collision, all terms represented as hash values
                                            #in addition second hash function generated a negative hash value


#the following two examples deomstrate the difference between 
#two hash vectorizers one with unigrams, one allowing for bigrams
#and one vocab vectorizer with bigrams
vectorizer = hash_vectorizer(2 ^ 4, c(1L, 1L)) #unigrams only
hash_dtm = create_dtm(it, vectorizer)
as.matrix(hash_dtm)
#    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
# 1    0    0    0    0    0    0    0    0    0     0     0     1     2     0     0     0
# 2    0    0    0    0    0    0    0    0    0     1     0     0     1     1     0     0

vectorizer2 = hash_vectorizer(2 ^ 4, c(1L, 2L)) #unigrams + bigrams
hash_dtm2 = create_dtm(it, vectorizer2)
as.matrix(hash_dtm2)
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
# 1    1    0    0    1    0    0    0    0    0     0     0     1     2     0     0     0
# 2    0    0    0    0    0    1    1    0    0     1     0     0     1     1     0     0

v <- create_vocabulary(it, c(1L, 2L))
vectorizer_v = vocab_vectorizer(v) #unigrams + bigrams
v_dtm = create_dtm(it, vectorizer_v)
as.matrix(v_dtm)
#   a_string and_another a another and string_string another_string string
# 1        1           0 1       0   0             1              0      2
# 2        0           1 0       1   1             0              1      1


sum(Matrix::colSums(as.matrix(hash_dtm)) > 0)
#[1] 4   - these are the four unigrams a, string, and, another
sum(Matrix::colSums(hash_dtm2) > 0)
#[1] 8   - these are the four unigrams as above plus the 4 bigrams string_string, a_string, and_another, another_string 
sum(Matrix::colSums(v_dtm) > 0)
#[1] 8 - same as hash_dtm2
Manuel Bickel
  • 2,156
  • 2
  • 11
  • 22
  • Hi Manuel, i checked for 1-1, 1-2 and 1-4. the sum for them was 12850,16384 and 16384 respectively. i agree that your logic makes sense and also i got my code from the vignette, but how can i really make sure? – Akhil Dec 14 '17 at 16:38
  • However, when i check for sum(colSums(as.matrix(dtm_train))) then i get different sums for all variations. why have you checked for (>0) ? the values i guess will be 0 or 1 (or maybe >1). There will be no negative values, correct? – Akhil Dec 14 '17 at 16:41
  • (i) I checked for `>0` since the hash size I set was larger than the number of tokens, hence, not all hash elements will have entries, only the positive ones count. For additional clarification, test my example with `c("a string string", "and another string")`. You will notice a count of 2 at one of the positions (plus an additional bigram at another). (ii) What hash size did you set for your 1-4 test (maybe you need to increase it) / what is the number of unigrams/tokens in your data. (iii) You should provide a reproducible example of text otherwise SO members need to keep guessing. – Manuel Bickel Dec 14 '17 at 16:51
  • Additional issue: What was the `dim` of your dtm created with a `vocab_vectorizer` in your above stated test of 1-1, 1-2, 1-4? – Manuel Bickel Dec 14 '17 at 16:54
  • (1) i have used the "spooky author identification" dataset from kaggle. there are 19K+ rows in the train set alone, so i dont really know the count of bigrams and above.(2)i have used 2^14 as the hash size for hash_vectorizer for 1-10 ngrams. will this be enough?. (3) here are the dim values for 1-1 (19579 * 25108),1-2 (19579 * 246512), 1-4(19579 * 1100631). these values i got for vocab_vectorizer – Akhil Dec 14 '17 at 17:02
  • the sum(colSums(dtm_train)) from vocab vectorizer matches the same from hash vectorizer. wow. great relief!. This means that even though the dim might be same for hash vectorizer, one should really check the sum(colSums(dtm)). is my understanding correct? – Akhil Dec 14 '17 at 17:14
  • 1
    Please see my updated answer. Please note that the dtms generated by a hash or a vocab vectorizer are related but not "the same". Hash dtm has no information about terms only hash values of features. – Manuel Bickel Dec 15 '17 at 08:29