14

I want to get bigrams and trigrams from the example sentences I have mentioned.

My code works fine for bigrams. However, it does not capture trigrams in the data (e.g., human computer interaction, which is mentioned in 5 places of my sentences)

Approach 1 Mentioned below is my code using Phrases in Gensim.

from gensim.models import Phrases
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, threshold=1, delimiter=b' ')
trigram = Phrases(bigram_phraser[sentence_stream])

for sent in sentence_stream:
    bigrams_ = bigram_phraser[sent]
    trigrams_ = trigram[bigrams_]

    print(bigrams_)
    print(trigrams_)

Approach 2 I even tried to use Phraser and Phrases both, but it didn't work.

from gensim.models import Phrases
from gensim.models.phrases import Phraser
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)
trigram = Phrases(bigram_phraser[sentence_stream])

for sent in sentence_stream:
    bigrams_ = bigram_phraser[sent]
    trigrams_ = trigram[bigrams_]

    print(bigrams_)
    print(trigrams_)

Please help me to fix this issue of getting trigrams.

I am following the example documentation of Gensim.

1 Answers1

20

I was able to get bigrams and trigrams with a few modifications to your code:

from gensim.models import Phrases
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, delimiter=b' ')
trigram = Phrases(bigram[sentence_stream], min_count=1, delimiter=b' ')

for sent in sentence_stream:
    bigrams_ = [b for b in bigram[sent] if b.count(' ') == 1]
    trigrams_ = [t for t in trigram[bigram[sent]] if t.count(' ') == 2]

    print(bigrams_)
    print(trigrams_)

I removed the threshold = 1 parameter from the bigram Phrases because otherwise it seems to form weird digrams that allow the construction of weird trigrams (notice that bigram is used to build the trigram Phrases); this parameter would probably come useful when you have more data. For trigrams, the min_count parameter also needs to be specified because it defaults to 5 if not provided.

In order to retrieve the bigrams and trigrams of each document, you can use this list comprehension trick to filter elements that aren't formed by two or three words, respectively.


Edit - a few details about the threshold parameter:

This parameter is used by the estimator to determine if two words a and b form a phrase, and that is only if:

(count(a followed by b) - min_count) * N/(count(a) * count(b)) > threshold

where N is the total vocabulary size. By default the parameter value is 10 (see docs). So, the higher the threshold, the harder the constraints for words to form phrases.

For example, in your first approach you were trying to use threshold = 1, so you would get ['human computer','interaction is'] as digrams of 3 out of your 5 sentences that begin with "human computer interaction"; that weird second digram is a result of the more relaxed threshold.

Then, when you try to get trigrams with default threshold = 10 you only get ['human computer interaction is'] for those 3 sentences, and nothing for the remaining two (filtered by threshold); and because that was a 4-gram instead of a trigram it would also be filtered by if t.count(' ') == 2. In case that, for example, you lower the trigram threshold to 1, you can get ['human computer interaction'] as trigram for the two remaining sentences. It doesn't seem easy to get a good combination of parameters, here's more about it.

I'm not an expert, so take this conclusion with a grain of salt: I think it's better to firstly get good digram results (not like 'interaction is') before moving on, as weird digrams can add confusion to further trigrams, 4-gram...

stjernaluiht
  • 730
  • 6
  • 14
  • 1
    Many thanks for your very valuable answer. Cheers! :) By the way, can please tell me what happens with `threshold` value, as it is not very clear for me? –  Sep 11 '17 at 11:07
  • 2
    You're welcome! Yes, I edited the answer, hopefully now it's a bit clearer. – stjernaluiht Sep 11 '17 at 15:20
  • 1
    Thanks a lot! Found your answer very useful :) –  Sep 12 '17 at 00:02
  • 1
    It was not obvious from [gensim](https://radimrehurek.com/gensim/models/phrases.html) that `delimiter=b' '` must be in binary format. Thanks for that. – Max Mar 01 '18 at 02:05
  • 1
    How to use it for train and test data? It doesn't have any fit and transform method like scikit learn vectorizers. – user_12 Sep 10 '19 at 14:24