Phrase detection using PhrasesTransformer

Question

from gensim.sklearn_api.phrases import PhrasesTransformer

# Create the model. Make sure no term is ignored and combinations seen 3+ times are captured.
m = PhrasesTransformer(min_count=1, threshold=3)
text = [['I', 'love', 'computer', 'science', 'computer', 'science', 'is', 'my', 'passion', 'I', 'studied', 'computer', 'science']]

# Use sklearn fit_transform to see the transformation.
# Since computer and science were seen together 3+ times they are considered a phrase.
m.fit_transform(text)

The above code does return computer_science as expected. But What is the right method to extract phrases pragmatically?

What do you mean by "right way" and "pragmatically"? (The `Phrases` statistical technique works OK for a bunch of purposes, but will both miss phrases/concepts/entities a person would perceive, and combine multigrams that a person could tell, from context, aren't a real unit-of-meaning. So the results will often be unaesthetic for showing to users, but still helpful behind-the-scenes for things like classification or info-retrieval.) — gojomo, Apr 28 '20 at 05:21
Something like m.get_phrases() so it can return computer_science. I am not sure if there's such method or property that can do it — John, Apr 28 '20 at 05:44
Exactly, I am not sure if there's such method or property as i am new to n-gram — John, Apr 29 '20 at 02:05

score 0 · Answer 1 · answered Apr 29 '20 at 20:11

The PhrasesTransformer wraps an instance of gensim's Phrases model, which does the actual compilation of co-occurrence statistics & promotion of bigrams to phrases.

Unfortunately, the Phrases object does not, at least through gensim-3.8.3 (April 2020), offer any list of all phrases it could emit. Essentially, it just compiles statistics, then when presented with new text, checks to see which bigrams should be combined (in accordance with its current stats & parameters).

Downsides of this approach include:

lots of redundant re-calculations
no handy list of "all" the possible phrases
retains memory-consuming potential-phrases, that don't qualify with current scoring/threshold parameters, but might if those changed

Upsides of this approach include:

changing scoring/threshold parameters can immediately enable new phrases

To actually get a list of the phrases a Phrases instance could create, you need to feed it text with all potential phrases, and see what it promotes. This should probably be a utility method on gensim Phrases – as a very similar enumeration is already done inside the associated Phraser class initialization – but isn't. Such a method could work roughly work like (this may have bugs as I haven't tested it):

from gensim.models.phrases import pseudocorpus

def report_all_phrases(phrases_model):
    corpus = pseudocorpus(phrases_model.vocab, phrases_model.delimiter, phrases_model.common_terms)
    phrasegrams = set()
    for bigram, score in phrases_model.export_phrases(corpus, self.delimiter, as_tuples=True):
        phrasegrams.add(bigram)
    return phrasegrams

Phrase detection using PhrasesTransformer

1 Answers1