The PhrasesTransformer
wraps an instance of gensim's Phrases
model, which does the actual compilation of co-occurrence statistics & promotion of bigrams to phrases.
Unfortunately, the Phrases
object does not, at least through gensim-3.8.3
(April 2020), offer any list of all phrases it could emit. Essentially, it just compiles statistics, then when presented with new text, checks to see which bigrams should be combined (in accordance with its current stats & parameters).
Downsides of this approach include:
- lots of redundant re-calculations
- no handy list of "all" the possible phrases
- retains memory-consuming potential-phrases, that don't qualify with current scoring/threshold parameters, but might if those changed
Upsides of this approach include:
- changing scoring/threshold parameters can immediately enable new phrases
To actually get a list of the phrases a Phrases
instance could create, you need to feed it text with all potential phrases, and see what it promotes. This should probably be a utility method on gensim Phrases
– as a very similar enumeration is already done inside the associated Phraser
class initialization – but isn't. Such a method could work roughly work like (this may have bugs as I haven't tested it):
from gensim.models.phrases import pseudocorpus
def report_all_phrases(phrases_model):
corpus = pseudocorpus(phrases_model.vocab, phrases_model.delimiter, phrases_model.common_terms)
phrasegrams = set()
for bigram, score in phrases_model.export_phrases(corpus, self.delimiter, as_tuples=True):
phrasegrams.add(bigram)
return phrasegrams