How to define pos_pattern for extracting nouns followed by zero or more sequence of nouns or adjectives for KeyphraseCountVectorizer?

Question

I'm trying to extract Arabic keywords from tweets. I'm using keyBERT with KeyphraseCountVectorizer

vectorizer = KeyphraseCountVectorizer(pos_pattern='< N.*>*')

I'm trying to write more custom pos patterns regExp to select nouns followed by zero or more sequence of nouns or adjectives but not verbs. can you please help me to write the right regExp? Thank you

score 1 · Accepted Answer · answered Jan 14 '23 at 22:26

I interpret your requirement to match "nouns followed by zero or more sequence of nouns or adjectives" as matching at least one or more sequential nouns (i.e. <N.*>+), followed by zero or more adjectives (i.e. <J.*>*). So putting these together you get the full RegExp as follows:

vectorizer = KeyphraseCountVectorizer(pos_pattern="<N.*>+<J.*>*")

As a side point, you note that you are attempting to extract Arabic keywords. From my understanding the keyphrase_vectorizers package relies on the text being annotated with spaCy PoS tags, and so to change languages from the default (English) you have to load a corresponding pipeline/model in the desired language and set the stop words to those of the new language. For example, if using the Keyphrase Vectorizer for German:

vectorizer = KeyphraseCountVectorizer(spacy_pipeline='de_core_news_sm', stop_words='german')

However, at present spaCy does not have a pipeline trained for Arabic text, which means that using KeyphraseCountVectorizer in a straightforward manner with Arabic text is not possible without workarounds (something you may have already solved but I just thought I'd mention it).

How to define pos_pattern for extracting nouns followed by zero or more sequence of nouns or adjectives for KeyphraseCountVectorizer?

1 Answers1