I interpret your requirement to match "nouns followed by zero or more sequence of nouns or adjectives" as matching at least one or more sequential nouns (i.e. <N.*>+
), followed by zero or more adjectives (i.e. <J.*>*
). So putting these together you get the full RegExp as follows:
vectorizer = KeyphraseCountVectorizer(pos_pattern="<N.*>+<J.*>*")
As a side point, you note that you are attempting to extract Arabic keywords. From my understanding the keyphrase_vectorizers package relies on the text being annotated with spaCy
PoS tags, and so to change languages from the default (English) you have to load a corresponding pipeline/model in the desired language and set the stop words to those of the new language. For example, if using the Keyphrase Vectorizer for German:
vectorizer = KeyphraseCountVectorizer(spacy_pipeline='de_core_news_sm', stop_words='german')
However, at present spaCy
does not have a pipeline trained for Arabic text, which means that using KeyphraseCountVectorizer
in a straightforward manner with Arabic text is not possible without workarounds (something you may have already solved but I just thought I'd mention it).