I want extract keywords/tags from a set of documents (pdf
, docx
, txt
) using opennlp
API for tagging purpose.
Can anyone suggest how I can make use of the opennlp
tool for keyword extraction purpuse?
I want extract keywords/tags from a set of documents (pdf
, docx
, txt
) using opennlp
API for tagging purpose.
Can anyone suggest how I can make use of the opennlp
tool for keyword extraction purpuse?
Welcome to SO! If you think of a "keyword" as a relative term, then OpenNLP can help you in many ways. For instance, you can use the part of speech tagger to extract nouns, and only index the nouns as keywords (you could do the same for verbs). You could use the SentenceChunker, and extract noun phrases or verb phrases and index the phrases. You could perform Named Entity Recognition with the Namefinder and index the entities by type (then your search engine could enable searching specifically on people's names or the names of organizations. This can be powerful depending on your use case. In order to get the text out of the pdf and doc/docx you should think about using Tika.
Here are some links to other SO question
also, if you are using SOLR, I think some work has been done to utilize OpenNLP as a tokenizer... never used it though.