integrating Elasticsearch & Stanford NLP without re-indexing

Question

We've been using Elasticsearch in the system. Although i used its analyzers and queries. I didn't do deep into its indexing. as of now, i don't know how far ES lets us work the Lucene (inverted-)indexes it has in its shards.

We're now looking at a range of NLP features-- NER for one thing and Stanford NLP appealed.

There's no plug-in to work these 2 packages together(?)

I haven't had a deep look into Stanford NLP. however - as far as i saw, it's working it all on its own indexes. whichever object or type passed to it, Stanford NLP is indexing it itself and going from there.

This would make the system work 2 different indexes for the same set of documents-- those of ES & StanfordNLP, and this would be costly.

Is there a way to get around this?

One scenario i have is: make StanfordNLP work on Lucene segments-- the inverted indexes that ES already built. In this case:

1.) does StanfordNLP use Lucene indexes without re-indexing anything for itself? i don't know StanfordNLP's indexing structure-- or even how far it uses/doesn't use Lucene.

2.) are there any restrictions on using the Lucene indexes in ES shards? would we hit a rock bottom in using these Lucene segments directly as is, bypassing ES in between?

I'm trying to put things together-- all in the air for now. sorry for the naive Q.

I'm aware of OpenNLP and its plug-in. i haven't checked - i'm guessing it wouldn't be "double-indexing" and using ES's indexes(?) However, it's StanfordNLP we're after.

TIA.

score 6 · Answer 1 · answered Jul 20 '15 at 15:45

6

Stanford NER neither uses a Lucene/SOLR index, nor makes its own text index. It maps a piece of text or a token sequence to a sequence of tokens with NER annotations.

Typically, you would run NER on each document on ingestion, around the time of tokenization, prior to indexing, and then index each document for entities as well as words.

I know of no existing ElasticSearch plugin for Stanford NER, but it may be informative to look at how people have done this with Solr: http://www.searchbox.com/named-entity-recognition-ner-in-solr/ . Both Solr and ElasticSearch are using Lucene Analyzers and indexes internally.

answered Jul 20 '15 at 15:45

Christopher Manning

9,360
34
46

thx for the response. is there a way to feed in an indexed set of documents, i.e. the inverted indexes on these documents, for StanfordNLP to get them ready without lengthy processing to turn them into its own indexing structure? there's no way to get around double indexing-- unless i work on a per-document basis, get the outcome of a StanfordNLP component and take it from there(?). i am now looking at the efficiency of process-time in converting it to/from StanfordNLP types. – Roam Jul 20 '15 at 17:30
there is a range of packages, http://nlp.stanford.edu/software/index.shtml. i'm looking at them generally for now. i dont know what each does specifically, however i'm wondering how big a concern i'd be having for using those processing a whole document base and not just a few documents. – Roam Jul 20 '15 at 17:34

score 0 · Answer 2 · answered Dec 17 '15 at 08:32

There is a repository on github that has been experimenting with NER on ElasticSearch using OpenNLP: github page. It uses the ElasticSearch Plugin architecture, so it should be easy to test out in an ES instance. I haven't tried the plugin, but I have experience using OpenNLP from previous jobs, and it has a very solid NER parser.

integrating Elasticsearch & Stanford NLP without re-indexing

2 Answers2