2

I haven't found any example of how to set up an ES index with term vectors and to retrieve them later programmatically in Java by document ID.

The JSON variant is described here is working: https://www.elastic.co/guide/en/elasticsearch/reference/2.2/docs-termvectors.html

Can anyone give a Java "translation" for this?

Currently, I create the index like so:

CreateIndexRequestBuilder createIndexRequestBuilder = client.admin().indices().prepareCreate(indexName);
createIndexRequestBuilder.execute().actionGet(); 

And add a document like this:

XContentBuilder sourceBuilder;
sourceBuilder = XContentFactory.jsonBuilder().startObject()
                .field("text", text)
                .field("type", "testType");
IndexRequest request = new IndexRequest(indexName, esContentType).source(sourceBuilder);
client.index(request);

This is how I can fetch a document again:

GetResponse response = client.prepareGet(indexName, esContentType, id).execute().actionGet();
Dharman
  • 30,962
  • 25
  • 85
  • 135
Oliver
  • 43
  • 6
  • This Q&A should help: http://stackoverflow.com/questions/29450241/elasticsearch-java-termvectorrequest-termvector – Val Mar 11 '16 at 15:43
  • Thanks for the quick answer, which seems to solve the question at least partially for the retrieval of the term vector, once you know that it is called TermVectorsResponse in the latest version of ES. ;-) Any pointers how to activate the term vectors in the index programmatically? – Oliver Mar 11 '16 at 16:00

2 Answers2

0

Ok, I finally figured out what I was looking for (this link was also quite helpful). As it may be helpful for others I would like to share it here:

Create your index like so:

CreateIndexRequestBuilder createIndexRequestBuilder = client.admin().indices().prepareCreate("indexName");
createIndexRequestBuilder.execute().actionGet(); 

try {
    client.admin().indices().preparePutMapping("indexName").setType("docType")
        .setSource(XContentFactory.jsonBuilder().prettyPrint()
        .startObject()
            .startObject("docType")
            .startObject("properties")
                .startObject("text").field("type", "string").field("index", "not_analyzed").field("term_vector", "yes").endObject()
            .endObject()
            .endObject()
        .endObject())
    .execute().actionGet();
} catch (IOException e) ...

And here is how you can get back the term vectors from ES:

TermVectorsResponse resp = client.prepareTermVectors().setIndex("indexName")
                          .setType("docType").setId("docId").execute().actionGet();

XContentBuilder builder;
try {
    builder = XContentFactory.jsonBuilder().startObject();
    resp.toXContent(builder, ToXContent.EMPTY_PARAMS);
    builder.endObject();
    System.out.println(builder.string());
} catch (IOException e) ...

This works for me so far, but if anyone has another or a better solution, please feel free to share.

Community
  • 1
  • 1
Oliver
  • 43
  • 6
0

To get the terms we parse the TermsVectorResponse as follows:

import org.apache.lucene.index.Fields;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.action.termvectors.TermVectorsResponse;

...

public List<String> getTerms(TermVectorsResponse resp){

    List<String> termStrings = new ArrayList<>();
    Fields fields = resp.getFields();
    Iterator<String> iterator = fields.iterator();
    while (iterator.hasNext()) {
        String field = iterator.next();
        Terms terms = fields.terms(field);
        TermsEnum termsEnum = terms.iterator();
        while(termsEnum.next() != null){
            BytesRef term = termsEnum.term();
            if (term != null) {
                termStrings.add(term.utf8ToString());
            }
        }
    }
    return termStrings;
}

The TermsEnum object provides further methods to get some aggregated values for the current term. In case you need values for distinct documents (like frequency of term per document) you probably use termsEnum.postings(...) to retrieve them.

We use Elastic 2.3 with Lucene 5.5.0

c_froehlich
  • 1,305
  • 11
  • 13