3

Out team currently on way of migration legacy project from Elasticsearch v1.7.3 to v7.8.0. It's in most part written in Scala, so along with this we would like to replace Java client : Maven Repository: org.elasticsearch » elasticsearch » 1.7.3

And during the work, we found peace of code we very uncertain about, something like:

SignificantTerms.Bucket bucket = //fethced significant terms;
bucket.getDocCount
bucket.getSupersetDf
bucket.getSubsetSize
bucket.getSupersetSize

so we did not find what getSupersetDf, getSubsetSize and getSupersetSize stands for at all. In ES 1.7.3. documentation for significant terms aggregation: Significant Terms Aggregation | Elasticsearch Reference [1.7] | Elastic

Present only doc_count, bg_count and score per bucket. What is those methods stands for - we can only guess. One of our suggestions that getSupersetDf is same value as bg_count, but again main problem - there is no direct mapping between values in Java client and Elastic documentation.

Could you help us, please?

Thanks!

Ivan Kurchenko
  • 4,043
  • 1
  • 11
  • 28

1 Answers1

3

We can find this in ES source code:

@Override
public final XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
    builder.startObject();
    keyToXContent(builder);
    builder.field(CommonFields.DOC_COUNT.getPreferredName(), getDocCount());
    builder.field(InternalSignificantTerms.SCORE, getSignificanceScore());
    builder.field(InternalSignificantTerms.BG_COUNT, getSupersetDf());
    getAggregations().toXContentInternal(builder, params);
    builder.endObject();
    return builder;
}

You can see that indeed, getSupersetDf stands for bg_count, which is the number of documents in the background (superset) that contain the term.

And this, suggests that subsetSize is doc_count, while getSupersetSize points to this calculation. I think it means the total number of documents in the background (whether they contain the term or not).

So to summarize:

  1. bucket.getDocCount: foreground count, the doc_count in each significant terms bucket.

  2. bucket.getSupersetDf: background count, the bg_count in each significant terms bucket.

  3. bucket.getSubsetSize: total foreground document count, the doc_count that appears in the response outside of the bucket list.

  4. bucket.getSupersetSize: total background document count, the bg_count that appears in the response outside of the bucket list.

Doron Yaacoby
  • 9,412
  • 8
  • 48
  • 59