MongoDB Compass shows bad minimum value of data distribution of a key

Question

I'm on MongoDB Compass Version 1.5.1 for mac.

When I look at distribution of values, Compass returns plots like the following:

As you can see, min and max value are available. But min values are wrong. I know the minimum values of those two keys are 1 and 1, not 9 and 13.

Does Anyone know how to fix that problem?

floatingpurr · Accepted Answer · 2017-01-14T17:41:44.767

Got it. The standard report is based on a sample of max 1000 documents.

From the doc:

Sampling in MongoDB Compass is the practice of selecting a subset of data from the desired collection and analyzing the documents within the sample set.

Sampling is commonly used in statistical analysis because analyzing a subset of data gives similar results to analyzing all of the data. In addition, sampling allows results to be generated quickly rather than performing a potentially long and computationally expensive collection scan.

MongoDB Compass employs two distinct sampling mechanisms.

Collections in MongoDB 3.2 are sampled via the $sample operator in the aggregation framework of the core server. This provides efficient random sampling without replacement over the entire collection, or over the subset of documents specified by a query.

Collections in MongoDB 3.0 and 2.6 are sampled via a backwards compatible algorithm executed entirely within Compass. It comprises three phases:

Query for a stream of _id values, limit 10000 descending by _id

Read the stream of _ids and save sampleSize randomly chosen values. We employ reservoir sampling to perform this efficiently.

Then query the selected random documents by _id The choice of sampling > method is transparent in usage to the end-user.

sampleSize is currently set to 1000 documents.

How can the sample size be hard coded? Can a sample size of 1000 be meaningful in a population of hundreds of millions of documents (or even only hundreds of thousands?? In practice, when I let compass re-analyze the collection schema I get very different results and distributions each time, just depending on what documents more or less randomly made it into the sample. How can each of those very different snapshots be similar to the result of analysing all of the data? I think, the premise justifying such a sample size is valid for a limited area of data characteristics. — rexford, Mar 12 '18 at 23:37

MongoDB Compass shows bad minimum value of data distribution of a key

1 Answers1