Estimated size of the self-join operation on a relation R, given a histogram for R

Question

Query optimizers typically use summaries of data distributions to estimate the sizes of the intermediate tables generated during query processing. One popular such summarization scheme is a histogram, whereby the input range is partitioned into buckets and a cumulative count is maintained of the number of tuples falling in each bucket. The distribution within a bucket is assumed to be uniform for the purposes of estimation.

The following shows one such histogram for a relation R on a discrete attribute a with domain [1..10]:

Bucket 1: range = [1..2] Cumulative tuple count = 6 

Bucket 2: range = [3..8] Cumulative tuple count = 30

Bucket 3: range = [9..10] Cumulative tuple count = 10

What is the estimated size of the self-join operation R x R

A) 46
B) 218
C) 248
D) 1,036
E) 5,672

Answer given in solutions : B

How is the answer to be calculated?

score 2 · Answer 1 · edited Feb 08 '15 at 22:15

The size of a self-join on attribute R is equal to the summation of the frequency of each value of attribute R.

Here the frequency is given in buckets, e.g. the first bucket has 2 values r with frequency = 6, so we can assume the frequency of each value in bucket one is frequency = 3, similarly for bucket two frequency of each = 30/6 = 5, and for bucket three frequency of each value = 10/2 = 5.

Therefore, the size is

Size =  [(3^2)*2] + [(5^2)*6] + [(5^2)*2]
     =  218

score 0 · Answer 2 · answered Nov 11 '11 at 16:08

I've been trying to figure this one out myself (it's from the GRE Computer Science subject test preparation exam). So far I haven't found an answer as to why the answer is 218, but I have found a connection between the numbers given and the correct answer.

It turns out that that sum of the square of the cumulative tuple counts divided by the number of discrete values in each bucket, you get 218. Less abstractly: 6²/2 + 30²/5 + 10²/2 = 218.

It's not an answer, but at least there's a connection =)

Estimated size of the self-join operation on a relation R, given a histogram for R

2 Answers2