Sedgewick Algorithms 4, why BinarySearchST put FrequencyCounters test costs lower than SequentialSearchST?

Question

I'm reading Algorithms 4th edition. I have some questions when reading chapter 3 Searching. From the cost summary the insert cost of BinarySearchST(2N in worst case) is a little worse than SequentialSearchST(N in worst case). But the FrequencyCounter test with VisualAccumulator(which draws plots) shows

Returning to the cost of the put() operations for FrequencyCounter for words of length 8 or more, we see a reduction in the average cost from 2,246 compares (plus array accesses) per operation for SequentialSearchST to 484 for BinarySearchST.

Shouldn't the put() operations of BinarySearchST need more compares(plus array accesses) than SequentialSearchST?

Another question, for BinarySearchST, the book says

Proposition B (continued). Inserting a new key into an ordered array of size N uses ~ 2N array accesses in the worst case, so inserting N keys into an initially empty table uses ~ N² array accesses in the worst case

When I look at the code of BinarySearchST, I think inserting a new key into an ordered array of size N uses ~ 4N array accesses.

    public void put(Key key, Value val)  {
    if (key == null) throw new IllegalArgumentException("first argument to put() is null"); 

    if (val == null) {
        delete(key);
        return;
    }

    int i = rank(key);

    // key is already in table
    if (i < n && keys[i].compareTo(key) == 0) {
        vals[i] = val;
        return;
    }

    // insert new key-value pair
    if (n == keys.length) resize(2*keys.length);

    for (int j = n; j > i; j--)  {
        keys[j] = keys[j-1];
        vals[j] = vals[j-1];
    }
    keys[i] = key;
    vals[i] = val;
    n++;

    assert check();
}

Because for every i in the loop, there are 4 array accesses, 2 for keys reading and updating, 2 for values reading and updating. So why does prop B say it uses ~2N array accesses?

Miljen Mikic · Answer 1 · 2018-01-15T11:32:58.573

1

Shouldn't the put() operations of BinarySearchST need more compares(plus array accesses) than SequentialSearchST?

The key thing to understand is where does complexity comes from for each of these two symbol table implementations. SequentialSearchST reaches its worst case when the input key is not present, because in that case it needs to perform N searches (and has N misses). Based on the type of the input text, this could happen quite often. However, even if the key is already there, on average there are N/2 compares to find it sequentially.

As per BinarySearchST, searching for the key costs logN in the worst case, so here the complexity comes from resizing the array and/or from moving the existing elements to the right to make room for a new key. Notice that when the key is missing, you should make N/2 moves on average, and when key is there, only logN compares on average. In this case the total running time highly depends on the distribution of the keys - if new keys keep coming, running time will be higher!

The test they performed included text "Tale of two cities" by Charles Dickens, taking only words with 8 letters or more. There are 14350 such words, from which 5737 distinct. After 14350 put() operations and 5737 keys in the table, you would expect about 5737 / 2 = 2868 compares to perform another put() in SequentialSearchST. However, it's better than that, you "only" need 2246 compares. BinarySearchST's runtime significantly depends on the presence of the key; the experiment showed that for this text there were far more O(logN) searches of existing keys than O(N) moves required to insert new keys, which combined gives smaller cost than SequentialSearchST. Do not mix average and worst case runtime, this analysis relies on the average case complexity for the specific example.

When I look at the code of BinarySearchST, I think inserting a new key into an ordered array of size N uses ~ 4N array accesses.

Authors should have clarified the exact definition of access. If referencing the array element means access, then there are even more, 8N array accesses because in the worst case you should first resize the whole array (take a look at the implementation of resize()). Of course, whole implementation could be rewritten to optimize number of accesses in this case by putting new key at the right place during the resize operation.

edited Jan 15 '18 at 11:32

answered Jan 14 '18 at 21:56

Miljen Mikic

14,765
8
58
66

BinarySearchST does not use logN/2 compares for search on average – Panic Jan 15 '18 at 10:50
I was able to replicate the results for SequentialSearchST, but not for BinarySearchST. According to my results, BinarySearchST moves 1275 array entries on average in each array (`keys` and `vals`), which means that the average cost is about 5100 array accesses (much more than 484). – Panic Jan 15 '18 at 13:27
@Panic if it's sorted, I think C * (ln(2) / ln(n+1)) is accurate -- you're right of course. it's definately O(log(N)), you're cutting the search set in half with every iteration. a hash is usually better at O(N). must have memory restrictions – Abdul Ahad Jan 15 '18 at 14:42
@Panic Interesting. How did you count array entries - did you use the same tool as they? Citing, "for the ith put() operation we plot a gray point with x coordinate i and y coordinate the number of key compares it uses and **a red point with x coordinate i and y coordinate the cumulative average number of key compares used for the first i put() operations**" – Miljen Mikic Jan 15 '18 at 15:23
1

@Panic I believe the key is to observe only a single array while counting (hence theirs 2N operations in the worst case for BinarySearchST). Let's try to estimate the average number of operations in BinarySearchST: there are 5737 misses and 8613 hits. If we take N operations for misses and logN for hits, average is ((1 + 2 + .. + 5737) + 1.5 * (log(1) + log(2) + .. + log(5737)) / 14350 = (5737 * 5738 / 2 + 1.5 * log(5737!)) / 14350 =~ (5737 * 5738 / 2 + 1.5 * 5737 log(5737)) / 14350 = 1154. That is around double of their number, and very close to your measurement. – Miljen Mikic Jan 15 '18 at 17:14
what is a pointer? also, it might be better to try to append multiple records and then sort only once – Abdul Ahad Jan 15 '18 at 21:29
@AbdulAhad Actually, SequentialSearchST uses unordered list, so adding a new element does not require shifting of all elements. However, there is a huge cost of sequential search for each key. And yes, there are much better implementations of a symbol table, such as binary search trees and hash tables, that are covered later in that course. – Miljen Mikic Jan 15 '18 at 21:34
@MiljenMikic yeah, so the binary search is expensive to put, and the sequential search is expensive to get. It's basically a trade off, but I'm not sure it makes sense to compare the put for the two methods because a sequential list has instant inserts. I'm not sure I understand what's being asked. the expense of the put can be offset it you expect to insert many records at a time between gets and don't sort every time. academically interesting anyway. you should also consider a symbol table size of 1,000,000,000,000 or some other number like that when deciding between the two – Abdul Ahad Jan 15 '18 at 21:58
@AbdulAhad In practice neither should be used as a symbol table (like I mentioned, there are more efficient implementations). It is interesting to compare them though, because sequential search "eats" all benefit from O(1) insert and put() is more efficient in BinarySearchST than in SequentialSearchST. – Miljen Mikic Jan 15 '18 at 22:12
@MiljenMikic BinarySearchST moves N/2 elements on average when inserting a new key, therefore 5131*5130*0.5*0.5/14350=459 elements are moved. Search cost is negligible in this case because it less than (14350-5131)*log2(5131)/14350=7.9. I was able to verify these numbers (previously I made a trivial mistake in my measurements). Also there are 5131 distinct keys, not 5737 (it is mentioned in errata https://algs4.cs.princeton.edu/errata/errata-printing4.php) – Panic Jan 16 '18 at 17:29
@Panic I avoided another 2 in denominator in my calculation because there are 2 arrays moved, but yes, that was probably their logic. Nice! – Miljen Mikic Jan 16 '18 at 18:01

score 0 · Answer 2 · edited Jan 08 '18 at 21:14

0

"Shouldn't the put() operations of BinarySearchST need more compares(plus array accesses) than SequentialSearchST?"

No, because previously the book talks about the WORST case.

Worst and Average cases are different. From the next sentence of the book we can read : "As before, this cost is even better than would be predicted by analysis, and the extra improvement is likely again explained by properties of the application ..."

"So why prop B says it uses ~2N array accesses?"

At some point, I think, you are right, formally there are 4N accesses, but

what if we rewrite loop as :

keys[j] = keys[j-1];
keys[j-1] = keys[j-2];
keys[j-2] = keys[j-3];
...
keys[i+1] = keys[i];

will it mean that we still use 4N accesses? I assume, that JIT compiler can optimize the loop in a right way.

Also we can do an assumption that arrays usually represented as a linear memory, computers read data into virtual pages, so, such a page has been already accessed and it is in a cache.

edited Jan 08 '18 at 21:14

zenwraight

2,002
1
10
14

answered Jan 07 '18 at 22:06

yvs

507
3
19

What is the right way for JVM to optimize the loop? Btw, the loop can be replaced with something like `System.arraycopy(keys, i+1, keys, i, n-i+1)` – Panic Jan 10 '18 at 21:39
Sure, it can be replaced! I think it is event better to do this. Good IDEs can offer to convert it if code looks like: `for (int j = n; j > i; j--) { keys[j] = keys[j-1]; } for (int j = n; j > i; j--) { vals[j] = vals[j-1]; }` – yvs Jan 11 '18 at 11:56
The rest of stuff does not relate to algorithms, and i assume it is not very important to be strong in algo theory, but for computer science it is important to understand that modern processors has different types of memory: virtual, RAM, L3, L2, L1 caches. with different access time (very different!) , so when you wrote `key[i]` at some conditions it could access a virtual memory on disk, but just after that `key[i+1]` could be located in L1 and accessed a thousand times faster – yvs Jan 11 '18 at 11:58

Abdul Ahad · Answer 3 · 2018-01-14T16:18:11.890

If a binary search tree is "balanced", there will be far less comparisons.

1         d
        /   \
2     b       f
    /   \   /   \
3  a     c e     g

In the worst case "unbalanced", there will be more, on the same "order" as sequential. It's not a linear reduction when the tree is balanced, I think it's C * (ln(2) / ln(n+1)) or just O(log(N)) for short. So for millions of records there are much much less.

1  a
    \
2    b
      \
3      c
        \
4        d
          \
5          e
            \
6            f
              \
7              g

If it's only a little unbalanced, the result will be somewhere in the middle.

1         d
        /   \
2     b       e
    /   \       \
3  a     c        f
                   \
4                    g

I'm not sure that your code is optimal, but if the book says there are twice as many operations in the worst case, it's probably accurate. Try to get it to 2x at each level if you're interested in the details for academic reasons.

I wouldn't worry about the value of C - you probably only want to use a BST if you know in advance it's going to be balanced or close to balanced based on your insertion/update method because O(N) will probably be catastrophic. Consider 40 * (ln(2) / ln(1,000,0000,000,000+1)) versus 1 * 1,000,000,000,000.

This question is not about BST. It is about two symbol table implementations: 1. sequential search in a linked list 2. binary search in a sorted array — Panic, Jan 14 '18 at 19:38
@Panic really, hehe, I just scanned the question in 5 seconds or so, sorry hehe, I think the numbers are the same for a sorted array and a balanced BST, probably requiring a BST sort. you can assume the BST is saved as an array, it would only affect C — Abdul Ahad, Jan 15 '18 at 14:05

score 0 · Answer 4 · answered Jan 14 '18 at 16:53

0

The point about BinarySearchST vs SequentialSearchST performance in average case was already covered in other responses.

Concerning the second question: 2N is for an array. It’s obviously true. The BinarySearchST uses 2 arrays but anyway when you’re inserting in an initially empty tree N times you get ~N^2 operations. It’s up to a multiplier. Either you have 2 + 4 + 6 + ... + 2N or 2 times that - anyway you get ~N^2.

answered Jan 14 '18 at 16:53

algrid

5,600
3
34
37

@Panic Second question concerned about complexity of a loop: for (int j = n; j > i; j--) { keys[j] = keys[j-1]; vals[j] = vals[j-1]; } So, how the hell we will get "when you’re inserting in an initially empty tree N times you get ~N^2 operations" in this loop? – yvs Jan 16 '18 at 01:13
~N^2 is not even close – yvs Jan 16 '18 at 01:21
1

@yvs In BinarySearchST insert is O(N), therefore N inserts is O(N^2) – Panic Jan 16 '18 at 06:50

Sedgewick Algorithms 4, why BinarySearchST put FrequencyCounters test costs lower than SequentialSearchST?

4 Answers4