Let's say you have a dynamically allocated array of words
words:
char **word;
size_t words;
If you want to know the number of unique words, and the number of times they repeat in the array, you can use a simplified version of a disjoint-set data structure and an array of counts.
The idea is that we have two arrays of words
elements each:
size_t *rootword;
size_t *occurrences;
The rootword
array contains the index of the first occurrence of that word, and occurrences
array contains the number of occurrences for each first occurrence of a word.
For example, if words = 5
, and word = { "foo", "bar", "foo", "foo", "bar" }
, then rootword = { 0, 1, 0, 0, 1 }
and occurrences = { 3, 2, 0, 0, 0 }
.
To fill in the rootword
and occurrences
arrays, you first initialize the two arrays to "all words are unique and occur exactly once" state:
for (i = 0; i < words; i++) {
rootword[i] = i;
occurrences[i] = 1;
}
Next, you use a double loop. Outer loop loops over unique words, skipping the duplicates. We detect duplicates by setting their occurrence
count to zero. The inner loop is over the words we don't know if are unique or not, and pick off the duplicates of the currently unique word:
for (i = 0; i < words; i++) {
if (occurrences[i] < 1)
continue;
for (j = i + 1; j < words; j++)
if (occurrences[j] == 1 && strcmp(word[i], word[j]) == 0) {
/* word[j] is a duplicate of word[i]. */
occurrences[i]++;
rootword[j] = i;
occurrences[j] = 0;
}
}
In the inner loop, we obviously ignore words that are already known to be duplicates (and j
only iterates over words where occurrences[j]
can be only 0
or 1
). This also speeds up the inner loop for later root words, because we only compare candidate words, not those words we've already found a root word for.
Let's examine what happens in the loops with word = { "foo", "bar", "foo", "foo", "bar" }
input.
i ╷ j ╷ rootword ╷ occurrences ╷ description
───┼───┼───────────┼─────────────┼──────────────────
│ │ 0 1 2 3 4 │ 1 1 1 1 1 │ initial values
───┼───┼───────────┼─────────────┼──────────────────
0 │ 1 │ │ │ "foo" != "bar".
0 │ 2 │ 0 │ 2 0 │ "foo" == "foo".
0 │ 3 │ 0 │ 3 0 │ "foo" == "foo".
0 │ 4 │ │ │ "foo" != "bar".
───┼───┼───────────┼─────────────┼──────────────────
1 │ 2 │ │ │ occurrences[2] == 0.
1 │ 3 │ │ │ occurrences[3] == 0.
1 │ 4 │ 1 │ 2 0 │ "bar" == "bar".
───┼───┼───────────┼─────────────┼──────────────────
2 │ │ │ │ j loop skipped, occurrences[2] == 0.
───┼───┼───────────┼─────────────┼──────────────────
3 │ │ │ │ j loop skipped, occurrences[3] == 0.
───┼───┼───────────┼─────────────┼──────────────────
4 │ │ │ │ j loop skipped, occurrences[4] == 0.
───┼───┼───────────┼─────────────┼──────────────────
│ │ 0 1 0 0 1 │ 3 2 0 0 0 │ final state after loops.