3

I need to calculate how many times every character occurs in the given string. I need to do it on C or C++, I can use any library. The problem is that I am not a C/C++ developer, so I am not sure that my code is optimal. I want to get the best performance algorithm, it is the main reason for this question.

I am using the following code at the moment:

using namespace std;
...

char* text;        // some text, may be very long
int text_length;   // I know this value, if it can help

map<char,int> table;
map<char,int>::iterator it;

for(int i = 0; c = text[i]; i++) {
    it = table.find(c);
    if (it2 == table.end()) {
        table[c] = 1;
    } else {
        table[c]++;
    }
}

I may use any other structure except std::map, but I do not know which structure is better.

Thanks for your help!

Dmitrii Tarasov
  • 414
  • 2
  • 13

4 Answers4

6

You are doing it right using bucket sort. There cannot be a faster (non-parallel) algorithm for counting elements in a finite universe (such as characters).

If you only use ASCII characters, you can use a simple array int table[256] to avoid the overhead of C++ containers.

Using Duff's device (which is actually slower on some CPUs nowadays):

int table[256];
memset(table, 0, sizeof(table));
int iterations = (text_length+7) / 8;
switch(count % 8){
    case 0:      do {    table[ *(text++) ]++;
    case 7:              table[ *(text++) ]++;
    case 6:              table[ *(text++) ]++;
    case 5:              table[ *(text++) ]++;
    case 4:              table[ *(text++) ]++;
    case 3:              table[ *(text++) ]++;
    case 2:              table[ *(text++) ]++;
    case 1:              table[ *(text++) ]++;
                 } while(--iterations > 0);
}

Update: As MRAB remarked, processing text chunks in parallel might give you a perfomance boost. But be aware that creating a thread is quite expensive, so you should measure, what the lowest amount of characters is, which justifies the thread creation time.

Andriy Tylychko
  • 15,967
  • 6
  • 64
  • 112
Kijewski
  • 25,517
  • 12
  • 101
  • 143
  • 1
    If the string is very large, you could perform bucket sorts in parallel on different substrings and then combine the results. – MRAB Jul 31 '11 at 19:56
  • @MRAB: Right. If OP does this process often, he could even measure the minimum amount of characters needed to justify the thread creation time. – Kijewski Jul 31 '11 at 20:02
  • 4
    That has got to be the most hideously unreadable code I have ever read. – Puppy Jul 31 '11 at 21:09
  • 3
    A) It's not a bucket sort. B) Duff's device (or an equivalent) is performed automatically by any reasonably good optimizing compiler -- no need to code it. – Hot Licks Jul 31 '11 at 21:45
5

You could make an array of 256 ints. one for each character.

Initialize them all to 0, then for each character you see increase the cell in the table with that ascii value.

Yochai Timmer
  • 48,127
  • 24
  • 147
  • 185
1

Just use a 256-entry table and index the table by the character value.

int table[256];
// Wrong, if int table: memset(table, 0, 256);
memset(table, 0, sizeof(table));  // Right
for (int i = 0; i < text_length; i++) {
    table[text[i]]++;
}
Hot Licks
  • 47,103
  • 17
  • 93
  • 151
1

You can use a hash map for O(1) insertion and lookup, which will give you O(n) runtime instead of O(n log n). You can find one in Boost, TR1, or C++0x.

Puppy
  • 144,682
  • 38
  • 256
  • 465