How much every character occurs in the given string

Question

I need to calculate how many times every character occurs in the given string. I need to do it on C or C++, I can use any library. The problem is that I am not a C/C++ developer, so I am not sure that my code is optimal. I want to get the best performance algorithm, it is the main reason for this question.

I am using the following code at the moment:

using namespace std;
...

char* text;        // some text, may be very long
int text_length;   // I know this value, if it can help

map<char,int> table;
map<char,int>::iterator it;

for(int i = 0; c = text[i]; i++) {
    it = table.find(c);
    if (it2 == table.end()) {
        table[c] = 1;
    } else {
        table[c]++;
    }
}

I may use any other structure except std::map, but I do not know which structure is better.

Thanks for your help!

Seems you can change your map to a std::vector using the index as key. I leave it as an exercise to you to rewrite it =) — Viktor Sehr, Jul 31 '11 at 19:45
Are you content with just ASCII encoded text, or do you want Unicode? If so, what encoding will the text be in? — Adam Wright, Jul 31 '11 at 19:46
for unicode with mainly ascii, mix the std::vector and hash approach. — Karoly Horvath, Jul 31 '11 at 19:51
@Dmitry The entirely of ASCII, or just 0-9, a-z, A-Z? I only ask because I'm pondering on vectorised solutions. — Adam Wright, Jul 31 '11 at 20:05

score 6 · Accepted Answer · edited Jul 31 '11 at 20:27

6

You are doing it right using bucket sort. There cannot be a faster (non-parallel) algorithm for counting elements in a finite universe (such as characters).

If you only use ASCII characters, you can use a simple array int table[256] to avoid the overhead of C++ containers.

Using Duff's device (which is actually slower on some CPUs nowadays):

int table[256];
memset(table, 0, sizeof(table));
int iterations = (text_length+7) / 8;
switch(count % 8){
    case 0:      do {    table[ *(text++) ]++;
    case 7:              table[ *(text++) ]++;
    case 6:              table[ *(text++) ]++;
    case 5:              table[ *(text++) ]++;
    case 4:              table[ *(text++) ]++;
    case 3:              table[ *(text++) ]++;
    case 2:              table[ *(text++) ]++;
    case 1:              table[ *(text++) ]++;
                 } while(--iterations > 0);
}

Update: As MRAB remarked, processing text chunks in parallel might give you a perfomance boost. But be aware that creating a thread is quite expensive, so you should measure, what the lowest amount of characters is, which justifies the thread creation time.

edited Jul 31 '11 at 20:27

Andriy Tylychko

15,967
6
64
112

answered Jul 31 '11 at 19:46

Kijewski

25,517
12
101
143

1

If the string is very large, you could perform bucket sorts in parallel on different substrings and then combine the results. – MRAB Jul 31 '11 at 19:56
@MRAB: Right. If OP does this process often, he could even measure the minimum amount of characters needed to justify the thread creation time. – Kijewski Jul 31 '11 at 20:02
4

That has got to be the most hideously unreadable code I have ever read. – Puppy Jul 31 '11 at 21:09
3

A) It's not a bucket sort. B) Duff's device (or an equivalent) is performed automatically by any reasonably good optimizing compiler -- no need to code it. – Hot Licks Jul 31 '11 at 21:45

score 5 · Answer 2 · answered Jul 31 '11 at 19:47

5

You could make an array of 256 ints. one for each character.

Initialize them all to 0, then for each character you see increase the cell in the table with that ascii value.

answered Jul 31 '11 at 19:47

Yochai Timmer

48,127
24
147
185

Hot Licks · Answer 3 · 2011-07-31T21:47:44.057

1

Just use a 256-entry table and index the table by the character value.

int table[256];
// Wrong, if int table: memset(table, 0, 256);
memset(table, 0, sizeof(table));  // Right
for (int i = 0; i < text_length; i++) {
    table[text[i]]++;
}

edited Jul 31 '11 at 21:47

answered Jul 31 '11 at 19:49

Hot Licks

47,103
17
93
151

1

Shouldn't it be `memset(table,0,256*sizeof(int))`? – lccarrasco Jul 31 '11 at 19:56
1

Even better: `memset(table, 0, sizeof(table))` – RocketR Jul 31 '11 at 20:04
5

Even better: `int table[256] = { 0 };` – gwiazdorrr Jul 31 '11 at 20:07
Yep, one of those. I started out with a byte table and thought better (?) to make it ints. – Hot Licks Jul 31 '11 at 21:42

score 1 · Answer 4 · answered Jul 31 '11 at 19:49

1

You can use a hash map for O(1) insertion and lookup, which will give you O(n) runtime instead of O(n log n). You can find one in Boost, TR1, or C++0x.

answered Jul 31 '11 at 19:49

Puppy

144,682
38
256
465

How much every character occurs in the given string

4 Answers4