Karatsuba multiplication improvement

Question

I have implemented Karatsuba multiplication algorithm for my educational goals. Now I am looking for further improvments. I have implemented some kind of long arithmetic and it works well whether I do not use the base of integer representation more than 100. With base 10 and compiling with clang++ -O3 multiplication of two random integers in range [10^50000, 10^50001] takes:

Naive algorithm took me 1967 cycles (1.967 seconds)
Karatsuba algorithm took me 400 cycles (0.4 seconds)

And the same numbers with base 100:

Naive algorithm took me 409 cycles (0.409 seconds)
Karatsuba algorithm took me 140 cycles (0.14 seconds)

Is there a way for improve this results? Now I use such function to finalize my result:

void finalize(vector<int>& res) {
    for (int i = 0; i < res.size(); ++i) {
        res[i + 1] += res[i] / base;
        res[i] %= base;
    }
}

As you can see each step it calculates carry and push it to the next digit. And if I take base >=1000 the result will be overflowed.

If you see at my code I use vectors of int to represent long integer. According to my base a number will divide in separate parts of vector. Now I see several options:

to use long long type for vector, but it might also be overflowed for vast length integers
implement representation of carry in long arithmetic

After I had saw some coments I decided to expand the issue. Assume that we want to represent our long integer as a vector of ints. For instanse:

ULLONG_MAX = 18446744073709551615

And for input we pass 210th Fibonacci number 34507973060837282187130139035400899082304280 which does not fit to any stadard type. If we represent it in a vector of int with base 10000000 it will be like:

v[0]: 2304280
v[1]: 89908
v[2]: 1390354
v[3]: 2187130
v[4]: 6083728
v[5]: 5079730
v[6]: 34

And when we do multiplication we may get (for simplicity let it be two identical numbers) (34507973060837282187130139035400899082304280)^2:

v[0] * v[0] = 5309706318400
...
v[0] * v[4] = 14018612755840
...

It was only the first row and we have to do the six steps like that. Certainly, some step will cause overflow during multiplication or after carry calculation.

If I missed something, please, let me know and I will change it. If you want to see full version, it is on my github

If you have working code then perhaps your question is better suited for http://codereview.stackexchange.com/? — EdChum, Jul 10 '15 at 10:13
I'm voting to close this question as off-topic because it belongs to codereview. There, You would need to post the code in your post rather than linking to your github. You only have to post relevant parts. — UmNyobe, Jul 10 '15 at 10:16
If you really want to accelerate, you should then use another faster algorithm: Toom-Cook for example or Fourier transform. — Jean-Baptiste Yunès, Jul 10 '15 at 10:17
@UmNyobe I'm really sorry, didn't notice that. I'll delete the comment. Mark this as obsolete when you see it. — Ismael Miguel, Jul 10 '15 at 10:25
@EdChum Please read this check-list before recommending a migration to Code Review http://meta.codereview.stackexchange.com/questions/1687/what-questions-are-suitable-for-migration-to-code-review-and-how-does-the-proce/1689#1689 — jacwah, Jul 10 '15 at 10:25
@UmNyobe thanks for your comment, I made some changes in my topic in the problem part that needs to be improved as I think. — vpetrigo, Jul 10 '15 at 10:26

score 0 · Answer 1 · answered Jul 10 '15 at 10:28

Base 2^64 and base 2^32 are the most popular bases for doing high precision arithmetic. Usually, the digits are stored in an unsigned integral type, because they have well-behaved semantics with regard to overflow.

For example, one can detect the carry from an addition as follows:

uint64_t x, y; // initialize somehow
uint64_t sum = x + y;
uint64_t carry = sum < x; // 1 if true, 0 if false

Also, assembly languages usually have a few "add with carry" instructions; if you can write inline assembly (or have access to intrinsics) you can take advantage of these.

For multiplication, most computers have machine instructions that can compute a one machine word -> two machine word product; sometimes, the instructions to get the two halves are called "multiply hi" and "multiply low". You need to write assembly to get them, although many compilers offer larger integer types whose use would let you access these instructions: e.g. in gcc you can implement multiply hi as

uint64_t mulhi(uint64_t x, uint64_t y)
{
    return ((__uint128_t) x * y) >> 64;
}

When people can't use this, they do multiplication in 2^32 instead, so that they can use the same approach to implement a portable mulhi instruction, using uint64_t as the double-digit type.

If you want to write efficient code, you really need to take advantage of these bigger multiply instructions. Multiplying digits in base 2^32 is more than ninety times more powerful than multiplying digits in base 10. Multiplying digits in base 2^64 is four times more powerful than that. And your computer can probably do these more quickly than whatever you implement for base 10 multiplication.

But as I said I use long arithmetic based on a vector of ints to represend huge numbers as even plain `uint64_t` type can not fit integer which has more than 20 digits. — vpetrigo, Jul 10 '15 at 10:32
@vpetrigo: But a *two* digit number in base `2^64` can store a number with 20 decimal digits. — , Jul 10 '15 at 10:54

Karatsuba multiplication improvement

1 Answers1