4

This part of code is from dotproduct method of a vector class of mine. The method does inner product computing for a target array of vectors(1000 vectors).

When vector length is an odd number(262145), compute time is 4.37 seconds. When vector length(N) is 262144(multiple of 8), compute time is 1.93 seconds.

     time1=System.nanotime();
     int count=0;
     for(int j=0;j<1000;i++)
     {

             b=vektors[i]; // selects next vector(b) to multiply as inner product.
                           // each vector has an array of float elements.

             if(((N/2)*2)!=N)
             {
                 for(int i=0;i<N;i++)
                 {
                     t1+=elements[i]*b.elements[i];
                 }
             }
             else if(((N/8)*8)==N)
             {
                 float []vek=new float[8];
                 for(int i=0;i<(N/8);i++)
                 {
                     vek[0]=elements[i]*b.elements[i];
                     vek[1]=elements[i+1]*b.elements[i+1];
                     vek[2]=elements[i+2]*b.elements[i+2];
                     vek[3]=elements[i+3]*b.elements[i+3];
                     vek[4]=elements[i+4]*b.elements[i+4];
                     vek[5]=elements[i+5]*b.elements[i+5];
                     vek[6]=elements[i+6]*b.elements[i+6];
                     vek[7]=elements[i+7]*b.elements[i+7];


                     t1+=vek[0]+vek[1]+vek[2]+vek[3]+vek[4]+vek[5]+vek[6]+vek[7];
                     //t1 is total sum of all dot products.
                 }
             }
     }
     time2=System.nanotime();
     time3=(time2-time1)/1000000000.0; //seconds

Question: Could the reduction of time from 4.37s to 1.93s (2x as fast) be JIT's wise decision of using SIMD instructions or just my loop-unrolling's positive effect?

If JIT cannot do SIMD optimizaton automatically, then in this example there is also no unrolling optimization done automatically by JIT, is this true?.

For 1M iterations(vectors) and for vector size of 64, speedup multiplier goes to 3.5X(cache advantage?).

Thanks.

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97

2 Answers2

8

Your code has a bunch of problems. Are you sure you're measuring what you think you're measuring?

Your first loop does this, indented more conventionally:

 for(int j=0;j<1000;i++) {
     b=vektors[i]; // selects next vector(b) to multiply as inner product.
                   // each vector has an array of float elements.
 }

Your rolled loop involves a really long chain of dependent loads and stores. Your unrolled loop involves 8 separate chains of dependent loads and stores. The JVM can't turn one into the other if you're using floating-point arithmetic because they're fundamentally different computations. Breaking dependent load-store chains can lead to major speedups on modern processors.

Your rolled loop iterates over the whole vector. Your unrolled loop only iterates over the first (roughly) eighth. Thus, the unrolled loop again computes something fundamentally different.

I haven't seen a JVM generate vectorised code for something like your second loop, but I'm maybe a few years out of date on what JVMs do. Try using -XX:+PrintAssembly when you run your code and inspect the code opto generates.

tmyklebu
  • 13,915
  • 3
  • 28
  • 57
  • Im sorry I was going to delete that "}" but forgot to delete.(debugging) – huseyin tugrul buyukisik Jul 03 '13 at 23:09
  • Ugh. Yeah, it's kind of annoying to make it work. You need something like `hsdis-amd64.so` to be found by the JVM. You need to pass `-XX:+UnlockDiagnosticVMOptions` before `-XX:+PrintAssembly`. And you have to capitalise everything right. – tmyklebu Jul 03 '13 at 23:26
  • Could not load hsdis-amd64.dll; library not loadable; PrintAssembly is disabled ----> searching for a download location then will I show this as external jar file(or library)? – huseyin tugrul buyukisik Jul 03 '13 at 23:29
  • Build your own openjdk. Or build the `hsdis-base` package or whatever. Anyway, when I tried this with similar code, I still see it issue a bunch of scalar instructions. Bottom line: Don't use java for anything where speed matters at all. – tmyklebu Jul 03 '13 at 23:33
  • I was using opencl but read that the direct usage of avx could get at least 10 times faster for now. – huseyin tugrul buyukisik Jul 03 '13 at 23:39
5

I have done a little research on this (and am drawing from knowledge from a similar project I did in C with matrix multiplication), but take my answer with a grain of salt as I am by no means an expert on this topic.

As for your first question, I think the speedup is coming from your loop unrolling; you're making roughly 87% fewer condition checks in terms of the for loop. As far as I know, JVM supports SSE since 1.4, but to actually control whether your code is using vectorization (and to know for sure), you'll need to use JNI.

See an example of JNI here: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?

When you decrease the size of your vector to 64 from 262144, cache is definitely a factor. When I did this project in C, we had to implement cache blocking for larger matrices in order to take advantage of the cache. One thing you might want to do is check your cache size.

Just as a side note: It might be a better idea to measure performance in flops rather than seconds, just because the runtime (in seconds) of your program can vary based on many different factors, such as CPU usage at the time.

Community
  • 1
  • 1
  • Also compute time of vector(size=13) is equal to vector(size=13*16), isnt this weird? Just multiplying problem size by 16 gives same time(nearly 1 second for both) I thought CPU was strong in branching. – huseyin tugrul buyukisik Jul 03 '13 at 23:07
  • 13 is tiny, and 13*16 allows you to use loop unrolling. But both are pretty small vectors, so I doubt you'll see much time variation. The overhead of the program probably dwarfs both of their calculation times. – Arjun Baokar Jul 03 '13 at 23:09
  • If his second loop wasn't broken, the two loops would use the cache in exactly the same ways. The speedup isn't because of the reduced condition checking. It's because the computations are organised differently. If you reduce the size of the vector to 64, then everything fits in L1 and it doesn't matter at all what order you traverse it in. Flops aren't a great measure of speed on modern processors because they're highly pipelined and arithmetic usually isn't the bottleneck. – tmyklebu Jul 03 '13 at 23:09
  • I just delete the "}" in the upper side. I forgot to delete when I was debugging. Sorry. So you mean different size of vector means different level of cache. No SIMD at all? – huseyin tugrul buyukisik Jul 03 '13 at 23:11
  • To use JNI, do I need that .h file to really use CPU? I tried and did not work(downloaded the jar of javacpp and showed the path) but I will try harder. – huseyin tugrul buyukisik Jul 03 '13 at 23:16