0

I have an implementation of a generic Matrix and I create an option to use '*' and '+' operators with parallel processing and serial processing.

parallel caluclations example: consider we have m1 and m2 matrices and m3 = m1 * m2. we calculate m3 row i' with different threads. plus:

serial calculation just calculate m3[0,0], m3[0,1].. etc

Than I measured the time of each operations using on the big and small matrices and I noted that on the small matrices the Serial processing was faster than Parallel processing but on the other hand Parallel processing had better preformance on the big matrices.

The results:
+----------------------------+--------------------------------+------------+
|             Big            |              Small             |            |
+----------------------------+--------------------------------+------------+
|      *      |       +      |       *       |        +       |            |
+-------------+--------------+---------------+----------------+------------+
| 0.697798sec | 0.0407585sec | 8.7408e-05sec | 0.000109052sec | Parallel   |
+-------------+--------------+---------------+----------------+------------+
| 11.9984sec  | 0.0235058sec | 6.68e-07sec   | 7.76e-07sec    |  Serial    |
+-------------+--------------+---------------+----------------+------------+

Can someone please explain why?

Thanks alot!

user2672165
  • 2,986
  • 19
  • 27
Eliav
  • 29
  • 5
  • 4
    CPU caches are optimised for linear memory access. Get some tools that will show you L1, L2 and L3 cache performance. – Richard Critten Sep 19 '16 at 19:16
  • Spinning up a thread takes time. – NathanOliver Sep 19 '16 at 19:16
  • Also Big and Small are meaningless terms are we taking 10 and 100? And for each size how many threads and were you using a thread pool or starting a new thread for each parcel of work? – Richard Critten Sep 19 '16 at 19:20
  • we are talking about small matrcies of size 2*2 and big size: 1000*1000 – Eliav Sep 19 '16 at 19:27
  • and I start new thread for each parcel of work – Eliav Sep 19 '16 at 19:31
  • Edit your post to show us the code. For example, are you letting the OS schedule the threads? How many threads are you using? How did you benchmark the program? Just because you create threads doesn't mean they will be run on separate cores by the OS or run in parallel by the OS. – Thomas Matthews Sep 19 '16 at 19:44

2 Answers2

0

In a small matrix, for example, take a matrix of size 10*10, serial processing is favourable because a program wouldn't need to be broken down into smaller pieces and then carried onto the serial or single processor for further processing. When this same 10*10 matrix is operated upon via parallel processing, it'll be broken down into smaller pieces which will then be fed to each of the individual processors (keep in mind, all this breaking the matrix and handling it over to each of the parallel processors requires time) and hence, performance of parallel processing reduces over small matrices.

In case, a large matrix, for example, a matrix of size 100*100 is handed over to a serial processor, the processor can't process this single program but has to handle all the interrupts, all other plethora of processes and hence wait time increases in this case. But, when same 100*1000 matrix is handed over to parallel processing, it is broken down into reasonably small pieces and is operated upon by possibly more than one processor. CPU for example if contains two cores/processors, it can dedicate one for this matrix and the other for all other interrupt handling and other programs, hence wait time will significantly reduce over time

0

There are many factors in performance with multiple threads or tasks.

Overhead
Regardless of whether you have an OS or not, parallel programming has an overhead. Minimally, the other processor (core) has to be set up to execute the thread core. This takes execution time.

Another item is the synchronization wait time. At some point, the primary processor needs to wait for the other processor(s) to finish.

There is also an overhead related to signalling or communications. The secondary processor(s) must take execution time to notify the primary processor that computation is complete and they must store the results somewhere.

If the overhead in your thread takes more time than the execution time of the thread (such as a simple single multiplication), you may not notice any time savings from the parallel effort.

Workload
The amount of work performed by the threads should be significant. A 10x10 matrix may not have enough work to overcome the overhead expense. When creating threads, they should perform more work to justify the overhead of their creation and monitoring.

Delegation
If there is an OS, the OS is in charge of how the threads will be executed. They could be executed round-robin style on one processor, even when there are multiple processors in the system. (The OS could delegate one task or application per processor). There will not be much performance improvement when the OS is sharing your threads on a single core. You may want to research the OS to see if you can force delegation of your thread code to other cores.

Memory collisions
Most multi-processor platforms share memory. The hardware may have only one data bus between the processors and the memory. Thus, one processor will have to wait while the other is accessing memory (since only one processor can use the data bus at a time). This can slow down the efficiency of your program so that your performance results are negligible.

Data Caches and Instruction Pipelines
If your program is not optimized for data cache, creating multiple threads will not produce significant performance improvement. A processor can waste time by having to reload it's cache from memory (especially when waiting for the opter processors to finish using the data bus). You may be able to show more improvement in a single execution thread by designing your data and data processing around the data cache structure.

Processors may also have caches or pipelines for instructions. Transfers in program execution are annoying to the processor. It must waste time evaulating whether the code is in a cache or to go and fetch code. Reducing the number of execution transfers will speed up your program, usually more than creating threads.

Your results may not be significant due to various factors, especially if your platform is executing other applications while yours is running. Study up on techniques for benchmarking. You may need to either have a significant amount of data or run your program a significant amount of time (or both). Usually significant is 1E09 iterations around the benchmark area (many computers can execute instructions around 1E-8 seconds, so you'll have to run many times to get an good average).

Thomas Matthews
  • 56,849
  • 17
  • 98
  • 154