There are many factors in performance with multiple threads or tasks.
Overhead
Regardless of whether you have an OS or not, parallel programming has an overhead. Minimally, the other processor (core) has to be set up to execute the thread core. This takes execution time.
Another item is the synchronization wait time. At some point, the primary processor needs to wait for the other processor(s) to finish.
There is also an overhead related to signalling or communications. The secondary processor(s) must take execution time to notify the primary processor that computation is complete and they must store the results somewhere.
If the overhead in your thread takes more time than the execution time of the thread (such as a simple single multiplication), you may not notice any time savings from the parallel effort.
Workload
The amount of work performed by the threads should be significant. A 10x10 matrix may not have enough work to overcome the overhead expense. When creating threads, they should perform more work to justify the overhead of their creation and monitoring.
Delegation
If there is an OS, the OS is in charge of how the threads will be executed. They could be executed round-robin style on one processor, even when there are multiple processors in the system. (The OS could delegate one task or application per processor). There will not be much performance improvement when the OS is sharing your threads on a single core. You may want to research the OS to see if you can force delegation of your thread code to other cores.
Memory collisions
Most multi-processor platforms share memory. The hardware may have only one data bus between the processors and the memory. Thus, one processor will have to wait while the other is accessing memory (since only one processor can use the data bus at a time). This can slow down the efficiency of your program so that your performance results are negligible.
Data Caches and Instruction Pipelines
If your program is not optimized for data cache, creating multiple threads will not produce significant performance improvement. A processor can waste time by having to reload it's cache from memory (especially when waiting for the opter processors to finish using the data bus). You may be able to show more improvement in a single execution thread by designing your data and data processing around the data cache structure.
Processors may also have caches or pipelines for instructions. Transfers in program execution are annoying to the processor. It must waste time evaulating whether the code is in a cache or to go and fetch code. Reducing the number of execution transfers will speed up your program, usually more than creating threads.
Your results may not be significant due to various factors, especially if your platform is executing other applications while yours is running. Study up on techniques for benchmarking. You may need to either have a significant amount of data or run your program a significant amount of time (or both). Usually significant is 1E09 iterations around the benchmark area (many computers can execute instructions around 1E-8 seconds, so you'll have to run many times to get an good average).