3

I am working on a multithreaded application (Forex trading app built on C#) and had the client upgrade from the 12 core 3.0GHZ machine (Intel) to a 32 core 2.2 Ghz machine (AMD). The PassMark benchmark results were significantly higher when using multicores doing Integer, Floating and other calculations while for a single core calculation it was a bit slower than the pack (others that were being compared to with similar config as the 12 core one). Oh it also comes with 64 GB RAM (4 times as the other one) and a much faster SSD.

So after configuring and running the application on that machine, not only did it not perform as well, it was significantly slower. We're talking about 30seconds - 1 minute slower on an app that usually completes processing within 5-20 secs. The application uses MAX DEGREE of PARALLELISM (TPL) which I've tried setting to number of cores and also half of that. I've also tried running single threaded and without setting any limits in parallel threading.

While it may be the hardware has some issues, I am wondering if the CPU processing speed is the issue. I can overclock to 3.0 GHZ. But is that even a good idea?

Server Info -

AMD http://www.passmark.com/forum/showthread.php?4013-AMD-Dual-6272-performance-is-60-lower-than-benchmarks Seems that benchmark was wrong to start with - officially.

Intel i7 3930k

OS (same in both) Windows 7 Professional 64-bit

Related issue - https://stackoverflow.com/questions/7747573/net-performance-on-amd-processor

EDIT I see a lot of useful information. I want to modify the question slightly now - Forget the Intel processor for now. What can be done with the AMD system to get more out of it? We're working on profiling. We've had a DBA look into the indexing, fragmentation and other parameters like I/O usages. There seems to be a lot more reads and writes than in the Intel based CPU. I saw an answer on AMD based optimization. Is there a way to do this other than use OpenCL? How about overclocking? Would that cook the CPU? In terms of owning up - I see people kind of pissed off at me! The PC was on sale and boss and I discussed if the resources available (4 times more RAM, almost 3 times as many CPUs and a lot faster driver SSD) would help us gain a lot of performance. We're always looking to tune it from the software end, except it hasn't (I won't say didn't) turned out to be that magical bang for the buck we were looking/hoping for. I do feel every bit miserable about this - thus the lengthy post.

More Edit I just wish some AMD rep would say this is bull**** You're doing it the wrong way! You've overlooked this and haven't used this feature.. To make matters worse I read that AMD's made huge losses this year and are waiting on a bailout. :(

Mukus
  • 189
  • 7
  • Please provide the actual CPU model numbers, operating system and version. We can't help you without that information. – ewwhite Dec 19 '12 at 03:16
  • Is your app even capable of 32-way parallelization? – Michael Hampton Dec 19 '12 at 03:25
  • @ewwhite I will add the specs in a bit. atMichael Yes. – Mukus Dec 19 '12 at 03:27
  • c# multithread performance intel vs amd? system ticks on old and new system? – John Siu Dec 19 '12 at 03:28
  • it's not really an Intel VS AMD.. I wouldn't think of it that way. – Mukus Dec 19 '12 at 03:31
  • @TejaswiRana I've worked in high-frequency trading for years, and never found an appropriate use for AMD-based systems... but you need to provide some real details of your setup in order for this question to be useful. – ewwhite Dec 19 '12 at 03:37
  • TejaswiRana - what @ewwhite said; he is an expert in this area. The actual processor model numbers are very important, as I can get a quad-core 3Ghz processor from 4 years ago or a quad-core 2.6ghz processor from today, and I'd chose the latest, technically slower, processor every time. – Mark Henderson Dec 19 '12 at 03:41
  • Hey guys, I am waiting on my client to get back from dinner. Sorry. Both the servers have been shipped back. One returned and the AMD one to check back for faults. I will post the specs as soon as he gets back. – Mukus Dec 19 '12 at 03:43
  • Some CPU's have larger caches. That's why the model numbers may help explain things a little. – hookenz Dec 19 '12 at 04:22
  • I've added the specs – Mukus Dec 19 '12 at 04:54
  • The big question is -- how does your app *actually* use the CPU? If, for example, most if the time it's loading only a single core, then more cores will make very little different and single core performance is critical. – David Schwartz Dec 19 '12 at 13:21
  • We're using TPL. The max degree of parallelism is set to number of cores. – Mukus Dec 19 '12 at 14:35
  • @JohnSiu I stand corrected. Read my last edit. – Mukus Dec 19 '12 at 15:56

4 Answers4

8

Let me get this straight. You upgraded the client based on a hunch and a single benchmark?

That's a mistake. Benchmarks are entirely artificial and do not reflect how real world programs will perform. I will say that they do however provide an indication of potential performance.

Firstly, there is a lot more to getting apps to perform well on multiple cores and to use all the available memory effectively.

Many apps are not written with large concurrency in mind and not all problem domains lend themselves to concurrent solutions. The bottleneck on your app may be locks around shared memory.

For example, I've seen graphs of concurrent apps that seem to scale really well up to say 4 threads, but then for no apparent reason the performance drops off linearly as the number of threads are increased. This is an indication of starvation of a resource. Locks are really expensive. Consider using lockfree structures or minimise the amount of shared resources and interaction between threads.

Another slowdown can be around caches. A really interesting example is the lz4 compressor. Earlier versions were very fast, but another more complex compressor (snappy) gave similar performance. The reason was due to the way the caches are used. Don't underestimate this. If you know what you're doing you can speed up some algorithms and data structures by many multiples which is exactly what the author of LZ4 did.

See the following link for interest sake: http://fastcompression.blogspot.co.nz/2011/06/lz4-improved-performance.html

The first thing I'd do though, is run your code on the 32 core system and see if you can profile it to get an idea of where it's spending its time. It's probably with locks. Also, try reducing the number of threads and benchmarking again. You may find performance increases - in fact I'd say that's likely.

hookenz
  • 14,472
  • 23
  • 88
  • 143
  • when they help pick servers they display the benchmarks to back that up. Think of auto industry or any other. IF benchmarking results are faked to sell something... Also the other system had to be either returned or purchased. It does however sound like a mistake to have purchased this.. that's why the question, right? – Mukus Dec 19 '12 at 04:26
  • Benchmark results were off! http://www.passmark.com/forum/showthread.php?4013-AMD-Dual-6272-performance-is-60-lower-than-benchmarks.. Now your answer is making a lot more sense to me. UpVote. – Mukus Dec 19 '12 at 04:56
  • It also sounds like you have lots of memory available. Is your app able to utilise this more to speed things up? As I know nothing about what your software does it's all guesswork. – hookenz Dec 19 '12 at 10:41
4

One way to think about this: You went from 12 cores x 2 threads per core (HT enabled) x 3.0 ghz = 72.0, to a system with 32 x 1 x 2.2 = 70.4.

Edit: Based on your updated info, the 3930k as described in the ARK has a 6x2 arch = 12 threads, not a 12x2 arch as I suggested. (http://ark.intel.com/products/63697/Intel-Core-i7-3930K-Processor-12M-Cache-up-to-3_80-GHz)

Oversimplified view of the system aside - Intel has more efficient physical cores while the "virtual" (HT) cores are less efficient, and there are many other variables to consider - triple-channel memory controller etc.

But one thing possibly stands out: thread blocking. If there are threads that block / prevent other threads from executing, the faster clock rates + more efficient architectures are going to win out over having simply more thread capability. That is more of a software optimization problem.

Another thing to look at: are you using an AMD-optimized compiler for the C# app, or are you still using the Intel-optimized version? Edit: Visual Studio and most other compilers have options that allow you to target specific CPU architectures, i.e. 32-bit vs 64-bit, ARM, specific instruction sets (SSE2/SSE3/SSE4 etc). I wonder aloud if that could be a factor at play?

Joshua
  • 593
  • 2
  • 19
  • I don;t think it is 12 x 2. It is 6 x 2. 6 Physical and 12 logical. – Mukus Dec 19 '12 at 03:44
  • Can you shed some light on you last paragraph? I haven't heard of that before. Looks quite useful. – Mukus Dec 19 '12 at 03:46
  • Intel HT only gives an approximately 15-30% performance boost, not 100% because it's not a fully independent core. So the weird comparison you make in the first paragraph is fairly meaningless. – hookenz Dec 19 '12 at 04:24
  • @TejaswiRana updated. Matt of course you're right, and I alluded to such. – Joshua Dec 19 '12 at 06:59
  • Intel cores also tend to outperform AMD cores MHz for MHz. That's been true for some time. – hookenz Dec 19 '12 at 10:49
  • @Matt Truth. Yet "tend to" is understated. I think even the low-end Intel i3 parts out-perform AMD parts on a MHz-for-MHz basis. Sad :( AMD's main selling point right now is the GPU, but that will change soon. But I digress... – Joshua Dec 19 '12 at 17:35
2

There are many things to consider.

  • Is the SSD the only "drive" on the system? If the SSD is NOT the only drive on the system is the SSD being used only for the operating system? Are you employing RAID for the application and if so does it connect to other servers that are databases that run RAID? RAID has been found to kill some aspects of database data retrieval.

  • Regarding the CPU, you really do need the chip model number to know that you are comparing Apples to Apples. The model number will tell you the chip cache, # of Cores and # of Threads, processor speed, bus type on the chip, as well as the gigabit per second pipeline speed between cores. For example, one Intel CPU may have an 8.00 GT/s bandwidth and another CPU may have a 6.5 GT/s bandwidth...and between cores that is very important. If data is stuck on a CPU core after doing its work... it effectively deadlocks the entire system, hardware and software.

Intel Server Processors

AMD Server Processors

  • Have you checked to see how large the data set is, and how large the application is when running in RAM? How fast is the RAM between the two systems being compared, AND does the chip that you purchased support the speed of the RAM purchased!!! It is well known that motherboards support many different speeds of RAM, but the CPU that you ordered the system with may not. So you may order a system with a motherboard that supports 1300MHz and due to the chip that you ordered you get less than 1000MHz. If this system has so many cores, why does it only have 64GB of RAM on it for a new system. I have a Dell T-410 for a home system and I purchased it around 2009 and it maxes out at 64GB, with 8 cores(2 quad cores)...and the newer model has 128GB of RAM available with 12 cores(2 x 6). If you reorder the system consider more RAM if you need it...heck, I use 32GB for an 8 core home system running VMware 5.0.

  • Me thinks based on how you wrote your post, and the type of inquiry being made, you did not bone up on the hardware aspects before ordering. If you look at the small print... you may be able to return it for another system. Just tell the boss that the performance is not as expected based on the application that it is running, and do not delay, because the return may be good for a week to two weeks, and after that YOU OWN IT.

Do not be ashamed, just own up to it and let management know that the numbers that you are getting back from initial testing are not within the ballpark of what you believed you would get for the outlay of cash...and we need to exchange it for another system.

T I
  • 21
  • 1
  • While I agree with you on the last para.. it was on sale.. final sale.. so that option quite isn't an option – Mukus Dec 19 '12 at 11:49
1

As others have already noted, benchmarks are not always a good guidance for which processor to choose. Especially PassMark is definitely is not something you would want to look at for non-general-purpose applications.

If you have some idea about what resources your software is using and where it is going to be bottlenecked, you might want to be looking at "raw" performance data like memory latency, memory throughput and maybe also the distinct tests of the the Spec benchmark suite in the CINT (Intel 3960, AMD 6274) and CFP (Intel 3990, AMD 6274) disciplines.

Keep in mind that results (and also the perceived or measured application performance) may vary significantly depending on the compiling options or the compiler version used to produce a particular piece of binary. Things are somewhat different for .NET as compilers are only producing metacode which is translated to actual architecture-dependent code by the JIT runtime. But even there you also can specify optimization parameters for a specific architecture. Also, your specific patch level of the OS might be significant as well - Microsoft has released patches to fix underperformance on certain AMD CPUs.

the-wabbit
  • 40,737
  • 13
  • 111
  • 174