Modified Nvidia Maxwell, increased global memory instruction count

Question

I had an experiment on both GTX760(Kepler) and GTX750Ti(Maxwell) using benchmarks(Parboil, Rodinia). Then I analyzed results using Nvidia visual profiler. In most of the applications, the number of global instructions are enormously increased up to 7-10 times on Maxwell architecture.

spec. for both graphic cards

GTX760 6.0Gbps 2048MB 256bit 192.2 GB/s

GTX750Ti 5.4Gbps 2048MB 128bit 86.4Gb/s

Ubuntu 14.04

CUDA driver 340.29

toolkit 6.5

I compiled the benchmark application(No modification) then I collected the results from NVVP(6.5). Analyze all > Kernel Memory > From L1/Shared Memory section, I collected global load transaction counts.

I attached screenshots of our simulation result of histo ran on kepler(link) and maxwell(link)

Anyone know why the number of global instruction counts are increased on Maxwell architecture?

Thank you.

There are some simplifications in the Maxwell architecture that can lead to an increase in dynamic instruction count. For example, 32-bit integer multiplication is now a short inline instruction sequence rather than a single instruction. I have seen instruction count expansion of up to 2x in certain standard math functions. I don't see how any of the architecture changes would cause dynamic instruction count changes by a factor of 7-10x. Are you sure both of your builds are release builds? — njuffa, Mar 18 '15 at 14:09
Can you provide OS, driver version, toolkit version, name of counters/metrics you are collecting, and directions on how to get and run the benchmark in question. Without investigating the SASS and counter values I'm not sure anyone can provide you a good answer. — Greg Smith, Mar 19 '15 at 03:30
Ubuntu 14.04 / 340.29 / toolkit 6.5 / I compiled the benchmark then I collected the results from NVVP(6.5). Analyze all > Kernel Memory > From L1/Shared Memory section, I collected global load transaction counts. @GregSmith — hkim, Mar 19 '15 at 04:50
@hkim - Sorry for the long delay in responding. See my answer below. A future version of the tools should have better metrics for Maxwell that are actionable and comparable to past architectures. — Greg Smith, Mar 24 '15 at 19:53

score 4 · Accepted Answer · edited Apr 11 '15 at 13:15

The counter gld_transactions is not comparable between Kepler and Maxwell architecture. Furthermore, this is not equivalent to the count of global instructions executed.

On Fermi/Kepler this counts the number of SM to L1 128 byte requests. This can increment from 0-32 per global/generic instruction executed.

On Maxwell global operations all go through the TEX (unified cache). The TEX cache is completely different from the Fermi/Kepler L1 cache. Global transactions measure the number of 32B sectors accessed in the cache. This can increment from 0-32 per global/generic instruction executed.

If we look at 3 different cases:

CASE 1: Each thread in a warp accesses the same 32-bit offset.

CASE 2: Each thread in a warp accesses a 32-bit offset with a 128 byte stride.

CASE 3: Each thread in a warp accesses a unique 32-bit offset based upon its lane index.

CASE 4: Each thread in a warp accesses a unique 32-bit offset in a 128 byte memory range that is 128-byte aligned.

gld_transcations for each list case by architecture

            Kepler      Maxwell
Case 1      1           4
Case 2      32          32
Case 3      1           8
Case 4      1           4-16

My recommendation is to avoid looking at gld_transactions. A future version of the CUDA profilers should use different metrics that are more actionable and comparable to past architectures.

I would recommend looking at l2_{read, write}_{transactions, throughput}.

Modified Nvidia Maxwell, increased global memory instruction count

1 Answers1

Linked

*Modified* Nvidia Maxwell, increased global memory instruction count

1 Answers1

Linked

Modified Nvidia Maxwell, increased global memory instruction count