0

I had an experiment on both GTX760(Kepler) and GTX750Ti(Maxwell) using benchmarks(Parboil, Rodinia). Then I analyzed results using Nvidia visual profiler. In most of the applications, the number of global instructions are enormously increased up to 7-10 times on Maxwell architecture.

spec. for both graphic cards

GTX760 6.0Gbps 2048MB 256bit 192.2 GB/s

GTX750Ti 5.4Gbps 2048MB 128bit 86.4Gb/s

Ubuntu 14.04

CUDA driver 340.29

toolkit 6.5

I compiled the benchmark application(No modification) then I collected the results from NVVP(6.5). Analyze all > Kernel Memory > From L1/Shared Memory section, I collected global load transaction counts.

I attached screenshots of our simulation result of histo ran on kepler(link) and maxwell(link)

Anyone know why the number of global instruction counts are increased on Maxwell architecture?

Thank you.

hkim
  • 13
  • 2
  • There are some simplifications in the Maxwell architecture that can lead to an increase in dynamic instruction count. For example, 32-bit integer multiplication is now a short inline instruction sequence rather than a single instruction. I have seen instruction count expansion of up to 2x in certain standard math functions. I don't see how any of the architecture changes would cause dynamic instruction count changes by a factor of 7-10x. Are you sure both of your builds are release builds? – njuffa Mar 18 '15 at 14:09
  • 1
    Can you provide OS, driver version, toolkit version, name of counters/metrics you are collecting, and directions on how to get and run the benchmark in question. Without investigating the SASS and counter values I'm not sure anyone can provide you a good answer. – Greg Smith Mar 19 '15 at 03:30
  • Ubuntu 14.04 / 340.29 / toolkit 6.5 / I compiled the benchmark then I collected the results from NVVP(6.5). Analyze all > Kernel Memory > From L1/Shared Memory section, I collected global load transaction counts. @GregSmith – hkim Mar 19 '15 at 04:50
  • @njuffa no I built with released version. (no modify) – hkim Mar 19 '15 at 04:51
  • @hkim - Sorry for the long delay in responding. See my answer below. A future version of the tools should have better metrics for Maxwell that are actionable and comparable to past architectures. – Greg Smith Mar 24 '15 at 19:53

1 Answers1

4

The counter gld_transactions is not comparable between Kepler and Maxwell architecture. Furthermore, this is not equivalent to the count of global instructions executed.

On Fermi/Kepler this counts the number of SM to L1 128 byte requests. This can increment from 0-32 per global/generic instruction executed.

On Maxwell global operations all go through the TEX (unified cache). The TEX cache is completely different from the Fermi/Kepler L1 cache. Global transactions measure the number of 32B sectors accessed in the cache. This can increment from 0-32 per global/generic instruction executed.

If we look at 3 different cases:

CASE 1: Each thread in a warp accesses the same 32-bit offset.

CASE 2: Each thread in a warp accesses a 32-bit offset with a 128 byte stride.

CASE 3: Each thread in a warp accesses a unique 32-bit offset based upon its lane index.

CASE 4: Each thread in a warp accesses a unique 32-bit offset in a 128 byte memory range that is 128-byte aligned.

gld_transcations for each list case by architecture

            Kepler      Maxwell
Case 1      1           4
Case 2      32          32
Case 3      1           8
Case 4      1           4-16

My recommendation is to avoid looking at gld_transactions. A future version of the CUDA profilers should use different metrics that are more actionable and comparable to past architectures.

I would recommend looking at l2_{read, write}_{transactions, throughput}.

Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
Greg Smith
  • 11,007
  • 2
  • 36
  • 37