Benching Intel 4000, Radeon 7670m and 4850

Question

I am trying to benchmark GPUs. I have a Radeon 7670m (480 x 600 MHz), Intel 4000 (16? x 1100 MHz) and Radeon 4850 (800 x 625 MHz). I throw in a 4096 x 4096 Rg32f-texture and receive a Red-texture. Each pixel takes 400 - 800 ns. The 7670m is around 56% of the 4850 as expected from (480 x 600) / (800 x 625) = 58%. But, the 4000 is 74%, which is not expected from (16 x 1100) / (800 x 625) = 3.5%. Backwards, it seems the 4000 has 350 stream processors (?) yielding 77%.

The 7670m is faster than the 4000 if texture < 512 x 512 and shader time < 45 ns / pixel (perfect for gaming?) I want to run at 30 minutes per texture, which means 10 minutes extra on 7670m.

Now, could it be that the 4000 has 350 streams or is the 7670m just optimized for small(?) texture gaming (and I've been gypped again) or is there some way to get the 7670m to run faster through OpenGL ?

I'll try any hints on this test bed!

Here are the shaders:

#version 330
uniform sampler2D inData
in vec2 glFragCoord
out float outData
void main(void)
{
   float x = texture(inData, glFragCoord.xy).r;
..... do stuff at 400 - 800 microSeconds / Pixel randomly
   outData = x;
}

#version 330
in vec2 position;
out vec2 glFragCoord;
void main()
{
   glFragCoord = position * vec2(0.5) + vec2(0.5);
   gl_Position = vec4(position, 0, 1);
}

Here is the texture/buffer stuff:

float[] data = new float[textureSize * textureSize * 2];
Int32 frameBufferId = GL.GenFramebuffer();
GL.BindFramebuffer(FramebufferTarget.Framebuffer, frameBufferId);
Int32 textureId = GL.GenTexture();
GL.BindTexture(TextureTarget.Texture2D, textureId);
GL.TexImage2D(TextureTarget.Texture2D, 0, PixelInternalFormat.Rg32f, textureSize, textureSize, 0, PixelFormat.Rg, PixelType.Float, data);
GL.FramebufferTexture2D(FramebufferTarget.Framebuffer, FramebufferAttachment.ColorAttachment0, TextureTarget.Texture2D, textureId, 0);
GL.TexParameter(TextureTarget.Texture2D, TextureParameterName.TextureMinFilter, (Int32)TextureMinFilter.Linear);

and here is the new stuff (Ortho, Begin, Vertex2, and End were removed):

Int32 arrayBufferId = GL.GenBuffer();
GL.BindBuffer(BufferTarget.ArrayBuffer, arrayBufferId);
float[] arrayBufferData = new float[4 * 2] { -1, -1, 1, -1, 1, 1, -1, 1 };
GL.BufferData(BufferTarget.ArrayBuffer, new IntPtr(4 * 2 * sizeof(float)), arrayBufferData, BufferUsageHint.StaticDraw);
Int32 positionId = GL.GetAttribLocation(programId, "position");
GL.EnableVertexAttribArray(positionId);
GL.VertexAttribPointer(positionId, 2, VertexAttribPointerType.Float, false, 0, 0); 
GL.Viewport(0, 0, textureSize, textureSize);
GL.DrawArrays(PrimitiveType.Quads, 0, 4);

and here is the timing stuff:

float[] result = new float[textureSize * textureSize];
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
GL.ReadPixels(0, 0, textureSize, textureSize, PixelFormat.Red, PixelType.Float, result);
stopWatch.Stop();

Different timer:

Int32 timerQuery = GL.GenQuery();
GL.BeginQuery(QueryTarget.TimeElapsed, timerQuery);

GL.ReadPixels(0, 0, textureSize, textureSize, PixelFormat.Red, PixelType.Float, result);

GL.EndQuery(QueryTarget.TimeElapsed);
Int32 done = 0;
while (done != 1) { GL.GetQueryObject(timerQuery, GetQueryObjectParam.QueryResultAvailable, out done); }
Int64 elapsedTime = 0;
GL.GetQueryObject(timerQuery, GetQueryObjectParam.QueryResult, out elapsedTime);

Console.WriteLine("GPU query = " + (elapsedTime / 1000000.0).ToString() + " ms");

There are many factors such as number of texture units, number of ROP units, memory bandwidth, and cache sizes that affect the performance. You cannot directly translate a timing to number of shader units and vice versa. — Damon, Jan 05 '14 at 17:26
You also cannot do timing this way at all. The **only** thing you are doing is timing how long it takes GL to transfer the image from VRAM to CPU... there are **timer queries** in modern OpenGL that you can use to time the execution of commands/tasks in the pipeline. — Andon M. Coleman, Jan 05 '14 at 19:54
OK. So, the different timer I added above should give a more accurate measure ? — user3162781, Jan 05 '14 at 20:20
Yes, that will give a much more accurate measure, but still it is only measuring the amount of time taken to read the framebuffer. You have a lot of other things going on in your sample code (e.g. shaders), and this timer query only times your readpixels operation not the time taken to execute your shader. If you time `glDrawArrays (...)` instead, that will include the execution of your shader in the returned time. — Andon M. Coleman, Jan 05 '14 at 21:58
Thanks. That makes more sense. Now the StopWatch and the Query show roughly the same time. And, in the manual it states ReadPixels is frambuffer complete. However, I still get a run time of 9.6s for Intel 4000 and 12.9s for the Radeon 7670m -- that pesky 35% that does not make sense. Thanks for helping! — user3162781, Jan 06 '14 at 09:56
Intel HD4000's compute units can do SIMD16 pixel processing which means 16 operation issued per cycle and there are 16 compute units so this means 16*16=256 things at the same time for pixel computing. Also being inside of cpu can be advantageous when cpu needs to send data to gpu or the opposite. Sharing CPU's cache can up the performance for big textures. — huseyin tugrul buyukisik, Jan 15 '14 at 21:02

Benching Intel 4000, Radeon 7670m and 4850

0 Answers0