2

I've run into a bit of an issue writing a fragment shader for a project. I'm creating a palette-less terminal emulator, so I figure I'd do this with the following shader:

#version 110

uniform sampler2D tileset;
uniform sampler2D indices;
uniform sampler2D colors;
uniform sampler2D bgcolors;

uniform vec2 tileset_size;
uniform vec2 size;

varying vec2 tex_coord;

void main(void)
{
    // Calculated texture coordinate
    vec2 screen_pos = vec2(gl_FragCoord.x / 800.0, 1.0 - gl_FragCoord.y / 500.0);

    // Indirect texture lookup 1
    vec2 index = texture2D(indices, screen_pos.st).rg;
    vec4 color = texture2D(colors, screen_pos.st);
    vec4 bgcolor = texture2D(bgcolors, screen_pos.st);

    // Calculated texture coordinate
    vec2 tileCoord;
    //256.0 because the [0,256) byte value is normalized on [0,1)
    tileCoord.x = mod(screen_pos.x, 1.0/size.x)*(size.x/tileset_size.x) + floor(index.x*256.0)/tileset_size.x;
    tileCoord.y = mod(screen_pos.y, 1.0/size.y)*(size.y/tileset_size.y) + floor(index.y*256.0)/tileset_size.y;

    // Indirect texture lookup 2
    vec4 tile = texture2D(tileset, tileCoord);

    vec4 final = tile*color;

    gl_FragColor = vec4(mix(bgcolor.rgb, final.rgb, final.a), 1.0);
}

To render this to the screen, I draw one big quad and let the shader do the rest.

This code generates the desired output. However, it does so at 5 seconds per frame. From what I've researched, this is likely due to the display driver executing my shader in software, rather than hardware. I found that by uncommenting texture2D() calls, things ran smoothly again.

This led me to the following code:

void main(void)
{
    //vec2 screen_pos = vec2(gl_FragCoord.x / 800.0, 1.0 - gl_FragCoord.y / 500.0);
    vec2 screen_pos = vec2(0.5, 0.5);

    vec2 index = texture2D(indices, screen_pos.st).rg;
    vec4 color = texture2D(colors, screen_pos.st);
    vec4 bgcolor = texture2D(bgcolors, screen_pos.st);
    vec4 tiles = texture2D(tileset, screen_pos.st);

    gl_FragColor = vec4(index.rgg + color.rgb + bgcolor.rgb + tiles.rgb, 1.0);
}

Which turned out to be just as awfully slow. Commenting out the last line, vec4 tiles = ..., and removing it from the output ran smoothly again. So I looked up the number of texture2D calls my device supported. I got the following results:

GL_MAX_VERTEX_TEXTURE_IMAGE_UNITS_ARB: 8
GL_MAX_COMBINED_TEXTURE_IMAGE_UNITS_ARB: 16
GL_MAX_TEXTURE_IMAGE_UNITS_ARB: 8
GL_MAX_PROGRAM_TEX_INDIRECTIONS_ARB: 8

So something must be up. Even if each of my calls were indirect accesses (which I'm pretty sure they're not), I should have up to 8 of them! Additionally, glGetShaderInfoLog() and glGetProgramInfoLog() have nothing to say.

I should list my specs:

  • Machine: Intel Atom Duo running Linux 3.17.1 (Arch, specifically)
  • GPU: Intel 945GM/GMS/GME, 943/940GML Integrated Graphics Controller Mesa
  • Version: 10.4.5

And yes, I am checking GL_ARB_fragment_program after calling the standard glewInit() procedure.

So, I have two possible solutions in mind.

  1. The spec sheet for ARB_fragment_shader states that the minimum number of texture indirections should be 4. It could be that my program hasn't initialized the ARB_fragment_program correctly, and the system is falling back to the default. (I tried putting "ARB" in as many shader-related places as I could, but I think glewInit() takes care of this anyway.)
  2. Mesa's compiler has a bug with my particular chip. The final post here mentioned this, and has a similar sounding GPU. Basically, the compiler falsely labels all texture reads as indirect texture reads, thereby rejecting the program incorrectly.

If anyone has any incredible knowledge in this area, I'd really like to hear it. Normally I'd say "screw it, get a better computer," but the sheer irony of having a high-end graphics card just to run a terminal emulator is.. well.. ironic.

If I've forgotten to write some information here, let me know.

Edits

glxinfo -l: pastebin

ARB assembly (partially generated by cgc)

Disabling any of the TEX instructions puts it in hardware mode, all 4 will return to software.

lowq
  • 628
  • 6
  • 18
  • Can't say thanks for reading? I figure it's a pretty long post to follow. – lowq Mar 04 '15 at 04:14
  • Can you add output of `glxinfo -l` (fragment program and fragment shader parts)? Also you can try calculating resulting colour in-place, like `vec4 result = vec4(0); vec4 texresult = texture2D(...); result += texresult; texresult = texture2D(...); ...` - may be you're limited by temporary registers count. – keltar Mar 04 '15 at 06:50
  • As a side solution, you can use ARB_fragment_program instead of fragment shader, which could give you finer control on code. That will require writing shader in ARB assembly, or using compiler that can output it (e.g. nvidia cgc). – keltar Mar 04 '15 at 07:35
  • I installed the cgc compiler. Do I send this to OpenGL the same way? (autodetecting ARB assembly) I've also found glProgramBinary(), maybe use this? – lowq Mar 04 '15 at 18:25
  • 1
    To a sane GLSL compiler, none of those texture fetches in the second shader are indirect. They all use a constant coordinate. So if you're still having performance issues, then you can throw out indirect texture fetches as the problem. Perhaps your actual problem is the image format / filter used by `tileset`? That particular sampler seems to be the problem in both shaders. – Andon M. Coleman Mar 04 '15 at 18:36
  • Hmph. none of my specs indicate that too may texture accesses would be the problem. (see glxinfo link) What else explains why 3 texture accesses works, but 4 doesn't? (I'm suspecting the compiler isn't quite sane) – lowq Mar 04 '15 at 18:42
  • `tileset` is RGBA, the others are RGB. Sampling `tileset` worked just fine before I introduced `bgcolors`. – lowq Mar 04 '15 at 19:15
  • After writing my dummy shader in ARB assembly and sending it to the graphics card via `glProgramStringARB()` I observe the same behavior as before. 3 texture accesses are ok, 4 puts it in software mode. I'm suspecting a bad driver here. – lowq Mar 04 '15 at 21:32

1 Answers1

1

Fragment Program

Well, looks like the following ARB fragment program assembly did the trick. Generated by cgc but the vast majority was scrapped and written by hand.

!!ARBfp1.0
# cgc version 3.1.0013, build date Apr 18 2012
# command line args: -oglsl -profile arbfp1
# source file: tilemap.frag
#vendor NVIDIA Corporation
#version 3.1.0.13
#profile arbfp1
#program main
#semantic tileset
#semantic indices
#semantic colors
#semantic bgcolors
#semantic tileset_size
#semantic size
#var float4 gl_FragCoord : $vin.WPOS : WPOS : -1 : 1
#var float4 gl_FragColor : $vout.COLOR : COL : -1 : 1
#var sampler2D tileset :  : texunit 3 : -1 : 1
#var sampler2D indices :  : texunit 0 : -1 : 1
#var sampler2D colors :  : texunit 1 : -1 : 1
#var sampler2D bgcolors :  : texunit 2 : -1 : 1
#var float2 tileset_size :  : c[0] : -1 : 1
#var float2 size :  : c[1] : -1 : 1
#var float2 tex_coord :  :  : -1 : 0
#const c[2] = 0.0020000001 1 0.00125 256
PARAM c[3] = {
        program.local[0..1],
        { 0.0020000001, 1, 0.00125, 256 }
};
TEMP R0;
TEMP R1;
TEMP R2;
TEMP R3;

# R2 := normalized screen coords

MAD R2.z, -fragment.position.y, c[2].x, c[2].y;
MUL R2.x, fragment.position, c[2].z;
MOV R2.y, R2.z;

TEX R3, R2, texture[2], 2D;
TEX R0, R2, texture[1], 2D;
TEX R1, R2, texture[0], 2D;

# multiply by screen size
MUL R2.x, R2.x, c[0].x;
MUL R2.y, R2.y, c[0].y;
# backup original
MOV R2.z, R2.x;
MOV R2.w, R2.y;

# multiply by inverse of font size
MUL R2.x, R2.x, c[1].z;
MUL R2.y, R2.y, c[1].w;
FLR R2.x, R2.x;
FLR R2.y, R2.y;
MUL R2.x, R2.x, c[1].x;
MUL R2.y, R2.y, c[1].y;
# now we have a bit of a staircase, take the original minus staircase
ADD R2.x, R2.z, -R2.x;
ADD R2.y, R2.w, -R2.y;
# modulo is complete

# normalize per unit (inv font size)
MUL R2.x, R2.x, c[1].z;
MUL R2.y, R2.y, c[1].w;
# divide by 16 for proper texture offset
MUL R2.x, R2.x, .0625;
MUL R2.y, R2.y, .0625;
# add to given texture offset
ADD R2.x, R2.x, R1.x;
ADD R2.y, R2.y, R1.y;

# ... and sample!
TEX R2, R2, texture[3], 2D;

#R2 is tile color
#R3 is background color
#R0 is color color
MUL R0, R0, R2;
#R0 is result color
SUB R3, R3, R0;
#R3 is bgcolor - rescolor

# lerp R3 (multiply by 1 - r)
MAD R3, R3, -R0.a, R3;

#R3 is (bgcolor - rescolor) * rescolor.a - (bgcolor - rescolor)
ADD result.color, R3, R0;
END

For whatever reason, writing out the assembly for the simplified case e.g.

TEX ...
TEX ...
TEX ...
TEX ...

Put the shader in software mode, just like before. After using cgc to compile several different versions, I found that some still worked with 4 texture accesses. Additionally, I swapped what was originally:

TEX R1, R2, texture[2], 2D;
TEX R0, R2, texture[1], 2D;
ADD R0, R0, R1
TEX R1, R2, texture[0], 2D;

into

TEX R3, R2, texture[2], 2D;
TEX R0, R2, texture[1], 2D;
TEX R1, R2, texture[0], 2D;

# ... addition done later

based on what I read in the ARB_fragment_program spec

A texture indirection can be considered a node in the texture dependency chain. Each node contains a set of texture instructions which execute in parallel, followed by a sequence of ALU instructions. A dependent texture instruction is one that uses a temporary as an input coordinate rather than an attribute or a parameter. A program with no dependent texture instructions (or no texture instructions at all) will have a single node in its texture dependency chain, and thus a single indirection.

So, at least I removed one texture indirection. It appeared the cgc version (and likely the glsl compiler) were trying to minimize temporaries, rather than texture accesses. It ended up being ok to use 4 temporaries after all; I'm still not sure why that kind of optimization is necessary.


ARB GL-Code

This API was difficult to get documentation on. It was new in 2002, I think? Either way, I made it work.

    if(!GLEW_ARB_fragment_program)
    {
            printf("GLEW_ARB_fragment_program is unavailable.\n");
            return false;
    }

    glClear(GL_COLOR_BUFFER_BIT);
    SDL_GL_SwapWindow(window);


    glEnable(GL_FRAGMENT_PROGRAM_ARB);

    glGenProgramsARB(1, &tilemap_prog);

    if(!tilemap_prog)
    {
            printf("Failed to generate fragment program\n");
            return false;
    }

    glBindProgramARB(GL_FRAGMENT_PROGRAM_ARB, tilemap_prog);

    glProgramStringARB(GL_FRAGMENT_PROGRAM_ARB, GL_PROGRAM_FORMAT_ASCII_ARB, strlen(tilemap_frag_asm), tilemap_frag_asm);

    GLuint error = glGetError();

    if(error == GL_INVALID_OPERATION)
    {
            printf("GL_INVALID_OPERATION!\n");

            printf("glGetString(GL_PROGRAM_ERROR_STRING_ARB): %s\n", glGetString(GL_PROGRAM_ERROR_STRING_ARB));

            GLint texture_units;
            glGetIntegerv(GL_MAX_VERTEX_TEXTURE_IMAGE_UNITS_ARB, &texture_units);
            printf("GL_MAX_VERTEX_TEXTURE_IMAGE_UNITS_ARB: %d\n", texture_units);
            glGetIntegerv(GL_MAX_COMBINED_TEXTURE_IMAGE_UNITS_ARB, &texture_units);
            printf("GL_MAX_COMBINED_TEXTURE_IMAGE_UNITS_ARB: %d\n", texture_units);
            glGetIntegerv(GL_MAX_TEXTURE_IMAGE_UNITS_ARB, &texture_units);
            printf("GL_MAX_TEXTURE_IMAGE_UNITS_ARB: %d\n", texture_units);
            glGetIntegerv(GL_MAX_PROGRAM_TEX_INDIRECTIONS_ARB, &texture_units);
            printf("GL_MAX_PROGRAM_TEX_INDIRECTIONS_ARB: %d\n", texture_units);

            return false;
    }

    // Window size
    glProgramLocalParameter4fARB(GL_FRAGMENT_PROGRAM_ARB, 0, width, height, 00.0, 00.0);
    // Font output size and inverse font output size
    glProgramLocalParameter4fARB(GL_FRAGMENT_PROGRAM_ARB, 1, 10.0, 10.0, 1/10.0, 1/10.0);

Kind of finicky, but it worked in the end. Special thanks to keltar for pointing me in the right direction.

lowq
  • 628
  • 6
  • 18