3

Everybody that has intermediate experience with 2d renderers knows that a sprite batcher has data inside of graphics API specific buffers that needs to be updated, and we always look for the fastest way to update it. Now I've run into a dilemma - For Metal and Swift, what is the smartest thing to update, and what is the smartest way of doing it? To be more specific, shall I update vertices before sending them to the GPU (do the vertex and tex coord transformations on the CPU), or create the transform matrix, create the tex coord parameters, and send it in one instanced uniforms buffer (do the vertex and tex coord transformations on the GPU). The way I'm doing it currently involves instanced rendering and a giant uniforms buffer which is aligned to 8 bytes.

Static Data

static let spritesPerBatch: Int = 1024
static var spritesData: [Float] = [Float](count: spritesPerBatch * BufferConstants.SIZE_OF_SPRITE_INSTANCE_UNIFORMS / sizeof(Float), repeatedValue: 0.0)

Queueing Sprite Data

Method: SpriteBatch.begin()

spritesInBatch = 0

Method: SpriteBatch.submit(sprite)

let offset: Int = spritesInBatch * BufferConstants.SIZE_OF_SPRITE_INSTANCE_UNIFORMS / sizeof(Float)
// transform matrix (3x2)
spritesData[offset + 0] = wsx * cosMetaRot * xOrtho
spritesData[offset + 1] = wsx * sinMetaRot * yOrtho
spritesData[offset + 2] = -hsy * sinMetaRot * xOrtho
spritesData[offset + 3] = hsy * cosMetaRot * yOrtho
spritesData[offset + 4] = (tx * cosNegCameraRotation - ty * sinNegCameraRotation) * xOrtho
spritesData[offset + 5] = (tx * sinNegCameraRotation + ty * cosNegCameraRotation) * yOrtho

// tex coords and lengths
spritesData[offset + 6] = sprite.getU()
spritesData[offset + 7] = sprite.getV()
spritesData[offset + 8] = sprite.getUVW()
spritesData[offset + 9] = sprite.getUVH()

// which texture to use out of the 16 that could be bound
spritesData[offset + 10] = Float(targetTextureIDIndex)

spritesInBatch++

Copying sprite data into the uniforms buffer

Method: SpriteBatch.end()

instancedUniformsBuffer = device.newBufferWithLength(length: spritesPerBatch * BufferConstants.SIZE_OF_SPRITE_INSTANCE_UNIFORMS, options: MTLResourceOptions.CPUCacheModeWriteCombined)
instancedUniformsPointer = instancedUniformsBuffer.contents()
memcpy(instancedUniformsPointer, spritesData, instancedUniformsBuffer.length)
Renderer.renderSpriteBatch()

Sprite batch render method

Method: Renderer.renderSpriteBatch()

Shaders.setShaderProgram(Shaders.SPRITE)

let textureIDs: [TextureID] = SpriteBatch.getTextureIDs()
for (var i: Int = 0; i < textureIDs.count; i++) {
    renderEncoder.setFragmentTexture(TextureManager.getTexture(textureIDs[i]).texture, atIndex: i)
}

let instancedUniformsBuffer: MTLBuffer = SpriteBatch.getInstancedUniformsBuffer().buffer
renderEncoder.setVertexBuffer(VertexBuffers.SPRITE.buffer, offset: 0, atIndex: 0)
renderEncoder.setVertexBuffer(instancedUniformsBuffer, offset: 0, atIndex: 1)
renderEncoder.drawIndexedPrimitives(MTLPrimitiveType.Triangle, indexCount: BufferConstants.SPRITE_INDEX_COUNT, indexType: MTLIndexType.UInt16, indexBuffer: IndexBuffers.SPRITE.buffer, indexBufferOffset: 0, instanceCount: SpriteBatch.getSpritesInBatch())

I currently am able to get about 1400 sprites sized at 32x64 with 8 separate textures at 60 fps on an iPhone 5s. I am mostly satisfied with this and will be able to finish my iOS game with that number. However, I want to push the boundary so that I can use better effects in the game. To reiterate the question in case I haven't made it clear just yet, I'm wondering two major things that are specific to PERFORMANCE.

  1. Would it be a better idea to have a larger vertex buffer (as opposed to my current method: sharing one vertex and index buffer for ALL sprites) where I am setting the position and texture coordinates of each vertex using memory copies on the CPU side? This would also mean NOT using instanced draw calls.
  2. If not, is there a faster way to prepare and copy the sprite data?

Thanks and sorry for the super long post! :)

amedley
  • 121
  • 10

1 Answers1

0

Just a few thoughts...

  1. I'd use instruments to see what's costing you the most time in your game loop. However the 'Time Profiler' probably won't help you out too much on the GPU side of things.

  2. Look at the GPU report in XCode, it should show you how much time is spent on each frame on the GPU and CPU. There's no point shifting more work to the GPU if it's already hovering near 16ms.

  3. Look at replacing memcpy with a memory buffer that is shared across the GPU and CPU. This way you'd simply write to the array in Swift, and it would be available to the GPU without the need to copy memory.

  4. You could look a rewriting the SpriteBatch.submit(sprite) in a Metal compute shader, but the method doesn't seem computationally expensive if you're only doing it a couple thousand times. The output MTLBuffer would contain all your spritesData that could be fed straight into the render encoder. You'd still need to get the input data from the CPU to GPU (compute) though.

  5. Your point 1 is interesting. I don't think you want to transform vertexes on the CPU, but this could be a good candidate for a compute shader. This is similar to a boid simulation I did a little while ago. A metal compute shader updates each boid location and velocity, it also creates a per boid transformation matrix that is then used to transform the 6 vertex locations (simple fish drawn with 2 triangles) that make up the boids visual representation. My scene is constructed in SceneKit, so using instanced draw calls wasn't really an option.

lock
  • 2,861
  • 1
  • 13
  • 20