The Littlest CPU Rasterizer

Okay, so I’ve been declaring for a long time that I’m going to blog about some of the stuff I’ve developed for my personal project, but I never seem to get around to it.

Finally I thought I’d just start off with a cute little trick: here’s how I’m doing “image-based” occlusion for my sky light. A little while ago, while I was thinking about ambient lighting, I realized I actually had a simple enough problem that I could render very small cubemaps, on the CPU, in vast numbers.

Here’s the basic underlying idea: A 16-by-16 black-and-white image is 256 bits, or 32 bytes. That’s exactly two SSE registers, or one AVX register.

That is, you can store tiny cubemaps whole in your CPU’s SIMD registers!

Next, I have a very limited problem domain. I’m working with a cube-world, so that helps. I’m only going to render from the cube vertices, so I have a very limited number of relative offsets between my camera and other cubes – few enough that I can actually exhaustively enumerate them.

So I wrote a little bit of code to render out 4096 (that’s 16x16x16) black-and-white images of cubes, and toss them into a header file. Here’s what that looks like:

static const U16 kCubeRenders[] = {
    
.... twenty thousand lines or so ....

    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000011100000000,
    0b0000011111000000,
    0b0000011111000000,
    0b0000001111000000,
    0b0000001111000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,

    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000011100000,
    0b0000000011111000,
    0b0000000011111000,
    0b0000000011111000,
    0b0000000001111000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,

    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000011100,
    0b0000000000011111,
    0b0000000000011111,
    0b0000000000011111,
    0b0000000000000111,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,

.... and on, and on, and on ....

To render an image, I just zero two registers, then loop over the world searching for any filled cubes, and logical OR my 32-byte image with the prerendered images of any cubes I find. Essentially nothing but this:

// this is a renderer
for(I32 zz=zStart; zz<zEnd; ++zz)
{
    for(I32 yy=yStart; yy<yEnd; ++yy)
    {
        for(I32 xx=xStart; xx<xEnd; ++xx)
        {
            U8 terr = pBlock->Get(xx, yy, zz);
            if(terr != kTerrainTypeEmpty)
            {        
                U32 ind = (xx-xStart + (yy-yStart)*16 + (zz-zStart)*16*16) * 2;
        
                bits0 = _mm_or_si128(bits0, pRenders[ind]);
                bits1 = _mm_or_si128(bits1, pRenders[ind+1]);
            }
        }
    }
}

(That’s a bit of a simplification, since I actually render three cubemap faces at once. I also have special cases for cubes in the 0,0,0 position, that is, actually touching the camera. Cubes in that spot block the entire octant, so you can skip the whole render.)

So now I have a 1-bit cube map; all that remains is to convert it to a spherical harmonic. I’m using first-order harmonics – just 4 floats – because I think it looks fine and I’m cheap.) For this I have 32-bit float cubemaps of the spherical harmonic bases, and I convolve them with the image using code along these lines:

// four copies of the first 32 bits of the image
const __m128i splat = _mm_shuffle_epi32(render[0],  _MM_SHUFFLE(0,0,0,0));  

// mask off everything from word 0 except bit 0, word 1 gets only bit 1, etc...
const __m128i mask = _mm_set_epi32( 1<<0,  1<<1,  1<<2,  1<<3);
const __m128i sel = _mm_and_si128(splat, mask);  

// set word 0 to 0xffffffff if image bit 0 was 0
const __m128i nmask = _mm_cmpeq_epi32(sel, _mm_set1_epi32(0));  

// mask the first four pixels of the SPH basis
const __m128 maskedImage = _mm_and_ps( sphBasis[0], reinterpret_cast(nmask) ); 

// finally add the mask pixels to a running sum
sum = _mm_add_ps( sum, maskedImage )

This conversion is actually the slowest part of the whole technique – it takes more time than rendering the cubemap! – so if I’m lucky, this post will nerd-snipe somebody who actually knows anything about SSE and get me a faster routine. (The compiler does seem to be converting the constants into vector loads.)

It’s still fast enough, though, that this step isn’t a significant part of my terrain generation. My pokey 2.53 GHz Core 2 Macbook can render around 100 of these spherical harmonics per core per millisecond. That’s much too slow to run every frame for everything, but running it once on each new block of terrain isn’t a concern.

Once I have a harmonic at each vertex, I linearly interpolate them to get a field of harmonics for the entire world.

So, results! Here’s the underlying voxel field I’m generating cubemaps of:

A boring voxel image.

And here’s the detailed geometry, with sky occlusion only:

I've actually never seen this before! Never had a reason to look at it.

The ambient contribution I’m actually using (occlusion, convolved with the surface normal, convolved with the sky dome):

With normals

The full lighting (above plus the sun and a little bit of nonphysical nonsense):

Lighting component

And final render, in all its Perlin-noise-overuse glory:

Final render

For this sort of simple terrain, it seems to work very nicely! There are some artifacts where the lumpy terrain goes “too deep” inside its corresponding voxel and you get inaccurate deep black, but it’s still not unconvincing to my eye.

Caves, obviously, are hard. I only render out to a distance of 16 cubes, so light is going to leak through any object larger than 16 voxels – I haven’t decided what to do about that. Presumably you could create a “mip-mapping” scheme – convert the world into “megacubes” and do a second render – if you figured out what you were going to do with partially filled areas. (My instinct says you can probably fish something out of your ear that’ll work OK…)

Similarly, if you decided what to do about partially filled blocks, you could presumably use a technique like this on a world voxelized from polygons rather than on a world polygonized from voxels.

And of course the day is probably coming when you just render a zillion images on the GPU, or cone trace everything, or whatever you do on monster machines — but I have a soft spot for specialized software renderers and I think this one is especially adorable.

Anyone else ever done anything like this?

This entry was posted in Lighting, Voxels. Bookmark the permalink.

15 Responses to The Littlest CPU Rasterizer

  1. Nathan Kurz says:

    > (The compiler does seem to be converting the constants into vector loads.)

    Yes, compilers often do a poor job here, and the constant loads and re-loads within tight loops can be a bottle neck. It’s worth noting that at the assembly level, unlike for integers or floats, there are no ‘immediate’ vector constants. Your only choices are to load it from memory/cache, or to create it by vector operations.

    Since creation is often a single cycle latency, it’s often the better choice. Section 13.4 here shows ways to create some useful vector constants: http://www.agner.org/optimize/optimizing_assembly.pdf

    But what may benefit you more is just moving the load/creation out of the loop. You didn’t think your compiler would do this for you, did you? You should be able to get a good boost just by generating the constants once in advance and then using them. On x64, you have 16 vector registers available, so you might as well use them.

    For stuff like this, you will want to be looking at the generated assembly to make sure it makes sense. On Linux, ‘objdump -d’ is a good place to start. It’s not actually that hard to understand the assembly, and if it doesn’t look ‘right’ to you, you can probably do better. Don’t assume that any ‘improvements’ the compiler made are actually for the better. While it’s not 1:1, fast assembly code is usually simple assembly code.

  2. Charlie says:

    Yeah, I’ve written vector assembly for SPU; it’s just that this is my first time working with SSE. I really just wanted to make it clear that that nasty-looking __mm_set_epi32 was “in reality” a load; of course the load could be slow.

    In fact the strategy the compiler (clang/llvm if you’re wondering) has chosen is to hoist everything it can out of the loop; the inner loop is just movdqa, movdqa, pand, pcmpeqd, pandn, addps – the splat is computed pre-loop and stored in memory, then it, the bitmask, and the SPH table entries are loaded each time through. So you may be right that the loads are the bottleneck.

    And thanks for the link; I have a copy but hadn’t yet found that nice constant table. Doesn’t help me here, though, because I’m looking for “1,2,4,8” (for example) and the examples there only offer the same constant in all slots. There may be some way to generate my mask, though, if I can come up with it.

    But unfortunately I can’t get rid of that SPH table; it’s the result of a bunch of offline computation so I have no choice but to load one vector from it each time through the loop. So if cache misses caused by loads are my problem then I may be just screwed.

    Not that screwed, though, as I said I’m not really unhappy with current performance.

  3. Nathan Kurz says:

    > the compiler (clang/llvm if you’re wondering) has chosen is to hoist everything it can
    > out of the loop; the inner loop is just movdqa, movdqa, pand, pcmpeqd, pandn, addps

    I haven’t used Clang much for this, but it’s specifically those movdqa’s that you could hoist out manually. You might be able to convince Clang to treat these as globals, or you might use inline assembly and just stash them somewhere that you are confident won’t be overwritten.

    > So you may be right that the loads are the bottleneck.

    I think this is likely the bottleneck. There is a 7 cycle latency to load a 128-bit vector on Sandy Bridge. You can issue two loads per cycle (or one load and one store) but because of the limited number of line fill buffers, you can only have 10 loads or stores outstanding. When you run out of buffers you stall.

    > There may be some way to generate my mask, though, if I can come up with it.

    Yes, this is more reason to load it once and reuse it. You probably can just change your function call to a macro so the constant vector stays in scope. Generating all zeros with with pxor is usually a free operation, as it can execute on any of the 3 vector execution units, but anything else often has an associated cost.

  4. Torsten Schroeder says:

    Is the source available to take a look on it and recompile it locally? Looks really nice.

    • Charlie says:

      That’s a very nice offer… unfortunately I’m not really ready to open this up as a whole. If I make time to pull out the masking code I’ll let you know.

      I admit I was kind of hoping there was some magic SSE instruction I didn’t know about that would make everything fast, but it sounds like my troubles are in the context, not in my choice of operations, so if I want more performance I’m just going to have to do a bunch of tedious fiddling.

  5. Mmalex says:

    Super cool! Kautz and Aila did something like this in 2004 but and-ing together precomputed bit masks of half planes to get triangle rasterization -> AO on triangle meshes rendered into a bit mask.
    Linky: https://mediatech.aalto.fi/~timo/publications/kautz2004egsr_paper.pdf

  6. Jaakko L says:

    Cool!

    Got the link from Johan Andersson’s tweet.

    We did bit rasterization on a hemispherical domain for SH illumination using precomputed edge masks in our EGSR 2004 paper: http://www.tml.hut.fi/~jaakko/publications/kautz2004egsr_paper.pdf

    Related GPU algorithm for AO is given by Laine and Karras (EGSR 2010): https://mediatech.aalto.fi/~samuli/publications/laine2010egsr_paper.pdf

    • Charlie says:

      Very nice! I particularly like the spherical projection; it now seems obvious that if you’re just merging prerendered images you can use whatever projection you like, but it never occurred to me to use anything other than flat.

      Of course there are many, many ways to map a sphere or a hemisphere to 2D, going back hundreds of years, so one could presumably spend many happy afternoons trying out the quality/size tradeoffs of different projection techniques. :)

  7. Jaakko L says:

    Whoops, Mmalex had the earlier link already.

  8. This looks great! I did a less optimal version for my JavaScript voxel renderer at:

    http://casual-effects.blogspot.com/2013/05/sol-lewitt-and-voxel-ambient-occlusion.html

    using precomputed occlusion textures based on the local voxel pattern.

  9. Steven An says:

    Hi there! Awesome stuff. I’m just wondering, what are you doing to get the detail in your voxels, like the rock pebbles and stuff? Is that just a hand-made scene, or is it procedurally done with like 3D voronoi cells or something? Cheers!

    • Charlie says:

      Thanks!

      The fine geometry is procedural. Your guess of voronoi cells is right on the mark; that’s both the little pebbles and the big rock clumps. If you look closely you can see some floating rocks.

      Basically there’s a coarse field that comes from the big “logical” voxels; then I modulate it with a “detail” field, then I feed it into a point-based thing that renders the zero isosurface.

    • Steven An says:

      Very cool. Can I ask what the “point-based thing” is? :P Is it a ray marcher or..?

  10. mmalex says:

    Hi Charlie! Been meaning to post this observation for ages and it just came back to me for no good reason. Regarding your reply to Jaakko L about there being different hemisphere projections, indeed! and it occurred to me that in fact, because both you and Jaakko et al were using entirely pre-computed masks, the bits of your mask need not represent square pixels at all, and need not be on a regular grid. so you can in fact just arrange to perfectly cover the hemisphere any way you like, with no need for projection at all, with little patches, one patch per bit. only the code that precomputes the bitmask lookup table need care what shape these patches are. in the 2004 paper, I think they were for example only using the pixels that fell within the unit circle inside their 32*32 framebuffer, ie only 800 pixels were used, wasting 224 needlessly. It’s probably an obvious observation, but one that was missed (or not mentioned) in that classic old paper.
    anyway, sorry for the super delayed comment :)

    • Charlie says:

      Right! It does seem obvious in retrospect, but I totally missed that for the longest time.

      Actually you have me thinking again about what a “good” projection might be, and it occurred to me that you could use the “projection” of “256 rays importance-sampled over the SH basis.” With the sampling set up properly, the conversion to SH would then be nothing but counting zero bits in the registers.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>