Hi,
I'm the developer of swShader. I don't see any specific questions here so I'll just try to explain how exactly swShader works.
First of all, software rendering is fast, blazing fast. Ok you don't believe me... Well, five years ago we were all playing Quake in software mode with good resolutions and framerates on a Pentium 133. Nowadays processors are at least ten times faster but nobody does software rendering any more? You might think that this is because of things like bilinear filtering. Indeed, you need more than a few instructions for that (Qauke had four instrucitons per pixel in the inner loop). But current processors feature instructions that were not available for Quake. The main reason is that software rendering until now couldn't keep up with the flexibility of hardware rendering. That's right, the one thing software rendering is good at, it gets beaten at.
Let me explain. Hardware can support tens of blending modes, filter modes, border modes, you name it. A handfull of transistors make the 'decision' what modes are active. Nevertheless, software is the most flexible you can get, but at a huge price. If you had to implement all those modes you have two options: write one version full of control statements, or write every version separately and optimize them manually. The first option
is the DirectX reference rasterizer. Even in the inner pixel loop it had hundreds of control statements. This is horribly slow because most of the time you're just 'jumping over code'. The Pentium 4 is also has huge penalties for mispredicted jumps. The second option is also not very attractive. We're talking about thousands of combinations here, and every new option doubles that number. Also, manually optimizing is not managable because when you make a small design change or an optimization, all other code has to change as well. And I'm still talking about the fixed-function pipeline now...
The solution is simple, elegant and mighty powerful. For every combination of render modes, generate the render function. For this purpose I have first written my own run-time assembler:
SoftWire. It's main feature is what I call run-time intrinsics. They are regular functions, with the names of assembly instructions (the whole instruction set). When called, they generate the corrresponding machine code of the instruction and place that in a buffer. Once all instructions are generated I link the external symbols, load it into memory and it's ready to be called just like a regular function! Register allocation is resolved at the same time, so I can work with symbolic names instead of figuring out what value is stored in what register. This also ensures optimal register usage.
In other words, this means I select
exactly those instructions that do the actual rendering operation. No more conditional statements in the inner pixel loop. It has disadvantages too, but they are solvable. First of all I do not genrate the functions for each and every render mode combination. Just the ones that are needed. Run-time intrinsics are very fast so generating a few shaders per frame has no influence on performance. I also cache the last hundred shaders so it's efficient in memory usage. Another problem with run-time generated code is that you obviously can't hand-optimize it any more. For this I have solutions as well. You are not limited to always let the shader be generated. If you have a very efficient hand-optimized rendering function for one specific, frequently used, combination of render modes, you can place that in the cache beforehand. Also, what I have been experimenting with is automatic instrucition sheduling. Because of the complex processor architectures this isn't easy but I'm making progress.
So what about real results? Well, here you can find a Quake III renderer in software:
Real Virtuality. The performance is not that great but remember that Quake III never was designed for software rendering! Also, there are still tons of optimization opportunities for that project. For example only the pixel pipeline is run-time generated but also the vertex pipeline and clipper could benefit from it.
And finally we come back to the swShader project. Currently it is just a very successful proof-of-concept for an article I wrote for the upcoming ShaderX 2 Tips and Tricks book. It uses full 32-bit floating-point precision using SSE instructions everywhere. But the shader instruction translator was written in a hurry so it could be more optimized as well. And just like the Real Virtuality project the vertex pipeline and clipper is still written in plain C++ code. So there's still lots of room for performance increases.
I hope this answers most of the questions you had in mind.