Real-time 3D graphics has come a long, long way since it first arrived on microcomputers less than 20 years ago. We've moved from simple wireframes to texture-mapped polygons with per-vertex lighting—all handled in software. Then, in 1996, custom graphics chips hit the scene in the form of Rendition's Verite and 3dfx's Voodoo Graphics. These cards, especially the Voodoo, were an instant success among gamers, who naturally appreciated the additional graphics power afforded by dedicated hardware. The Voodoo's success in the PC market caught even 3dfx off guard; the chip was originally intended for video arcade machines, not consumer PCs.
PC platform custodians like Intel and Compaq never saw the 3D tide coming. Unlike nearly every new PC feature we've seen in the past ten years, 3D graphics hardware was not a "must have" feature integrated and promoted by PC OEMs as a means of driving demand for new systems. (Intel, of course, would have preferred to do graphics processing work on the CPU.) At first, Voodoo cards sold primarily as retail, aftermarket upgrades. PC builders caught on pretty quickly, but in truth, raw consumer demand pushed dedicated 3D graphics chips into the mainstream.
Since the Voodoo chip, graphics has skyrocketed to a position of prominence on the PC platform that rivals CPUs. During this time, graphics ASICs have moved from relatively simple pixel filling devices into much more complex vertex and pixel processing engines. By nature, graphics lends itself to parallel processing, so graphics chips have been better able to take advantage of Moore's Law than even CPUs. Moore predicted exponential increases in transistor counts, and graphics chips have followed that progression like clockwork. Consider ATI's chips: there were about 30 million transistors in the original Radeon chip, roughly 60 million in the Radeon 8500, and about 110 million transistors in the new Radeon 9700. Desktop CPUs haven't advanced at anything near that pace. The resulting increases in graphics performance have been staggering.
In the early years, graphics hardware incorporated new features one by one, adding custom circuitry to support a particular graphics technique, like environmental bump mapping or cubic environment mapping. What's more, each one of these techniques was a hack, a shortcut used to approximate reality. But as time passed, graphics chips developed into GPUs, incorporating more programmability and allowing developers to replace some of their hacks with more elegant approximations of reality—much cooler hacks, or shortcuts that cut fewer corners.
The progress of consumer graphics ASICs has shattered the traditional order of the graphics world. Long-time high-end leader SGI nearly imploded a few years ago, and the ranks of companies like NVIDIA and ATI are populated heavily with ex-SGI engineers. Consumer "gaming" chips have developed the necessary performance and internal precision to compete—in rebadged forms as Quadros and FireGL cards—against workstation stalwarts like 3DLabs' Wildcat line. And heck, 3DLabs recently turned the tables, getting itself bought out by Creative in order to fund a move into the consumer market with its P10 chip.
The next frontier
Consumer graphics chips have come a long way, but they haven't yet supplanted general-purpose microprocessors and software renderers for the high-quality graphics now used commonly in cinematic production. The sheer complexity and precision of rendering techniques used by professional production houses—not to mention the gorgeous quality of the resulting images—has kept the worlds of consumer graphics and high-end rendering apart.
Of course, the graphics chip companies have frequently pointed to cinematic-style rendering as an eventual goal. NVIDIA's Jen-Hsun Huang said at the launch of the GeForce2 that the chip was a "major step toward achieving" the goal of "Pixar-level animation in real-time". But partisans of high-end animations tools have derided the chip companies' ambitious plans, as Tom Duff of Pixar did in reaction to Huang's comments at the GeForce2 launch. Duff wrote:
`Pixar-level animation' runs about 8 hundred thousand times slower than real-time on our renderfarm cpus. (I'm guessing. There's about 1000 cpus in the renderfarm and I guess we could produce all the frames in TS2 in about 50 days of renderfarm time. That comes to 1.2 million cpu hours for a 1.5 hour movie. That lags real time by a factor of 800,000.)
Do you really believe that their toy is a million times faster than one of the cpus on our Ultra Sparc servers? What's the chance that we wouldn't put one of these babies on every desk in the building? They cost a couple of hundred bucks, right? Why hasn't NVIDIA tried to give us a carton of these things? -- think of the publicity milage [sic] they could get out of it!
Duff had a point. He hammered the point home by handicapping the amount of time necessary for NVIDIA to reach such a goal:
At Moore's Law-like rates (a factor of 10 in 5 years), even if the hardware they have today is 80 times more powerful than what we use now, it will take them 20 years before they can do the frames we do today in real time. And 20 years from now, Pixar won't be even remotely interested in TS2-level images, and I'll be retired, sitting on the front porch and picking my banjo, laughing at the same press release, recycled by NVIDIA's heirs and assigns.
Clearly Pixar-class rendering was out of the chip companies' reach—at least, that was the thinking at the time.
Bringing the horizon closer
But Duff wasn't counting right when he implicitly equated general-purpose UltraSPARC processors with dedicated graphics ASICs. Later that same year at the Siggraph CG conference, Mark S. Peercy and his colleagues from SGI presented a paper entitled Interactive Multi-Pass Programmable Shading. Peercy's paper provided the missing link between consumer graphics chips and cinematic CG by proving that any OpenGL-capable chip could produce high-end rendering techniques through multi-pass rendering. Essentially, Peercy showed how even highly complex effects could be broken down into manageable, bite-sized steps—steps even a GeForce2 could process.
Peercy's paper demonstrated precisely how to translate complex code—in this case, those of the RenderMan Shading Language used at places like Pixar—into OpenGL rendering passes using a compiler. The compiler would accept RenderMan shading programs and output OpenGL instructions. In doing so, Peercy was treating the OpenGL graphics accelerator as a general SIMD computer. (SIMD, or Single Instruction Multiple Data, is the computational technique employed by CPU instruction set extensions like MMX and SSE. SIMD instructions perform the same operation on entire matrices of data simultaneously.) If you know a little bit about CPUs and a little bit about graphics, the notion makes sense as Peercy explains it:
One key observation allows shaders to be translated into multi-pass OpenGL: a single rendering pass is also a general SIMD instruction—the same operations are performed simultaneously for all pixels in an object. At the simplest level, the framebuffer is an accumulator, texture or pixel buffers serve as per-pixel memory storage, blending provides basic arithmetic operations, lookup tables support function evaluation, the alpha test provides a variety of conditionals, and the stencil buffer allows pixel-level conditional execution. A shader computation is broken into pieces, each of which can be evaluated by an OpenGL rendering pass. In this way, we build up a final result for all pixels in an object.
Peercy goes on to explain how the OpenGL "computer" handles data types, arithmetic operations, variables (which are stored in textures) and flow control. It's heady stuff, and it works. Peercy's demonstration compiler was able to produce output nearly identical to RenderMan's built-in renderer.
Peercy compiled RenderMan Shading Language to OpenGL, but the same principle could be applied to other high-level shading languages and other graphics APIs, like Direct3D. The implications were simple but powerful: this method would enable consumer graphics chips to accelerate the rendering of just about "anything." Even if the graphics chips couldn't handle all the necessary passes in real time, they could generate the same ouput far faster than even the speediest general-purpose microprocessor.
Crossing the threshold
Graphics chips could get close to rendering "anything," at least. Peercy noted two capabilities needed in graphics hardware in order to run nearly all general shading routines with relative ease: higher-precision data types and what he called "texture pixel." The latter of those capabilities arrived with the Radeon 8500 in the form of pixel shaders capable of dependent texture operations. NVIDIA extended its pixel shaders to incorporate this capability in the GeForce4 Ti chip, and newer cards like Matrox's Parhelia have this feature, as well.
ATI's "natural light" demo shows off the effects possible with Radeon 9700's extended dynamic range
The former of those capabilities is set to arrive on the scene with ATI's R300—now powering the Radeon 9700—and NVIDIA's forthcoming chip, code-named NV30. Both of these next-gen chips can process 128-bit floating-point datatypes for increased precision and extended dynamic range. Extended dynamic range will allow lifelike representations of subtle lighting effects and soft color tones. Increased precision will allow the chips to render complex shader techniques without accumulating error through multiple rendering passes. (Color artifacting with 32-bit color isn't so bad through one or two rendering passes, but it becomes a real problem after the same data makes 10 to 20 passes through the pipeline.)
Some recently released chips, like Matrox's Parhelia and 3DLabs' P10, offer increased color precision by assigning 10 bits to each red, green, and blue color channel at the expense of precision in the alpha (transparency) channel. These solutions are sometimes helpful, but they can't match true floating-point datatypes. The inability to represent fractions makes some types of math difficult, and quantization error is a problem. Plus, some applications will want to use the alpha channel, and two bits of alpha is hardly enough.
So ATI's R300 and NVIDIA's NV30 will comprise the first generation of dedicated graphics chips capable of cinematic quality shading. They won't be capable of rendering all of the best effects seen in recent movies with all of the detail in each scene in real time, but they should be able to deliver some exceptionally compelling graphics in real time. Gamers had better hold on to their seats once games that use these chips arrive. And these chips will challenge entire banks of servers by rendering production-quality frames at near-real-time speeds. Graphics guru John Carmack's recent Slashdot post on the subject anticipates replacing entire render farms with graphics cards within 12 months:
Note that this doesn't mean that technical directors at the film studios will have to learn a new language -- there will be translators that will go from existing languages. Instead of sending their RIB code to the renderfarm, you will send it to a program that decomposes it for hardware acceleration. They will return image files just like everyone is used to.
Multi chip and multi card solutions are also coming, meaning that you will be able to fit more frame rendering power in a single tower case than Pixar's entire rendering farm. Next year.
I had originally estimated that it would take a few years for the tools to mature to the point that they would actually be used in production work, but some companies have done some very smart things, and I expect that production frames will be rendered on PC graphics cards before the end of next year. It will be for TV first, but it will show up in film eventually.
ATI has already stated that its R300 chip can work in parallel configurations using as many as 256 chips.
Not only that, but artists and animators should be able to preview effects and animation sequences instantaneously, tweaking and fiddling with them in real time on a workstation equipped with a single or multi-chip AGP card.
So Pixar had better be ready to receive its carton of graphics cards. Only two years after Tom Duff laughed out loud at NVIDIA's ambitions, graphics chip makers are on the brink of reaching their goal of producing Hollywood-class graphics on a chip.
With this set of advances, we can see the way forward in graphics hardware more clearly than before. As ATI's Rick Bergman noted at the Radeon 9700 launch, what comes next will largely build on the foundation established by this new generation of chips. Future chips will be faster versions of what we have now, and the additional power will make ever more complex scenes and rendering techniques possible in real time. Newer generations of chips will further clean up the rendering pipeline, bringing more mathematical precision to any remaining pockets of lower precision. And although most of the advancements we've talked about are related primarily to pixel shading, vertex processing power will continue to mushroom, as well.
In fact, Peercy's prescient insights about multi-pass rendering's possibilities may prove to be only a way station to the future of real-time graphics. Already, graphics chip makers have introduced programmability into their chips in the form of pixel and vertex shaders, making rendering in multiple passes less necessary. The move toward general programmability on graphics chips has begun in earnest.
But before I get too far into such things, we should turn our attention to the next-gen graphics chips arriving soon and see more precisely how they fit into the picture.
The DirectX 9 generation
Microsoft's relationship with graphics chip makers is unique, because the chip companies help define the Direct3D API layer that Windows programs will use to talk to their chips. At the same time, Microsoft's specification helps define the feature sets of the chips themselves. It's a give-and-take process, and in some cases, one graphics chip maker is giving and another is taking. For instance, NVIDIA made a lot of noise about playing a large part in the genesis of DirectX 8. At the time, NVIDIA was developing the GPU for Microsoft's Xbox alongside the DX8-capable GeForce3 chip. In doing so, NVIDIA was helping move the industry forward and defining a standard developers could use to write applications for its chips. However, NVIDIA was also giving away much of its graphics technology.
Months later, ATI came along behind with its Radeon 8500 chip, which integrated nearly all of DirectX 8's (and the GeForce3's) functionality. In fact, over the last few generations, ATI's Radeon chips have been nearly dead-on implementations of Microsoft's Direct3D specifications—with a few enhancements here or there. The original Radeon, for instance, was the embodiment of DirectX 7 with a few extensions for features like matrix palette skinning. Likewise, the DX8-class Radeon 8500 introduced a functionally similar, but somewhat improved, pixel shader implementation.
For would-be competitors in graphics, the Direct3D specification offers a roadmap for engineering the logic units necessary to create a "clone" GPU. Today, DirectX 8-class GPUs are available from the likes of Matrox, Trident, and SiS, to name a few.
The first of the DirectX 9 chips available, of course, is ATI's R300. I won't speculate about who played a hand in defining what aspects of the DX9 specification or about how much of the final result is primarily attributable to ATI, NVIDIA, or Microsoft itself. However, some companies have already laid claim to certain technologies. Matrox has taken credit for DX9's displacement mapping capabilities, which were first integrated into its Parhelia chip. NVIDIA seems to have contributed much to Microsoft's High Level Shading Language, which is closely related to NVIDIA's Cg initiative.
Regardless, the final DX9 specification determines, in large part, the actual capabilities of the two truly DX9-compliant chips due on the market this year: ATI's R300 and NVIDIA's NV30. In fact, the DX9 spec is so specific and calls for so much added computational power, there are vast areas of overlap in the new capabilities of these two chips. I've spent a fair amount of time combing through both public and non-public presentations from ATI and NVIDIA and quizzing engineers from both companies, and I have been able to pinpoint some key differences between R300 and NV30. But it wasn't easy. Nearly every time I thought I found a chink in the armor of the R300, an additional clarification from an ATI engineer would assuage my concerns. The NV30 is a little tougher to gauge. NVIDIA has been helpful in answering my questions, but they aren't talking about all the chip's specifics just yet.
For the most part, these chips will differ in implementation more than in features.
Likely common features
I want to talk about the differences between the chips, because they aren't entirely trivial, but we should start with a look at DX9-class capabilities the chips ought to—at least as a baseline, if all goes as planned—have in common. DirectX 9 is a complex beast, but its highlights include:
More precision everywhere — The watchword for DX9 is precision, as you might have gathered by now. DX9 calls for larger, floating-point datatypes throughout the rendering pipeline, from texture storage to pixel shaders, from the Z-buffers to the frame buffers. 128-bit floating-point color precision is the most complex color mode, but the DX9 spec calls for a range of color formats, including 32, 40, and 64-bit integer modes (with red, green, blue, and alpha channels of 8:8:8:8, 10:10:10:10, and 16:16:16:16), plus 16 and 32-bit floating-point modes.
The additional precision will, of course, help with color fidelity and dynamic range, but it will also help with visual quality in other ways. For example, depth-handling errors will be dramatically reduced with the addition of floating-point Z buffers. Bump mapping elevations will increase (and quantization error will be reduced) with added precision for normal maps.
Pixel shader 2.0 — DX9 pixel shaders will take steps toward general programmability while exposing more detail about a chip's underlying capabilities. Since high-level shader languages may compile to DX9 API calls, these changes make sense. At the same time, DX9 pixel shaders will offer floating-point precision. These new pixel shaders incorporate a number of powerful vector and scalar instructions, including log, exponent, power, and reciprocal.
DX9 pixel shaders should be able to intermix as many as 32 texture address operations and 64 math operations arbitrarily. That is, they will be able to execute 32 address and 64 math operations in a single rendering pass, without having to resort to multipass rendering. With this kind of power, reasonably complex shader effects should be possible in only one or two passes.
Vertex shader 2.0 — Vertex shaders aren't as dramatically improved in DX9 as pixel shaders, but they do gain one powerful improvement: flow control. In this iteration, vertex shaders gain the ability to handle vertex programs with jumps, loops, and subroutines. Vertex shader units will add more registers and constants in order to make that happen. Improved flow control will allow for much more efficient vertex programs that reuse code and the like.
Higher-order surface handling improvements — DX9 adds several new tricks to its arsenal here. Displacement mapping is similar to bump mapping, but it actually modifies the geometry of an object, adding real polygon detail. Bump mapping just fakes it with lighting. Microsoft is working on tools to take a high polygon model (say, for a game character) and generate a a low-poly model and a displacement map for it. When mated up again on the other side of the AGP bus, these two elements could create a high-poly model yet again—much more efficient than sending the whole high-poly model over the bus.
DX9 also brings adaptive tessellation to displacement maps and other models, and does so with floating-point precision. This ability to cut out unneeded polygon detail as objects move away from the viewer will allow more complex scenes to be rendered faster.
Multiple render targets — Chips should be able to perform pixel shader operations on multiple color buffers at once. ATI has demonstrated this ability on the R300, rendering several copies of a race car model with several different colors of paint but the same, shiny gloss.
Improved gamma correction — Now that the pixel pipeline is more precise, DX9 includes some provisions for handling gamma correction properly, at least in the pixel shaders. Better gamma handling will help with tuning output brightness while preserving dynamic range.
All of these impressive capabilities should be present in DX9-class graphics chips. Individual implementations, of course, will vary in their scope, quality, and performance. ATI demonstrated many of these features live on the Radeon 9700 at its launch.
Differences between the chips
As I said before, I've read up on both chips and talked to folks from both NVIDIA and ATI in an attempt to understand the similarities and differences between these chip designs. Both designs look very good, and somewhat to my surprise, I've found very few weaknesses in the ATI design, despite the fact it's hitting the market well before NVIDIA's chip. There are some differences between the chips, however, and they point to different approaches taken by the two companies. Most of my attention here is focused on the pixel pipeline, and the pixel shaders in particular, because that's where the key differences seem to be.
For what it's worth, I've boiled down those differences into a table, which does a nice job of encapsulating the issues for us. As is often this case with chips this complex, sticking the specs into a table can't do them justice at all. I realize this is a gross oversimplification, and I'm probably totally missing the boat on vertex shader implementations and various other things. But I'll persist, because I think the table is instructive.
|Pixels per clock
|Textures per pipe per rendering pass
|Textures per pipe per clock cycle
|Texture address operations per pass
|Color instructions per pass
|Max pixel shader precision
||96-bit floating point
||128-bit floating point
|Integer color modes
||16, 32, 64 bits (signed or unsigned)
||16, 32 bits
|FP color modes
||16, 32, 64, 128 bits
||64, 128 bits
|Early Z occlusion culling?
|Pixel shader ops on video streams?
|Max parallel configuration
*Up to 128 instructions are possible with carefully arranged data and the use of swizzle
**NVIDIA won't confirm this one, but one of their Siggraph presentations mentions Early Z in next-gen hardware
NVIDIA isn't filling in all the blanks for us yet about the NV30, but we have enough information to see that NV30 allows for the execution of much longer pixel shader programs in a single pass. Now let me give you my qualifications of the above table, and then we can talk about the implications.
First, I may not have a complete set of color modes listed for the NV30. I actually expect it to support all the modes I've listed for the R300, which ATI says are all part of the DX9 specification.
Next, I've listed the NV30's potential for working in multi-chip solutions as "unknown," but NVIDIA would confirm for me that they plan to continue to support Quantum 3D, who ships multi-chip products now based on current NVIDIA chips. I'd expect to see multi-chip NV30-based solutions of this class, as well.
Also, NV30 has long been rumored to have eight parallel rendering pipelines, as R300 does. This configuration only makes sense, and I'd be surprised to see NV30 have fewer than eight pixel pipelines.
Note that NV30 can, like the R300, apply pixel shader effects to video streams. This capacity changes the image processing and video editing games, because Photoshop-like effects can be applied to incoming video streams in real time.
Finally, I've not included any numbers on memory bandwidth here. The Radeon 9700 will launch with DDR memory and a 4-by-64-bit memory interface capable of moving about 20GB/s of data. ATI claims the R300 can address DDR-II memory types, as well. NVIDIA will only say that NV30 can use "DDR-II-like memory." Use of DDR-II type memory has the potential to double memory bandwidth from the current 20GB/s, but only time will tell how this one will play out.
Of pixels, floats, loops, and bits
These chips' pixel shaders stand apart for two big reasons: the maximum amount of internal precision and number of instructions they can process per pass.
The R300's 96-bit precision in its pixel shaders seem to run contrary to the 128-bit color precision available in other parts of the chip. These pixel shaders compute color data in four channels of 24 bits each, or 24:24:24:24 for red, green, blue, and alpha. That's not as good as 128-bits and 32:32:32:32 RGBA, but it's not exactly horrible, either. This is one of those compromises that happen in chip design, but given where we're coming from, I'm having a hard time complaining about 24 bits of floating-point precision per color channel. Still, technically, NVIDIA should have an edge here. I expect ATI to add more precision to its pixel shaders in future hardware. Whether or not it matters in R300 is another story.
The question about the number of pixel shader instructions per pass is more interesting. The ability to execute more instructions per pass probably won't help NV30 in purely DirectX 9 scenarios, because the API only supports 32 address ops and 64 color ops per pass. However, using OpenGL or compiling from high-level shading languages might change the picture. A shading language compiler could compile to an API like DirectX, but it might also compile directly to machine language for the underlying graphics hardware. A good compiler from NVIDIA for NV30 could take full advantage of the extra instructions per pass with complex shader programs. In fact, future performance battles could turn on the quality of compliers for a chip, as they do now in CPUs.
NVIDIA claims this class of shading will be possible in real time with NV30
NVIDIA points out that executing hundreds of shader operations in a single pass places the burden almost entirely on the GPU instead of that old graphics chip bottleneck, memory bandwidth. The R300 might have to resort to multiple passes to render a particularly complex shader, writing the results of the first pass out to memory and reading it back in before going to the next one. With a 128-bit framebuffer, there would be no loss of precision, but the time required for the write to and read from the buffer would slow things down. NV30's ability to execute more shader ops before going to memory is vastly more efficient, in theory.
Too much is still up in the air for us to know whether these differences in pixel shading power will matter much. Questions linger. When executing a shader program with, say, over 100 instructions, will it really hurt the R300's performance significantly to execute 60-some ops, write to and read from memory, and then execute the rest? More importantly, will the NV30's ability to execute more operations per pass enable it to process new kinds of effects in real time—where R300 can't?
Practical considerations weigh in, too. NV30's apparent advantages in pixel processing power and precision might make it better suited for a render farm, which is great for Quantum3D, but these abilities may mean next to nothing to gamers. Developers tend to target their games for entire generations of hardware, and it's hard to imagine many next-gen games working well on the NV30 but failing on R300 for want of more operations per pass or 128-bit pixel shaders.
And finally, will any of this matter for this generation of hardware? I suspect we'll be at least one generation beyond R300 and NV30 before 32 address ops and 64 math ops becomes a significant limitation.
But I could be wrong. We'll see. These are issues to watch.
We've talked about NV30's potential leg up on ATI's R300. ATI's advantage right now is much simpler: they have a great chip ready to roll immediately. The Radeon 9700 should simply outclass anything else on the market by miles when it hits store shelves. I've seen the Radeon 9700 demonstrate in real time many of the capabilities discussed above, and I can tell you: it's for real.
NVIDIA's challenge is a little more complex. NVIDIA is banking on making the NV30 superior to the competition by harnessing a pair of new technologies: TSMC's 0.13-micron chip fabrication process and some kind of DDR-II-like memory. ATI played it safe and went with 0.15-micron fab technology for manufacturing the R300. NVIDIA's use of 0.13-micron tech could give NV30 lower costs, less heat, and higher clock speeds, but TSMC's ability to produce 0.13-micron chips in volume is unproven. The same goes for DDR-II memory, which holds promise, but could prove difficult and costly to implement or manufacture. NVIDIA risks delaying NV30 significantly by employing bleeding-edge tech. The payoff could be big, but the Christmas buying season looms, and folks are already wondering whether NV30 will make it to market in time.
Meanwhile, ATI is squeezing NVIDIA on price with its super-cheap Radeon 9000 cards. NVIDIA is set to fire back with yet another series of GeForce2 and GeForce3-based chips branded as GeForce4 products. These chips will offer AGP 8X support but none of the next-gen features of R300 or NV30.
But the business considerations are overshadowed by the amazing reality of what the next-gen graphics chips will be able to do. Hyperbole is a common vice among tech types, and I'm still a little shy after the embarrassing spectacle of the dot-com boom and bust. Still, I have to say it, because I believe it's true: graphics chips are poised to transform the PC as we know it. I can't explain precisely why that is with words, specs, and static screenshots. There is nothing quite so compelling as a compelling visual. Wait until you see one of these chips in action, and then you'll understand.