Power of XBOX360 - raw triangle power

Hi,

I was wondering if it was possible to compare the graphics power of XBOX360 to any of current gfx cards. Say, if it`s capable of rendering same amount of triangles as e.g. GF6800 does. The old XBOX was usually compared to GF3 in terms of graphics performance.

The reason I ask is, that I would like to know the general boundaries of the raw gfx power, since I`m planning on converting some of my current games to XNA (and later to XBOX360) as soon as it becomes available.

Currently I`m pushing the GF6800 card to its limits, which means I`m rendering a range of 0.5M-1.0M triangles (all unique, non-instanced) per frame and it seems that GF6800 starts to slow down when it crosses 1M barrier. No pixel shaders there, just Vertex Shaders (mainly for decompression and lighting).

Obviously, the full power can be gain only by using all cores of XBOX360, but at least with the single-thread, is it safe to say, that 1M triangles (just vertex-shaded, no pixel shaders) per frame (not instanced, unique) is a lower range of available scene complexity ?

Thanks

[1097 byte] By [VladR] at [2007-12-23]
# 1
If you search the WWW for something like "XBox 360 specs" you should be able to find theoretical maximum rates. If you compare that to the specs of the card you're using now you should be able to get some idea of the actually performance you might get on the XBox 360.
RossRidge at 2007-8-30 > top of Msdn Tech,Game Technologies: DirectX, XNA, XACT, etc.,Game Technologies: General...
# 2

Thanks, I found that nVidia states 400M vertices per second for GF6800 (and 600M vertices for 6800Ultra).

Is it safe to assume that they consider vertices as triangles here too ? Probably due to triangle strip where only 1 new vertex has to be transformed for each triangle, right ?

Now the only remaining question is if you need to use all threads/cores for such performance (just benchmarking scenario - no real-world game), or it`s possible just by using single thread and 1 call to DrawIndexedPrimitive () function.

VladR at 2007-8-30 > top of Msdn Tech,Game Technologies: DirectX, XNA, XACT, etc.,Game Technologies: General...
# 3

They would consider vertices as tranformed vertices. You should be able to figure how many vertices you have per triangle in your scene. The complicating factor is what your vertex cache miss rate is. You also need to find out what the limiting factor in your application is. From the sounds of things it's not the vertex transformation rate of your card.

I believe the XBox 360 graphics specs are only for the graphics chip itself and are independent of the CPU.

RossRidge at 2007-8-30 > top of Msdn Tech,Game Technologies: DirectX, XNA, XACT, etc.,Game Technologies: General...
# 4

Vertex Cache miss rate is minimal, especially with terrain. While with tri-strip you can get only 1,01 efficiency, with triangle list, it`s easy to get efficiency of 0,612 - meaning I can render 3024 triangles with only 1853 transforms (I once had an idea of the chunk pattern fitting into cache and started playing with it). It`s even possible to get efficiency of 0,52 with Priming of the Vertex Cache, but I haven`t implemented that yet.

And I would really like to spend at least 200k tris on the terrain+~75k on trees. Such a terrain (with appropriate procedural texturing) is finally quite pretty. Remaining ~100k can be spent on environment+characters.

If the raw transform power is indipendent of other cores, then it`s great, since it`s possible to let other cores handle the decompression of the terrain and the textures.

VladR at 2007-8-30 > top of Msdn Tech,Game Technologies: DirectX, XNA, XACT, etc.,Game Technologies: General...
# 5
Its impossible to take rough PR stats and calculate off it (those NVIDIA number are 'stretching' the truth (not exactly false but...), max PC triangle rate on the best video card money can buy is ~200M Tris)

You need to look at your graphics dataflow determine your bottlenecks and then check that with the real GPU data your working on.

Given you said vertex shader and no (or more correctly very simple) pixel shaders, your likely going to be bound (assuming static data) by either vertex transform rate, triangle setup or attribute transfer....

In all three fields (IIRC) X360 will perform better than a GF6800. The ALU architecture allows X360 to perform very well on vertex heavy/low pixel shader work, and its also has high setup rate.

Of course all this is indepedent of CPU on both GF6800 and X360 GPU.

And the disclaimer this is all a simplification, you haven't given enough detail or render target resolution, AA settings, dynamic/static generation of data and also be very careful about benchmark data, especially on X360 (cos of ALU sharing and shared memory bandwidth).

DeanoCalver at 2007-8-30 > top of Msdn Tech,Game Technologies: DirectX, XNA, XACT, etc.,Game Technologies: General...
# 6

OK, to be more specific, all data would be in static vertex/index buffers. The Vertex Shader has just about 30 instructions (just a fog,normal,UV1/2 and decompression of all parameters). Though, I`m using streams, and it`s said that those incur an additional 10% performance hit (on common consumer-level cards).

That`s probably going to be true also on XBOX, since the architecture is similar. However, streams are necessary, since they save about 48 MB of Vertex Buffer data that repeats with each terrain chunk - there`s no need for terrain vertex to have Position and UV duplicated 4 milion times (or at least for visible cached area that`s updated every few seconds). Since compressed Position and UV takes 10 Bytes, with 4M vertices on terrain that would be 40 MB of VRAM wasted+add 8 MB for padding and we have 48 MB wasted without using separate streams. That`s worth 10% performance hit, isn`t it ?

Pixel shading is just simple multitexturing with 2 textures, the rest is in vertices`s color. Also, AA should affect only fillrate, not transform rate. Or is the XBOX360 sort-of DX10 gfx chip where there aren`t specialized vertex/pixel units and all work is being distributed equaly to all available "general" units ?

As for the shared memory bandwidth - does that mean that Vertex/ndex data are moved each frame from main shared memory ? That would seem slow, but maybe there are much faster RAM modules which compensate for the disadvantage.

Thanks for info

VladR at 2007-8-30 > top of Msdn Tech,Game Technologies: DirectX, XNA, XACT, etc.,Game Technologies: General...
# 7
I am not technical guy. But Xbo360 is using 48 Unified Shader Pipeline, out of the spec sheet. So if you are only using vertex operation, all 48 pipelines will be working on that. It is better to put in textures to test the real power.
magicalclick at 2007-8-30 > top of Msdn Tech,Game Technologies: DirectX, XNA, XACT, etc.,Game Technologies: General...
# 8

Yeah, it`s unified - I just checked the specs again. I must have overlooked that when reading the specs in a hurry previously. However, that might mean, that the official polygon performance 500Mtris/s is reached only when all 48 pipelines are working, and that would mean no texturing at all. So, it seems, that in case of basic multitexturing about 40 pipes would be working (just guessing though) and that`s about ~420Mtris/s. And that may be the optimistic number. So my guess of using the terrain that has 300k in frustum (after LOD) is slowly appearing impossible.

But, I just realized that games are supposed to run at 60fps (aren`t they ?), so that almost doubles that amount.of polygons (I still hope for ideal parallelism of all cores with GPU), so all is not lost.

VladR at 2007-8-30 > top of Msdn Tech,Game Technologies: DirectX, XNA, XACT, etc.,Game Technologies: General...
# 9
Multiple streams are often not as bad as assumed, there is ususally a small (possible neglible) DMA overhead per stream (note this might not apply to X360 where the mem controller is the GPU...). The real slow down comes from potentially worse pre-transform cache performance, if you can optimise for this is possible for multiple streams to be free and can actually win in multiple-pass situation, due to being able to turn off attribute fetches.

X360 has a pool of 48 ALUs that are distrubuted to vertex and pixel shaders at run-time by the system. There is no current PC card with a similar architecture, all PC chips currently have a fixed allocation of vertex and pixel shader ALU.

Yep except for framebuffer writes, all read/writes on X360 by CPU or GPU go to and from the same memory pool... However its fast RAM so hopefully not a problem...

Resolution and AA can affect trangle rate, due to pipeline bubbles, there are small caches between various stages on the GPU. It these fill up because a stage further down the line is unable to process its stuff fast enough, it will case stalls all the way up the pipe.

DeanoCalver at 2007-8-30 > top of Msdn Tech,Game Technologies: DirectX, XNA, XACT, etc.,Game Technologies: General...
# 10
The official triangle count figure isn't related to shader power but triangle setup rate.

The maximum vertex rate doing a WVP transform would be 48*500Mhz / 4 (4 ops for a vector/matrix mul in a shader) = 6 billion verts per second with no pixel shading (i.e back face culled). The triangle setup rate of 1 triangle per clock, hits before this is possible to achieve hence 500Mtris/s max.

DeanoCalver at 2007-8-30 > top of Msdn Tech,Game Technologies: DirectX, XNA, XACT, etc.,Game Technologies: General...
# 11

DeanoCalver wrote:
The official triangle count figure isn't related to shader power but triangle setup rate.

The triangle setup rate of 1 triangle per clock, hits before this is possible to achieve hence 500Mtris/s max.

So, if I understand it correctly, it`s the "Triangle Setup" stage (of the whole rendering pipeline) that`s the limiting factor in MTris/s, right ? Simply, no matter how many Vertices are already transformed, this stage ("Triangle setup") can only process 1 triangle per clock, right ?

DeanoCalver wrote:
The maximum vertex rate doing a WVP transform would be 48*500Mhz / 4 (4 ops for a vector/matrix mul in a shader) = 6 billion verts per second with no pixel shading (i.e back face culled).
But is it possible for for each of the 48 pipes to be processing different vertex ? Otherwise I can`t see how 48 pipes would be competing for processing of single vertex (in case, there`s no pixel shaders).

Or maybe, that`s the reason, why many effects are essentially "free" on XBOX360, since it`s the only way (lots of vertex AND pixel shaders at any time) to make all 48 pipes work efficiently ?

Is there any official document (not under NDA) on how the work is delegated among all pipes ?

So, from purely theoretical standpoint, if my Vertex Shader has 30 instructions, Maximum Vertex rate would be (48*500Mhz)/30 = 800MVerts/s. Add to that some texturing, and we`re approaching those 500MTris/s. And only if all 48 pipes are working at once.

But we`re getting to an interesting point here. Let`s say, that with some texturing we can get 600MVerts/s. Since with terrain( Indexed Tri-lists), it`s easy to reach ratio of 0,627 (triangles/Transforms), thus we could have 956MTris/s. But "Triangle Setup" limits us to 500MTris/s anyway. Am I right in guessing that we can use this free performance for other/better effects (e.g. longer and more complicated shaders) that can be actually renderable due to "Triangle Setup" bottleneck ?

BTW, what`s the post-transform Vertex Cache in XBOX360 ? Is it limited by size (say, there`s 768 Bytes reserved for post-transform cache somewhere), or is it limited as always just by number of transformed vertices (e.g. 24 vertices) ?

I ask, because I have a pretty nice compression (fitting whole terrain vertex into 12 Bytes (Position,Normal,Alpha,Index) and character vertex into 8 Bytes (Pos+Normal)), so if there was a post-transform cache architecture that would benefit from lower vertx size, I might get a ratio of twice as much transformed vertices as in regular case. But that`s most probably not the case and the post-transform cache works here the same way as it does on PC gfx chipsets, right ?

Thanks for the detailed explanation Deano, it`s really appreciated.

VladR at 2007-8-30 > top of Msdn Tech,Game Technologies: DirectX, XNA, XACT, etc.,Game Technologies: General...
# 12

DeanoCalver wrote:
Resolution and AA can affect trangle rate, due to pipeline bubbles, there are small caches between various stages on the GPU. It these fill up because a stage further down the line is unable to process its stuff fast enough, it will case stalls all the way up the pipe.
So, the AA is actually processed through several of those 48 pipes ? Then we won`t be able to reach 500MTris/s if AA is processed like this. Or isn`t "Triangle Setup" stage influenced by this ?

PS: The previous reply from me didn`t go at the end of thread but is instead two replies above.

VladR at 2007-8-30 > top of Msdn Tech,Game Technologies: DirectX, XNA, XACT, etc.,Game Technologies: General...
# 13
The AA is a seperate fixed function unit (on X360 its a part of the

EDRAM) but because pipeline bubbles its still possible to stall much

further up the pipe.

hmm lets see if I can sketch a 'average' PC GPU pipline....

[push buffer read] [ Vertex Fetch ]
| |
| [ pre-transform cache]
| |
[ command proc ] |
| [ Vertex Processing ]
| |
[Index Fetch] [ Post Transform Cache]
| |
[Index Cache]-[ Primitive Assembly (AKA Triangle setup) ]
| |
[ Fragment Shader ] [ Early ROP (z cull, alpha kill) ]
| | |
[ N Texture units ] |-- [ fragment assembly]
| |
[ Texture Cache ] [ Late ROP (MSAA, Alpha) ]
| |
[Texture Fetch] [ Framebuffer output ]

Thats

a vast simplification (no vertex texturing for example) and also X360

is significantly different due to it being the 1st generation of

unified ALU (so rather than fragment and vertex units its has a pool of

ALU that are allocated to where needed (tho with a certain granuality..

not sure that info is public...) . Each stage has a limited amount of

FIFO to stop stalls but at any stage later on, if not processed fast

enough can cause a stall further up... So if your ROP limited (so using

lots of AA or floating point blending) eventually the ROP buffers will

fill, and that stall will then trickle up the pipe, causing a so called

'bubble' (cos its a empty space in your graphics throughput, just like

a bubble is an empty space in water...)

So in practise things

like AA and average triangle size drastically effect things like vertex

throughput. Oddly its worth noting small triangles can be just as bad

(if not worse) than big triangles because the fragment system works in

units of quads (2x2 pixel blocks) and any not covered pixels in that

quad are just wasted.

Each unit has a specified speed (mostly

not public)... Different architecture share some bits (X360 share ALU,

the latest Intel DX10 IGP shares ALU with triangle setup!). There are

many hidden bottleneck (things like post vertex attribute

interpolation) but its easy to see why triangle setup is the hub of

everything...

Sorry that a heavy post and quite vague but

hopefully helpful... modern GPUs are extremely complex...

DeanoCalver at 2007-8-30 > top of Msdn Tech,Game Technologies: DirectX, XNA, XACT, etc.,Game Technologies: General...
# 14

Excellent info Dean ! Obviously, if I had a picture of whole XBOX360 architecture, it would be all more obvious. But that`s under NDA (AFAIK). So, I`m just trying to pick up as much info as is possible.

The issue with small triangles is interesting. Causing bubbles during rendering of half of terrain wouldn`t be the wisest move, so more LOD is needed (or a LOD taking into account silhouette and collapsing just the inner triangles). So, it`s going to be quite challenging to get the most out of XBOX despite its raw power.

BTW, do you have some link to a more detailed officially-released description of XBOX360 rendering architecture ?

VladR at 2007-8-30 > top of Msdn Tech,Game Technologies: DirectX, XNA, XACT, etc.,Game Technologies: General...