VOGONS


First post, by silikone

User metadata
Rank Member
Rank
Member

Though I have found some vague info on this subject, I still find the numbers associated with texture mapping units and pixel pipelines confusing.
Where does one see the strongest benefit (and lack thereof) of using multiple pipelines with one TMU each over having the units paired together?
How are uneven quantities of multitextures handled with an even amount of TMUs?

Do not refrain from refusing to stop hindering yourself from the opposite of watching nothing other than that which is by no means porn.

Reply 1 of 11, by agent_x007

User metadata
Rank Oldbie
Rank
Oldbie

With TMUs it's all about what a texture unit can do in a single pass.
Multiple TMUs per ROP are good to have if you got many effects that rely on textures, compared to pixel operations you want to do per second (example, games from Voodoo 2 and up to DX7/DX8 era).

With DX9 and later, the most pipeline stoping thing switched from pure texeture and pixel pushing, to math crunching used in more and more complex shader programs.
It's also the reason why you get more than one Pixel Shader per TMU in Radeon X1900 series.

What do you mean by "uneven quantities of multitextures" ?

157143230295.png

Reply 2 of 11, by silikone

User metadata
Rank Member
Rank
Member
agent_x007 wrote:

What do you mean by "uneven quantities of multitextures" ?

Multitexturing using three textures when you have something capable of four.

Do not refrain from refusing to stop hindering yourself from the opposite of watching nothing other than that which is by no means porn.

Reply 3 of 11, by Scali

User metadata
Rank l33t
Rank
l33t
silikone wrote:

Multitexturing using three textures when you have something capable of four.

That's simple: NOP 😀
You can use however many textures you like, up to the maximum the hardware allows (if anything, that's required for backward compatibility with single-textured software).
So if you have more TMUs than you actually need, they can simply be bypassed.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 4 of 11, by silikone

User metadata
Rank Member
Rank
Member
Scali wrote:
silikone wrote:

Multitexturing using three textures when you have something capable of four.

That's simple: NOP 😀

Yes, but does this then simply waste the resources whether one has one or two TMUs per pipeline?
I'm looking for something like an equation that shows how different combinations operate with different rendering tasks in theory, including the strain put on the memory bandwidth and fillrate. Tests indicate that the latter gets large savings from efficiently leveraging multitexturing, though the results were inconsistent.

Do not refrain from refusing to stop hindering yourself from the opposite of watching nothing other than that which is by no means porn.

Reply 5 of 11, by Scali

User metadata
Rank l33t
Rank
l33t
silikone wrote:

Yes, but does this then simply waste the resources whether one has one or two TMUs per pipeline?

Yes, but if you only want one texture, then what else were you going to do with that TMU? 😀
Anyway, in the real world it is far more complicated...
Different generations of hardware use TMUs differently.
For example, early hardware had the TMUs hardwired to the pipeline.
So if you had only one TMU, you could only do one texture per pass. You couldn't do multitexturing, your only choice was to do multiple renderpasses and blend the results together, which isn't very efficient.
Early multitexturing hardware, like a GeForce, still had the TMUs hardwired to the pipeline. So you could blend two textures together in a single pass. Of course, if you only used one texture, the other TMU sat there idling.
Later hardware could allocate its TMUs more efficiently. So you could re-use the same TMU multiple times during a pass, with different textures.
A nice example of that is the Kyro II, which could do 8 textures per pass, even though it only had 1 TMU per pipeline.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 6 of 11, by silikone

User metadata
Rank Member
Rank
Member
Scali wrote:

Of course, if you only used one texture, the other TMU sat there idling.

I was under the impression that the Geforce's Quadpipe allowed it to work with multiple adjacent pixels in parallel, rendering a single texture with full speed, while a single pipeline setup with multitexturing (Voodoo 2) would conversely only benefit in speed when performing multitexturing. Also, why only two textures per pass on a Geforce when it is supposedly a four pipeline chip, like NVIDIA's trademark implies?

Do not refrain from refusing to stop hindering yourself from the opposite of watching nothing other than that which is by no means porn.

Reply 7 of 11, by Scali

User metadata
Rank l33t
Rank
l33t
silikone wrote:

I was under the impression that the Geforce's Quadpipe allowed it to work with multiple adjacent pixels in parallel

Yes, quadpipe literally means 4 (pixel) pipelines.

silikone wrote:

Also, why only two textures per pass on a Geforce when it is supposedly a four pipeline chip, like NVIDIA's trademark implies?

You have to separate pipelines and textures here.
A pipeline renders a single pixel. A single pixel may have 0 or more textures applied.
The amount of textures you can apply per pixel depends on a combination of the number of TMUs per pipeline, and how often these TMUs can be (re-)used, as mentioned above with the example of the Kyro II.
In the era of fixedfunction texturing, this was known as 'texture stages'. The GeForce supports two texture stages. So the pipeline can apply two textures (+ a texture blend function) to a pixel in a single pass.
The ATi Radeon in comparison supports three texture stages. And the aforementioned Kyro II supports 8 texture stages.

If you look here, you see that they mention the 'ratio' of "Pixel pipelines: texture mapping units: render output units"
https://en.wikipedia.org/wiki/List_of_Nvidia_ … orce_256_series
The GF256 is 4:4:4, so it has 4 pixel pipelines, 4 TMUs and 4 ROPs. So you have a straightforward pipeline of 4 pixels at a time, each with a single TMU (which can be re-used for dual texturing in a single pass).
The GF2 is 4:8:4, so they doubled the number of TMUs. This means it has twice the texture throughput, making dual texturing far more efficient. It also allows for more efficient trilinear filtering (where two mipmaps of the same texture are sampled). Although Wikipedia claims that the GF256 also has that second TMU, and can do the same efficient trilinear filtering. They claim that dual texturing was disabled because of a bug.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 8 of 11, by silikone

User metadata
Rank Member
Rank
Member

That does clear some things up. I thought the pipelines with their single TMUs would work together by each processing their own texture, combining the results together just as they are about to be sent to the framebuffer.
So I suppose that the second texture stage is initialized right away, avoiding sending a single-textured pixel and wasting precious memory bandwidth.
Using an equation from NVIDIA themselves, reading the Z-buffer and textures, then writing the final 32-bpp output, should consume 20 bytes per pixel, so nothing else considered, in theory, a 256 DDR would have exactly enough bandwidth for five full 1024x768 passes at 60 FPS. Assuming that multitexturing helps by only reading and writing to the framebuffer once for two textures, the second texture will just leave a footprint of four bytes in comparison to 20 had been blended in a separate pass. This should enable eight full texture blends at 60 FPS. Is this reasoning valid?

Do not refrain from refusing to stop hindering yourself from the opposite of watching nothing other than that which is by no means porn.

Reply 9 of 11, by Reputator

User metadata
Rank Member
Rank
Member
Scali wrote:
You have to separate pipelines and textures here. A pipeline renders a single pixel. A single pixel may have 0 or more textures […]
Show full quote

You have to separate pipelines and textures here.
A pipeline renders a single pixel. A single pixel may have 0 or more textures applied.
The amount of textures you can apply per pixel depends on a combination of the number of TMUs per pipeline, and how often these TMUs can be (re-)used, as mentioned above with the example of the Kyro II.
In the era of fixedfunction texturing, this was known as 'texture stages'. The GeForce supports two texture stages. So the pipeline can apply two textures (+ a texture blend function) to a pixel in a single pass.
The ATi Radeon in comparison supports three texture stages. And the aforementioned Kyro II supports 8 texture stages.

If you look here, you see that they mention the 'ratio' of "Pixel pipelines: texture mapping units: render output units"
https://en.wikipedia.org/wiki/List_of_Nvidia_ … orce_256_series
The GF256 is 4:4:4, so it has 4 pixel pipelines, 4 TMUs and 4 ROPs. So you have a straightforward pipeline of 4 pixels at a time, each with a single TMU (which can be re-used for dual texturing in a single pass).
The GF2 is 4:8:4, so they doubled the number of TMUs. This means it has twice the texture throughput, making dual texturing far more efficient. It also allows for more efficient trilinear filtering (where two mipmaps of the same texture are sampled). Although Wikipedia claims that the GF256 also has that second TMU, and can do the same efficient trilinear filtering. They claim that dual texturing was disabled because of a bug.

This explains a lot. I was really curious why a GeForce 256 got a speed-up when doing multi-texturing despite only having a single TMU per pipe, whereas a TNT2 behaved as I expected, with no speed-up.

https://www.youtube.com/c/PixelPipes
Graphics Card Database

Reply 10 of 11, by Scali

User metadata
Rank l33t
Rank
l33t
silikone wrote:

This should enable eight full texture blends at 60 FPS. Is this reasoning valid?

Well, not entirely. The GeForce256 can still perform two textures in a single pass, if I'm not mistaken. It just has less TMUs. So the second texture read can still be done in just 4 extra bytes. But you will get additional latency.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 11 of 11, by silikone

User metadata
Rank Member
Rank
Member
Scali wrote:
silikone wrote:

This should enable eight full texture blends at 60 FPS. Is this reasoning valid?

Well, not entirely. The GeForce256 can still perform two textures in a single pass, if I'm not mistaken. It just has less TMUs. So the second texture read can still be done in just 4 extra bytes. But you will get additional latency.

Oh yeah, I meant to say four pairs of multitextures for a total of eight textures. Dividing the memory bandwidth limit by four 24 byte passes at 1024x768 yields 63.5Hz. How close it gets to this theoretical limit, I don't know, but I reckon that it is fairly lower in practice.

Do not refrain from refusing to stop hindering yourself from the opposite of watching nothing other than that which is by no means porn.