All is clear now...
NV30 vs R300-
"If we compare NV30 to its competitor from ATi on a per-clock basis, the NV30 receives a thorough beating from R300. ATi comes ahead with 8 texture ops and 8 arithmetic ops in parallel. The higher core clock improves the situation a bit for NV30, but it still only does 2600 MTO/s + 700 MAO/s or 2000 MTO/s + 1000 MAO/s. The R300 as R9700Pro reaches 2600 MTO/s + 2600 MAO/s without this balancing option. nVidia can only dare to dream of this raw performance, or try to overclock an NV30 to 1 GHz. Not only the award for best performance per clock, but also the one for best overall raw performance goes to R300 from ATi.
Let us have a look at how prone to wasting performance the NV30 design is. Its universal FPU can execute any operation, which means that the texture op to arithmetic op ratio has no influence on performance waste. However, to be able to perform two texture ops at once they have to occur as pairs. If not, you waste half the possible texture ops.
R300 reaches its maximum of 5200 MI/s at a ratio of 1:1, dropping at both extremes (only texture/arithmetic ops, respectively) to 2600 MI/s. Without paired texops, NV30 yields a constant 2000 MI/s, far below R300 even in the best case.
Conclusively one can say that NV30 is less prone to wasting performance, but R300 has enough raw performance, so wasting a bit doesn't hurt too much."
NV35 improvements vs R350:
NV35 shows a different behavior than NV30. In the ideal case of paired texture instructions followed by an arithmetic instruction, it can reach a maximum of 5400 MI/s. The second line shows a shader without paired texture operations. The last curve shows how NV35 behaves when using PS1.4 and PS2.0 in the form preferred by ATi. Because this means either 8 texture or 8 arithmetic instructions per clock, we get a constant 3600 MI/s. At a 1:6 ratio and below, NV35 is able to beat R350.
With shaders optimized for both architectures, NV35 does a much better job than its predecessor did. NV35 beats R350 outside the range of 2:1 to 1:3. But in between, ATi dominates and even R300 is able to beat NV35 here. If we consider the bigger performance hit of R350 when doing dependent reads, we can conclude that NV35 and R350 are competitors of equal weight if both get fed with optimized shader code."
This never materialized I guess?
"But nVidia can't expect an application to always deliver such code. At this point we can only preach that nVidia has to put instruction reordering capabilities into their drivers, but without changing the function of the code. The method the current driver uses nothing more than a stop-gap solution that is acceptable as long as the number of applications containing unfavorable shader code is small. But wiht a growing number of titles using shader technology, nVidia can't expect GeForceFX users to wait for new drivers to enjoy higher performance through replacement shaders."