dexvx wrote:You do not believe in mathematical theories such as the average, geometric mean, median, etc? GTX 1080 is, on average, about 5-10% faster in 3D games than the Vega 64. Are all the singular datapoints (games) where the GTX 1080 is faster just cherry-picked and therefore be discarded? Or are you implying the opposite?
I'm not sure if you understand what cherry picking is...
If you want to talk about maths and statistics, the point is 'outlier'... If a certain result deviates far from the average/mean/median/whatnot, then there is something interesting going on there, which warrants deeper inspection in order to try and explain why this particular result is faster/slower than most.
Cherry-picking is where you only pick the results that support your argument. In some cases that can be people only picking the 'outliers' where their brand of choice is faster, discarding the results that bring the average down.
In other cases, people may discard the 'outliers' where their brand of choice is slower.
It all depends on what outliers you have.
In the case of DX12/Vulkan, the sample space is currently too small to even have a good idea of what the expected results are, vs what the outliers would be.
dexvx wrote:I'll proceed and tell those game devs to shut up.
Game devs don't necessarily understand hardware these days. Heck, many of then don't even know assembly language anymore. They're probably doing the same as you do: repeat stuff they read on the internet, because confirmation bias.
Really, you'll have to come up with something better.
dexvx wrote:It's software from my perspective because I can't change the SM's on the fly. How the fvk would I know what my fractional workload (which is often times dynamic) is going to be?
As far as I recall, Maxwell can very well change the SMs on the fly (as pointed out, they have been able to do that since Kepler with HyperQ), the only limitation being that it cannot do it while a draw call is executing, because draw calls are not pre-emptive. So it can change the SMs between any two draw calls.
In which case you are misunderstanding the hardware and misrepresenting the facts.
dexvx wrote:Please cite anywhere that AoTS was developed with the intention of deliberately hurting performance.
Why do I have to need to cite anything? How about common sense?
They released a benchmark which had worse performance on Maxwell when async is enabled.
If you didn't want to hurt performance on NV, you would either not enable the async path at all, or you would make an alternative path that doesn't hurt performance.
In fact, why would they even release a benchmark at all, of a game that was still far from finished at the time?
Not to mention that AoTS was an AMD-sponsored game, so the writing is on the wall, isn't it?
If you look at DOOM, yes it was AMD-sponsored, and yes, it has AMD-specific shader optimizations, and yes it uses async compute optimized for AMD... But at least they don't enable async compute on NV at all, so it doesn't HURT performance. That's what one expects in a game.
dexvx wrote:From the papers I've read, AoTS's design was to just emulate a single logical compute queue and then serializing tasks into the graphics queue. It just so happens that Maxwell's very static implementation of async compute (with explicit scheduling of graphics and compute tasks) was terrible at executing the task structured as such.
Putting the cart before the horse, are we?
The point of writing a game should be to make it run as fast as possible, and make it look as good as possible. What you're saying just proves my point: they ran a task that was structured in a way that it ran very poorly on Maxwell.
Why would you even allow such a code path to run on the hardware? QA should have figured out that this didn't work on that hardware, so you disable it. After all, async compute doesn't change anything about how the game looks. It's merely a basic tool that may or may not allow you to get small gains on certain hardware if you can use it correctly.
It should be disabled by default, unless you made specific optimizations and have verified that they indeed improve performance during QA. This is also what the DX12 best practices docs say.
Instead, not only did they enable it by default, they even went as far as shout out in the media that NV's hardware was broken and whatnot. Which is what got us to where we are today, with people like you arguing about how only AMD has "true async". It's a dirty game that AMD has been playing, and you fell for it.
dexvx wrote:Anyways, you give nice and detailed technical explanations (like some of our PE's and higher). However, like many PE's, you seem to ignore real world results that don't conform to your internal theories (and dismiss anomalies as 'cherry picked' data).
Actually, it's the other way around.
See, async compute is mainly an AMD marketing tool. It is basically the only DX12-thing that they can sorta do. Not to mention that they get it 'for free' on the PC platform since game devs also use it on consoles.
As a result, AMD's DX12 strategy has been to focus 100% on async compute (and completely ignore other new features of the API, many of which they didn't even implement). The only software out there that uses async compute and ISN'T AMD-sponsored/biased is FutureMark's Time Spy.
So the async compute consensus on the net is very much AMD-biased. I am one of the few who has a more balanced view, and actually understands how async compute works and what the differences are between the archictectures.
Pretty much everything else is "Look, AMD is faster, NV is fake!", which is nonsense of course. It's about as nonsensical as saying that AMD's CPUs must be 'pseudo-hardware' because they can't run x86 software as quickly as Intel can.
Different architectures just have different solutions to the same problem, which comes with different performance characteristics and optimization strategies.
Time Spy is the only 'fair' async compute test we have so far, and we can see that it indeed works on NV hardware. It doesn't get as much as a boost as it does on AMD hardware, but does that make NV's bad or fake? No. Their architecture is just different. As I already said before, NV's pipeline is far more efficient than AMD's, so there is less to gain with async compute in the first place. Even if NV would copy AMD's async compute implementation 1:1 and glue it onto Pascal, you wouldn't see the same gains as you get on AMD hardware. That's not because the async compute doesn't work that well, it's because the other parts of the GPU are more efficient, and therefore take away more of your resources (just look at how much more performance NV gets out of the same memory bandwidth or GFLOPS... eg compare RX580 with GF1070, which have the same memory interface. Then think about what that means for something like async compute... NV cards that generally perform at the same level as AMD cards, are far 'lighter' in terms of actual resources. They just perform the same because they're more efficient).
Of course you also have to consider the compute units themselves, which are not the same between AMD and NV.
dexvx wrote:What does HPC compute tasks have to do with async compute?
Isn't that obvious?
Async compute is about running compute shaders asynchronously. It doesn't necessarily have to be paired with graphics tasks. This concept existed long before DX12, because in HPC it was a real problem.
In the end, a graphics task is just another shader task. At the time of Kepler, the hardware wasn't completely generalized in this way yet, so you could not run them in parallel. But later hardware did make this generalization, and the basic concept is just the same: you have multiple shader tasks running in parallel, and you can dynamically schedule these over the SMs.
You can see that in Maxwell v2, they made the first step here: a graphics task could run in parallel with compute tasks, but they had not implemented pre-emption of graphics tasks yet, so an entire draw call had to complete before context switches could be made.
In Pascal they improved granularity considerably: they can now pre-empt graphics tasks at the pixel level.
You could argue that they could still make another step: pre-empt graphics tasks at the instruction level. However, unless you have some extremely long pixel-shaders, it is probably not going to make a difference in practice. And I can see why they did it per-pixel... this way they can perform synchronization at the raster operation level. That probably makes the implementation of the special-case of graphics tasks a whole lot less complicated for context switches.
As far as I know, AMD has not specified exactly when or how they pre-empt, but I doubt that they go beyond pixel level on graphics tasks.