Archer57 wrote on 2025-07-09, 00:29:
Honestly i have very hard time buying the narrative that it was a screw-up and was fixed. Newer, supposedly fixed, chips fail just as much if not more.
GTX470-480, 570-580 wold be a good example - most are dead. And AMD HD4870/4890 are no better. Most laptops, especially high-end ones, die because of this issue. Either GPU or chipset fails.
Well, specifically regarding high-end GPUs like the GTX 280, 470, 480, 580 and etc. along with HD4850/70/90 - all of those were simply due to the GPUs just running at too high of a temperature (inadequate cooler and lousy cooling profiles to prioritize less noise over better cooling), with the GTX 470 and 480 having probably the most underwhelming cooling of this bunch. I mean 85C is NOT OK, no matter what nVidia and ATI/AMD tell you, especially when you consider that with each generation, the complexity and average die area has either remained the same or gone up. Larger die area is good for improving heat dissipation (and also carrying more current to the GPU core), but it also has the negative effect greater material expansion with temperature deltas (e.g. if you take two sheets made of the same plastic material and cut one to be 10x10 cm while the other is 50x50 cm and heat them up, guess which one will show more expansion). This is why, if you've noticed, the temperatures on modern GPUs have overall been declining (very slowly) compared to what we saw with the GTX 280-580 era. *AND* GPU silicon has also become thinner - not only due to smaller manufacturing node, but also to be able to dissipate heat better. But in essence, this makes the GPU die weaker, as now material expansion can cause the die to crack or fracture - this is more a failure mode of more modern sub-20-10 nm GPUs than the older stuff.
So to say it in a shorter way, new GPUs fail because of different reasons now... or rather, there have been new challenges in manufacturing modern GPUs, and not all of these were solved or solved completely.
Take for example cards like the RTX 4000 and 5000 series or equivalent AMD - we are now at 300+ Watts at around 1V... so about 300 AMPS (!!!) all going into an area of less than 400 mm^2. Sure the distance between the GPU die and the substrate underneath is tiny (the width of the solder bumps)... but that's still a massive amount of current. And if you compared that to what bumpgated era cards like the 7800/7900 had to carry with their high-lead bumps, you can start to see where the issue is with modern cards compared to the old stuff.
In other words, the failures are for different reasons now. But temperature is still a major player in the field. Which brings me to your next point.
Archer57 wrote on 2025-07-09, 00:29:
As for "preserving"... my question would be: what actually kills - high temperature, or fast temperature changes?
Both.
And not only.
This is purely materials physics.
Higher temperatures (or equivalently, higher temperature delta between cold/hot) means greater expansion and contraction rates between the different materials of the GPU chip.
Then there's the number of cycles sustained - i.e. how many times you expose these materials to the hot-cold cycles. The more cycles sustained, the more fatigued the material will get (mainly the solder bumps, as this is the material between the GPU die and GPU substrate that is exposed to the most stress, along with the underfill that is supposed to alleviate most of this stress from the bumps.)
And lastly, "fast changes in temperature" (of the GPU): in a way, that is a sub-set / combination of the above phenomenons (number of cycles sustained and temperature delta.) With that said, fast changes in temperature probably won't contribute too much to materials stress of the GPU if the temperature deltas are rather small (e.g. GPU constantly bouncing up and down 2-3C every second.) But if the temperature deltas are relatively large (e.g. GPU at 50C at one instant and 60C the next), that will surely contribute to a faster failure.
Archer57 wrote on 2025-07-09, 00:29:
So theoretically whatever cooling solution used should be built with that in mind - as high heat capacity as possible to prevent rapid changes. In this sense modern cooling solutions, be it heatpipes+thin fin stack or liquid, are bad and may be a part of the reason why failures happen in the first place. Giant slab of copper might do better...
Good point... and probably one reason why the P4 Prescott (with its soldered IHS) never had any problems compared to Northwood (IHS not soldered), which would occasionally fail "out of the blue".
Unfortunately, a giant slab of copper is not only more expensive, but also has inferior thermal conductivity compared to heat pipes. And given the TDP per surface area of modern chips, an all-copper heatsink may simply not cut it anymore to transfer all of the heat to the cooling fins. So heat pipes are somewhat necessary at this point.
But in the end, it's not that designers and manufacturers can't solve the cooling issues. Rather, they are just interested in providing a solution that is only "good enough" for the "average user" - e.g. around 5 years of daily use... which is more or less the average upgrade cycle for most people both in the IT field and gaming.
Archer57 wrote on 2025-07-09, 00:29:
IMO, discounting conspiracy theories like "intentionally built to fail", there is simply a fundamental flaw in this way of building chips which has not been resolved and still very much kills new hardware.
But then again, we spin back to the question: why do CPUs don't fail (nearly as often) then?
IMO, there is no fundamental flaw of how chips are built.
It's just more of the industry following the trends of the user/consumer.
Again, people tend to upgrade their GPU about every 5 years - at least anyone that plays more demanding games to need a discreet GPU... and the industry knows this. Meanwhile, CPUs may stick around with their users for 10 years... or sometimes even more in the case of industrial applications. Also, CPUs are still a fundamental building block of any PC, while a discreet GPU is not. As such, most discreet GPUs are considered more of a "consumable", even when they are non-upgradeble or replaceable in a system.
So in short, CPU and GPUs are built to different expected lifecycle standards. And because of that, manufacturers simply don't see a reason to make GPUs as durable as CPUs.
Of course now with more modern and very small manufacturing nodes, we've also started seeing CPU start to fail a lot more frequently compared to before / back in the day (socket 462 excepted. 🤣 )
The Serpent Rider wrote on 2025-07-08, 19:31:
Not sure about ATi. So far my experience with X1800/X1900/X1950/HD 2900 is that they are built like a tank, despite being Devil's armpit hot.
Well, the HD 2000 and 3000 series specifically (and to a smaller extent the HD 4000 and 5000 series) - yes, they really are quite well-built. If you keep these cool, you can actually make them last for a very long time. The problem with cards like the HD 3850/70 and 4850/70/90, though, is that their stock cooling profile was often set to kick up the fan only after the GPU reaches some absurd temperatures, like 70-75C, and not really noticeably ramp up until things were cooking at 80+C. That's bad and 100% the fault of the cooling. (It also shows how much trust ATI/AMD had in their GPUs at the time, despite the then rumors of the nVidia bumpgate issue on the horizon.)
As for the X1000 series - I don't have that many in my collection and haven't analyzed them too closely as a result. But from what I have seen as sold over the years, they do appear to be relatively OK too.
In the case of the Xbox 360, that too was mostly a cooling issue - the GPU heatsink (both the simple aluminum one and the one with the heatpipe and small extension) is really only good for cooling chips up to 30-35W TDP... maybe 40W tops. Past that, the temperatures start to go past 70C - and that's with lots of airflow in an open-air test box. The Xbox 360 has neither of these last two. That same GPU heatsink in it gets both poor airflow and it's not very cool air. So the older higher-TDP GPUs in these tend to run very hot (80C) and probably the primary reason they die. Now, if the early GPU models also had some issues with a soft underfill like nVidia did, that explains it even more why the older/oldest Xbox 360's were so failure-prone.
The Serpent Rider wrote on 2025-07-08, 19:31:
Now Radeon 9600/9700/9800 were notoriously dropping like flies, but mostly due to RAM failures.
Well, the Radeon 9600 should not be grouped in the same category as the 9700 and 9800.
If anything, "Radeon 9600 series" can be split into two categories - those with BGA RAM and those with regular TSSOP RAM. The latter (TSSOP RAM models) are pretty bullet-proof if they come with an active-cooling cooler - e.g. one with a fan. The passively-cooled ones (often 9550, 9600 vanilla / non-pro, and a few 9600 Pro) would occasionally fail due to running too hot. And the BGA RAM models (mostly 9600 XT cards) had issues due to RAM.
As for the 9700/9800 series - in addition to failing RAM, these also failed due to inadequate cooling of the GPU chip. Actually, the two go hand-in-hand: if the GPU chip runs hot, it lets the PCB underneath and around it get hot as well. The BGA RAM used on these GPUs relies on getting cooled by the PCB underneath (through the BGA solder balls) The solder itself is leaded and not prone to failing at all. But when the PCB is already very hot due to the GPU running too hot, obviously the RAM can't cool properly... and that can lead to premature RAM failures... if the GPU didn't fail first from the high temperatures.
So for Radeon 9700/9800 cards, the overall cooling of the card is just dismal, no matter which way you look at it. Anyone that runs these with the stock cooler is essentially asking for their card to die. On that note, if you put a massive heatsink on the GPUs, you'll notice that the RAM now also runs much cooler. So not only do you extend the life of the GPU, but also that of the RAM... though for better completeness, one should also install RAM heatsinks and actively cool them as well. Doing so should greatly extend the life of these cards. I have 2 that were reflown over 10 years ago that still work with my improvised oversized coolers. So I take that as good enough proof that these an last with good cooling. FWIW, reflowing of GPUs should not be considered a fix, but only a temporary revival. The fact that I can extend their service life as much goes to show that temperature is the enemy of these.