VOGONS


Saving defective chips with improvised cooling?

Topic actions

First post, by Vany

User metadata
Rank Newbie
Rank
Newbie

Over the years I've come across various retro laptops from mid 2000s that have defective chips. These all have the same problem like the PS3's gpu - cracked solder balls in the substrate package.

I have several ideas but would like to know if anyone attempted these before so I can cross them out and not waste time in research. These would apply to known defective chips and those that failed already which can temporarily be revived via heat.

Repairing + Preserving
*Solution A - Re-heat the chip to get it working again, then add a larger/improvised heatsink to permanently keep it cooler. Add extra fans if necessary.
*Solution B - Re-heat the chip and apply enough pressure onto the chip itself to squeeze it tighter and possibly negate the cracks as they make contact again (this works on launch PS4s, the infamous coin trick)

Preserving good chips
*Solution C - Improvised heatsink which would allow liquid metal to cool the chip without messing up the motherboard (larger area coverage than thermal paste? possibly not having to ever touch it again as AFAIK LM doesn't dry)
*Solution D - Undervolting the chip to the point that it no longer heats up as much? (I tried this on a Toshiba C55 with a GeForce 710 that ran at 95C regularly, undervolting caused crashes + system instability.)
*Solution E - Assuming the chips fail because of poor laptop design/planned obsolescence by not having a big enough heatsink or lack of cooling holes and rectifying these design flaws?

I am not interested in doing chip replacements or bga rework as I see that as curing the symptoms rather than the disease.
In case anyone wonders which chips I believe to be defective, I made a list (came across these way too many times to count):

NVIDIA:
Laptops:
GeForce Go 6000 series
GeForce Go 7000 series
GeForce 8000M series
GeForce 9000M series
Desktop:
GeForce 6600 (later models), 6700, 6800
GeForce 7000 series (models above 7600 GT)
GeForce 8000 series (all models + workstation models)
GeForce 9000 series (models above 9500)

ATI:
Laptops:
Mobility Radeon X1000 series (Only above X1600?)
Mobility Radeon HD 3000 series (except integrated chips such as HD 3100, HD 3200 but HD 3450 for example is affected)
Mobility Radeon HD 4000 series (all)
Mobility Radeon HD 5000 series (all)
Mobility Radeon HD 6000 series (all except integrated chips)

Trident Cyber 9525DVD Test, Review and supported games list

Reply 2 of 24, by Fish3r

User metadata
Rank Newbie
Rank
Newbie

I've not used a laptop variant of these chips but I feel like even with a combo of undervolting + improved TIM you're going to struggle to keep the higher end chips below the temps required to cause issues, which according to https://youtu.be/3qKtS_uxdcU?t=2206 is around 65c

Reply 3 of 24, by Vany

User metadata
Rank Newbie
Rank
Newbie
MikeSG wrote on 2025-03-14, 19:14:

Baking the GPU seems to work for some. I don't know if I'd do it unless it was a last resort.

https://linustechtips.com/topic/989014-so-i-d … the-oven-trick/

That damages other components and is not really a fix as much as a crutch put to hold a damaged wall.

Trident Cyber 9525DVD Test, Review and supported games list

Reply 4 of 24, by Vany

User metadata
Rank Newbie
Rank
Newbie
Fish3r wrote on 2025-03-14, 19:28:

I've not used a laptop variant of these chips but I feel like even with a combo of undervolting + improved TIM you're going to struggle to keep the higher end chips below the temps required to cause issues, which according to https://youtu.be/3qKtS_uxdcU?t=2206 is around 65c

Thank you. So let's say 65C is the max for a defective substrate, if the system is cool enough does that "guarantee" that the chip won't destroy itself under normal usage? If true, then not much is needed as I've seen significant temp drops when adding another copper pipe and gluing it with thermal glue to the existing ones. It may just be enough. Hmm

Trident Cyber 9525DVD Test, Review and supported games list

Reply 5 of 24, by Trashbytes

User metadata
Rank Oldbie
Rank
Oldbie
Vany wrote on 2025-03-16, 00:23:
Fish3r wrote on 2025-03-14, 19:28:

I've not used a laptop variant of these chips but I feel like even with a combo of undervolting + improved TIM you're going to struggle to keep the higher end chips below the temps required to cause issues, which according to https://youtu.be/3qKtS_uxdcU?t=2206 is around 65c

Thank you. So let's say 65C is the max for a defective substrate, if the system is cool enough does that "guarantee" that the chip won't destroy itself under normal usage? If true, then not much is needed as I've seen significant temp drops when adding another copper pipe and gluing it with thermal glue to the existing ones. It may just be enough. Hmm

Heat isnt the main issue, nor is the substrate or the die, its the shitty lead free solder they used combined with inadequate laptop cooling designs. You cant solve the solder issue by baking it, modifying the cooler, undervolting or gimping the GPU the shit solder can only be fixed by replacing it, the bad cooling designs are also pretty damn hard to fix if not impossible on most laptops.

There are no guarantees here either, you can try bandaiding it and it may work fine till it doesn't, much like baking a GPU in the oven doesn't mean its fixed, it will break again and this is due to the solder.

Not telling you to not try any of this but set your expectations on it breaking again or being woefully under powered just to get it working without breaking. (Also once the damage has been done itll happen again faster and likely be terminal, dealt with a lot of 8800 cards dying from this, no amount of modding stopped it from happening again)

Reply 6 of 24, by lti

User metadata
Rank Member
Rank
Member

On the Nvidia and AMD chips listed above (except for possibly the X1000 series), reballing with leaded solder didn't help. The success rate (both immediate and long-term) was identical to the shitty oven bake. However, reballing does work with other chips (especially other manufacturers).

I have seen Nvidia chips last longer if they're cooled well from the factory, but that's both rare and only extended their life to 8-10 years.

Reply 7 of 24, by momaka

User metadata
Rank Oldbie
Rank
Oldbie
Trashbytes wrote on 2025-03-16, 00:49:

Heat isnt the main issue, nor is the substrate or the die, its the shitty lead free solder they used combined with inadequate laptop cooling designs.

NO, it is most definitely NOT the lead-free solder that's the cause of this.
And it's NOT the (solder) connections that fail between the board and the chip substrate, but rather the (bump) connections between the chip substrate and chip silicon die.
SO REBALLING WON'T FIX ANYTHING OR ANY BETTER!!
C'mon people, it's 2025 already, so kindly PLEASE STOP PROPAGATING THESE MYTHS!!!

If you don't believe me, then ask yourself these questions:
Why do Radeon 9700 and 9800 video cards fail, despite using leaded solder??
Or better yet, why is it that CPUs almost never seem to fail (at least older stuff before around 6th/7th gen Intel) while GPUs tend to fail very often (and have been for a pretty long time)?
FWIW, both CPUs and GPUs jumped on the lead-free bandwagon quite a long time ago... yet we didn't see CPUs with the same kind of failures as GPUs. So MOST CERTAINLY, lead-free solder cannot have been the reason here. Here's a hint to answer the above questions: what did/do you consider as your "normal" operating temperature for your CPU and your GPU? See any difference, especially a little further back in the days? 😉

The REAL reason why flip-chip stuff fails is almost always due to HIGH HEAT... which is mostly due to shitty cooling designs. And in the case of more modern flip-chips (starting around chips made around the last 8-10 years), also silicon degradation.

The two reasons excessive heat kills flip-chip chips is as follows:
1) This is the MAJOR one: differing material expansion coefficient rates between materials. The Silicon die/core expands at a slightly different rate than does the (typically glass-epoxy -based PCB) substrate on which the Silicon die/core is mounted on. So the higher the temperature differentials between "hot" and "cold" states, the bigger the expansion/contraction there will be between these two. What joins these two, electrically, are small solder bumps (collectively, just "bumps"). It IS TRUE that leaded solder is more flexible / agile and thus can take more abuse when it comes stretching and flexing. However, the solder bumps aren't what physically holds together the silicon die/core and the PCB substrate. This is job of the underfill epoxy/glue material (collectively, just "underfill"). No matter how strong/good this underfill material is of course, the thermal expansion rates between the die and substrate will always be there. So eventually, the solder "bumps" will break between the two given enough expansion contraction and number of times (cycles) this occurs.
YES, leaded solder can take on more abuse from thermal expansion compared to lead-free solder, but that still won't save a chip from failing (as evident by the Radeon R300 series of GPUs.)
Thus, reducing the max operating temperature, and thus the thermal expansion between the die and the substrate, will directly result in a longer-lasting chip, all other constants being equal.

Small side note: in the case of the nVidia "bumpgate" problem, the issue was mostly caused by an underfill material that was found to soften too much past 60-65C... meaning the solder bumps were exposed to even more stress now with the underfill not really doing its job. Hence why nVidia "bumpgate" chips failed so much more often than others, and especially at higher temperatures.

2) Silicon degradation. It is a REAL phenomenon. High heat changes the properties of silicon and also accelerates it towards electromigration. This is why CPUs and GPUs have a "max" temperature. Actually, this is not a "binary" fixed number either - i.e. so long as the chip doesn't go past its max XX degree rating, then it won't have a problem. NO! The closer a chip gets to its maximum rated temperature, the quicker is the silicon material in the die accelerated towards degradation. That being said, there is another factor that comes into play here: the manufacturing node (size/lithography). Older (larger) processing nodes means chips with phyisicaly bigger transistors inside, which directly translates to more silicon material per transistor. What this means is that even if a small part of the transistor starts degrading, it still won't affect its overall operation that much. But for a smaller transistor (especially nowadays with -sub 20-10 nm nodes) the changes in transistor parameters caused by degradation can have a much more dramatic effect on its operation. Furthermore, older chips operate at higher voltages and higher currents... so again, this is why a small change in the transistor properties / parameters isn't likely to affect anything. But for modern stuff where signal voltages and currents are much smaller, an equivalent change in the transistor parameters can lead to fatal signal errors. And again, all of this degradation is only further accelerated by elevated heat.

So to say it in fewer words: less heat / lower temperatures = longer lasting silicon. PERIOD.

Now back to the topic at hand...

Vany wrote on 2025-03-16, 00:23:

Thank you. So let's say 65C is the max for a defective substrate, if the system is cool enough does that "guarantee" that the chip won't destroy itself under normal usage?

No, there's no guarantee how long a reflown / re-heated chip will work. It's really a matter of luck.
However, reducing the temperature does reduce thermal expansion difference between materials, so it should still help.
Even more importantly, for nVidia "bumpgate" era chips, its the underfill material that softens at elevated temperatures, so keeping the chip cool should help prolong the repair even more... if you are lucky, of course.

That said, I still have a Compaq Presario v6000 laptop with a GeForce 6100/6150 Go chipset still working after my initial reflow some 8 years ago. Granted I haven't used it that much in the last 4 years or so. But still, after I reflowed it, I did use it quite a bit for about a year, and it went without a hitch - that is, after I heavily undervolted and underclocked the CPU (via CrystalCPUID) to keep the temperatures down to under 60C (idling around 50C most of the time.) If I didn't do that, the idle temps were closer to 58-59C, and any CPU load could quickly bring up the temperature to 65C. FWIW, that laptop has a "uni-cooler" design like just about every laptop from that era, where the chipset uses the same heatsink to dump heat into as the CPU. In my case, the laptop only had a thermal pad to couple the chipset core to the CPU heatsink, so the chipset often was about 5 degrees off (higher or lower, depending on load) than the CPU temperature. I changed that for a folded copper shim "sandwitch" (with a thermal pad inside it), reducing that temperature offset to 1C most of the time.

I also have reflown Radeon 9700 video cards - 8 to be exact. Only managed to get 3 working. Of these, one failed pretty quickly (the first one I did) simply because I used the same (shitty) stock cooler. I suspect this was the reason why these cards failed even back then... so when I saw this happen, I quickly ditched the stock cooler. The truth with these is: the stock cooler is only good for 15-20W TDP... maybe 25W on a good day in the Arctic circle. The GPU core itself is closer to 40W TDP, so the core on these cards probably runs in the 60-70C range, if not more (certainly more when the fan on these starts to grind / slow down.) My last "homebrew" itteration for a cooler for one of the reflown cards is an Xbox 360 CPU heatsink. I've tested those 360 CPU heatsink pretty toroughly, and they seem to be able to handle up to ~60-70W before the temperatures start creeping past 60C. With that, I have not had my 2nd or 3rd cards fail... yet. And I have used them a decent amount over the years. So suffice to say that cooler temperatures can prolong a "repair" like this. But how much is probably still down to luck.

Last edited by momaka on 2025-03-17, 20:57. Edited 2 times in total.

Reply 8 of 24, by PcBytes

User metadata
Rank l33t
Rank
l33t
Trashbytes wrote on 2025-03-16, 00:49:
Heat isnt the main issue, nor is the substrate or the die, its the shitty lead free solder they used combined with inadequate la […]
Show full quote
Vany wrote on 2025-03-16, 00:23:
Fish3r wrote on 2025-03-14, 19:28:

I've not used a laptop variant of these chips but I feel like even with a combo of undervolting + improved TIM you're going to struggle to keep the higher end chips below the temps required to cause issues, which according to https://youtu.be/3qKtS_uxdcU?t=2206 is around 65c

Thank you. So let's say 65C is the max for a defective substrate, if the system is cool enough does that "guarantee" that the chip won't destroy itself under normal usage? If true, then not much is needed as I've seen significant temp drops when adding another copper pipe and gluing it with thermal glue to the existing ones. It may just be enough. Hmm

Heat isnt the main issue, nor is the substrate or the die, its the shitty lead free solder they used combined with inadequate laptop cooling designs. You cant solve the solder issue by baking it, modifying the cooler, undervolting or gimping the GPU the shit solder can only be fixed by replacing it, the bad cooling designs are also pretty damn hard to fix if not impossible on most laptops.

There are no guarantees here either, you can try bandaiding it and it may work fine till it doesn't, much like baking a GPU in the oven doesn't mean its fixed, it will break again and this is due to the solder.

Not telling you to not try any of this but set your expectations on it breaking again or being woefully under powered just to get it working without breaking. (Also once the damage has been done itll happen again faster and likely be terminal, dealt with a lot of 8800 cards dying from this, no amount of modding stopped it from happening again)

It's been debunked since long ago. Substrate IS the issue. It's been the issue ever since Geforce 6.

"Enter at your own peril, past the bolted door..."
Main PC: i5 3470, GB B75M-D3H, 16GB RAM, 2x1TB
98SE : P3 650, Soyo SY-6BA+IV, 384MB RAM, 80GB

Reply 9 of 24, by Vany

User metadata
Rank Newbie
Rank
Newbie
momaka wrote on 2025-03-17, 20:33:

Thank you for the detailed reply. I don't even bother with re-flowing anymore, let alone exposing myself to magic fumes from the oven method. I'm mostly interested in preserving the defective chips that haven't failed yet or that are about to fail (artefacting etc) The temps and TDP is something that I considered when adding coolers, I usually go kinda overkill on cooling, say, cpu is ~35 watts and I'll add a 70 or 80 watt cooler on it. I know that some laptop coolers have a "socket" for the thermal pad, but I do wonder if adding thermal paste, copper plate (cut to fit) and then thermal paste again could fill the socket? I suppose the goal would be to measure how long a defective chip survives under "best" conditions at roughly 90-95% load.

Trident Cyber 9525DVD Test, Review and supported games list

Reply 10 of 24, by The Serpent Rider

User metadata
Rank l33t++
Rank
l33t++

Felix has dropped a new video about PS3 bumpgate which pretty much confirms that all 110nm lineup of GeForce 6xxx is affected. Some 65nm chips like G92 of certain revisions (packaging plant?) had fixed substrate already.
So yeah, the only reliable way to preserve old defective package chips is too keep them as cool as possible. I would consider even reported 60C somewhat risky, because chip hotspots could be way higher than this.

I must be some kind of standard: the anonymous gangbanger of the 21st century.

Reply 11 of 24, by momaka

User metadata
Rank Oldbie
Rank
Oldbie
The Serpent Rider wrote on 2025-06-20, 06:23:

Felix has dropped a new video about PS3 bumpgate which pretty much confirms that all 110nm lineup of GeForce 6xxx is affected. Some 65nm chips like G92 of certain revisions (packaging plant?) had fixed substrate already.

Very nice video!
Thank you for finding it and posting it here. 😀

TLDR: Both nVidia and ATI screwed up their designs in a similar way and TSMC had no problems manufacturing what they ordered... plus introducing even more problems of their own. So it seems this was just technology of the times and both ATI and nVidia made similar mistakes in their designs.

The video also echoes what I have been saying for years: if it was the BGA failing, then how come CPUs (not only on consoles, but in general) so rarely fail, yet GPUs fail all of the time. I'm really glad to see someone went so much in depth and to such lengths to reach a final verdict.

That said, if nVidia fixed their bumpgate issue (and ATI / now AMD their equivalent), then why do new(er) GPU still continue to fail. And I think here no matter how deep we dig, we probably can still get to the following conclusions:
1) More modern GPUs with their ever-increasing silicon area are becoming more prone to breaking simply because of the different expansion rates of materials. So in that sense, GPUs with cores with larger surface area are always more likely to fail than smaller chips... or alternatively said, can withstand smaller temperature changes and operating temperatures.
2) Modern silicon is pretty much disposable at this point due to too small of a manufacturing node - basically, transistors are comprised of too little silicon atoms, literally, and therefore any degradation of the material makes them malfunction easier. On top of that, modern GPUs use ever-decreasing communication voltages in order to save power and not have the voltage "jump across" transistors... meaning these lower voltages are also becoming more susceptible to noise.
3) Manufacturers still only design stuff with "just enough" cooling in mind for the device/card/board/CPU to last the expected average lifecycle of the device - i.e. probably around 5-6 years or so, with 2 years +/- based on manufacturing tolerances. So even though cooling solutions have seen vast improvements over the decades, cooling is still "sub-par" in terms of cooling performance. Then again, I think the latest RTX lines from nVidia (and equivalent from AMD) have just gone to "stoopid" levels with their TDP.

In conclusion, at least with the older stuff, I think I can say we have a pretty good shot at making stuff to last *IF* we really keep stuff very cool-running. I would actually be very interested to try (or see someone else) get a NOS never used bumpgate nVidia card like the 7900 GTX and run it with a massive heatsink or water cooling from day one, keeping the GPU chip below 45C at all times (if possible.) I bet such setup will last way, waaay longer that way.

The Serpent Rider wrote on 2025-06-20, 06:23:

So yeah, the only reliable way to preserve old defective package chips is too keep them as cool as possible. I would consider even reported 60C somewhat risky, because chip hotspots could be way higher than this.

Yes.
My goal is to usually keep known defective GPUs under 50C... which can get pretty challenging with air cooling.
And you are completely right about silicon hot spots. For example, on AMD/ATI cards like the HD 4850/4870 that have multiple core sensors, you can often see the GPU MEMIO area is regularly about 10C hotter than the overall GPU core temperature. So if the GPU chip "core" us running at 50C, MEMIO could be as hot as 60C. When I keep the core at 45C or lower (usually with FPS limit to lower CPU utilization), I can see the MEMIO is also often lower at 50-55C. As a result, I still have reflown cards that are working. I'm sure they will fail again... but I've squeezed some pretty good amount of hours out of some of them now - not bad!

Reply 12 of 24, by The Serpent Rider

User metadata
Rank l33t++
Rank
l33t++

Both nVidia and ATI screwed up their designs in a similar way

Not sure about ATi. So far my experience with X1800/X1900/X1950/HD 2900 is that they are built like a tank, despite being Devil's armpit hot. I had no issues building quite a hefty collection of working X1xxx cards. GeForce 7xxx and all GeForce 8xxx? Yeah, so many dead cards on my hands. Now Radeon 9600/9700/9800 were notoriously dropping like flies, but mostly due to RAM failures.

Xbox 360 chips were defective, but Microsoft also had some say in the final design.

I must be some kind of standard: the anonymous gangbanger of the 21st century.

Reply 13 of 24, by Archer57

User metadata
Rank Member
Rank
Member
momaka wrote on 2025-07-08, 17:35:

TLDR: Both nVidia and ATI screwed up their designs in a similar way and TSMC had no problems manufacturing what they ordered... plus introducing even more problems of their own. So it seems this was just technology of the times and both ATI and nVidia made similar mistakes in their designs.

Honestly i have very hard time buying the narrative that it was a screw-up and was fixed. Newer, supposedly fixed, chips fail just as much if not more.

GTX470-480, 570-580 wold be a good example - most are dead. And AMD HD4870/4890 are no better. Most laptops, especially high-end ones, die because of this issue. Either GPU or chipset fails.

IMO, discounting conspiracy theories like "intentionally built to fail", there is simply a fundamental flaw in this way of building chips which has not been resolved and still very much kills new hardware.

Why CPUs do not fail is a good question. For example in a laptop with shared cooling and everything sitting around 100C it will always be either GPU or chipset, not CPU. Which is curious, because it was related to power chipset should not be the one to die...

As for "preserving"... my question would be: what actually kills - high temperature, or fast temperature changes? Because from what the issue supposedly is - fast changes in temperature would be more dangerous than just high temperature. So theoretically whatever cooling solution used should be built with that in mind - as high heat capacity as possible to prevent rapid changes. In this sense modern cooling solutions, be it heatpipes+thin fin stack or liquid, are bad and may be a part of the reason why failures happen in the first place. Giant slab of copper might do better...

Reply 14 of 24, by momaka

User metadata
Rank Oldbie
Rank
Oldbie
Archer57 wrote on 2025-07-09, 00:29:

Honestly i have very hard time buying the narrative that it was a screw-up and was fixed. Newer, supposedly fixed, chips fail just as much if not more.

GTX470-480, 570-580 wold be a good example - most are dead. And AMD HD4870/4890 are no better. Most laptops, especially high-end ones, die because of this issue. Either GPU or chipset fails.

Well, specifically regarding high-end GPUs like the GTX 280, 470, 480, 580 and etc. along with HD4850/70/90 - all of those were simply due to the GPUs just running at too high of a temperature (inadequate cooler and lousy cooling profiles to prioritize less noise over better cooling), with the GTX 470 and 480 having probably the most underwhelming cooling of this bunch. I mean 85C is NOT OK, no matter what nVidia and ATI/AMD tell you, especially when you consider that with each generation, the complexity and average die area has either remained the same or gone up. Larger die area is good for improving heat dissipation (and also carrying more current to the GPU core), but it also has the negative effect greater material expansion with temperature deltas (e.g. if you take two sheets made of the same plastic material and cut one to be 10x10 cm while the other is 50x50 cm and heat them up, guess which one will show more expansion). This is why, if you've noticed, the temperatures on modern GPUs have overall been declining (very slowly) compared to what we saw with the GTX 280-580 era. *AND* GPU silicon has also become thinner - not only due to smaller manufacturing node, but also to be able to dissipate heat better. But in essence, this makes the GPU die weaker, as now material expansion can cause the die to crack or fracture - this is more a failure mode of more modern sub-20-10 nm GPUs than the older stuff.

So to say it in a shorter way, new GPUs fail because of different reasons now... or rather, there have been new challenges in manufacturing modern GPUs, and not all of these were solved or solved completely.

Take for example cards like the RTX 4000 and 5000 series or equivalent AMD - we are now at 300+ Watts at around 1V... so about 300 AMPS (!!!) all going into an area of less than 400 mm^2. Sure the distance between the GPU die and the substrate underneath is tiny (the width of the solder bumps)... but that's still a massive amount of current. And if you compared that to what bumpgated era cards like the 7800/7900 had to carry with their high-lead bumps, you can start to see where the issue is with modern cards compared to the old stuff.

In other words, the failures are for different reasons now. But temperature is still a major player in the field. Which brings me to your next point.

Archer57 wrote on 2025-07-09, 00:29:

As for "preserving"... my question would be: what actually kills - high temperature, or fast temperature changes?

Both.
And not only.

This is purely materials physics.

Higher temperatures (or equivalently, higher temperature delta between cold/hot) means greater expansion and contraction rates between the different materials of the GPU chip.

Then there's the number of cycles sustained - i.e. how many times you expose these materials to the hot-cold cycles. The more cycles sustained, the more fatigued the material will get (mainly the solder bumps, as this is the material between the GPU die and GPU substrate that is exposed to the most stress, along with the underfill that is supposed to alleviate most of this stress from the bumps.)

And lastly, "fast changes in temperature" (of the GPU): in a way, that is a sub-set / combination of the above phenomenons (number of cycles sustained and temperature delta.) With that said, fast changes in temperature probably won't contribute too much to materials stress of the GPU if the temperature deltas are rather small (e.g. GPU constantly bouncing up and down 2-3C every second.) But if the temperature deltas are relatively large (e.g. GPU at 50C at one instant and 60C the next), that will surely contribute to a faster failure.

Archer57 wrote on 2025-07-09, 00:29:

So theoretically whatever cooling solution used should be built with that in mind - as high heat capacity as possible to prevent rapid changes. In this sense modern cooling solutions, be it heatpipes+thin fin stack or liquid, are bad and may be a part of the reason why failures happen in the first place. Giant slab of copper might do better...

Good point... and probably one reason why the P4 Prescott (with its soldered IHS) never had any problems compared to Northwood (IHS not soldered), which would occasionally fail "out of the blue".

Unfortunately, a giant slab of copper is not only more expensive, but also has inferior thermal conductivity compared to heat pipes. And given the TDP per surface area of modern chips, an all-copper heatsink may simply not cut it anymore to transfer all of the heat to the cooling fins. So heat pipes are somewhat necessary at this point.

But in the end, it's not that designers and manufacturers can't solve the cooling issues. Rather, they are just interested in providing a solution that is only "good enough" for the "average user" - e.g. around 5 years of daily use... which is more or less the average upgrade cycle for most people both in the IT field and gaming.

Archer57 wrote on 2025-07-09, 00:29:

IMO, discounting conspiracy theories like "intentionally built to fail", there is simply a fundamental flaw in this way of building chips which has not been resolved and still very much kills new hardware.

But then again, we spin back to the question: why do CPUs don't fail (nearly as often) then?

IMO, there is no fundamental flaw of how chips are built.
It's just more of the industry following the trends of the user/consumer.
Again, people tend to upgrade their GPU about every 5 years - at least anyone that plays more demanding games to need a discreet GPU... and the industry knows this. Meanwhile, CPUs may stick around with their users for 10 years... or sometimes even more in the case of industrial applications. Also, CPUs are still a fundamental building block of any PC, while a discreet GPU is not. As such, most discreet GPUs are considered more of a "consumable", even when they are non-upgradeble or replaceable in a system.

So in short, CPU and GPUs are built to different expected lifecycle standards. And because of that, manufacturers simply don't see a reason to make GPUs as durable as CPUs.

Of course now with more modern and very small manufacturing nodes, we've also started seeing CPU start to fail a lot more frequently compared to before / back in the day (socket 462 excepted. 🤣 )

The Serpent Rider wrote on 2025-07-08, 19:31:

Not sure about ATi. So far my experience with X1800/X1900/X1950/HD 2900 is that they are built like a tank, despite being Devil's armpit hot.

Well, the HD 2000 and 3000 series specifically (and to a smaller extent the HD 4000 and 5000 series) - yes, they really are quite well-built. If you keep these cool, you can actually make them last for a very long time. The problem with cards like the HD 3850/70 and 4850/70/90, though, is that their stock cooling profile was often set to kick up the fan only after the GPU reaches some absurd temperatures, like 70-75C, and not really noticeably ramp up until things were cooking at 80+C. That's bad and 100% the fault of the cooling. (It also shows how much trust ATI/AMD had in their GPUs at the time, despite the then rumors of the nVidia bumpgate issue on the horizon.)

As for the X1000 series - I don't have that many in my collection and haven't analyzed them too closely as a result. But from what I have seen as sold over the years, they do appear to be relatively OK too.

In the case of the Xbox 360, that too was mostly a cooling issue - the GPU heatsink (both the simple aluminum one and the one with the heatpipe and small extension) is really only good for cooling chips up to 30-35W TDP... maybe 40W tops. Past that, the temperatures start to go past 70C - and that's with lots of airflow in an open-air test box. The Xbox 360 has neither of these last two. That same GPU heatsink in it gets both poor airflow and it's not very cool air. So the older higher-TDP GPUs in these tend to run very hot (80C) and probably the primary reason they die. Now, if the early GPU models also had some issues with a soft underfill like nVidia did, that explains it even more why the older/oldest Xbox 360's were so failure-prone.

The Serpent Rider wrote on 2025-07-08, 19:31:

Now Radeon 9600/9700/9800 were notoriously dropping like flies, but mostly due to RAM failures.

Well, the Radeon 9600 should not be grouped in the same category as the 9700 and 9800.
If anything, "Radeon 9600 series" can be split into two categories - those with BGA RAM and those with regular TSSOP RAM. The latter (TSSOP RAM models) are pretty bullet-proof if they come with an active-cooling cooler - e.g. one with a fan. The passively-cooled ones (often 9550, 9600 vanilla / non-pro, and a few 9600 Pro) would occasionally fail due to running too hot. And the BGA RAM models (mostly 9600 XT cards) had issues due to RAM.

As for the 9700/9800 series - in addition to failing RAM, these also failed due to inadequate cooling of the GPU chip. Actually, the two go hand-in-hand: if the GPU chip runs hot, it lets the PCB underneath and around it get hot as well. The BGA RAM used on these GPUs relies on getting cooled by the PCB underneath (through the BGA solder balls) The solder itself is leaded and not prone to failing at all. But when the PCB is already very hot due to the GPU running too hot, obviously the RAM can't cool properly... and that can lead to premature RAM failures... if the GPU didn't fail first from the high temperatures.

So for Radeon 9700/9800 cards, the overall cooling of the card is just dismal, no matter which way you look at it. Anyone that runs these with the stock cooler is essentially asking for their card to die. On that note, if you put a massive heatsink on the GPUs, you'll notice that the RAM now also runs much cooler. So not only do you extend the life of the GPU, but also that of the RAM... though for better completeness, one should also install RAM heatsinks and actively cool them as well. Doing so should greatly extend the life of these cards. I have 2 that were reflown over 10 years ago that still work with my improvised oversized coolers. So I take that as good enough proof that these an last with good cooling. FWIW, reflowing of GPUs should not be considered a fix, but only a temporary revival. The fact that I can extend their service life as much goes to show that temperature is the enemy of these.

Reply 15 of 24, by The Serpent Rider

User metadata
Rank l33t++
Rank
l33t++

In the case of the Xbox 360, that too was mostly a cooling issue

It's pretty much proven that early Xbox 360 had the same "soft" substrate that all Nvidia chips used. Shitty cooling just accelerated degradation exponentially.

I must be some kind of standard: the anonymous gangbanger of the 21st century.

Reply 16 of 24, by Archer57

User metadata
Rank Member
Rank
Member
momaka wrote on 2025-07-12, 13:07:

Both.

Well, my opinion is different. Temperature by itself does not do much. It may accelerate certain processes a bit, but neither of those have been very significant reasons for failure so far (though they may become in time). That is assuming something is not defective of course, 70C underfill Tg + 85C fan target is a nice and fast way to kill stuff.

If you were able to make a chip sit steadily at 55C and 85C with no changes at all both would be just fine.

What matters are changes and related stresses, and not only on whole chip level, but also localized hotspots. And that's where messing with cooling should be done carefully - it is easy enough to do more harm than good, even if temperatures are lower as a result.

Also IMO that, and not just temperatures, is the reason stuff in laptops and such dies a lot. Very low heat capacity of cooling system resulting in fast, large and frequent temperature changes.

Reply 17 of 24, by myne

User metadata
Rank Oldbie
Rank
Oldbie

The Tldw of the playstation guy is that the underfill between the die and substrate, which is supposed to surround the mini-bga style bumps, and provide a level of mechanical support was less than ideal for the task.

Think of it like the grout under and between your tiles.

If you regularly shower at 100c and then cool it to 20 rapidly while also being a fat bastard, it's likely many grouts will not perform over the long term.

I built:
Convert old ASUS ASC boardviews to KICAD PCB!
Re: A comprehensive guide to install and play MechWarrior 2 on new versions on Windows.
Dos+Windows 3.11+tcp+vbe_svga auto-install iso template
Script to backup Win9x\ME drivers from a working install
Re: The thing no one asked for: KICAD 440bx reference schematic

Reply 18 of 24, by Vany

User metadata
Rank Newbie
Rank
Newbie

Small update on this, I've acquired a laptop, Thinkpad T500, from late 2000s that has such a chip (ATI Mobility Radeon HD 3650 in this case). Upon disassembly, I've noticed that the cooler for the GPU was looking quite dodgy, as if it was designed to fail. The GPU chip wasn't covered completely by the heatsink as it had bent during years of usage, so this wouldn't be noticeable when the machine was new. I suppose the chip survived because this laptop has two graphics chips, the other being the GMA 4500MHD. I went with "Solution C" and modified the heatsink by gluing a single 1mm washer on top of it so that the keyboard keeps it squeezed down.

This would be my second IBM/Lenovo that really has mind boggling design decision when it comes to cooling. My older one (A30?) had a Radeon 7500 that was known to fail due to overheat. I opened that one too and it was just a naked chip, no heatsink on it whatsoever, and that one wasn't opened before.

Now that I've confirmed though that the Radeon chip in the T500 is good, I'll do some stress testing.

Trident Cyber 9525DVD Test, Review and supported games list

Reply 19 of 24, by Archer57

User metadata
Rank Member
Rank
Member
Vany wrote on Yesterday, 05:17:

I went with "Solution C" and modified the heatsink by gluing a single 1mm washer on top of it so that the keyboard keeps it squeezed down.

Did you use glue which can handle up to 100C? It may fall off otherwise...
Also should be really careful with this - pretty easy to break the die if you apply uneven pressure. I'd be pretty hesitant to keep it "squeezed" by keyboard just in case i type on it with a bit too much force...

Vany wrote on Yesterday, 05:17:

Now that I've confirmed though that the Radeon chip in the T500 is good, I'll do some stress testing.

Be careful with this. I am not sure it is a good idea to torture old chip with stress tests. If it's working i'd just use it normally and hope it lasts, no reason to basically do accelerated aging/wear on purpose...