Accurately troubleshooting video memory faults with VMTCE \ VOGONS

Accurately troubleshooting video memory faults with VMTCE

Topic actions

First post, by Thermalwrong

Posted on 2023-12-07, 01:20

Thermalwrong Offline

Rank Oldbie

Rank: Oldbie
Posts: 1821
Joined: 2018-03-18, 19:24
Location: England

Testing failing video memory on a 3Dfx card is something that I've struggled with for a while, because I keep purposely buying really old broken video cards to see if they can be fixed. For the initial part of this thread I'll focus on a 3Dfx Voodoo Banshee card with faulty memory that I got hold of about 3 years ago, the ELSA Victory II: http://hw-museum.cz/vga/31/elsa-victory-ii
It was sold as 'untested' which meant faulty but not as roughed up as the cards I get from the recycler.
It has confounded me for a couple of years how I'd go about getting working memory for it and how to troubleshoot it so I've put it off and put it off.
The card would look okay in text mode but going to VGA mode or doing any kind of 3D games the errors in the memory would be immediately visible.

There's a dos-level video memory tester called VMTCE from a long time ago that can run from a floppy or CD drive and that's my preferred tool at this point for testing memory function: https://sourceforge.net/projects/vmtce/files/
Up to this point I've used the floppy disk version because it's fast to set up but today I got tired of waiting for it to start up each time for a test so now I've fdisk/formatted a small CF card with FreeDOS (what VMTCE runs on) and copied the floppy disk files onto it. Startup is super quick now, the FDCONFIG.SYS file can be edited to reduce or remove the wait time in the menu at startup.

VMTCE displays memory locations that give bad values like this - you can see noise on the picture too because the bad memory here is in a low enough range to break the framebuffer in 640x480 VGA mode:

The attachment VMTCE-initially.JPG is no longer available

I do not think that VMTCE is infallible with all cards, I've seen it decide there are memory errors on a pair of ATI Rage Pro Turbo cards which seem to work fine in use. VMTCE does work well with 3Dfx cards and Nvidia cards though 😀

Recently I was going through my junk cards selection and found an ATI Rage (Pro Turbo?) 8MB SGRAM AGP card with no bracket / damaged. It had 8MB (4 chips) of SGRAM with a different manufacturer than my ELSA Victory II card, ISSI IS42G32256 instead of the Samsung km4132g512, but 100-pin QFP SGRAM is standard. The pinout and memory layout are all the same, with 256k (cells) x 16 (bits) x 2 (banks) at 3.3v. Perfect, we thank you for your service little ATI Rage. As a bonus the donor card got to have a run through VMTCE before being dismantled for parts.

To try to figure out how to efficiently replace the Banshee's SGRAM instead of replacing all chips - because I have less chips and soldering errors increase in probability as we change more chips - I wanted to understand the memory layout of the Voodoo Banshee card. As far as I'm aware, the true pinout of the Voodoo Banshee is unknown but we can use documents for the Voodoo 3 in its stead. The Voodoo 3 is kind of a sequel to the Voodoo Banshee rather than the Voodoo 2, it's a Banshee 2 😀
Incidentally, there just so happens to be a full application schematic of a 3Dfx Banshee 2 PCI card on the web. I've verified the pinout and it is referring to the Voodoo 3's pin layout rather than the Banshee.
Anyways, we have the Voodoo 3 databook which gives us a diagram of the layout as well as pins:

The attachment Voodoo3-&-Banshee-chip-layout.jpg is no longer available

That means that the chip in the upper left of the card is data pin 0 to 31 going to one 32-bit SGRAM chip, the lower right is the end point ending up at frame buffer data pin 127. The ELSA Victory II has 4 SGRAM chips on the front and 4 on the back - the data pins go to the same places but the chips on the front are Bank 0 with the back-side being Bank 1.

To simplify testing since it's the framebuffer area that's got a problem, the Voodoo 3 and Banshee can be limited to operating just 1 bank of memory or 4-chips mode by moving a 'strap' resistor. This is a small pull-up or pull-down resistor that connects to the VGA BIOS ROM's Data and Address pins, which the card detects on power-up and sets itself up accordingly. I had to remove the big 40-pin header from the card to find the jumpers but you can trace this easily by following the EEPROM's Data pin 5 to a nearby resistor. There will usually be an unpopulated resistor pad nearby that connects to the opposite, in this case Ground:

The attachment mapping strap pins.jpg is no longer available

To limit the ELSA Victory II to 4-chip mode, or 8MB of ram, I moved the 4.7k resistor from R23 to R4.

It made no difference to the display issue in this case but it was helpful for diagnostic purposes because this card had a fault in both banks. I made some initial guesses from what I was seeing in VMTCE which I'll not elaborate on, but I changed them in order of:

Okay change the top-left chip U1
No change huh, okay well now we'll put the chip I just took off from U1 into U2's spot since we know that wasn't the problem
Oh still no good huh, well let me think about this before I swap the chip again, oh huh the bottom right chip has a scratch on it.
Got it! That bottom chip was bad and now the card passes VMTCE tests at 8MB of memory. Now the display is normal again, I'm sure the card is fixed now...

The attachment IMG_2662 (Large).JPG is no longer available

BTW - that first VMTCE picture shows how I know that it's a RAM fault and not a trace / signal fault, notice that it toggles the bad bit at one point to a different position.

The attachment VMTCE-8MB-working.JPG is no longer available

I was very happy to see this, so now the card appears to be working and this confirms that the Banshee's BGA chip and all memory traces connecting to it are working properly. Excellent! Surely everything will go fine once I enable the rest of the memory

Reply 1 of 6, by Thermalwrong

Posted on 2023-12-07, 01:55

Thermalwrong Offline

Rank Oldbie

Rank: Oldbie
Posts: 1821
Joined: 2018-03-18, 19:24
Location: England

When I put the strap resistor back to its original position for 8-chip operation instead of 4-chip, disappointment ensued.

Oh I also skipped a bit where the card didn't work after swapping the correct chip because of an unsoldered pin. Prior to that the card was displaying lines all over the place from where I was trying to resolder the pins without enough flux, that got fixed ages ago though.

From here, the card would give an error the moment it moved onto Bank 1 of memory, the 800025 hexadecimal address is 8388645 in decimal so just over the 8MB boundary:

The attachment VMTCE-Bank1-Enabled.JPG is no longer available

I swapped chips pretty much at random because my understand of what I was seeing was limited, it took a few hours of bashing against this to understand what's what. Eventually I started lifting Data pins on individual memory chips to see what would change, to get a value similar to the one it's displaying above where the common elements are: Above 8MB and ending in "5".
To try to find the fault here I've listed DQ2 away on Bank1-Chip4 - notice that I'm not soldering it to Ground or VCC, I quickly found that messing with Data lines like that breaks the data line for everything on that bus, so Bank 0 - Chip 4 was broken when Bank 1 - Chip 4's DQ2 was hooked up to Ground:

The attachment VMTCE-lifting pins.jpg is no longer available

Isolating it results in the pin floating so it could be either high or low without the memory chip affecting that result.

The attachment VMTCE-lifted-pins.JPG is no longer available

Notice that the test now shows an error ending in "F". It took me a while to cotton on to what was happening, I tried a few more just lifting the pins at the edge of a chip so it's easy to get to then put back:

Lifting DQ3 on bank1-chip2 (bits 32-63) resulted in errors ending in "4"
Lifting DQ3 on bank1-chip1 resulted in errors ending in "0"
Lifting DQ29 on bank1-chip1 resulted in errors ending in "3"

Then I removed Bank1-Chip1 and tested without it fitted to see what would happen:

The attachment Rip off Bank1-Chip1.JPG is no longer available

And the resulting error confirmed it:

The attachment VMTCE-rip-off-chip1.JPG is no longer available

Now the errors go 0-1-2-3-5. That means that if the address value ends in 0,1,2,3 then it's an error on Chip 1. It can be broken down like this:
If an address value ends with this hex value, it means:
0 - CHIP1-DQ0 TO DQ7
1 - CHIP1-DQ8 TO DQ15
2 - CHIP1-DQ16 TO DQ23
3 - CHIP1-DQ24 TO DQ31
4 - CHIP2-DQ0 TO DQ7
5 - CHIP2-DQ8 TO DQ15
6 - CHIP2-DQ16 TO DQ23
7 - CHIP2-DQ24 TO DQ31
8 - CHIP3-DQ0 TO DQ7
9 - CHIP3-DQ8 TO DQ15
A - CHIP3-DQ16 TO DQ23
B - CHIP3-DQ24 TO DQ31
C - CHIP4-DQ0 TO DQ7
D - CHIP4-DQ8 TO DQ15
E - CHIP4-DQ16 TO DQ23
F - CHIP4-DQ24 TO DQ31

That matches up with the original fix on the first bank of memory. That's great right? Then surely just replacing Bank 1 - Chip 2 will fix the card, right? 😁 😁 😁

Reply 2 of 6, by Thermalwrong

Posted on 2023-12-07, 02:30

Thermalwrong Offline

Rank Oldbie

Rank: Oldbie
Posts: 1821
Joined: 2018-03-18, 19:24
Location: England

Now that I knew the fault was somewhere on Bank1-Chip2, I tried swapping the chip for a replacement but it made no difference. I had a suspicion that would be the case - the error value never fluctuated, always just wrong at the same bit in the same way.
Thinking about it

It's not an address pin because it appears to be getting values for most of that 8-bit cell of memory
Since Bank 0 has all of its Data bits working, if there's a problem it's probably a trace that goes from a Bank 0 data pin into the Bank 1 data pin
Swapping the chip didn't help so we know it's not a chip fault
Need to narrow down what we're seeing for this error value - what data pin is this 00010000 really referring to?

Since we know that "5" means the 2nd set of 8 data bits on Chip2, that should mean DQ8 to DQ15 is where the problem lies.

Btw if you're using VMTCE to troubleshoot, this could mean could mean bank 1 or bank 0, they're sharing the trace, though you can figure that out from the faulting address value being under or over 8MB / half the card memory.
Additionally, lifting a pin is fairly easy if you use a small-ish tip, not much flex and you need a craft knife blade - that's thin enough and gives a good angle to carefully wedge out the pin to be isolated.
To aid in locating the pin to be isolated, I took a well-lit picture just above the memory chip in question, then in paint.net overlaid the datasheet's pin diagram on top of it to see which Data bit is at which pin.

To test, I tried disconnecting DQ15 and that gave this result:

The attachment VMTCE-DQ15-lifted.JPG is no longer available

So DQ15 is the left-most bit! That means VMTCE displays memory blocks as 8-bit groups going from highest value on the left to lowest (LSB?) on the right:
15 - 14 - 13 - 12 - 11 -- 10 -- 9 -- 8
0 -- 0 -- 0 -- 1 -- 0 -- 0 -- 0 - 0

Data Bit 12 is the one that has no connection! 😁 That even makes sense - this card had some corrosion from poor storage or something and I had tried cleaning up the pins around that chip but there was a problem hiding underneath:

The attachment IMG_2648 (Custom).JPG is no longer available

So there's vias going from the front of the card to the back and the one connecting Data bit 12 on Bank1-Chip2 has no connection to the main trace anymore. We'll just fix that with a bodge wire because I don't want to remove the chips front & back:

The attachment IMG_2663 (Custom).JPG is no longer available

The attachment IMG_2664 (Custom).JPG is no longer available

Now the memory checks out - I have no idea whether it'll work properly for games yet but now I can install drivers which I'll do soon.

The attachment VMTCE-Memory-Repaired.JPG is no longer available

I hope this information is helpful to others, it should be useful knowledge for repairing not just a Voodoo Banshee card, but should also be applicable to the Voodoo 3 and lots of Nvidia cards. Now that I know this much I should have another go at repairing this GF4-Ti4200 with bad memory.

Reply 3 of 6, by Thermalwrong

Posted on 2023-12-07, 16:32

Thermalwrong Offline

Rank Oldbie

Rank: Oldbie
Posts: 1821
Joined: 2018-03-18, 19:24
Location: England

Now for a repair that's not successful - here's a 3Dfx Voodoo 3, it has 16MB of SDRAM and one of the bits is bad. I caused this, I was fitting one of those big zalman passive heatsinks on it and while tightening the screws heard a 'pop' and now the video output is broken.

The attachment A Voodoo 3.JPG is no longer available

It's fair game for testing though and this should be helpful for anyone else repairing a Voodoo 3 in future, especially since the Voodoo 3 is one PCB layout, my ELSA Victory II is a rather specific board. The V3 differs from the SGRAM equipped Banshee card because its memory chips are just 16-bits wide, that means every single chip contributes to the 128-bit data bus for the video card's memory, no real memory banking going on either.
This means that the straps I mentioned in my previous post do nothing at all on a Voodoo 3, just tested it by moving them and the card still sees itself as 16 Megabytes when configured for 1-bank operation with 8-megabit DRAM. That does limit us to testing the full memory at once.

However, because there's no banking, that means that shorting a data pin to ground has no adverse affects on the rest of the bus since each data pin is unique to each memory chip. That means I can short a convenient data pin to its nearest ground to determine which chip results in which end-value in VMTCE. For my testing, I shorted DQ1 (Data bit 1) of an SDRAM to VSSQ (Ground for data bus pins):

The attachment HYB39S16160.jpg is no longer available

Just a little bit of solder which cleans up easily afterwards:

The attachment Shorting a data pin to ground.JPG is no longer available

It results in an error like this:

The attachment VMTCE-broken-voodoo3 (Large).JPG is no longer available

In fact, this is the original error that the card had. Notice that it results in lines down the screen rather than small lines / noise at intervals, that means there's a signal/trace/pin-connection problem rather than a memory chip problem from what I understand so far. It could mean a pin is not connected to the PCB or could even be disconnected on the GPU itself.
This errors ends in "7" and is bit -----x--, what does that mean?

Well here's a reference picture of what I found bit errors on each chip segment resulted in:

The attachment Voodoo3-memory-layout.jpg is no longer available

Because it ends in "7", that means the broken data line is on chip U8 and is the upper half of the data lines. Because of how VMTCE displays the bit order I've laid out the diagram to make that easier to follow. If we count from DQ15 down to the 6th data pin, that's DQ10 on U8 that has failed.
It's soldered down correctly and I doubt that the chip is damaged - given where my heatsink was applying pressure I believe that the trace for U8-Data10 has broken away under the GPU. Maybe Oskhar could fix this but it can stay where it is for now.

So this results in a no-fix but now you should be able to pinpoint memory faults down to individual chips & pins on a Voodoo 3 with VMTCE 😀

This method of test is applicable to any memory bus where memory banking is not used, because that shares data pins across the chip's data pins. It's safer to do than lifting pins which on small TSOP packages like this could easily result in broken chips, the legs are too small and short.
Technically this method of memory diagnostic should be applicable to any card with TSOP memory like Geforce and Radeon cards up to the point they moved to BGA memory.

Reply 4 of 6, by Thermalwrong

Posted on 2023-12-08, 17:15

Thermalwrong Offline

Rank Oldbie

Rank: Oldbie
Posts: 1821
Joined: 2018-03-18, 19:24
Location: England

Ah yes, this looks Perfect 🤣 Here's the Voodoo Banshee again now that I can install drivers, not quite 100% working as I'd hoped even though it passes memory tests:

The attachment lol-banshee.png is no longer available

Still looks like a framebuffer issue rather than the card being actual toast since games *d0* run. 24-bit display mode looks pretty much fine as well, just 256, 16-bit and 32-bit have lots of corruption. Dropping the memory clock changes / lessens the corruption. I guess I need to now see if VMTCE can do some more thorough testing than setting static values.

edit: wow, unreal and Deus Ex actually look quite good and not horrifically glitchy. Textures and polygons are mostly fine, with a couple of graphical lines showing up occasionally for a split second a time.
If the memory is linear with the memory layout like: I guess that means it would be framebuffer then polygon stuff, then textures. Perhaps that would mean the error is in the lower part of memory or bank 0.
Maybe I should swap out the mis-matched ISSI chips for Samsung ones. And fix those address lines at trace-level rather than a bodge wire.

Reply 5 of 6, by Thermalwrong

Posted on 2023-12-10, 22:58

Thermalwrong Offline

Rank Oldbie

Rank: Oldbie
Posts: 1821
Joined: 2018-03-18, 19:24
Location: England

Now that I've got this knowledge, if the card can display on the screen then I can run VMTCE from a floppy disk which auto logs to log.txt on the disk.
With that I was just able to find a solder bridge I'd accidentally created between data bits 12 & 13 on RAM chip 1 and that took about a minute. I'm currently dismantling the Elsa Victory II Banshee card because that horror on the screen lead me to pulling RAM chips bank0-chip2 & bank1-chip2 - I've drilled holes through the vias that had corrosion under those chips and uh, that was probably a really bad idea but the card still runs. It cost me all my small carbide drill bits 😠

Here's the floppy disk image that's set up to to log to a file with no user intervention. Make sure the system can boot from floppy and then boot up with the video card to be tested - the display can be garbage it'll still work. Wait a minute for it to start which it should go from displaying a text mode screen to 640x480 VGA mode. After 30 seconds of testing you'll hear the floppy drive going kchnk-kchnk as it starts writing the log file to disk.

To stop the test and have it write to the disk fully, and avoid damaging the FAT / log file data: Press Escape and then Enter and wait a few seconds. The computer should then reboot and the floppy disk can be removed to check the log data. Here's what my soldering error & missing chips look like:

1LOG FILE: "a:\log.txt".
2[10.12.23 22:47:48] Test started...
3Processing offscreen...
4Error at [00096001] must be FF, but found CF (bits: 00110000)
5Error at [00096004] must be FF, but found 18 (bits: 11100111)
6....
7Error at [000964C6] must be FF, but found 18 (bits: 11100111)
8Error at [000964C7] must be FF, but found C6 (bits: 00111001)
9Error at [000964D1] must be FF, but found CF (bits: 00110000)
10Error at [000964D4] must be FF, but found 18 (bits: 11100111)
11Error at [000964D5] must be FF, but found C6 (bits: 00111001)
12Error at [000964D6] must be FF, but found 18 (bits: 11100111)
13TEST CANCELED
14Some ERRORS FOUND, use arrow keys for scroll results log or press ENTER to exit...

Reply 6 of 6, by lemonlime

Posted on 2023-12-11, 22:07

lemonlime Offline

Rank Newbie

Rank: Newbie
Posts: 45
Joined: 2019-10-15, 14:08
Location: Canada

This is very useful information. Thanks so much for sharing this. I always hoped there was a better way than just randomly replacing SDRAM chips until artifacting disappears. The next time I come across a Voodoo 3 with memory troubles, I will definitely refer to this thread 😀

Also known as vswitchzero. Check out my YouTube channel: https://www.youtube.com/c/vswitchzero

Go to top of page Go to top of page

Back to Video