VOGONS


First post, by darry

User metadata
Rank l33t++
Rank
l33t++

My Windows 98 SE rig crashed while running 3DMARK 2001 SE today and I am trying to determine why . Here is what happened :
- I run 3DMARK 2001 SE benchmark on FX5900 GPU
- After running a few tests, Windows bluescreens with disk write error .
- I press the reset button. PC POSTs but SIL3114 control does not detect the hard disk .
- I power down , wait 30 seconds and power up, but hard is still not detected
- I test the drive in another PC : SMART status OK, disk detectable, SMART history has logged 39 write errors . I get no read errors while running Acronis backup of disk .
- I reconnect the hard drive in Windows 98 SE PC .
- I power up the Windows 98 SE PC, but get no POST, so power I down
- I notice that DVI connector is loose so I push it in and power up gain, but still no POST, so I power down
- I power up again, this time it POSTs but with garbage on right side of screen. I power cycle monitor and garbage disappears .
- I check the BIOS health monitor and all voltages are OK .
- I boot into Windows 98 SE and run 3D Mark 2001 SE on FX5900 benchmark with no errors, artifacts or issues
- I reboot, check health monitor again for voltages, no issues (3.3V , 5V ,12V, -12V , -5V all still well within 5%)
- I reboot into Windows and run 3DMark benchmark on FX 5900 again without issue .

I vote for PSU possibly being intermittently out of spec . What do you all think ?
I may yet run an extended SMART test on the drive in another machine just to make sure it is OK .

Excessively detailed system specs :

- Enermax EG465P-VE PSU
- Pentium 3 Tualatin-S @1400MHz with 512K L2 cache
- Ipox 3ETI23 industrial motherboard (815EP-based with 3 ISA slots and onboard Fast Ethernet)
- 512 MB (2x256) of PC133 SDRAM
- MSI AGP Nvidia FX 5900
- 3Dfx Voodoo 3 3000 (PCI)
- generic SIL3114 SATA controller
- 500GB Toshiba SATA drive (with 4 partitions, each smaller than 127GB)
- 5 and 1/4" floppy drive
- 3 and 1/2" floppy drive
- LG GSA-4167 DVD-ROM drive
- Gravis Ultrasound Ultralsound 3.73 with 1 MB RAM
- Creative AWE64 Value with 28MB RAM (thanks to SIMMCONN revival)
- Mediatrix Audiotrix 3DXG (OPL3SA) without DB60XG daughterboard (for SB Pro compatibility and authentic embedded OPL3)
- Windows 98SE with big HDD patch
- MIDIMAN MIDISport 2x4 running in passive mode and with output 2 looped into input 2 (turning it into 1x3 midi splitter) connected to AWE64 via DB15 to MIDI cable
- first generation Roland MT32
- Roland SC-88VL
- Yamaha MU500
- Akai DPS12 multi-track recorder used as an audio mixer
- RCA brand 4x1 component AV switcher (not enough inputs on the DPS12)
- Samsung 204B (20-inch 4/3 1600x1200 LCD monitor)
- Aopen QF50C blue and beige early 2000s case

EDIT : corrected sound cards in specs

Last edited by darry on 2020-04-19, 20:44. Edited 1 time in total.

Reply 2 of 8, by derSammler

User metadata
Rank l33t
Rank
l33t
darry wrote on 2020-04-19, 19:47:
- After running a few tests, Windows bluescreens with disk write error . - I press the reset button. PC POSTs but SIL3114 contro […]
Show full quote

- After running a few tests, Windows bluescreens with disk write error .
- I press the reset button. PC POSTs but SIL3114 control does not detect the hard disk .
- I power down , wait 30 seconds and power up, but hard is still not detected
- I test the drive in another PC : SMART status OK, disk detectable, SMART history has logged 39 write errors . I get no read errors while running Acronis backup of disk .

I guess there's not much detective work to do here. The hard disk failed writing data, so better replace it before it dies completely. Note that if SMART logged the errors, you can rule out all the rest. SMART is running on the hard disk itself.

Reply 3 of 8, by darry

User metadata
Rank l33t++
Rank
l33t++
derSammler wrote on 2020-04-19, 20:21:
darry wrote on 2020-04-19, 19:47:
- After running a few tests, Windows bluescreens with disk write error . - I press the reset button. PC POSTs but SIL3114 contro […]
Show full quote

- After running a few tests, Windows bluescreens with disk write error .
- I press the reset button. PC POSTs but SIL3114 control does not detect the hard disk .
- I power down , wait 30 seconds and power up, but hard is still not detected
- I test the drive in another PC : SMART status OK, disk detectable, SMART history has logged 39 write errors . I get no read errors while running Acronis backup of disk .

I guess there's not much detective work to do here. The hard disk failed writing data, so better replace it before it dies completely. Note that if SMART logged the errors, you can rule out all the rest. SMART is running on the hard disk itself.

Alas, I feel it is not quite that simple, as the disk is fine in another machine read wise and now apparently fine read/write-wise in its original machine . I will definitely run a SMART extended test to know 100 % and as I have the "bad" LBA addresses in the SMART, I can try dd-ing the block under Linux .
What really gets me is the undetectability of the disk for a while . A (likely) media related write error would not cause that, normally, AFAIK, especially after a hard power cycle . Also shouldn't the pending reallocation count have incremented (which was not the case here )?
Maybe the drive is just being intermittent, but I wish I could rule out a power issue, just in case .

Reply 4 of 8, by aha2940

User metadata
Rank Member
Rank
Member

In my experience, when several parts of the PC start acting up at about the same time, the power supply is usually the culprit. Try using a different one when doing your testing. Also, having a backup of the hard drive would be good, just in case 😀

Reply 5 of 8, by darry

User metadata
Rank l33t++
Rank
l33t++
aha2940 wrote on 2020-04-19, 20:55:

In my experience, when several parts of the PC start acting up at about the same time, the power supply is usually the culprit. Try using a different one when doing your testing. Also, having a backup of the hard drive would be good, just in case 😀

Good advice . I already had a backup, but a fresher one never hurts, that's why I imaged it with Acronis TrueImage .

My video card issue was due to flaky cable, so it looks like it's "only" a disk issue . I am running an extended SMART self-test now .
I will probably end up changing the SATA cable, though, because upon closer inspection, they might be interface errors (Gsmartctl calls them Interface CRC errors (ICRC in the SMART output)) :

Complete error log:

SMART Error Log Version: 1
ATA Error Count: 39 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh🇲🇲SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 39 occurred at disk power-on lifetime: 7704 hours (321 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 01 2e 6c 15 01 Error: ICRC, ABRT 1 sectors at LBA = 0x01156c2e = 18181166

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 01 2e 6c 15 41 00 00:09:49.391 READ DMA
c8 00 01 2e 6c 15 41 00 00:09:49.390 READ DMA
c8 00 01 2e 6c 15 41 00 00:09:49.390 READ DMA
c8 00 01 2d 6c 15 41 00 00:09:49.390 READ DMA
c8 00 01 2d 6c 15 41 00 00:09:49.389 READ DMA

Error 38 occurred at disk power-on lifetime: 7704 hours (321 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 01 2e 6c 15 01 Error: ICRC, ABRT 1 sectors at LBA = 0x01156c2e = 18181166

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 01 2e 6c 15 41 00 00:09:49.390 READ DMA
c8 00 01 2e 6c 15 41 00 00:09:49.390 READ DMA
c8 00 01 2d 6c 15 41 00 00:09:49.390 READ DMA
c8 00 01 2d 6c 15 41 00 00:09:49.389 READ DMA
c8 00 01 2d 6c 15 41 00 00:09:49.389 READ DMA

Error 37 occurred at disk power-on lifetime: 7704 hours (321 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 01 2e 6c 15 01 Error: ICRC, ABRT 1 sectors at LBA = 0x01156c2e = 18181166

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 01 2e 6c 15 41 00 00:09:49.390 READ DMA
c8 00 01 2d 6c 15 41 00 00:09:49.390 READ DMA
c8 00 01 2d 6c 15 41 00 00:09:49.389 READ DMA
c8 00 01 2d 6c 15 41 00 00:09:49.389 READ DMA
c8 00 01 2d 6c 15 41 00 00:09:49.389 READ DMA

Error 36 occurred at disk power-on lifetime: 7704 hours (321 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 01 2d 6c 15 01 Error: ICRC, ABRT 1 sectors at LBA = 0x01156c2d = 18181165

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 01 2d 6c 15 41 00 00:09:49.390 READ DMA
c8 00 01 2d 6c 15 41 00 00:09:49.389 READ DMA
c8 00 01 2d 6c 15 41 00 00:09:49.389 READ DMA
c8 00 01 2d 6c 15 41 00 00:09:49.389 READ DMA
c8 00 01 2d 6c 15 41 00 00:09:49.389 READ DMA

Error 35 occurred at disk power-on lifetime: 7704 hours (321 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 01 2d 6c 15 01 Error: ICRC, ABRT 1 sectors at LBA = 0x01156c2d = 18181165

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 01 2d 6c 15 41 00 00:09:49.389 READ DMA
c8 00 01 2d 6c 15 41 00 00:09:49.389 READ DMA
c8 00 01 2d 6c 15 41 00 00:09:49.389 READ DMA
c8 00 01 2d 6c 15 41 00 00:09:49.389 READ DMA
c8 00 01 2d 6c 15 41 00 00:09:49.388 READ DMA

Reply 6 of 8, by pentiumspeed

User metadata
Rank l33t
Rank
l33t

When SMART attributes shows errors and failures of the hard drive, time to pull the plug on hard drive. Get another hard drive is a requirement.
The SMART monitors the health of the disk surfaces and all parts of the hard drive. Bad power supply rarely affects this but for safety sake, replace that hard drive!

Reliability and data are important.

Cheers,

Great Northern aka Canada.

Reply 7 of 8, by darry

User metadata
Rank l33t++
Rank
l33t++
pentiumspeed wrote on 2020-04-19, 21:51:
When SMART attributes shows errors and failures of the hard drive, time to pull the plug on hard drive. Get another hard drive […]
Show full quote

When SMART attributes shows errors and failures of the hard drive, time to pull the plug on hard drive. Get another hard drive is a requirement.
The SMART monitors the health of the disk surfaces and all parts of the hard drive. Bad power supply rarely affects this but for safety sake, replace that hard drive!

Reliability and data are important.

Cheers,

I agree that reliability and data are important . Just to be clear, I am not clinging for dear life to that 500GB hard drive and am ready to chuck it . What worries me is that the drive may not be the (sole) culprit .

The extended self test ran without a hitch , the "bad" blocks are perfectly readable and writable with dd under Linux and the only "bad" thing in the SMART variables is a reallocated sector count of 5 (which , if memory serves, is the same value as when this drive was pulled from a set-top box PVR put into service in its current home; it is my only current drive with a non zero reallocated sector count) . I would have been really happy if something had gone wrong during testing as that would have confirmed without the shadow of a doubt that the drive is going bad .

To be one the safe side, I will put this drive in my unreliable/for parts bin, put in a new drive and change the SATA cable .

I will also keep an eye out for other issues and replace the PSU if something suspicious crops up that points towards it . It will be fun finding a decent modern replacement that has enough amps on the 5V rail, all Japanese caps (primary in my Enermax is a Hitachi, have not checked the secondaries) and a reasonable price (at least I don't need -5v) .

smartctl 5.43 2012-06-30 r3573 [i686-w64-mingw32-win8(64)] (sf-5.43-1)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model: TOSHIBA DT01ABA050V
Serial Number: 84F8U7BAS
LU WWN Device Id: 5 000039 ffcee4838
Firmware Version: MU1OA720
User Capacity: 499,570,991,104 bytes [499 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Sun Apr 19 20:06:36 2020 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 4417) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 74) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 100 100 054 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 132 132 024 Pre-fail Always - 157 (Average 167)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 700
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
Show last 13 lines
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
8 Seek_Time_Performance 0x0005 100 100 020 Pre-fail Offline - 0
9 Power_On_Hours 0x0012 099 099 000 Old_age Always - 7709
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 669
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 700
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 700
194 Temperature_Celsius 0x0002 166 166 000 Old_age Always - 36 (Min/Max 6/46)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 39

Reply 8 of 8, by darry

User metadata
Rank l33t++
Rank
l33t++

Restored backup and replaced drive (new one is 7200 rpm , yay!) and SATA cable and am running 3d Mark 2001 SE in a 20-iteration loop to test video card/CPU, so far so good .

I will keep on using and will post back if anything craps out . Hopefully, it was just a hard drive with a slightly odd intermittent failure mode .

Thanks to all for your valued feedback.