VOGONS


Reply 340 of 1037, by CoffeeOne

User metadata
Rank Oldbie
Rank
Oldbie
red-ray wrote on 2020-02-23, 17:46:
Which section was being processed when SIV64X.exe crashed out? […]
Show full quote
CoffeeOne wrote on 2020-02-23, 17:08:

The program terminated (but no windows lock-up) also when I un-selected only [smb-bus] from save local.
So for the save-local I unselected all 3 smb entries (see above)

Which section was being processed when SIV64X.exe crashed out?

It looks to be device [ 1_69 ] that triggered the lockup and the attached SIV64X 5.47 DL580-01 should skip it. I feel it's a good idea to do siv64x -dbgsmb -save=[smb-bus]=SIV_DBGOUT.txt > SIV_DBGOUT.log | more again in case there is a 2nd rouge device. If that runs OK then please go for the full Save Local.

Looking at the save file SIV did quite well as it read 32 lots of SPD data and the 4 Xeon PIRs via the systems 5 SMBuses, it has 1 x PCH + 4 x IMC. Once we get [SMB Bus] working it will be interesting to see if there are any SMBus multiplexors.

Hello again,
I created the save file, extra for smbus-setup, smbbus-system and smb-bus and there was no crash.
Strange and frightening.
so there should be 6 files in this zip file.

Attachments

  • Filename
    SIV64X-smbbus.zip
    File size
    168.2 KiB
    Downloads
    45 downloads
    File license
    Fair use/fair dealing exception

Reply 341 of 1037, by red-ray

User metadata
Rank Oldbie
Rank
Oldbie
CoffeeOne wrote on 2020-02-23, 21:08:

Strange and frightening.

Thank you for all your testing and I suspect there is an SMBus device that causes the issue intermittently. Now I have SIV_HP_DL580_G7-smb-bus.txt I can see what all the devices are so will plan to add some more exclusions. On a positive note I spotted:

[ 1_70 ] PCA9544A (NXP) 10 249 0024 00 05 .. 01 02 03 04 05 06 07 .. 01 02 03 04 05 06

This is an SMBus mux so once things have settled down I will add support for this which with luck should allow all 64 sets of SPD to be read 😀.

I have also improved the memory timings SIV reports, but need to check all is OK on my Gulftown before I give you an updated SIV.

Reply 342 of 1037, by red-ray

User metadata
Rank Oldbie
Rank
Oldbie
CoffeeOne wrote on 2020-02-23, 21:08:

I created the save file, extra for smbus-setup, smbus-system and smb-bus and there was no crash.

Both [ smbus-setup ] + [ smbus-system ] simply report the current SMBus skip settings and I can't see how they could cause any lockup issues as they don't access the SMBus hardware at all.

I merged the [ smb-bus ] data into the SIV save file and then looked at what devices were present. Having done this I added several more exclusions which I am hoping will make the lockups go away.

I have attached SIV64X 5.47 More-01 which contains these changes, also contains a number of improvements for the reporting on the memory timings and should no longer try and report the DIMM temperatures as we can't figure out how to do word mode reads of the Westmere-EX IMC SMBuses.

I suggest doing SIV64X -DBGSDM > SIV_DBGOUT.log | MORE and generating several Save Local files. I hope things won't lockup, but if they do the last few lines in SIV_DBGOUT.log should tell us which device needs to be added to the exclusions.

How long does SIV take to start please? I think the initial screen should pop up after about 0.985 seconds and all the DIMM sizes will be filled in after about another 4.125 seconds, is this what you see?

Last edited by red-ray on 2020-02-26, 23:13. Edited 1 time in total.

Reply 343 of 1037, by CoffeeOne

User metadata
Rank Oldbie
Rank
Oldbie
red-ray wrote on 2020-02-24, 09:50:
Both [ smbus-setup ] + [ smbus-system ] simply report the current SMBus skip settings and I can't see how they could cause any l […]
Show full quote
CoffeeOne wrote on 2020-02-23, 21:08:

I created the save file, extra for smbus-setup, smbus-system and smb-bus and there was no crash.

Both [ smbus-setup ] + [ smbus-system ] simply report the current SMBus skip settings and I can't see how they could cause any lockup issues as they don't access the SMBus hardware at all.

I merged the [ smb-bus ] data into the SIV save file and then looked at what devices were present. Having done this I added several more exclusions which I am hoping will make the lockups go away.

I have attached SIV64X 5.47 More-01 which contains these changes, also contains a number of improvements for the reporting on the memory timings and should no longer try and report the DIMM temperatures as we can't figure out how to do word mode reads of the Westmere-EX IMC SMBuses.

I suggest doing SIV64X -DBGSDM > SIV_DBGOUT.log | MORE and generating several Save Local files. I hope things won't lockup, but if they do the last few lines in SIV_DBGOUT.log should tell us which device needs to be added to the exclusions.

How long does SIV take to start please? I think the initial screen should pop up after about 0.985 seconds and all the DIMM sizes will be filled in after about another 4.125 seconds, is this what you see?

Hello,
I now ran this version on the DL580 G7 again.
3 times Save local, no crash. Save local needs a lot of time (3 and a half minutes) because of numa-speed.
The I ran the command:
siv64x -dbgsmb -save=SIV_DBGOUT2.txt > SIV_DBGOUT2.log | more
also took time.
Starting the program is approx. 6-8 seconds.

Attachments

  • Filename
    SIV64X-dl580-save.zip
    File size
    1.3 MiB
    Downloads
    43 downloads
    File license
    Fair use/fair dealing exception

Reply 344 of 1037, by red-ray

User metadata
Rank Oldbie
Rank
Oldbie
CoffeeOne wrote on 2020-02-26, 20:19:

3 times Save local, no crash. Save local needs a lot of time (3 and a half minutes) because of numa-speed.
The I ran the command siv64x -dbgsmb -save=SIV_DBGOUT2.txt > SIV_DBGOUT2.log | more also took time.

Starting the program is approx. 6-8 seconds.

Thank your for checking things out numa-speed takes a while as there are eight NUMA nodes and 80 CPUs. It least you can see what happening from the % CPU usage plots! It's similar on my DL580 G5 which I run with 3 CPU groups, so 3 NUMA nodes, it's got 24 CPUs and takes 67 seconds.

For the SMBus tests I would be inclined to have a .BAT file that loops doing siv64x -dbgsmb -save=[smb-bus][pmb-bus]SIV_DBGOUT.txt > SIV_DBGOUT.log | more as we only need the SIV_DBGOUT.log is it fails.

6-8 seconds is longer than I expected as SIV was ready create the initial screen after 0.985 seconds. It has 983 controls, so I suspect it's the ES1000 (RN50) GPU slowing things down.

I have been looking into driving the SMBus mux, but we have an issue, we can't find out how to do SMBus writes! The SMBus reading is the same as on the 7300/5400/5000 FDB SMBuses, but writing is different and we don't know how to make it work. Further the 7300/5400/5000 FBD SMBuses can't do WORD reads, but Westmere-EX can, but again we don't have all the information we need.

Last edited by red-ray on 2020-02-26, 23:10. Edited 1 time in total.

Reply 345 of 1037, by CoffeeOne

User metadata
Rank Oldbie
Rank
Oldbie
red-ray wrote on 2020-02-26, 20:53:

Thank your for checking things out numa-speed takes a while as there are eight NUMA nodes and 80 CPUs.

It's only 4 numa nodes, right?

The other machine HP DL585 G7 has 8 numa nodes, each 6386SE CPU consists of 2 numa nodes.
But a Xeon E7-4870 is one node only AFAIK.

Reply 346 of 1037, by red-ray

User metadata
Rank Oldbie
Rank
Oldbie
CoffeeOne wrote on 2020-02-26, 21:09:

It's only 4 numa nodes, right?

Sorry, it's just 4, I guess I got them mixed up. I suspect you could get it to have 8 by doing bcdedit /set {current} groupsize 10 or even more by bcdedit /set {current} maxgroup on and rebooting. A HP DL980 G7 will have 8 NUMA nodes and at least 2 CPU Groups.

Reply 347 of 1037, by red-ray

User metadata
Rank Oldbie
Rank
Oldbie
CoffeeOne wrote on 2020-02-26, 20:19:

I now ran this version on the DL580 G7 again.

I was looking at the SPD data and it looks like SIV had issues reading the IMC SMBuses as looking at [ SPD ] there look to have been read errors as for some DIMMs CRC is preceded by a red rather than a green blob and I am wondering why.

All of CPUZ + HWiNFO + SIV + … use the Global\Access_SMBUS.HTP.Method lock to synchronise SMBus access such that only one program is accessing it at any one time, see Menu->Help->Lock Handle, but HP don't and when I checked C:\Program Files\Hewlett-Packard\iLO 3\service\ProLiantMonitor.exe was active. What does this report? Does it report such as DIMM SPD data and/or temperatures? If so then it could be the root cause of the CRC errors as the DIMMs with the bad CRC vary from run to run.

Do you feel it would be it worth stopping the ProLiantMonitor Service (sc stop ProLiantMonitor) then starting SIV a couple of times to see if the CRC errors "go away"?

I think I have acquired the information I need to control the PCA9544A and thereby report SPD for all 64 DIMMs. Do you feel things are now stable enough for me to add initial support for you to test?

Update: I have been looking in the SIV save files and noticed as below. The highlighted 6 is the IMC SMBus command which SIV never sets to 6, so I am pretty use some HP software must be accessing the SMBus. I also think 6 must be the read word command. Given this I have created the attached experimental SIVX64.sys, please replace the existing one with it, exit/restart SIV64.exe and post what Menu->Hardware->SPD Thermal reports. I am hoping the word at offset 5 is C2xx, if so the next SIV64x.exe should correctly report your DIMM temperatures 😀

file.php?id=77799

Attachments

  • Westmere-EX-Word.png
    Filename
    Westmere-EX-Word.png
    File size
    69.48 KiB
    Views
    830 views
    File comment
    Westmere-EX IMC SMBus control and status registers
    File license
    Public domain
Last edited by red-ray on 2020-02-28, 09:53. Edited 1 time in total.

Reply 348 of 1037, by CoffeeOne

User metadata
Rank Oldbie
Rank
Oldbie
red-ray wrote on 2020-02-27, 12:30:
I was looking at the SPD data and it looks like SIV had issues reading the IMC SMBuses as looking at [ SPD ] there look to have […]
Show full quote
CoffeeOne wrote on 2020-02-26, 20:19:

I now ran this version on the DL580 G7 again.

I was looking at the SPD data and it looks like SIV had issues reading the IMC SMBuses as looking at [ SPD ] there look to have been read errors as for some DIMMs CRC is preceded by a red rather than a green blob and I am wondering why.

All of CPUZ + HWiNFO + SIV + … use the Global\Access_SMBUS.HTP.Method lock to synchronise SMBus access such that only one program is accessing it at any one time, see Menu->Help->Lock Handle, but HP don't and when I checked C:\Program Files\Hewlett-Packard\iLO 3\service\ProLiantMonitor.exe was active. What does this report? Does it report such as DIMM SPD data and/or temperatures? If so then it could be the root cause of the CRC errors as the DIMMs with the bad CRC vary from run to run.

Do you feel it would be it worth stopping the ProLiantMonitor Service (sc stop ProLiantMonitor) then starting SIV a couple of times to see if the CRC errors "go away"?

I think I have acquired the information I need to control the PCA9544A and thereby report SPD for all 64 DIMMs. Do you feel things are now stable enough for me to add initial support for you to test?

Update: I have been looking in the SIV save files and noticed as below. The highlighted 6 is the IMC SMBus command which SIV never sets to 6, so I am pretty use some HP software must be accessing the SMBus. I also think 6 must be the read word command. Given this I have created the attached experimental SIVX64.sys, please replace the existing one with it, exit/restart SIV64.exe and post what Menu->Hardware->SPD Thermal reports. I am hoping the word at offset 5 is C2xx, if so the next SIV64x.exe should correctly report your DIMM temperatures 😀

file.php?id=77799

Hi
Sorry, I am a bit lost. The machine is up and running. Yes, I can disable the "Proliant Monitor Service". It is a part of ILO3, which I don't use.
It's now stopped.

The description is:
"Monitors the management controller and shuts down the system in the event of overheating or loss of cooling."
So when no cable is connected to the ILO interface, still via that service the machine could be shut down.

What shall I do now?
Exchanging the sys file did not provide temperatures for the DIMMs, still only half of the DIMMs (32) are read out.

How shall I post: Menu->Hardware->SPD Thermal?
it does not fit on the screen:

spd-termal-sensors.jpg
Filename
spd-termal-sensors.jpg
File size
171.91 KiB
Views
829 views
File license
Fair use/fair dealing exception

Please advise

Reply 349 of 1037, by CoffeeOne

User metadata
Rank Oldbie
Rank
Oldbie

I did 2 times Save -> Local:
1st with ProliantMonitor service stopped: That are 3 files with -4
2nd with ProliantMonitor service running: 3 Files with -5

Attachments

  • Filename
    SIV64X-modsys-save.zip
    File size
    701.12 KiB
    Downloads
    40 downloads
    File license
    Fair use/fair dealing exception

Reply 350 of 1037, by red-ray

User metadata
Rank Oldbie
Rank
Oldbie
CoffeeOne wrote on 2020-02-27, 21:02:

Exchanging the sys file did not provide temperatures for the DIMMs, still only half of the DIMMs (32) are read out.
it does not fit on the screen: Please advise

Yes, as I said the next SIV will report temperatures I was not expecting them. The way to get the whole of the panel on the screen is to get a bigger screen 😀, but I just needed to see some of the entries anyway.

I have not as yet added the muxing code, so only 32 sets of SPD are expected. I feel it's best to make these changes one at a time so if there is an issue we know which change caused it.

Please try the attached SIV 5.47 Temp-01 which should report the temperatures for 32 of the DIMMs, may I have a new save file.

I strongly advise that you should not run any of the HP software at the same time as SIV. If you choose to then such as https://rog.asus.com/forum/showthread.php?927 … -problems/page2 may happen. Running SIV + HWiNFO + CPUZ at the same time should be fine, but AFAIK CPUZ can't report the SPD anyway, out of interest can it?

Note: With attached test SIV the driver will not run on systems with secure boot enabled as I did not bother to do the Microsoft driver countersigning.

Last edited by red-ray on 2020-03-04, 13:25. Edited 1 time in total.

Reply 351 of 1037, by CoffeeOne

User metadata
Rank Oldbie
Rank
Oldbie
red-ray wrote on 2020-02-28, 10:11:
Yes, as I said the next SIV will report temperatures I was not expecting them. The way to get the whole of the panel on the scre […]
Show full quote
CoffeeOne wrote on 2020-02-27, 21:02:

Exchanging the sys file did not provide temperatures for the DIMMs, still only half of the DIMMs (32) are read out.
it does not fit on the screen: Please advise

Yes, as I said the next SIV will report temperatures I was not expecting them. The way to get the whole of the panel on the screen is to get a bigger screen 😀, but I just needed to see some of the entries anyway.

I have not as yet added the muxing code, so only 32 sets of SPD are expected. I feel it's best to make these changes one at a time so if there is an issue we know which change caused it.

Please try the attached SIV 5.47 Temp-01 which should report the temperatures for 32 of the DIMMs, may I have a new save file.

I strongly advise that you should not run any of the HP software at the same time as SIV. If you choose to then such as https://rog.asus.com/forum/showthread.php?927 … -problems/page2 may happen. Running SIV + HWiNFO + CPUZ at the same time should be fine, but AFAIK CPUZ can't report the SPD anyway, out of interest can it?

Note: With attached test SIV the driver will not run on systems with secure boot enabled as I did not bother to do the Microsoft driver countersigning.

Hello,
Everything done now with ProLiant service stopped.
Yes, temperatures are shown now for 32 DIMMs

temp-01.jpg
Filename
temp-01.jpg
File size
486.78 KiB
Views
803 views
File license
Fair use/fair dealing exception

Yes, cpuinfo cannot show any SPD information.

Attachments

  • Filename
    SIV64X-save-temp1.zip
    File size
    347.96 KiB
    Downloads
    38 downloads
    File license
    Fair use/fair dealing exception

Reply 352 of 1037, by red-ray

User metadata
Rank Oldbie
Rank
Oldbie
CoffeeOne wrote on 2020-02-28, 20:21:

Everything done now with ProLiant service stopped.

Good, thank you for the new testing and sorry for the delay. All looks good 😀.

Please do Menu->Tools->Maximum Lines->Set 60 Lines (or more) then exit and restart SIV as once you do this the initial window should look better, does it?

I have been busy adding SMBus mux support and with luck if you replace SIVX64.sys with the attached all 64 DIMM temperatures should be displayed and all 64 lots of SPD should be there. May I have a new save file?

If it does not work I may need you to get a driver trace, have you ever used https://docs.microsoft.com/en-us/sysinternals … loads/debugview?

Update: I noticed two DIMMs did not report their temperature and tracked this down to the SPD being corrupted. Byte 0 should be 0x92 but is actually 0x04.

Attachments

  • Bad SPD.png
    Filename
    Bad SPD.png
    File size
    103.05 KiB
    Views
    793 views
    File comment
    Byte 0 should be 0x92 but is actually 0x04
    File license
    Public domain
Last edited by red-ray on 2020-02-29, 09:41. Edited 1 time in total.

Reply 353 of 1037, by CoffeeOne

User metadata
Rank Oldbie
Rank
Oldbie
red-ray wrote on 2020-02-28, 22:41:
Good, thank you for the new testing and sorry for the delay. All looks good :). […]
Show full quote

Good, thank you for the new testing and sorry for the delay. All looks good 😀.

Please do Menu->Tools->Maximum Lines->Set 60 Lines (or more) then exit and restart SIV as once you do this the initial window should look better, does it?

I have been busy adding SMBus mux support and with luck if you replace SIVX64.sys with the attached all 64 DIMM temperatures should be displayed and all 64 lots of SPD should be there. May I have a new save file?

If it does not work I may need you to get a driver trace, have you ever used https://docs.microsoft.com/en-us/sysinternals … loads/debugview?

Update: I noticed two DIMMs did not report their temperature and tracked this down to the SPD being corrupted. Byte 0 should be 0x92 but is actually 0x04.

Yes, looks better with 61 lines.
Still only 32 DIMMs.
No, haven't used debugview yet.
I can't believe that 2 DIMMs have a bad SPD, but who knows?

Attachments

  • Filename
    SIV64X-save-sys2.zip
    File size
    349.06 KiB
    Downloads
    56 downloads
    File license
    Fair use/fair dealing exception

Reply 354 of 1037, by red-ray

User metadata
Rank Oldbie
Rank
Oldbie
CoffeeOne wrote on 2020-02-29, 00:00:

Still only 32 DIMMs.
I can't believe that 2 DIMMs have a bad SPD, but who knows?

Thank you and I suspected with more lines it would. SIV was using an old driver as looking at Menu->Help->SIV Lookup it's V5.47 Built Feb 28 2020 at 09:42:20 WDK 6001.18000 V14.00 and the time should be 22:11. Did you exit and restart SIV?

Look at Menu->System->SMB Bus and you should see most are 92, but two are 04.

Please try the attached SIV64X 5.47 Frig-01 which also has the updated SIV driver and a SIV64X that frigs things so the temperature should be reported for the DIMMs with corrupt SPD.

Attachments

  • Bad SPD.png
    Filename
    Bad SPD.png
    File size
    50.71 KiB
    Views
    789 views
    File comment
    0x04 @ 0 for [ 2_53 ]
    File license
    Public domain
Last edited by red-ray on 2020-02-29, 19:44. Edited 1 time in total.

Reply 355 of 1037, by CoffeeOne

User metadata
Rank Oldbie
Rank
Oldbie
red-ray wrote on 2020-02-29, 00:15:
Thank you and I suspected with more lines it would. SIV was using an old driver as looking at Menu->Help->SIV Lookup it's V5.47 […]
Show full quote
CoffeeOne wrote on 2020-02-29, 00:00:

Still only 32 DIMMs.
I can't believe that 2 DIMMs have a bad SPD, but who knows?

Thank you and I suspected with more lines it would. SIV was using an old driver as looking at Menu->Help->SIV Lookup it's V5.47 Built Feb 28 2020 at 09:42:20 WDK 6001.18000 V14.00 and the time should be 22:11. Did you exit and restart SIV?

Look at Menu->System->SMB Bus and you should see most are 92, but two are 04.

Please try the attached SIV64X 5.47 Frig-01 which also has the updated SIV driver and a SIV64X that frigs things so the temperature should be reported for the DIMMs with corrupt SPD.

Hmm, seems that went wrong, now there is NO SPD info for none of the 64 DIMMs.

notemp.jpg
Filename
notemp.jpg
File size
502.45 KiB
Views
772 views
File license
Fair use/fair dealing exception

I will attach the save -> local.

Attachments

  • Filename
    SIV64X-notemp.zip
    File size
    339.02 KiB
    Downloads
    37 downloads
    File license
    Fair use/fair dealing exception

Reply 356 of 1037, by red-ray

User metadata
Rank Oldbie
Rank
Oldbie
CoffeeOne wrote on 2020-02-29, 18:21:

Hmm, seems that went wrong, now there is NO SPD info for any of the 64 DIMMs.

Thank you and none at all is actually a good sign as it means SIV is managing to update the SMBus mux. I think I have found the issue and suspect the DL580 G7 uses PCA9544 rather than PCA9545 chips, can you see either of them in the system? They should be 20-pin DIL packages.

I guess the easy option is to try a driver that uses the PCA9544 protocol, please exit SIV64X.exe, replace SIVX64.sys with the attached and with luck the temperatures + SPD will be back.

Did you notice that [SMB Bus] now reports 17 SMBuses in total, 4 for each CPU as it's a 4-way mux. Once I see a save file I may reduce this to 2 for each CPU if the DIMMs are only on 2 of the 4 channels.

Last edited by red-ray on 2020-02-29, 20:40. Edited 1 time in total.

Reply 357 of 1037, by CoffeeOne

User metadata
Rank Oldbie
Rank
Oldbie
red-ray wrote on 2020-02-29, 19:41:
Thank you and none at all is actually a good sign as it means SIV is managing to update the SMBus mux. I think I have found the […]
Show full quote
CoffeeOne wrote on 2020-02-29, 18:21:

Hmm, seems that went wrong, now there is NO SPD info for any of the 64 DIMMs.

Thank you and none at all is actually a good sign as it means SIV is managing to update the SMBus mux. I think I have found the issue and suspect the DL580 G7 uses PCA9544 rather than PCA9545 chips, can you see either of them in the system? They should be 20-pin DIL packages.

I guess the easy option is to try a driver that uses the PCA9544 protocol, please exit SIV64X.exe, replace SIVX64.sys with the attached and with luck the temperatures + SPD will be back.

Did you notice that [SMB Bus] now reports 17 SMBuses in total, 4 for each CPU as it's a 4-way mux. Once I see a save file I may reduce this to 2 for each CPU if the DIMMs are only on 2 of the 4 channels.

Sorry, can you send me the full SIV, I shall test? I deleted all the copies not to mix up something

Reply 359 of 1037, by CoffeeOne

User metadata
Rank Oldbie
Rank
Oldbie
red-ray wrote on 2020-02-29, 20:44:
CoffeeOne wrote on 2020-02-29, 20:20:

Sorry, can you send me the full SIV, I shall test?

Sure

Thx.
I don't see a difference.
save local attached

Attachments

  • Filename
    SIV64X-test.zip
    File size
    339.23 KiB
    Downloads
    38 downloads
    File license
    Fair use/fair dealing exception