While troubleshooting DIMM errors/issues on a UCS C-Series, I ran into an issue where the ECC error counters would not reset after changing a DIMM and after multiple reboots. During the course of troubleshooting, I caused a bunch of other ECC errors causing a number of faults across the motherboard. All of these displayed an error saying “EQUIPMENT_INOPERABLE” with the DIMM and slot listed as inoperable and instructions to replace the DIMM. These errors would not go away, regardless of clearing the SEL in the CIMC, removing power and rebooting everything and fully booting the system multiple times.
When I the system was reporting an issue with a single DIMM, it was believable. However once there were multiple errors on DIMMs that had previously passed diagnostics, something else must have been going on. Turns out after replacing my DIMMs, the error continued to show with the orange LED on the motherboard and in the faults on CIMC, like below:
After a lot of searching, and not finding a lot of information, I came across a KB article from Cisco with a list of CIMC CLI commands. One of those commands is reset-ecc. http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/c/sw/cli/config/guide/2-0/b_Cisco_UCS_C-Series_CLI_Configuration_Guide_201/b_Cisco_UCS_C-Series_CLI_Configuration_Guide_201_chapter_01.html#concept_CD37122CF0A94BAC9DA28D2EF379E002
Conditions
So, if you match these conditions:
- CIMC is showing faults for DIMMs inoperable (CIMC > Chassis > Faults & Logs > Faults Summary)
- Orange LED indicators appear beside DIMM or DIMMs on the server motherboard (correlates to the faults)
- In Cisco Diagnostics, you see an error that “ECC error count 60500 is critically high” (assuming 60500 is the limit…) under Server Information > Status & highlighting Memory
- In Cisco Diagnostics, you see reading of 60500 (or higher maybe?) listed in Server Information > Sensors
Solution [potentially]
You may reset the ECC errors to continue troubleshooting and clear the condition to test for reoccurrence. To do this, you would:
- Go to CIMC > Admin > Communication Services and ensure that SSH is enabled
- Open Putty or another SSH tool, put in the IP or DNS name of the CIMC and click connect.
- In the SSH session, run <code>scope chassis</code>
- then run <code>reset-ecc</code>
That’s it – fairly simple, but I haven’t seen a doc that say to do this, so use at your own discretion. After resetting this, in Cisco Diagnostics under Server Information > Sensors, the DIMMs now list ECC error count (Reading) as 0 on all DIMMS. Look for this to change when you run your memory test suite again if you have a real problem.
I recommend running a comprehensive memory diagnostic after clearing the ECC errors (which should cause them, if there is a true problem), to ensure that you do not have further problems with your DIMMs.