Protocol for establishing a baseline for Root Cause Analysis (RCA) of intermittent lockups on systems integrated by Silicon Mechanics:
- Ascertain current system IPMI firmware revision vs. latest available revision from Supermicro for the given motherboard model. (See: NOTE #1)
- If applicable, flash IPMI firmware to latest revision using Supermicro Update Manager: From the IPMI webGUI, go to: Maintenance > Firmware Update > Enter Update Mode. (See: NOTE #2)
- Ascertain current system BIOS firmware revision vs. latest available revision from Supermicro for the given motherboard model. (See: NOTE #1)
- If applicable, flash BIOS firmware to latest revision using Supermicro Update Manager: From the IPMI webGUI, go to: Maintenance > BIOS Update > Browse to and Upload BIOS. (See: NOTE #2)
- Check IPMI sel by clicking on: Server Health > Event Log; look for recent hardware events that may be relevant to the lockups and freezing behavior, as well as any 'Unknown' errors; forward these to Silicon Mechanics Support.
- Check for any irregular sensor readings from: Server Health > Sensor Readings; forward these to Silicon Mechanics Support.
- From the motherboard BIOS settings menu, ensure that ASPM (power-saving) is disabled. Go to: Advanced > (PCIe/PnP) > ASPM Support; set to: Disabled.
- For Unix and Linux-based systems, disable ASPM at boot time using the GRUB menu. Our Support Technicians can provide guidance on this procedure upon request.
- Following the establishment of the IPMI, BIOS, and ASPM optimizations described in the preceding steps (#1-8) please provide details, to the extent possible, on the nature and timing of the lockups you have observed with this and other systems of the same motherboard type. (See: NOTE #3)
- Any lockups occurring after the establishment of these 'baseline' optimizations relating to IPMI/BIOS firmware and the disabling of ASPM can be analyzed using a remote diagnostic tool provided by Supermicro called "SMC CPU Crash Dump Utility". (See: NOTE #3) -- UPDATE 07/21/17: For X10-based systems, IPMI firmware Rev. 3.58 has the Crash Dump utility built-in. Please contact our Support department for instructions on the use of this new built-in diagnostic tool.
NOTE #1: How to ascertain current system IPMI and BIOS firmware revisions
The IPMI and BIOS firmware revisions can be viewed from the IPMI webGUI landing page: "System", above the 'Remote Console Preview', as in:
Firmware Revision : 03.45
Firmware Build Time : 09/19/2016
BIOS Version : 2.0a
BIOS Build Time : 08/25/2016
(This example is for Supermicro Board/Model X10DRL-i ONLY.)
NOTE #2: Locating and flashing firmware to the latest revisions for your system
- With your motherboard model/product name in hand...
- Navigate to: https://www.supermicro.com/ResourceApps/BIOS_IPMI_intel.html.
- Search for or navigate to the specified motherboard model.
- Download the relevant ZIP file(s) to prepare to update firmware.
- Unzip and review the read-me files and instructions contained in these archives prior to flashing firmware.
- When flashing IPMI firmware, upload the IPMI '.bin' firmware file; opt for "DO NOT preserve BMC configuration".
- After flashing IPMI firmware, the system will be rebooted. (If using a Static IP addressing mode for the BMC, you may need to re-configure or re-enter the Static IP address post-flashing.)
- When flashing BIOS firmware, upload the UEFI BIOS image file ending with <.###> (a three-digit number).
- After flashing BIOS firmware, the system will be rebooted again.
NOTE #3: Gathering relevant data to facilitate RCA: Historical, Workload-related, Crash Dump Files
- We will work with Supermicro and our other Vendors and Industry Partners to analyze crash dump data, to the extent that this can be gathered.
- The SMC CPU Crash Dump Utility can only be executed remotely on systems that are in a locked-up state.
- Silicon Mechanics Support Technicians will furnish documentation and instructions on the use of the Crash Dump Utility upon request.
- In addition to crash dump data, the nature of the jobs run, and the work-load placed upon these systems, will be especially helpful in our analysis.
- To the extent possible, please provide a full listing of SM serial numbers for machines showing the lockup behavior; note number of and timing of lockups.
- We will work with you to establish a list of 'repeat offender' systems, should the lockup behavior on specific systems recur following the establishment of the baselines described in the Protocol, Steps #1-10 (top).