Follow

Establishing a Baseline for Intermittent Lockup RCA

Protocol for establishing a baseline for Root Cause Analysis (RCA) of intermittent lockups on systems integrated by Silicon Mechanics:

  1. Ascertain current system IPMI firmware revision vs. latest available revision from Supermicro for the given motherboard model. (See: NOTE #1)
  2. If applicable, flash IPMI firmware to latest revision using Supermicro Update Manager: From the IPMI webGUI, go to: Maintenance > Firmware Update > Enter Update Mode. (See: NOTE #2)
  3. Ascertain current system BIOS firmware revision vs. latest available revision from Supermicro for the given motherboard model. (See: NOTE #1)
  4. If applicable, flash BIOS firmware to latest revision using Supermicro Update Manager: From the IPMI webGUI, go to: Maintenance > BIOS Update > Browse to and Upload BIOS. (See: NOTE #2)
  5. Check IPMI sel by clicking on: Server Health > Event Log; look for recent hardware events that may be relevant to the lockups and freezing behavior, as well as any 'Unknown' errors; forward these to Silicon Mechanics Support.
  6. Check for any irregular sensor readings from: Server Health > Sensor Readings; forward these to Silicon Mechanics Support.
  7. From the motherboard BIOS settings menu, ensure that ASPM (power-saving) is disabled. Go to: Advanced > (PCIe/PnP) > ASPM Support; set to: Disabled.
  8. For Unix and Linux-based systems, disable ASPM at boot time using the GRUB menu. Our Support Technicians can provide guidance on this procedure upon request.
  9. Following the establishment of the IPMI, BIOS, and ASPM optimizations described in the preceding steps (#1-8) please provide details, to the extent possible, on the nature and timing of the lockups you have observed with this and other systems of the same motherboard type. (See: NOTE #3)
  10. Any lockups occurring after the establishment of these 'baseline' optimizations relating to IPMI/BIOS firmware and the disabling of ASPM can be analyzed using a remote diagnostic tool provided by Supermicro called "SMC CPU Crash Dump Utility". (See: NOTE #3) -- UPDATE 07/21/17: For X10-based systems, IPMI firmware Rev. 3.58 has the Crash Dump utility built-in. Please contact our Support department for instructions on the use of this new built-in diagnostic tool.

NOTE #1: How to ascertain current system IPMI and BIOS firmware revisions

The IPMI and BIOS firmware revisions can be viewed from the IPMI webGUI landing page: "System", above the 'Remote Console Preview', as in:

Firmware Revision : 03.45
Firmware Build Time : 09/19/2016
BIOS Version : 2.0a
BIOS Build Time : 08/25/2016

(This example is for Supermicro Board/Model X10DRL-i ONLY.)


NOTE #2: Locating and flashing firmware to the latest revisions for your system

  1. With your motherboard model/product name in hand...
  2. Navigate to: https://www.supermicro.com/ResourceApps/BIOS_IPMI_intel.html.
  3. Search for or navigate to the specified motherboard model.
  4. Download the relevant ZIP file(s) to prepare to update firmware.
  5. Unzip and review the read-me files and instructions contained in these archives prior to flashing firmware.
  6. When flashing IPMI firmware, upload the IPMI '.bin' firmware file; opt for "DO NOT preserve BMC configuration".
  7. After flashing IPMI firmware, the system will be rebooted. (If using a Static IP addressing mode for the BMC, you may need to re-configure or re-enter the Static IP address post-flashing.)
  8. When flashing BIOS firmware, upload the UEFI BIOS image file ending with <.###> (a three-digit number).
  9. After flashing BIOS firmware, the system will be rebooted again.

NOTE #3: Gathering relevant data to facilitate RCA: Historical, Workload-related, Crash Dump Files

  1. We will work with Supermicro and our other Vendors and Industry Partners to analyze crash dump data, to the extent that this can be gathered.
  2. The SMC CPU Crash Dump Utility can only be executed remotely on systems that are in a locked-up state.
  3. Silicon Mechanics Support Technicians will furnish documentation and instructions on the use of the Crash Dump Utility upon request.
  4. In addition to crash dump data, the nature of the jobs run, and the work-load placed upon these systems, will be especially helpful in our analysis.
  5. To the extent possible, please provide a full listing of SM serial numbers for machines showing the lockup behavior; note number of and timing of lockups.
  6. We will work with you to establish a list of 'repeat offender' systems, should the lockup behavior on specific systems recur following the establishment of the baselines described in the Protocol, Steps #1-10 (top).
Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk