You are on page 1of 37

x3850 X5 Quick Start Guide

John R. Encizo System X ATS jrencizo@us.ibm.com

February 15, 2011

x3850X5 Quick Start Guide Revision History First Draft (11/11/2010) Second Draft (12/01/2010) Stylistic updates Third Draft (12/20/2010) Editorial updates Fourth Draft (12/22/2010) Troubleshooting workflows added Fifth Draft (1/7/2011) Windows Boot Issues added Sixth Draft (2/15/2011) Editorial Updates

Notices: This paper is intended to provide information regarding IBM System X 3850X5. It discusses findings based on configurations that were created and tested under laboratory conditions. These findings may not be realized in all customer environments, and implementation in such environments may require additional steps, configurations, and performance analysis. The information herein is provided AS IS with no warranties, express or implied. This information does not constitute a specification or form part of the warranty for any IBM or non-IBM products. Information in this document was developed in conjunction with the use of the equipment specified, and is limited in application to those specific hardware and software products and levels. The information contained in this document has not been submitted to any formal IBM test and is distributed as is. The use of this information or the implementation of these techniques is a customer responsibility and depends on the customers ability to evaluate and integrate them into the customers operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. IBM may not officially support techniques mentioned in this document. For questions regarding officially supported techniques, please refer to the product documentation, announcement letters, or contact the IBM Support Line at 1-800-IBM-SERV. This document makes references to 3rd party applications or utilities. It is the customer responsibility to obtain licenses of these utilities prior to their usage. Copyright International Business Machines Corporation 2010. All rights reserved. US Government Users Restricted Rights Use, duplication, or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

Copyright IBM Corporation,2010 Page 2 of 37

x3850X5 Quick Start Guide

Table of Contents
Table of Contents....................................................................................................................... 3 Contributions .............................................................................................................................. 4 Abstract...................................................................................................................................... 5 Definitions: ................................................................................................................................. 6 Initial System Configuration & Component Installation ............................................................... 7
Table 1: DIMM Installation Order .............................................................................................................. 9

Firmware Update Procedures ...................................................................................................11 System Cabling .........................................................................................................................13 Basic Troubleshooting...............................................................................................................17


Sample Troubleshooting Workflows ........................................................................................................ 20

OS Installation Hints and Tips ...................................................................................................22


Table 3: Supported OS CPU & Memory Limits ....................................................................................... 22

Basic UEFI Performance Tuning ...............................................................................................25


Table 4: UEFI Tuning Table .................................................................................................................... 25

System Images .........................................................................................................................27 Appendix A: RAID Configuration ...............................................................................................30 Appendix B: Fixing Boot Issues with Windows 2008 in UEFI mode ...........................................34 Links .........................................................................................................................................36 Trademarks...............................................................................................................................37

Copyright IBM Corporation,2010 Page 3 of 37

x3850X5 Quick Start Guide

Contributions
This document could not have been put together without the work of the High-End RDS community. William Davey, Samuel Drew, Mahmoud Hussein, Kenny James, Bruce Kenney, Steven King, Doug Peaslee, Stephen Rasp, Rob Russell, Greg Shackleford and William Smutzer and Wayne Wigley. The assistance of the X5 development team was also invaluable in gathering additional information on these systems. Special thanks to Ralph Begun, David Drez and Mark Kapoor for their help in tracking down technical information on the Nehalem EX and X5 architectures.

Copyright IBM Corporation,2010 Page 4 of 37

x3850X5 Quick Start Guide

Abstract
This document will address the best practices for running an eX5 server (x3850X5 / x3950X5 MTM 7145) in either a single, QPI scaled configuration or MAX5 attached configuration. This document is neither designed to replace the official documentation, the IBM Redbooks, nor the IBM Help Desk; rather it is designed to give one all of the information necessary to unpack a system and be up and running in the minimal amount of time. In this document we will cover the following topics: Definitions Initial System Configuration & Component Installation Firmware Update Procedures System Cabling Basic Troubleshooting OS Installation Hints and Tips Quick Tips on UEFI Tuning RAID Configuration Fixing Windows 2008 Boot Issues

Copyright IBM Corporation,2010 Page 5 of 37

x3850X5 Quick Start Guide

Definitions:
1. Node/Chassis: These terms may be used interchangeably to refer to a single eX5 system. A single eX5 system is based on the Intel Nehalem EX architecture and does not include any IBM specific eX5 enhancements to the Nehalem EX architecture. 2. Complex: Used to describe a merged x3950X5. 3. QPI Wrap Card: A card (two required) that fits into a single x3850X5 and serves to interconnect all four CPUs in a full mesh QPI interconnect. 4. QPI Scalability: Means of scaling a Nehalem EX-based server from 4- to 8-sockets utilizing Intels out-of-the-box scalability functionality. Also alternatively referred to as glueless or native scalability. 5. EXA Scalability: Means of scaling a system via MAX5. EXA differs from QPI scalability in that it is proprietary to IBM and requires a MAX5. 6. MAX5: The official name of the IBM Memory and Scalability Expansion drawer. The MAX5 contains IBMs node and memory controller, the eX5 chipset, which allows IBM to offer memory and node scalability past that which is supported natively by Intel. Only one node can connect to a MAX5, and only one MAX5 can connect to a node. 7. QPI Scalability cable: A cable used to connect two base eX5 nodes together. 8. MAX5 Scalability cable: A cable used to connect a base eX5 node to a MAX5. 9. Boxboro: The name of the I/O bridge chips used in the Nehalem EX system architecture.

Copyright IBM Corporation,2010 Page 6 of 37

x3850X5 Quick Start Guide

Initial System Configuration & Component Installation


The x3850X5 is the scalable 4-socket member of the eX5 portfolio based on the Nehalem EX architecture and runs the Intel 7500 series CPUs. The system has up to 64 DIMMS, 7 PCIe Gen2 slots and can scale either memory via the MAX5 or expand CPU, memory and PCIe resources via adding a second x3850X5. Base systems ship with two CPUs (sockets 1 & 4) populated, two memory cards (slots 1 & 7) installed and the Emulex 10GbE card pre-populated in PCIe slot 7. An important note regarding the nature of the CPU sockets in the X5 systems, the 7500 series CPUs use a CPU socket where the CPU-system interconnect pins are in the socket. This means that the sockets are extremely fragile. The pins that connect the CPUs to the remainder of the system components are easily bent. IBM ships a CPU insertion tool with all CRU / FRU CPUs and system-boards. It is highly recommended to use this tool any time work is being done on the CPUs. This will ensure no CPU socket pins get bent, which can lead to numerous memory, QPI and I/O errors. The solution for bent pins on a system-board is almost always a new system-board. This can add significant delays to system implementation. CPUs should be added to the system in the following order: CPU 1, CPU4, CPU2 and CPU3. Figure 1

Upon receipt of a new system, there are a number of steps that one should take prior to implementing the system in production. One of the first steps that should be taken upon receipt of a new system is to remove the case cover and do a visual inspection of the components. A
Copyright IBM Corporation,2010 Page 7 of 37

x3850X5 Quick Start Guide

visual inspection of all of the installed PCIe cards and verification that DIMMs are installed correctly can save troubleshooting time later in the installation process. One common cause for issues involves improper seating of the QPI wrap cards. Because the wrap card filler plates look identical to the wrap cards themselves, this can often lead to systems being installed without the wrap cards. If this is the case, the system will manifest poor performance and may generate NMIs under load. In addition to making sure that the wrap cards are in the system, verify that the cards are actually fully latched in place. There is a blue latch handle that should lock into place when the cards are correctly installed (See Figure 1). Figure 2

When installing memory, it is important to validate that it is optimized across all CPUs, as proper memory layout is essential not only for future expandability but also for performance. It is recommended to populate DIMMs on one memory card per CPU prior to moving on to populating the second memory card in a socket. This ensures that DIMM quantity is balanced across CPUs. It is also highly recommended that memory capacity per memory card and per CPU is equivalent to ensue the best possible memory bandwidth in the system. The minimal memory layout for performance is: eight memory cards, four DIMMs / memory card for a total of eight DIMMs / CPU. The minimum memory layout for scalability is: eight memory cards, two DIMMs / memory card, four DIMMs / CPU.

Copyright IBM Corporation,2010 Page 8 of 37

x3850X5 Quick Start Guide

The following table lists the correct DIMM plug order within a socket: Table 1: DIMM Installation Order DIMM Pair 1 2 3 4 5 6 7 8

Memory Card 1 2 1 2 1 2 1 2

DIMM Slot 1, 8 1, 8 3, 6 3, 6 2, 7 2, 7 4, 5 4, 5

Prior to powering on the system it is important to verify that IO cards are properly seated and distributed correctly within the system. There is no hot-swap IO on the X5 system, and hotplugging / removing IO cards can lead to damage of the card, the system, or both. In order to optimize the IO subsystem for performance, adapters should be alternated between the Boxboros during installation. The same types of adapter cards (i.e., NICS vs. HBAs) should be split across the Boxboros. Figure 2 lays out how IO slots are arranged in the system. Figure 3

There are a number of best practices when installing adapters and configuring connections. In the case of NICs, the best practices fall primarily around adapter placement and NIC teaming. Adapter placement follows the rule of alternating devices between Boxboros; however, one needs to remember that the onboard NICs are homed on Boxboro 1. It is strongly recommended that the first NIC card be placed on Boxboro 2, particularly if the onboard NICs are to be used and / or NIC teaming is to be implemented. NIC teaming should be implemented
Copyright IBM Corporation,2010 Page 9 of 37

x3850X5 Quick Start Guide

such that NIC teams only include NICs by the same vendor and that teams are created horizontally across ports wherever possible (i.e., NIC1 port 1 teamed to NIC2 port 1). Fibre cards have card placement rules similar to NICs in that they should be alternated between Boxboros. It is strongly suggested to use at least two cards to ensure that IO is balanced across the Boxboros rather than contained within a single card and Boxboro. It is also recommended that LUNS assigned in a multi-pathed arrangement are zoned horizontally then vertically (i.e., HBA 1 port 1 and HBA 2 port 1 have LUNs from controller A while HBA 1 port 2 and HBA 2 port 2 have LUNs from controller B). Multi-node systems have slightly different card placement recommendations than in previous iterations of eXA systems. It is recommended to split NICs and HBAs across nodes so as to be able to distribute IO to each nodes local Boxboros. Once adapters are split between nodes, follow the horizontal before vertical teaming / multi-pathing configuration for single node systems. NIC teaming is never recommended between nodes, as the additional latency it introduces can negatively impact network performance. Table 2: PCIe Card Installation Order Card Number PCIe Slot Number 1 1 2 5 3 3 4 6 5 4 6 7 7 2

Copyright IBM Corporation,2010 Page 10 of 37

x3850X5 Quick Start Guide

Firmware Update Procedures


Prior to installing an OS on the system, it is recommended to update to the newest currently available firmware (FW) on the web. There are three primary means of updating FW: 1. Via the IMM: This method allows the user to update base system FW (IMM, FPGA, UEFI) without the need for any OS or additional tools. The user can accomplish this by connecting to an IMM (either via crossover cable or regular network connection) via a web browser to the default IMM IP (192.169.70.125). The down-sides of this particular method are that only base FW can be updated and this method is only supported in single node configurations. 2. Via Bootable Media Creator (BoMC): This method allows the user to create a bootable image (ISO, USB or PXE) with the most current FW for all options installed in the system. When selecting this method, the user can specify whether to choose the latest UXSP or the latest FW on the web. This author recommends utilizing the latest FW on the web for new installations. The down-sides of this method are that the creation and maintenance of an up-to-date BoMC image requires the download of a large amount of files, and the BoMC attempts to execute all FW flashes for the specified model, regardless of whether the options are installed in the system or not. This method is recommended for pre-OS deployment FW updates. 3. Via In-Band FW Updates: This method allows the user to deploy updates from within the operating system. This method allows the user to deploy and schedule FW updates from within an installed operating system. The upside of this method is that it allows users to script, schedule and deploy the updates via whatever means are currently used to for system maintenance. The downside of this method is that it requires an OS to be on the system. Once the appropriate means for FW updates has been chosen, the FW update procedure is to be performed in the following manner: 1. For systems with an OS currently installed, update RAID and network drivers prior to FW updates. 2. Firmware updates should be done for the major system level components first. The FW that should be updated during this first round of FW updates includes (and the order of updating) is: a) IMM (minimum recommended level is 1.14 YUOO73K) b) Restart the IMM c) Reboot the system to the F1 setup screen d) FPGA (minimum recommended level is 1.01 GOUD29CUS ) e) UEFI (requires the system to be at least in the F1 setup screen) (minimum recommended level is 1.23 G0E122D) f) Reboot system 3. Firmware updates for the remainder of the systems should be handled next, in the following order: a) pre-boot DSA b) Storage Controllers ServRAID HBA c) HDD / SSD d) Network cards e) Other adapters f) Reboot system
Copyright IBM Corporation,2010 Page 11 of 37

x3850X5 Quick Start Guide

4. If the system is going to be part of a multi-node system, repeat this same process on the second node prior to cabling the systems. Note: Once the systems are merged, all FW updates can be performed from the primary node either from BoMC or from within the OS. 5. If the system is to have a MAX5 attached, then: a) Shut down the system b) Remove AC power to the node and MAX5 c) Cable the systems together d) Reapply AC power e) Connect via the IMM and re-flash the FPGA code with the system turned off f) Restart the IMM g) Power on the system

Copyright IBM Corporation,2010 Page 12 of 37

x3850X5 Quick Start Guide

System Cabling
Proper system cabling in the x3850X5 is critical, as it has an impact not simply on performance but also on primary node ownership status. When an additional node or a MAX 5 is added to a base node, the primary node assumes responsibility for system POST, OS initialization, error alert aggregation and FW updating of the complex. In a native QPI (i.e., glueless) 8-socket configuration, primary node status is determined by the manner in which the cables are connected. The QPI cables are keyed on each end (J1, J2 on one end & J3, J4 on the other end). The J1, J2 end of each cable denotes which system will be the primary. These cables are fragile; treat them with care to avoid damaging the connector. The proper cabling sequence for an 8-socket system is (top-bottom) Port1-Port1, Port2-Port3, Port3-Port 2, Port 4-Port4. If you do not have all cables aligned with J1, J2 on the primary node, you will experience sporadic system issues. When cabling a MAX5 to a base node, the correct cabling sequence is Port1-Port1, Port2-Port2, Port3-Port3, Port4-Port4. Note: These cables are not hot-add/hot-swap. AC power must be removed prior to cabling the system. See the following figures for proper connection of cables to ports. Figure 4

Copyright IBM Corporation,2010 Page 13 of 37

x3850X5 Quick Start Guide

Figure 5

Figure 6

Copyright IBM Corporation,2010 Page 14 of 37

x3850X5 Quick Start Guide

Figure 7

When scaling the x3850X5 to 8-sockets, there are a number of steps to follow: 1. Update the FW on the nodes to be scaled to the pre-requisite FW levels for scaling. This author always recommends updating the systems to the most current FW publicly available prior to scaling. Current code levels required for scaling are: a. UEFI: 22C b. FPGA: 29C c. IMM: 73K 2. Shut down the nodes to be cabled together. 3. Remove power from both nodes. 4. Remove the QPI Wrap Cards. NOTE: These parts are NOT hot-swappable. Removing a wrap card with power to the system can damage the CPU board. 5. Remove the cable management arm and make sure that all cables from the secondary node are bundled together to allow oneself more room to work. 6. Insert the QPI scalability cables in the following sequence (right-to-left): Port 4-Port4, Port3-Port 2, Port2-Port3, Port1-Port1. The reason we are working right-to-left is because it allows you to have more room to work and correctly route IO and power cables. 7. Reattach all IO cables. 8. Reconnect cable management arm and Velcro clip. 9. Attach power cables in the following order (PSU 1 is leftmost PSU): Top PSU 2, Bottom PSU 1, Top PSU 1, and Bottom PSU 2. You will need to have power to all nodes at relatively the same time to ensure the IMMs can communicate across the QPI cables, which is the reason the author recommends the above plug-in sequence. 10. The IMMs should both be initializing. You can verify this by the flashing blue LED on the front of each system and the 5 pulse per second flashing of the power button on both systems. 11. Once the IMMs have synchronized, push the power button on the top node, which will be the primary node based on the cabling scheme laid out here, to start the system. Note: Once the systems are successfully merged, either power button will initiate/terminate DC power to the system. 12. If there are no error LEDs on the front of the system, once the IBM splash screen comes
Copyright IBM Corporation,2010 Page 15 of 37

x3850X5 Quick Start Guide

up go into the UEFI via F1. You should see both nodes listed by their serial numbers. 13. Verify your UEFI settings. a. NOTE: There are now Global and Nodal UEFI values. Global UEFI settings affect the entire 8-socket system; nodal values only affect the individual systems comprising the 8-socket system. b. Global settings include CPU settings, Boot and Option ROM scan order, Slot enable/disable and memory settings such as Patrol scrub. c. Nodal values include such things as re-enabling memory cards, memory speed and com port settings. 14. Save any changes and allow the system to complete POST and bring the OS up. When scaling a x3850 X5 to a MAX5, the following steps are recommended: Note: The MAX5 scalability cables will fit correctly only one way. 1. Update the FW on the base node to be scaled to the pre-requisite FW levels for scaling. This author always recommends updating the systems to the most current FW publicly available prior to scaling. 2. Shut down the OS on the base node. 3. Remove power from both the MAX5 and base node. 4. Remove the QPI Wrap Cards. NOTE: These parts are NOT hot-swappable. Removing a wrap card with power to the system can damage the CPU board. 5. Remove the cable management arm from the base node. 6. Insert the QPI scalability cables in the following sequence (right-to-left): Port 4-Port4, Port3-Port 3, Port2-Port2, Port1-Port1. The reason we are working right-to-right is due to the small size and bend radius of the QPI scalability cables. 7. Reconnect cable management arms. 8. Attach power cables in the following order (PSU 1 is leftmost PSU): MAX5 PSU 2, Node PSU 1, Node PSU 2, and MAX5 PSU 1. You will need to have power to all components at relatively the same time to ensure the IMMs can communicate across the QPI cables and identify the MAX5, which is the reason the author recommends the above plug-in sequence. 9. The IMM should be initializing both itself and the MAX5. You can verify this by the flashing blue LED on the front of each system and the 5 pulse per second flashing of the power button on the base systems. 10. Once the IMM has merged the MAX5 to itself, push the power button on the base node to start the system. Note: There is NO power button on the MAX5. 11. If there is no error LED on the front of the system, you should see a message stating that the system is initializing a x3850X5 and a MAX5. 12. Once the IBM splash screen comes up go into the UEFI via F1. Based on the OS installed, you will need to set the memory management of MAX5. There are two different settings: a. Pooled: This setting is only supported under Linux. All of the memory behind the Max5 is reserved as a single pool of memory addressable by any CPU as local RAM. b. Non-Pooled: This setting is required to use Windows or VMWare with a MAX5 attached. In this setting each CPU is logically assigned a set of memory registers from the MAX5s overall pool of memory registers. See the diagram to follow. 13. Save any changes other UEFI and allow the system to complete POST and bring the OS up.
Copyright IBM Corporation,2010 Page 16 of 37

x3850X5 Quick Start Guide

Basic Troubleshooting
In this section we will address some basic hardware troubleshooting steps on the x3850X5 systems. Generally, hardware issues tend to fall into two broad categories: system level component issue (including adapter cards) and firmware related issues. While many of these issues can be effectively troubleshot by the end user, this section is by no means a replacement for the Problem Determination Guide or the assistance of the IBM Help Desk. Five essential tools in successful problem determination are the IMM event logs accessed via the IMM web interface, the system event logs accessed via F1 or ipmitool, the Dynamic System Analysis (DSA) tool (either the pre-boot version which runs outside of the OS or the portable version run from within the OS), The Problem and Determination Guide (PDSG) from the IBM web site and the Retain Tips for the system as listed on the IBM web site. System level component problem diagnosis generally falls around diagnosing issue with three main subsystems: CPU, memory and IO. Intels use of the LGA 1567 socket has changed the CPU-to-systemboard connection paradigm where the interconnect pins used to be on the CPU, they are now in the socket. When interconnect pins become damaged, symptoms range from random CPU connectivity errors, QPI link errors, memory DIMMs failing to be recognized / dropping off line and IO link errors. Because of this new design, most CPU errors tend to actually be systemboard errors. It is important to note that if a systemboard is replaced, one can get into a firmware mismatch situation. For this reason, it is always recommended to perform firmware upgrades post component replacement via the IMM prior to DC powering on the system. Most often the code which will need to be updated is FPGA, though some boards may be back level on UEFI code as well. Also, one should never insert or remove a CPU from a socket without the CPU removal tool. It is specially designed for CPU (re)placement without damaging the socket pins. Memory errors require more troubleshooting than in previous generation platforms. Because of the inclusion of the memory controller on the CPU, troubleshooting memory errors involves more than replacing DIMMs. When troubleshooting memory errors, the following procedure is recommended: 1. Verify which memory card and DIMMs are reporting as bad by checking the IMM log 2. Shut down the system 3. Remove AC power 4. Reseat DIMMs and memory cards 5. Reapply power and go into F1>System Settings>Memory>Memory Details and reenable the DIMMs reported as bad 6. Save the settings 7. Reboot the server 8. If the system reflags the DIMMs as bad, shut down the server. 9. Move the DIMMs to a memory card on another CPU (i.e.- the next higher/lower even/odd card depending on which card the suspect DIMMs came off of) 10. Reapply power and go into F1>System Settings>Memory>Memory Details and reenable the DIMMs reported as bad 11. If the error follows the DIMMs, work with IBM Support to replace the DIMMs 12. If the error is still on the DIMMs from the original memory card (source), shut down the server and remove AC power 13. Swap the memory card hosting the previously suspect DIMMs (target) into the slot still reporting memory errors. 14. Reapply power and go into F1>System Settings>Memory>Memory Details and reenable the DIMMs reported as bad
Copyright IBM Corporation,2010 Page 17 of 37

x3850X5 Quick Start Guide

15. Save the settings 16. Reboot the server 17. If the system reflags the DIMMs in the source slot, the most likely cause is bent pins either in the memory card socket or CPU socket. work with IBM Support to replace the system board 18. If the errors follow the memory card, the most likely cause is a faulty memory car, work with IBM Support to replace the memory card. IO hardware errors are often the result of either bent socket / slot pins or cards that have come unseated. IMM logs will often show uncorrectable bus errors, SMIs or NMIs when there are IO hardware errors. Troubleshooting IO hardware errors primarily involves examination of the IO slots as well as verification that IO cards are seated. Other steps that one can take to troubleshoot IO hardware errors are to remove all cards of the same type (i.e. - all HBAs from the same vendor) and / or swap card slot locations. Certain cards may consume excess OptionROM space which in single node solution pose no problems, but in multi-node scenarios lead to exhaustion of resources. It is important to note that the IO board hosts not only the IMM, which means that both the MAC and IP address will be reset, but also much of the firmware. Scalability issues most often manifest themselves as link errors. There are two types of link errors, depending on whether the issue is related to a scalability port cable or an internal QPI link. External QPI link failures are most commonly associated with bad cables or QPI ports. Proper troubleshooting for these types of errors involves inspecting the QPI cable connector for damage, moving the cable from the QPI port identified in the IMM log to port in the adjacent port group (ports 1 and 2 are in port group1, ports 3 and 4 are in port group 2) or breaking the system into individual 4-socket nodes. The following workflow will aid in troubleshooting external link scalability issues: 1. Remove AC power 2. Reseat all QPI cables in both nodes 3. Restart the system 4. If errors continue, remove the QPI cables by depressing the blue tab on the cable. Be sure to remove the ends of the cable from both nodes at the same time for inspection. Trying to remove a cable from one node at a time can lead to damaging the cable due to the limited length and bend radius of the cables. 5. If the connectors on the cable appear damaged, work with IBM Support to replace the cable. 6. If there is no apparent cable damage one can attempt to visually inspect the pins in the QPI port. 7. In lieu of inspecting pins, swap the QPI cable with a cable from the other port group. Note: If the QPI cable is damaged and one attempts to insert it into another QPI port, the connector pins on that port may become damaged, leading to a systemboard swap. 8. Re-apply ac power and restart the system. 9. If the error follows the cable, the issue is most likely a cable. 10. If the error remains in the original port group, the error is most likely the systemboard, work with IBM Support to replace the systemboard. In the event of internal QPI link errors, most often these are due to QPI link on the systemboard. The best means of troubleshooting these issues is to remove ac power, remove the QPI scalability cables and reinsert the QPI wrap card and see if the problem(s) continues to manifest itself. If the problem continues to manifest itself the error is most likely the systemboard, work with IBM Support to replace the systemboard.
Copyright IBM Corporation,2010 Page 18 of 37

x3850X5 Quick Start Guide

Firmware related issues on X5 systems primarily revolve around mismatched and outdated firmware both at the base system level as well as on option cards. Mismatched firmware at the base system level is often the result of a hardware service action or corrupted firmware updates. Most often this can be remedied by re-flashing the base system from the IMM. In the case of corrupted UEFI, there is a jumper (J22) located under power supply (PSU) 1 which can be reset to boot into the backup UEFI block, allowing for the primary UEFI block to be flashed again. Figure 8

It is important to note that all IBM systems come with Automatic Boot Failure Recovery (ABR). ABR functions so that if the system detects an issue with the firmware in the primary bank it will switch to the backup bank to allow one to recover the primary firmware. Upon reboot one will see a prompt to Press F3 to restore to primary; this will allow the system to attempt to recover the primary firmware bank. When dealing with a MAX5, it is important that the firmware on the MAX5 and the base node are in sync. The firmware on the system and the MAX5 must be identical as the FPGA code tells the MAX5 which type of machine it is connected to as well as enables the base node to control power permissions and error alerting for the MAX5. Since the FPGA code on the MAX5 is flashed via the base node, when attaching a MAX5 to a system, whether a new system or a system that has just been serviced, the first thing one should do to ensure proper functionality is to (re)flash the FPGA on the MAX5 attached system but with the system not turned on. This will ensure the firmware on the MAX5 is in sync with the base node. Down-level firmware on option cards can often lead to NMIs or issues with connectivity. It is important to note that most option cards can be flashed from within the OS by using the packages located on the IBM website, though certain adapters (such as Emulex) may require one to load pieces of their card management framework, have the adapter up (Broadcom NICs) or in the case of Linux, the vendors drivers loaded (as opposed to those drivers which ship with
Copyright IBM Corporation,2010 Page 19 of 37

x3850X5 Quick Start Guide

the Linux distribution). One of the primary tools for flashing firmware should be the Bootable Media Creator (BoMC) utility from the IBM ToolsCenter web site. It is suggested that when troubleshooting firmware related issues that one allows the BoMC tool to gather the most current firmware from the web. This will ensure that the most up to date system firmware is applied. For more information on using the BoMC tool please reference the User Documentation. Sample Troubleshooting Workflows Troubleshooting memory errors 1) Validate the DIMM a) Move the DIMM to another bank on a different CPU b) Remember to turn re-enable the bank 2) Isolate the board a) Swap memory boards among CPUs 3) Isolate the CPU a) Swap CPUs into another socket i) Use the IBM CPU puller to avoid damaging pins on the system board. 4) Validate the socket a) Break out your flashlight and inspect the pins Troubleshooting IO errors 1) Examine IMM / IPMI event logs to pinpoint card and slot a) NMIs are often caused by unsupported adapters and/or adapters with downlevel FW b) Storage adapters (ServRAID, HBA & Infiniband) are sensitive to correct FW levels i) Many IO card related issues can be attributed to downlevel or corrupted FW 2) Isolate and validate the Card a) Swap the card to another IO slot. i) On a different IO bridge if supported by your architecture on a different CPU b) Validate the card is PCIe Gen2 i) change the slot to PCIe Gen 1 in UEFI c) Try disabling the Option ROM 3) Isolate the CPU a) Swap CPUs into another socket i) Use the IBM CPU puller to avoid damaging pins on the system board. 4) Validate the socket a) Break out your flashlight and magnifying glass to inspect the pins in the socket Troubleshooting QPI errors 1) Examine IMM / IPMI event logs to pinpoint CPU TX/RX i) NMIs can present with Int QPI link errors ii) QPI lane reduction / failover can also be seen 2) Isolate and validate the node i) If in a single node, try swapping the CPUs to verify if it is a socket or CPU issue ii) If in multi-node solution, try removing scalability cables and replacing with wrap cards to verify if cables are issue (a) IMM log will identify remote node by SN:FRU 3) Validate the connector a) Break out your flashlight and inspect the pins on the scalability ports and within the socket b) Validate that the cable ends are not crushed and/or damaged
Copyright IBM Corporation,2010 Page 20 of 37

x3850X5 Quick Start Guide

Finally, due to certain quirks in UEFI, there are certain installation issues which one should be aware of: 1. When an OS is installed in UEFI mode, it is necessary to either re-initialize the disk or use software tools to clean the metadata off o f the drive before it can be used to a legacy OS. This is because UEFI checks for UEFI boot devices prior to legacy devices. Even if Legacy Only is placed as the primary boot device, unless the disk is converted from a GPT device to a legacy MBR disk, the OS will be unable to install due to limitations in legacy OSs. 2. When additional LUNs are added to a UEFI-based Windows 2008+ OS installation, the boot target information may become out of sync with the bootloader. See Appendix B for quick tips on how to repair the Windows boot configuration datastore. 3. When installing Linux, NICs may rescan their order (eth0 becoming eth1) due to differences I how the Linux kernel scans devices versus how the devices are reported out in the PCIe table. The solution at present is to associate an eth device to a MAC address.

Copyright IBM Corporation,2010 Page 21 of 37

x3850X5 Quick Start Guide

OS Installation Hints and Tips


In this section we will attempt to cover some of the common pitfalls customers encounter when installing a select set of Operating Systems (OS). This section is not meant to be a comprehensive list of all issues associated with OS installation, nor is it meant to replace the OS installation guides available on the web site. The current list of supported OS and versions is outlined in Table 2. For an up to date list of all currently supported OSs their minimum supported levels and any issues around OS support, check the IBM ServerProven web site. Table 3: Supported OS CPU & Memory Limits HW Configuration x3850X5 1-node CPU Threads Memory Capacity (16GB DIMMs) OS PLAN RHEL 6 64 bit RHEL 6 w/KVM 64 bit RHEL 5.4 64 bit RHEL 5.4 w/KVM &/or Xen 64 bit SLES 11 64 bit SLES 11 with Xen 64 bit SLES 10 SP3 64 bit SLES 10 SP3 with Xen 64 bit Windows 2008 R2 DC Windows 2008 x64 DC VMware 4.0 update 1 VMware 4.1 Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes 1 TB Limit 1 TB Limit Yes Yes Yes Yes Yes Yes 1 TB Limit 1 TB Limit Yes Yes Yes 64 1TB

x3850X5 + MAX5 64 1.5TB

x3850X5 2-node 128 2TB

64 Cores & 1.6 TB Limit Yes 2 TB MS test statement Yes 64 Cores & 2 TB Limit Not Supported Not Supported 1 TB Limit 128 Cores & 1 TB Limit

Copyright IBM Corporation,2010 Page 22 of 37

x3850X5 Quick Start Guide

General Installation Information Dependent on the RAID adapter installed, one may not have access to the HDD until arrays are created. The BR10i and M1015 both support JBOD, and as such the HDDs are visible without being included in arrays. Neither the M5014 nor M5015 support JBOD, therefore HDDs must be placed into arrays prior to being available as installation targets (see Appendix A for information on creating arrays). The M5014 / M5015 adapters can migrate arrays (not JBODs) upwards without the need for reinstallation or damage to the array structure, however migrating an array from a M5014 / M5015 down to a BR10i is not supported. Windows 2008 Installation Tips 1. The M5015 RAID driver for Windows 2008 is not included on Windows 2008 DVD. 2. The Intel chipset driver is not included on the Installation CD. It is required for proper operation. 3. The Broadcom NIC driver is not on Windows 2008 installation CD. 4. The dual-port Emulex 10GbE NIC driver is not included on the Windows 2008 Installation CD. 5. If installing Windows 2008 Enterprise Edition (EE) on an 8-socket system, Hyperthreading must be turned off prior to installation or the system will BSOD. 6. The default Installation Media installs the OS in UEFI mode. This requires a different imaging methodology as UEFI mode requires the Os to reside on a GPT disk. 7. The system can re-enumerate the boot drive when fibre devices are hooked up. This will cause the OS to fail to boot properly. Using Windows Boot Manage from the UEFI Boot Manager can fix this. Windows 2008 R2 Installation Tips 1. The dual-port Emulex 10GbE NIC driver is not included on the Windows 2008 R2 DVD. 2. If installing Windows 2008 R2 Enterprise Edition (EE) on an 8-socket system, Hyperthreading must be turned off prior to installation or the systems will BSOD. 3. Windows 2008 R2 currently experiencing issues with memory > 1TB and requires this Hotfix. 4. Windows 2008 R2 requires Hotfix (KB975535) to correct an issue with GPT disks. 5. The default Installation Media installs the OS in UEFI mode unless Legacy Only is the first entry in the boot order. This requires a different imaging methodology as UEFI mode requires the OS to reside on a GPT disk. 6. In 8-socket configurations, edit the BCD Store to enable High Precision Event Timer. See RETAIN tip: H196919 7. The system can re-enumerate the boot drive when fibre devices are hooked up. This will cause the OS to fail to boot properly. Using Windows Boot Manage from the UEFI Boot Manager can fix this. Linux 1. 2. 3. 4. 5.

All system drivers are included in both RHEL 5.4 and SLES 11. RHEL 5.5 is required for 8-socket scaling. SLES 11 SP1 can install in UEFI mode. It requires installation on a GPT disk. SLES 10 SP3 is supported, but requires a mass storage driver for the M5015. RHEL has a tendency to assign eth0 to the Emulex 10GbE NIC. The suggestion is either to blacklist the module during PXE installs. From within the OS, it is recommended to hard-code the Ethernet device names based on MAC address.
Copyright IBM Corporation,2010 Page 23 of 37

x3850X5 Quick Start Guide

VMWare 1. VMWare 4.0 U1 is the minimum supported OS level for single node x3850X5. 2. The addition of a MAX5 or a second node requires at least VMWare 4.1. 3. Neither the included Emulex 10GbE NIC or the Emulex VFA, nor the Qlogic CNA driver is included in VMWare ESX 4.0 U1. Download the drivers from VMWare's site. 4. To scale to 8-sockets or with MAX5, you will need to edit your grub.conf file to include the following: allowInterleavedNUMAnodes=TRUE. 5. ESX 4.1 will generate a warning with 1TB RAM. 6. ESX requires equal amounts of memory per CPU and per node or it will fail to start. 7. ESX requires the MAX 5 memory configuration in UEFI to be in non-pooled mode.

Copyright IBM Corporation,2010 Page 24 of 37

x3850X5 Quick Start Guide

Basic UEFI Performance Tuning


While performance tuning on the x3850X5 is a complicated topic beyond the scope of this document, we will give some general rule of thumb settings for the X5 that can be a good starting point for performance tuning the system. The following table will give some general recommendation for the most common UEFI settings tweaked when optimizing for performance. Table 4: UEFI Tuning Table Setting Maximum Performance TurboMode Enabled TurboBoost Traditional Processor Performance states C states C1E state Prefetcher Hyperthreading Execute Disable Virtualization Extensions QPI Link Speed IMM Thermal Mode CKE Policy DDR Speed Page Policy Mapper Policy Patrol Scrub Demand Scrub Enabled Disabled Disabled Enabled Enabled Disabled Disabled Max Performance Performance Disabled Max Performance Closed Open Disabled Enabled

Virtualization Enabled Power Optimized Enabled Enabled Enabled Enabled Enabled Enabled Enabled Max Performance Performance Disabled Max Performance Closed Closed Disabled Enabled

Low Latency Disabled N/A Disabled Disabled Disabled Enabled Disabled Disabled Disabled Max Performance Performance Disabled Max Performance Adaptive Open Disabled Disabled

Performance per Watt Enabled Power Optimized Enabled Enabled Enabled Enabled Enabled Enabled Enabled Power Efficiency Normal Disabled Power Efficiency Adaptive Closed Disabled Enabled

HPC Disabled N/A Disabled Enabled Enabled Enabled Disabled Disabled Disabled Max Performance Performance Disabled Max Performance Closed Open Disabled Disabled

An important note in multi-node scenarios is that UEFI parameters can be broken down into those which impact both nodes (global parameters) and those which are node specific (nodal parameters). Global parameters include those which deal with Processor settings, OptionROM Execution Order and Memory Performance settings. Nodal values include DIMM enablement states and device and OptionROM enablement and control. It is vitally important to remember the differences in what settings are controlled on a per node basis as opposed to globally, as all settings between nodes should match to ensure optimal performance. Additionally, when making changes to these values via the Advanced Settings Utility (ASU), one must specifically target the IMM of the system to be changed when nodal values are being set. All commands passed to the IMM via the batch command are executed against the first IMM connected to. For more information on ASU, please reference the ASU Users Guide.
Copyright IBM Corporation,2010 Page 25 of 37

x3850X5 Quick Start Guide

As the complexity of systems has increased, it has led to longer boot times. This is due to such factors as larger memory array initialization and error-checking, more feature rich OptionROMs and larger amounts of storage attached to a system which must be queried for metadata information. Improving boot time can be accomplished by implementing the following suggestions: IMM reset adds to boot time, avoid resetting IMM unless you need to. Remember: IMM is the gatekeeper. Nothing happens until the IMM finishes initializing devices and starts UEFI Try to do as much of your system configuration outside of the actual F1 setup screens: Use ASU for UEFI tuning and boot ordering Use the MSM or MegaCLI for RAID configuration outside of the boot volume Use in-band or out-of-band FW flashes as opposed to offline FW updates Disable all unnecessary option ROMs for devices not requiring their ROMs to be loaded (i.e., NICs if not PXE, HBAs if not boot from SAN, etc.). Move option ROM execution order so that devices with option ROMs enabled are moved to the top of the scan order. Move any PCIe devices to be scanned in higher in the option ROM execution order then the onboard NICs. Remove all devices not connected or to be used as boot option from the boot order. For legacy OS only (any OS other the n Windows 2008+ or SLES 11 SP1), add legacy mode as the first device in the boot order. Installing a UEFI-aware OS will server to decrease boot times.

Copyright IBM Corporation,2010 Page 26 of 37

x3850X5 Quick Start Guide

System Images
Figure 9

Diagonal front view of single node x3850 X5

Copyright IBM Corporation,2010 Page 27 of 37

x3850X5 Quick Start Guide

Figure 10

Top down view of x3850 X5 Note: The heat shield above the CPUs has been removed. The heat shield is required for proper cooling and for the system cover to latch correctly.

Copyright IBM Corporation,2010 Page 28 of 37

x3850X5 Quick Start Guide

Figure 11

Rear view of the x3850 X5 Note: The QPI wrap cards are not installed in this photo.

Copyright IBM Corporation,2010 Page 29 of 37

x3850X5 Quick Start Guide

Appendix A: RAID Configuration


RAID configuration in the x3850X5 is accomplished either within the LSI WebBIOS or via the MegaCLI command line. It is important to note that for MegaRAID controllers, no boot devices will be available until Virtual Disks (i.e., arrays) have been created. A complete explanation of how to use either of these is beyond the scope of this document, but a quick walk-through of creating a RAID1 array will be shown here. From the WebBIOS: 1. Access the WebBIOS either through UEFI F1 setup or Ctrl + H during POST.

Copyright IBM Corporation,2010 Page 30 of 37

x3850X5 Quick Start Guide

2. Once on the main screen of the WebBIOS, go to the Configuration Wizard.

3. Select New Configuration.

Copyright IBM Corporation,2010 Page 31 of 37

x3850X5 Quick Start Guide

4. Select the HDD you wish to add to the Virtual Disk (VD). Note: You can Ctrl + select the drives, but you must ensure the controller is deselected or you will get an error. 5. Once you have added the disks to the VD, accept the configuration.

6. Add the drives to a span and accept the configuration. 7. Once you have added the drives to the span, you will be able to create array types based on the number of drives in each VD and the number of spans you have added.

Copyright IBM Corporation,2010 Page 32 of 37

x3850X5 Quick Start Guide

8. Select Next and then Accept the configuration. 9. Return to the Home screen and Exit the WebBIOS. Note: If WebBIOS access was gained via Ctrl+H, you will need to restart the system.

From the MegaCLI: 1. Determine the slot #'s of drives in system LDPDInfo -aALL -->this is most likely 0 & 1 (&2 for R5 volumes) 2. Determine enclosure ID: CfgDsply -aALL ---> this is most likely 0 3. Create the logical drives on a controller CfgLDAdd -R(0,1,5)[a1:b1,a2:b2,etc] -an --->-R=RAID levels, a:b=enclosure:slot & n= controller # For RAID1 : CfgLDAdd -R1[0:0,0:1] WB ADRA Direct NoCachedBadBBU -a0 --> this creates a RAID1 array on drives 0 & 1 on controller 0

Copyright IBM Corporation,2010 Page 33 of 37

x3850X5 Quick Start Guide

Appendix B: Fixing Boot Issues with Windows 2008 in UEFI mode


Windows 2008 introduced support for two new features which are designed to allow it to take advantage of new technologies. The first technology which was introduced with Windows 2008 was the Boot Configuration Data (BCD) database store. The BCD data store is the replacement for the venerable boot.ini file from previous versions of Windows. The BCD database stores all of the information describing the OS to be loaded, the partition structure of the disk the OS sits on and other parameters involved in loading of the OS. Windows 2008 also introduced support for UEFI, the replacement for BIOS. It is a 32-bit pre-OS environment responsible for loading OS's. The newest iteration of IBM's rack and blade platforms (HS22, x3550M2, x3650M2) all ship with UEFI. With this support, Windows 2008 gained a number of features: access to up to 2TB of addressable memory space for loading device drivers, the requirement of using GPT volumes as boot devices and the, eventual, support for memory spaces larger then 2TB. The introduction of these new features has led to some confusion for customers when they attempt to configure their systems with external storage. Windows 2008 and IBM's UEFI systems can have issues when external storage is attached after the OS has been installed. These issues arise primarily because the OS can re-enumerate the disks and volumes, leading to the BCD being unable to locate the partition hosting the OS. One can verify that this is the issue effecting one's boot installation by removing the externally attached storage and rebooting the system. If the system loads the OS, then the following steps will help you to reset your BCD and return the system to normal working order. The following workflow will help to restore the system to working order: 1. Reboot the system 2. Go into F1 3. Select Boot Manager 4. Select Windows Boot Manager 5. Navigate to the HDD hosting the efi bootloader and select it. This will cause the Windows Boot Manager to attempt to re-associate the bootloader with the disk. 6. Reboot If the above procedure fails, then one will need to use a WinPE disk with a UEFI boot sector to rebuild the boot configuration datastore on the boot drive. The following is a brief list of commands. For more information visit the Microsoft Support portal. 1. Reboot 2. Startup Repair 3. Go to the command prompt 4. Type Diskpart 5. Type list disk 6. Type select disk X where X=GPT drive with OS 7. Type list volume 8. Type select volume N where N=small 100MB volume w/ no drive letter 9. Type assign letter=Z 10. Type exit 11. Cd Z: 12. Type bcdedit /export c:\bcd_backup
Copyright IBM Corporation,2010 Page 34 of 37

x3850X5 Quick Start Guide

13. cd \efi\microsoft\boot 14. type ren bcd bcd.old 15. type bootrec /rebuildbcd a. If all goes correctly, you should see a screen similar to the one below:

16. Type Cd x: 17. Type Diskpart 18. Type list disk 19. Type select disk X where X=GPT drive with OS 20. Type list volume 21. Type select volume N where N=small 100MB volume w/ no drive letter 22. Type remove letter=Z 23. Type exit 24. Type exit 25. Type restart

Copyright IBM Corporation,2010 Page 35 of 37

x3850X5 Quick Start Guide

Links
Firmware and Driver Downloads
http://www-947.ibm.com/support/entry/portal/Downloads

x3850X5 Documentation
http://www947.ibm.com/support/entry/portal/More_documentation_links/Hardware/Systems/System_x/System_x385 0_X5/7145

x3850X5 Retain Tips (Problem Resolution)


http://www947.ibm.com/support/entry/portal/Problem_resolution/Hardware/Systems/System_x/System_x3850_X5/7 145

IBM Tools Center


http://www-947.ibm.com/support/entry/portal/docdisplay?brand=5000008&lndocid=TOOL-CENTER

Copyright IBM Corporation,2010 Page 36 of 37

x3850X5 Quick Start Guide

Trademarks
The following terms are trademarks of International Business Machines Corporation in the United States, other countries, or both: e-business logo IBM ServerGuide e-business logo Redbooks xSeries System x Intel and Xeon are trademarks of Intel Corporation in the United States, other countries, or both. VMware is a trademark of VMware, Inc. in the United States, other countries, or both. Microsoft, Windows, and Windows NT are trademarks of Microsoft Corporation in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. IBM is a trademark of International Business Machines Corporation, registered in many jurisdictions worldwide. Intel is a trademark or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Windows is a trademark of Microsoft Corporation in the United States, other countries, or both. Other product and service names might be trademarks of IBM or other companies.

Copyright IBM Corporation,2010 Page 37 of 37

You might also like