You are on page 1of 154

Education

XW0001 Servicing IBM System x Servers Part II

Study Guide
XW0001 Release 3.13 May, 2009

smpdr3.13-xw0001.pdf

May 2009

International Business Machines Corporation, 2009 All rights reserved. IBM System x Service and Support Education IBM Systems, Department EYGA. Building 203, Post Office Box 12195, Research Triangle Park, North Carolina 27709-2195 IBM reserves the right to change specifications or other product information without notice. This publication could include technical inaccuracies or typographical errors. References herein to IBM products and services do not imply that IBM intends to make them available in other countries. IBM provides this publication as is, without warranty of any kind either expressed or impliedincluding the implied warranties of merchantability or fitness for a particular purpose. Some jurisdictions do not allow disclaimer of expressed or implied warranties. Therefore, this disclaimer may not apply to you. Data on competitive products is obtained from publicly obtained information and is subject to change without notice. Please contact the manufacturer for the most recent information. The following terms are trademarks or registered trademarks of IBM Corporation in the United States, other countries or both: Active Memory, Active PCI, AT, BladeCenter, the e-business logo, EasyServ, Enterprise XArchitecture, EtherJet, HelpCenter, HelpWare, IBM RXE-100 Remote Expansion Enclosure, IBM XA-32, IBM XA-64, IntelliStation, LANClient Control Manager, Memory ProteXion, NetBAY3, Netfinity, Netfinity Manager, Predictive Failure Analysis, RXE Expansion Port, SecureWay, ServeRAID, ServerProven, ServicePac, SMART Reaction, SMP Expansion Module, SMP Expansion Port, UM Services, Universal Manageability, Update Connector, Wake on LAN, XceL4 Server Accelerator Cache, XpandOnDemand scalability. IBM Corporation Subsidiaries: Lotus, Lotus Notes, Domino, and SmartSuite are trademarks of Lotus Development Corporation. Tivoli and Planet Tivoli are trademarks of Tivoli Systems, Inc. LLC, Adobe, and PostScript are trademarks of Adobe Systems, Inc. Intel Celeron, LANDesk, MMX, Pentium II, Pentium III, Pentium 4, SpeedStep, and Xeon are trademarks or registered trademarks of Intel Corporation. Linux is a trademark of Linus Torvalds. Microsoft Windows and Windows NT are trademarks or registered trademarks of Microsoft Corporation. Other company, product, and service names may be trademarks or service marks of others. For more information, visit:www.ibm.com/legal/copytrade/phtml

smpdr3.13-xw0001.pdf

May 2009

Preface
This publication is primarily intended for use by students enrolled in the course Servicing System x Servers Part II xw0001. This document represents a training technique developed for and used by IBM and is not for sale. Portions of this document, such as foils, charts, and quizzes, may be copied and distributed if required to conduct a class properly. The instructor should exercise good judgment on handouts of this type. The complete document may not be copied for or sold to non-IBM personnel. Please write your name and address below to personalize your copy. Issued to: Address: ____________________________________________________ ____________________________________________________ ____________________________________________________ ____________________________________________________ Current release date: Current release level: Test numbers for this guide are: xw0001r313 The information contained within this publication is current as of the date of the latest revision and is subject to change at any time without notice. Please forward all comments and suggestions regarding the course material, format, and content to your local IBM System x Service and Support Education country coordinator or contact. May 2009 3.13

smpdr3.13-xw0001.pdf

May 2009

Table of Contents
Preface Table of Contents Introduction to the Study Guide Topic 1 Topic 2 Topic 3 Topic 4 Objectives and Agenda 3 4 4 5

High-performance System x Server Family Overview 15


RAID Adapters and Enclosures

29 49 75 113 133 147

High-performance Technologies Review

Topic 5 Working With Scalable Systems Topic 6 Dynamic System Analysis Topic 7 Problem Solving Topic 8 Support References

Introduction to the Study Guide


Purpose The purpose of this guide is to: Provide you with the necessary documentation to support the learning experience so that you can successfully fulfill the objectives defined for this course. This guide contains a number of lessons based on the instructor's presentation material, supplemental student notes within each lesson, and appropriate additional material (in the form of appendices) as is required by the course learning objectives. Limitations 1. No computer game playing or copying of games is allowed in class. 2. Do not copy recordable media on any of the systems in the lab. Adhering to this rule will keep viruses from spreading and ensure that the media that have been created especially for your systems are retained. 3. Do not remove any materials from the classroom other than those given to you by the instructor. 4. Do not remove the covers from your computer. If you encounter any problems with your system, please speak with your instructor.

smpdr3.13-xw0001.pdf

May 2009

XW0001 - Servicing IBM System x Servers Part II

Topic 1 Objectives and Agenda

Welcome!

smpdr3.13-xw0001.pdf

May 2009

XW0001 - Servicing IBM System x Servers Part II Topic Objectives

By the end of this topic, you will be able to:


-Describe the overall course objectives -Explain the course prerequisites -Understand the course agenda

Before we begin, we need to establish some basics for the course. You need to understand what the course objectives are so you can be sure you are taking the right class. You need to understand what we expect of you by way of previous knowledge. We also need to explain the course agenda so you will know what is about to happen.

smpdr3.13-xw0001.pdf

May 2009

XW0001 - Servicing IBM System x Servers Part II Course Objectives

By the end of this course, you will be able to:


-Identify the serviceability features of System x highperformance servers -Describe the advanced technologies used in System x servers and their service implications -Describe the management characteristics of System x servers -Perform a series of setup, configuration and troubleshooting tasks on System x servers and associated peripherals

This course concentrates on problem determination and the tools that can be utilized to trouble shoot IBM System x Servers. Before you start the practical exercises, however, we will discuss some of the key technologies used in IBM System x servers. The lab exercises revolve around best practice in dealing with IBM System x server problems. This combination remote lab and paper exercises will enable you to become familiar with the high end of the System x server range and how to perform service on them. You will also see some of the fault tolerant and redundant features of the servers and practice working with servers that have suffered a component failure but which are still running the NOS.

smpdr3.13-xw0001.pdf

May 2009

XW0001 - Servicing IBM System x Servers Part II Introductions

Your instructor is
-Your instructor will now introduce herself/himself

You are
-Your instructor will ask you to introduce yourself
Tell the class what you do (not what your job title is) Tell the class how you got into this role Tell the class anything else you wish to share

We will be together for some time. It will be useful for us all to get to know each other.

smpdr3.13-xw0001.pdf

May 2009

XW0001 - Servicing IBM System x Servers Part II Course Prerequisites

To make the most of this class, you should have completed the following education prior to attending this course:
-Strongly recommended (mandatory in some locations)
A+ Certification Server+ Certification

-Required
Servicing IBM xSeries Servers Part I (XW2001 R300)

In some locations, if you work with IBM System x server products, you are required to be A+ and Server+ certified. Even if this is not mandatory where you are, we strongly recommend that you are A+ Certified and Server+ certified. Servicing IBM System x Servers Part I is REQUIRED prior to attending this class. XW2001R300 is a self-paced, CD-ROM course. If you have not completed this training prior to attending, you will not get the most from this class. As there is a test at the end of this class, this may impact your ability to pass the end-of-class mastery test.

smpdr3.13-xw0001.pdf

May 2009

XW0001 - Servicing IBM System x Servers Part II IBM System x Curriculum

Worldwide field support training roadmap for System x warranty authorization


Level 1 Industry certification (entry point) Server+ Certification Compulsory in some locations, highly recommended everywhere else Approved service providers may stop here if not providing warranty service on BladeCenter or high performance System x products

Level 2 Approved warranty service provider (high volume System x servers)

XW2001 Core knowledge (self-study)

XW2xxx Service update CD-ROMs

Level 3 Approved warranty service provider (high performance System x servers)

XW0001 High perf skill (hands-on) XW2xxx Service update CD-ROMs

This chart identifies the position of this course in the IBM System x server service curriculum. This course is a mandatory module towards warranty approval for high-performance System x servers. Service update CD-ROMs are issued periodically to inform service technicians about new products that are announced.

smpdr3.13-xw0001.pdf

10

May 2009

XW0001 - Servicing IBM System x Servers Part II Key Exit Skills

Key practical exit skills include being able to:


-Identify the physical components of the IBM System x servers covered here and how to work with them using the support documentation -Use the features of multiple IBM System x Service Processor (SP) technologies -Configure and bring up a scaled, multi-node IBM System x server partition

In addition, you will be able to:


-Recognize and use server-specific support tools -Install a NOS on selected IBM System x servers -Understand how IBM System x servers work with the Microsoft Windows Network Operating System (NOS) and how they behave under certain failure conditions

This course provides practical experience through hands-on exercises. This is a list of the key exit skills that you should be able to perform after completing this course.

smpdr3.13-xw0001.pdf

11

May 2009

XW0001 - Servicing IBM System x Servers Part II Lesson Topics in This Course

Lesson topics
Topic 1: Objectives and agenda Topic 2: High-performance System x Server Family Overview Topic 3: RAID Adapters and Enclosures Topic 4: High-performance Technologies Review Topic 5: Working With Scalable Systems Topic 6: Dynamic System Analysis Topic 7: Problem Solving Topic 8: Support references

Lab exercises
Details are on the next page

Test
Here are the lesson topics in this guide.

smpdr3.13-xw0001.pdf

12

May 2009

XW0001 - Servicing IBM System x Servers Part II Lesson Topics in This Course

Lab exercises
Lab1- Locations, Removals, Flash Update and Diagnostics Lab 2 Remote Desktop Connection & Bios Setup Lab 3 Utilizing RCM &Virtual Console software Lab 4 Preboot DSA / Diagnostics Lab 5 Updating with IBM UpdateXpress Service Packs Lab 6 ServeRAID Mgr & Spanned Arrays Lab 7 MegaRAID Storage Manager Lab 8 Utilizing BMC and DSA to gather the Facts Lab 9a - Scale System x460 Lab 9b - Scale a multi-node x3950 M2

Here are the lesson topics in the associated lab guide.

smpdr3.13-xw0001.pdf

13

May 2009

XW0001 - Servicing IBM System x Servers Part II Summary Topic 1

This topic has covered the following:


-Described the course objectives -Explained the course pre-requisites -Established the course agenda and key exit skills

We have now outlined the contents and scope of this course. The next topic is a review of the product and how the components fit together.

smpdr3.13-xw0001.pdf

14

May 2009

XW0001 - Servicing IBM System x Servers Part II

Topic 2 High-performance System x Server Family Overview

We will provide an overview of the IBM System x and xSeries high performance products.

smpdr3.13-xw0001.pdf

15

May 2009

XW0001 - Servicing IBM System x Servers Part II Topic Objectives

By the end of this topic, you will be able to:


-Describe the System x high-performance server family of products -List the non-scalable and scalable models -Identify the system management components of IBM System x high-performance servers -Describe the server security software features

This topic describes high-performance System x server family offerings and some common options.

smpdr3.13-xw0001.pdf

16

May 2009

XW0001 - Servicing IBM System x Servers Part II Non Scalable IBM System x 3755 Overview

2- 4 processors, using AMD quad core Opteron processors with HyperTransport link Eight DIMM slots per processor card Dual Broadcom 5708c Gigabit Ethernet Optional redundant power and cooling Standard DVD drive 4 - 3.5 in. HS SAS HDD Bays

4 PCI Express, 2 PCI-X and 1 HTX I/O slots IPMI 2.0 BMC w/optional RSA II Slimline refresh SAS Chipset supporting RAID 0, 1 or 10 Optional RAID 5 upgrade RoHS Compliant Server Enablement Suite Support for Windows, Linux, VMWare and Netware

The x3755 is a low cost high end AMD Dual Core Opteron based Server. The system supports up to 4 Opteron revision F processors. Each processor supports up to 8 DIMM slots and using 4GB memory DIMMS the system supports up to 128GB of memory. The IO slot mixture is 4 PCI-E, 2 PCI-X slots and 1 HTX slot. The x3755 is a RoHS compliant system. Tip: Processors must be installed in order 1 through 4. Tip: The processor complex uses a passthru card. Older systems may have had only one processor installed. Current models ship with two processors standard. If there is no Processor/Memory Card in processor 2 slot there is no path to the ServerWorks HT2100 B PCI-E Bridge unless the passthru card is fitted. This passthru card must be present in processor slot 2 if no processor is installed. If processor 2 is present then no passthru card is used and one is NOT shipped as well, however, if processors 1, 2 and 3 are installed then a passthru card must be installed in processor socket 4. The Baseboard Management Controller (BMC-H8) is a system environmental monitor and controller. It will perform low level system monitoring and LED control functions using multiple I2C bus connections to communicate out-of-band with other onboard devices. The optional RSAII Slimline Refresh systems management adapter adds advance service processor alert notification and remote connectivity. .

smpdr3.13-xw0001.pdf

17

May 2009

XW0001 - Servicing IBM System x Servers Part II System x3800, x3850 and x3950 Product Overview 3U or7U tower or rack Up to two 1300W (x3850 and x3950), three 770W (x3800) XA-64e Enterprise Xredundant hot-swap power Architecture chipset supplies 1-way to 4-way, Intel Xeon MP X3850 and x3950 (up to 32-way on x3950) Active Memory Intel Xeon MP (support for ChipKill, Memory EM64T processors) ProteXion and Memory PC2-3200 DDR2 SDRAM Mirroring 2-way interleaving Remote Supervisor Adapter II x3800 slim line (optional on x3800 Disk support and x3850, standard on DVD-ROM standard x3950) Up to twelve 3.5 Serial Attached SCSI (SAS) hot Broadcom 5704 dual port swap disks(x3800) Up to six ethernet 2.5 SAS hot-swap disks Active PCI-X 2.0 Slots, 64(x3850 and x3950) bit/266MHz ServeRAID 8I (optional) 3-year, next business day RAID 0/1/5 warranty

The IBM Enterprise X-Architecture range of servers offers models 3U and 7U rack model server for high-volume network transaction processing. These high-performance, symmetric multiprocessing (SMP) servers are ideally suited for networking environments that require superior microprocessor performance, input/output (I/O) flexibility, and high manageability. EM64T is a 64-bit extension technology enhancement to the Intel IA-32 architecture. It is compatible with legacy IA-32 software while enabling new software to access larger memory address space. EM64T introduces a new operating mode which includes two sub-modes: (1) Sub-mode, referred to as compatibility mode, enables a 64-bit operating system to run most existing legacy 32-bit software unmodified. (2) Sub-mode, referred to as 64-bit mode, enables a 64-bit operating system to run applications written specifically to access 64-bit address space. The System x3800 and x3850 offers the Remote Supervisor Adapter II slim line (RSA II) as an option. This new adapter significantly enhances the tools available to the service technician for detecting and correcting problems with the server. It supports a web interface to the error logs, dramatically simplifying the troubleshooting task without disturbing the workings of the host server. The system error logs can be viewed and manipulated from a ThinkPad, connected to the Service Processor through the LAN (using a web browser) if the RSAII is connected to an ethernet network.

smpdr3.13-xw0001.pdf

18

May 2009

XW0001 - Servicing IBM System x Servers Part II x3850/3950 M2 Product Overview

XA64e

4th

generation

chipset Four processor sockets Intel Xeon Dual- and Quad-core and six core processors Up to four memory cards Up to 8 DIMMs per memory card PC2-5300 DDR II Disk support DVD-ROM standard Integrated LSI 1078 SAS RAID controller (supports RAID 0 & 1)
Up to four 2.5 SAS hot-

Two standard 1440 watt

redundant, hot-swap power supplies


Seven I/O slots:

- Seven PCI-E x8 slots


Two Active/Hot-Swap

Active Memory ChipKill, Memory

swap disks Optional ServeRAID

MR10

ProteXion & Memory Mirroring Dual embedded Broadcom 5709 ethernet Remote Supervisor Adapter II standard Chassis scalability supported One or three year, next business day warranty

The IBM System x3850 M2 and 3950 M2 is a high-performance, four-socket, non-scalable server featuring fourth-generation Enterprise X-Architecture. The x3950 M2 server contains advanced technology that combines scalable SMP power, PCI-E expansion, fourth-generation Enterprise XArchitecture (EXA), high availability, scalability, and substantial internal data storage capacity. This slide summarizes some of the features of the IBM System x3850 and x3950 M2. The x3850 M2 supports scaling with the installation of the IBM ScaleXpander Option kit. It will become a x3950 M2 when scaled. A multi-node configuration interconnects multiple servers. Each multi-node configuration can have one or more scalable partitions. Each scalable partition supports an independent operating system installation. The scalable partition uses a single, contiguous memory space and provides access to all associated adapters and hard disk drives. PCI slot numbering starts with the primary node and continues with the secondary nodes, in numeric order of the logical node ID. The scalability discussion is continued later in this course.

smpdr3.13-xw0001.pdf

19

May 2009

XW0001 - Servicing IBM System x Servers Part II Enterprise X-Architecture Overall Design

Scalable systems use the IBM Enterprise XArchitecture (EXA) and IBM XA-64e fourthgeneration chipset

Note. Third generation EXA chipsets use system memory for L4 cache

Scalable systems need a sophisticated chipset to enable processors and memory to be shared across multiple chassis under a single OS. This diagram shows the overall schematic of the EXA chipset. The processor and memory bus can be extended with the use of scalability cables, effectively joining the processors, memory and I/O into a single hardware set.

smpdr3.13-xw0001.pdf

20

May 2009

XW0001 - Servicing IBM System x Servers Part II Architecture x3800, x3850, x3950, x3950E
Not present on x3800 or x3850

The x3850, x3950 and x3950E use the third generation of the IBM XA-64e chipset. The architecture consists of the following components: One to four Xeon MP processors One Memory and I/O Controller (MIOC) Two PCI Bridges Each memory port out of the memory controller has a peak throughput of 5.33 GBps. DIMMs are installed in matched pairs (two-way interleaving) to ensure that the memory port is fully utilized. Peak throughput for each PC2-3200 DDR2 DIMM is 2.67 GBps. (The DIMMs are run at 333 MHz to remain in sync with the throughput of the front-side bus.) In addition, there are four memory ports; spreading installed DIMMs across all four memory ports can improve performance, because the four independent memory ports (memory cards) provide simultaneous/concurrent access to memory. With four memory cards installed (and DIMMs in each card), peak memory bandwidth is 21.33 GBps. The memory controller routes all traffic from the four memory ports, two CPU ports and the two PCI bridge ports. The memory controller also has embedded DRAM, which in the x366. x3800 and x3860 holds a snoop filter lookup table. This filter ensures that snoop requests for cache lines go to the appropriate CPU bus and not both of them, thereby improving performance. One PCI bridge supplies four of the six 64-bit 266 MHz PCI-X slots on four independent PCI-X buses. The other PCI bridge supplies the other two PCI-X slots (also 64-bit, 266 MHz), plus all the onboard PCI devices. This illustration details the interconnect and board components. CPLD = Complex Programmable Logic Device. BMC = Baseboard Management Controller smpdr3.13-xw0001.pdf 21 May 2009

XW0001 - Servicing IBM System x Servers Part II Architecture x3850M2, x3950M2

The x3850 M2 and x3950 M2 uses the fourth generation of the IBM XA-64e chipset. The architecture consists of the following components: One to four Xeon dual-core or quad-core processors 4 Memory and I/O Controller (MIOC) Eight high speed memory buffers Two PCI Express bridges One South bridge PCI bridge 1 supplies four of the seven PCI Express x8 slots on four independent PCI Express buses. PCI bridge 2 supplies the other three PCI Express x8 slots plus the onboard SAS devices, including the optional ServeRAID-MR10k. A separate South bridge supplies all the other onboard PCI devices, such as the USB ports, onboard Ethernet and the standard RSA II. As this is a multi-board system (processor board, I/O board and RSA II adapter, hardware replacements require careful thought to ensure a working system when a board is replaced. Code is located on all major boards in the system and this code must be matched for release levels to ensure proper operation The components represented by the black boxes require BIOS/Firmware update after parts replacement. CPU card there is BIOS, BMC, and FPGA code. (Field Programmable Gate Arrays) FPGA is very similar to CPLD in previous systems TPM (Trusted Platform Module) I/O card there is SAS, Ethernet, FPGA, and DSA (Diagnostics) code. RSAII Adapter Broadcom 5709 Ethernet controller ServeRAID- MR 10K SAS.SATA Controller (If present )

smpdr3.13-xw0001.pdf

22

May 2009

XW0001 - Servicing IBM System x Servers Part II Processor Architecture Single to Dual Core

Up to this point that sometimes we refer to these x86 platforms as either 2-socket, 4 socket, 8-socket, or 16-socket configurations. Historically we have been used to referring to these systems as n-way systems. Due to the current trends in microprocessors the term n-way or n-CPU could become misleading if not used in the proper context. The dual-core processors in the x3950 are the first Intel processor to offer multiple cores. Dual-core processors are a concept similar to a two-way system except that the two cores are integrated into one silicon die. This brings the benefits of two-way SMP with less power consumption and faster data throughput between the two cores. To keep power consumption down, the resulting core frequency is lower, but the additional processing capacity means an overall gain in performance. In addition to the two cores, the dual-core processor has separate L1 instruction and data caches for each core, as well as separate execution units (integer, floating point, and so on), registers, issue ports, and pipelines for each core. A dual-core processor achieves more parallelism than Hyper-Threading Technology, because these resources are not shared between the two cores. Estimates are that there is a 1.2 to 1.5 times improvement when comparing the dual-core Xeon MP with current single-core Xeon MP. With double the number of cores for the same number of sockets, it is even more important that the memory subsystem is able to meet the demand for data throughput. The 21 GB/sec peak throughput of the X3 Architecture of the x3950 with four memory cards is well-suited to dual-core processors. For additional information refer to IBM Red Book Virtualization on the IBM System x3950 Server Publication # SG 24-790-00

smpdr3.13-xw0001.pdf

23

May 2009

XW0001 - Servicing IBM System x Servers Part II Processor Architecture Dual Core to Quad Core

The dual-core processors are a concept similar to a two-way SMP system except that the two processors, or cores, are integrated into one silicon die. This brings the benefits of two-way SMP with less power consumption and faster data throughput between the two cores. To keep power consumption down, the resulting core frequency is lower, but the additional processing capacity means an overall gain in performance. The quad-core processors add two more cores onto the same die. Hyper-Threading Technology is not supported. Each core has separate L1 instruction and data caches, as well as separate execution units (integer, floating point, and so on), registers, issue ports, and pipelines for each core. A multi-core processor achieves more parallelism than Hyper-Threading Technology, because these resources are not shared between the two cores. With double and quadruple the number of cores for the same number of sockets, it is even more important that the memory subsystem is able to meet the demand for data throughput. The 34.1 GBps peak throughput of the x3850 M2 and x3950 M2 eX4 Architecture with four memory cards is well-suited to dual-core and quad-core processors. 1066 MHz front-side bus The Xeon MP uses two 266 MHz clocks, out of phase with each other by 90, and using both edges of each clock to transmit data. A quad-pumped 266 MHz bus therefore results in a 1066 MHz front-side bus. The bus is eight bytes wide, which means it has an effective burst throughput of 8.53 GBps. This can have a substantial impact, especially on TCP/IP-based LAN traffic.

smpdr3.13-xw0001.pdf

24

May 2009

XW0001 - Servicing IBM System x Servers Part II Manageability Features

All of the following are supported:


Active PCI-X and PCI-Express x8 half-length slots (system specific) Predictive failure analysis (PFA) on processors, memory, disks, fans, and power supplies Integrated Baseboard Management Controller (BMC) RSA II or RSA II slim-line (standard on some models, optional on others) Light Path Diagnostics

This group of servers offers a high degree of redundancy, fault tolerance and manageability hardware. The BMC and RSA II are discussed in detail later.

smpdr3.13-xw0001.pdf

25

May 2009

XW0001 - Servicing IBM System x Servers Part II Server Security Software (SSS) Trusted Platform Module (TPM)

Server Security Software


SSS is a set of software tools that allow a user access to the basic cryptographic key and identity capabilities of a TPM. Those familiar with the Client Security Software (CSS) will find a strong family resemblance.

Trusted Platform Module


TPM is a hardware chip used to improve the security and trustworthiness of a computer. Prior versions of the technology such as the IBM 4758 offered higher levels of security at much higher cost. The TPM chip brings security to the mass market.

Trusted Platform Module (TPM) Management is a new feature offered by Microsoft Windows. This feature will be available after the release of Windows Server 2008 Network Operating System. The feature set includes the TPM Management console, and an API called TPM Base Services (TBS). This architecture provides an infrastructure that allows Windows-based applications to use and share the TPM. TPM has the ability to create cryptographic keys and encrypt them so that they can be decrypted only by the TPM. This process, often called "wrapping" or "binding" a key, can help protect the key from disclosure. The TPM can also seal and unseal data generated outside of the TPM. With this sealed key and software like Microsoft Windows BitLocker Drive Encryption, you can lock data until specific hardware or software conditions are met. With a TPM, private portions of key pairs are kept separated from the memory controlled by the operating system.

smpdr3.13-xw0001.pdf

26

May 2009

XW0001 - Servicing IBM System x Servers Part II Summary - Topic 2

This topic has enabled you to:


-Describe the System x high-performance server family of products -List the non-scalable and scalable models -Identify the system management components of IBM System x high-performance servers -Describe the server security software features

This topic provided an overview of the models in this course.

smpdr3.13-xw0001.pdf

27

May 2009

XW0001 - Servicing IBM System x Servers Part II

Topic 3 RAID Adapters and Enclosures

Here, we will discus IBM RAID adapters and enclosures commonly associated with System x servers.

smpdr3.13-xw0001.pdf

28

May 2009

XW0001 - Servicing IBM System x Servers Part II Topic Objectives

By the end of this topic, you will be able to:


- Describe the Raid Levels offered by the IBM ServeRAID Adapter Family - List the ServeRAID adapter family of products - Describe the IBM storage enclosures commonly found in the System x server environment

This topic described the IBM Raid levels and the ServeRAID adapter family and storage enclosures.

smpdr3.13-xw0001.pdf

29

May 2009

XW0001 - Servicing IBM System x Servers Part II RAID Terminology

Array
A group of physical disks

Logical Drive
Who has control?

An array is a grouping of physical disks. A logical drive is a term given to part or all of an array. An array can contain multiple logical drives. Logical drives are recognized by the OS as physical disks.

smpdr3.13-xw0001.pdf

30

May 2009

XW0001 - Servicing IBM System x Servers Part II RAID 0 (Stripping)

Data distributed evenly across all disks


No redundancy or error correction Fastest performance for multiple concurrent requests
Disk 1 .......... Stripe 1 Block 1 Stripe 2 Stripe 3 . . . . . . . . Stripe x Block 5 Block 9 . . . . Block n-3 .......... .......... .......... .......... .......... Block 2 Block 6 Block 10 . . . . Block n-2 Disk 2 .......... .......... .......... .......... .......... .......... Block 3 Block 7 Block 11 . . . . Block n-1 Disk 3 .......... .......... .......... .......... .......... .......... Block 4 Block 8 Block 12 . . . . Block n Disk 4

RAID-0 stripes (or spreads) data across multiple disks drives without parity protection in order to maximize DASD performance. Performance is improved with larger files because read/writes are overlapped across all disks. An additional benefit of RAID-0 is "drive spanning". With data spread across multiple drives in the array, the logical drive size is the sum of the individual drive capacities. RAID-0 is the only level of RAID that does not provide any type of fault tolerance. In other words, the failure of one drive will cause the entire disk subsystem to fail.

smpdr3.13-xw0001.pdf

31

May 2009

XW0001 - Servicing IBM System x Servers Part II RAID 1 (Transparent Mirroring)

Data written simultaneously to two identical disks


Faster than a single disk for reads Reliability cost is 100% of protected drives

Disk 1
Data

Disk 2

Disk Duplexing
. . . . . . . .
Mirrored Data

. . . . . . . .

RAID-1 is either disk mirroring or disk duplexing. Disk mirroring involves duplicating the data from one disk onto a second using a single controller. Disk duplexing is the same as mirroring in all respects, except that the disks are attached to separate controllers. The server can now tolerate the loss of one disk controller or one disk, without the loss of the disk subsystem's availability or the customer's data. Since each disk is attached to a separated controller, performance and throughput may be further improved. NetWare splits seeks, reads half from data drive and half from mirrored drive

smpdr3.13-xw0001.pdf

32

May 2009

XW0001 - Servicing IBM System x Servers Part II RAID 1 Enhanced

Allows disk mirroring with odd number of disks


Stripes data and mirrored data across ALL disks Approximates RAID-0 performance

Disk 1
Data Stripe

Disk 2
.......... .......... .......... .......... .......... ..........
Data 2 Mirror 1 Data 5 Mirror 4

Disk 3
.......... .......... .......... .......... .......... ..........
Data 3 Mirror 2 Data 6 Mirror 5

Data 1
Mirrored Stripe Data Stripe Mirrored Stripe

Mirror 3 Data 4 Mirror 6

. . . .

. . . .

. . . .

RAID 1e offers an enhanced version of RAID-1 that combines mirroring with data striping. The first stripe is for data and the second is for mirrored data offset by one drive. This allows for improved performance and increased flexibility in configuring mirroring for greater than two drives.

smpdr3.13-xw0001.pdf

33

May 2009

XW0001 - Servicing IBM System x Servers Part II RAID 5 (Data Stripping with Parity)

Stripes data and parity information, sectors at a time, across all disks
Parity information is also striped across all disks Requires a minimum of three disks If any one disk fails, the data can still be accessed

Disk 1
Stripe 1 Stripe 2 Stripe 3
Block 1 Block 4 Block 7

Disk 2
.......... .......... .......... .......... .......... ..........
Block 2 Block 5
Checksum of blocks 7-9

Disk 3
.......... .......... .......... .......... .......... ..........
Block 3
Checksum of blocks 4-6

Disk 4
.......... .......... .......... .......... .......... ..........
Checksum of blocks 1-3

Block 6 Block 9

. . . .

Stripe x

. . . .

. . . .

Block 8

. . . .

. . . .

Checksum of blocks n-2 to n

Block n-2

Block n-1

Block n

Data and checksum information are evenly spread across drives, spreads both the data and data parity information across the disks one block at a time to ensure maximum read performance when accessing large files and to improve array performance in a transaction processing environment. This removes the bottleneck of storing all of the parity data on one drive. High transaction rate (good for random transactions) Drives operate independently (don't need to be in sync) Better server performance than RAID 2, 3 and 4 Low reliability cost: Capacity of 1 drive per array RAID-5 The equivalent of one drive per array is used for the parity data, regardless of the size of array. Once again, the capacity left for data storage is always N - 1.

smpdr3.13-xw0001.pdf

34

May 2009

XW0001 - Servicing IBM System x Servers Part II RAID 5 Enhanced

Stripes data, sectors at a time, across all disks with an additional stripe for parity information and hotspare space
Requires a minimum of four disks If any one disk fails, the data and parity information will be redistributed on the remaining drives (Logical Drive Migration) Capacity of n - 2 (n = number of disks)
Stripe 1 Stripe 2 Stripe 3
Parity Data 4 Data 7 Data 10 HSP

.......... .......... .......... .......... .......... ..........

Data 1 Parity Data 8 Data 11 HSP

.......... .......... .......... .......... .......... ..........

Data 2 Data 5 Parity Data 12 HSP

.......... .......... .......... .......... .......... ..........

Data 3 Data 6 Data 9 Parity HSP

. . . .

Stripe x

RAID 5E is firmware-specific. You can think of RAID 5E as RAID 5 with a built in spare drive. Reading from, and writing to, four disk drives is more efficient than three disk drives and therefore improves performance. Additionally, the spare drive is actually part of the RAID 5E array. With such a configuration, you can not share the spare drive with other arrays. If you want a spare drive for any other array, you must have another spare drive for those arrays. Like RAID 5, RAID 5E stripes data and parity across all of the drives in the array. When an array is assigned RAID 5E, the capacity of the logical drive is reduced by the capacity of two physical drives in the array (that is, one for parity and one for the spare). RAID 5E is a good choice to use, because it offers both data protection and increased throughput, in addition to the built-in spare drive. RAID 5E gives you better utilization of the array's physical capacity than RAID 1, but RAID 1 offers better performance. RAID 5E was superseded by RAID 5EE where the HSP is left room for in every stripe. (e.g. most prefer the RAID 5EE implementation) smpdr3.13-xw0001.pdf 35 May 2009

XW0001 - Servicing IBM System x Servers Part II RAID 6 Block striping with double distributed parity

RAID 6 reserves the equivalent of two disks in the array for parity information and stores two separately calculated checksums on different disks
Can survive the loss of two disks before data loss occurs Block striping with double distributed parity Two separate parity checksums to survive two disk failures
Stripe 1 Stripe 2 Stripe 3
A0 A1 P2 PD

.......... .......... .......... .......... .......... ..........

B0 P1 PC B3 A3

.......... .......... .......... .......... .......... ..........

P0 PB C2 C3 B1

.......... .......... .......... .......... .......... ..........

PA D1 D2 P3 A2

. . . .

Stripe x

B2

RAID 6 is a newly emerging RAID level that has been designed to address modern data storage needs. As RAID arrays increase in size and complexity, the ability to survive more than one disk failure becomes more important to avoid catastrophic data loss. RAID 6 is: Block striping with double distributed parity Two separate parity checksums to survive two disk failures RAID 6 reserves the equivalent of two disks in the array for parity information and stores two separately calculated checksums on different disks in order to survive the loss of two disks before data loss occurs.

smpdr3.13-xw0001.pdf

36

May 2009

XW0001 - Servicing IBM System x Servers Part II SCSI ServeRAID Adapters

ServeRAID 4 family
Ultra160 SCSI with one, two or four channels
-RAID levels 0, 1, 1e, 5, 5e, 00, 10, 1e0, 50 -Support for up to 56 disks

ServeRAID 5i, 6i
Zero channel RAID adapter (works with onboard SCSI controller)
-Uses full ServeRAID software stack -Has BIOS, firmware, device drivers, and utilities -RAID levels 0, 1, 1e, 5, 00, 10, 1e0, and 50

ServeRAID 6m
Ultra320 SCSI with two channels
-RAID levels 0, 1, 1e, 5ee, 00, 10, 1e0 and 50

The NOS device drivers are model specific. The ServeRAID 4 adapter family shares the characteristics listed here. It comes in several different flavors (4L/4Lx, 4m, and 4H) The ServeRAID 5i and 6i adapters have no internal or external SCSI connectors. They use the server's onboard SCSI controller but enhance the basic features to provide support for additional RAID levels. The ServeRAID 6m is a dual-channel Ultra320 SCSI controller.

smpdr3.13-xw0001.pdf

37

May 2009

XW0001 - Servicing IBM System x Servers Part II SATA and SAS ServeRAID Adapters

ServeRAID 7t
1.5 Gbps per port serial ATA (SATA) controller
-RAID levels 0, 1, 5, 10 -Up to four SATA disks on four separate ports

ServeRAID 7k
- The option is shipped as a special memory DIMM with a battery attached (for batterybackup purposes) - Memory is 256 MB, 133 MHz (PC2100) DDR1 memory - RAID levels 0, 1, 5, 10

The ServeRAID 7t is designed for smaller servers that require RAID support with SATA disks. A maximum of four disks can be connected to the ServeRAID 7t. It is unlikely that you will see a ServeRAID 7t in a high-end server as the controller does not support the SCSI or SAS backplanes that are common in high-end models. However, a customer may choose to add such an adapter to a system that can support non-hot-swap disks. The battery backup of the 7k adapter provides up to 33 hr backup.

smpdr3.13-xw0001.pdf

38

May 2009

XW0001 - Servicing IBM System x Servers Part II SATA and SAS ServeRAID Adapters

ServeRAID 8i
3.0 Gbps per port serial attached SCSI (SAS) controller
-RAID levels 0, 1, 5, 5ee, 6, 10, 1e0, 50, 60 -Up to eight SAS ports

ServeRAID 8k
- This option is shipped as a special memory DIMM with a battery attached via wires (for battery-backup purposes) - The DIMM is installed in a special DIMM socket in supported servers - Battery is connected to the DIMM by wires and is typically mounted on the server chassis

The ServeRAID 8i and 8k was introduced to support the third generation Enterprise X-Architecture servers as they are built around SAS disk subsystems. The ServeRAID 8k option is shipped as a special memory DIMM with a battery attached via wires (for battery-backup purposes) The DIMM is installed in a special DIMM socket in supported servers Five DRAM chips on the DIMM "Adaptec ATB-200" on battery side Write-back cache memory is 256 MB, 533 MHz DDR2 unbuffered memory Battery is connected to the DIMM by wires and is typically mounted on the server chassis

smpdr3.13-xw0001.pdf

39

May 2009

XW0001 - Servicing IBM System x Servers Part II ServeRAID 10 (MR10i, MR10k, MR10M)

IBM ServeRAID-MR10i/10is SAS/SATA Controller

IBM ServeRAID-MR10k SAS/SATA Controller

IBM ServeRAID-MR10M SAS/SATA Controller

LSI 1078 RAID Adapter (MR10i/is, MR10k, MR10E) Eight-port SAS RAID adapter, Two SAS connectors , 3 Gb/s throughput per port (full duplex) RAID levels 0, 1, 5, 6,10 and 50,60 w/Greater than 2TB array support X8 PCI Express host interface Battery-backed 256MB DDRII 667 MHz SDRAM DIMM module The 10is offers encryption/security Protects data in cache up to 72 hours during power loss or MegaRAID controller failure Allows system administrators to replace a failed adapter, while maintaining the data protected on the DIMM module for up to 72 hours. iTBBU support 122 device support RoHS and WEEE compliant

smpdr3.13-xw0001.pdf

40

May 2009

XW0001 - Servicing IBM System x Servers Part II EXP3000 Storage Enclosure

Entry level disk storage


-2 U rack mount enclosure with 12 easily accessible bays
Support for dual-port and hot-swappable SAS disks at 10,000 and 15,000 rpm speeds and SATA disks at 7,200 rpm

3 Gbps Serial Attached SCSI (SAS) host interface technology Easy to deploy and manage with the DS3000 Storage Manager Combination of 12 SAS or SATA 3.5" drives per enclosure Scalable to 3.6 TB of storage capacity with 300 GB hot-swappable SAS disks or 12.0 TB with 1.0 TB hot-swappable SATA disks in the first enclosure Expandable by attaching up to three EXP3000s, a total of 14.4 TB of storage capacity with 300 GB SAS or up to 48.0 TB with 1.0 TB SATA Telco model supports -48V dc power supplies NEBS and ETSI compliance for AC and DC models

smpdr3.13-xw0001.pdf

41

May 2009

XW0001 - Servicing IBM System x Servers Part II DS3000 Family

Host-side features -DS3200


SAS host-side connection

Disk-side features -All models have SAS disk-side connections


- SATA disks also supported

-DS3300
iSCSI host-side connection

-DS3400
Fibre Channel host-side connection

-Extension through up to three EXP3000 expansion units


Up to 48 disks per system

-One or two controllers available on all models


Two controllers provide host-side cable redundancy

-One or two controllers


Two controllers provides disk path redundancy

The DS3000 family of storage servers provide flexible connection for external, managed storage. SAS, iSCSI and FC models are available. All disks can be SAS or SATA. The host requires the appropriate host bus adapter for the chosen model (SAS adapter for DS3200, iSCSI adapter (ethernet) for DS3300 and FC HBA for DS3400).

smpdr3.13-xw0001.pdf

42

May 2009

XW0001 - Servicing IBM System x Servers Part II DS3200 Rear View

This picture shows the rear view of the D3200 chassis with dual power supplies and ESMs..

smpdr3.13-xw0001.pdf

43

May 2009

XW0001 - Servicing IBM System x Servers Part II DS3300 Rear View

iSCSI Ports

Components covered

This picture shows the rear of the DS3300

smpdr3.13-xw0001.pdf

44

May 2009

XW0001 - Servicing IBM System x Servers Part II DS3400 Rear View

This picture ends the series showing the rear view of the DS3400.

smpdr3.13-xw0001.pdf

45

May 2009

XW0001 - Servicing IBM System x Servers Part II Summary Topic 3

This topic has enabled you to:


-Describe the Raid Levels offered by the IBM ServeRAID Adapter Family -List the common ServeRAID adapter family of products -Describe some of the IBM storage enclosures commonly found in the System x server environment

This topic dealt with overviews of IBM Raid levels and the currently offered ServeRAID adapters. During this topic also discussed what storage solutions IBM System x offers

smpdr3.13-xw0001.pdf

46

May 2009

XW0001 - Servicing IBM System x Servers Part II

Topic 4 High-performance Technologies Review

The prerequisite to this course introduced the design principles of IBM System x and xSeries servers and how to service them. This topic looks more closely at what these design principles mean in practice when servicing an System x and xSeries server.

smpdr3.13-xw0001.pdf

47

May 2009

XW0001 - Servicing IBM System x Servers Part II Topic Objectives

By the end of this topic, you will be able to:


-Describe the advanced technologies used in highperformance System x servers -Describe the system management capabilities of highperformance System x servers

All System x servers support some of the more advanced technologies that IBM has designed and developed. This topic discusses these technologies and describes the implications of working with them in the field.

smpdr3.13-xw0001.pdf

48

May 2009

XW0001 - Servicing IBM System x Servers Part II

Processor Technologies

The industry standard server (Intel processor-based) takes many forms. There are a number of processor types in common use today. This section discusses some of the features of the Intel processor family and reviews some of the service implications when working with processor problems.

smpdr3.13-xw0001.pdf

49

May 2009

XW0001 - Servicing IBM System x Servers Part II Processor Types

Intel processors
Dual processor capable
-Xeon DP

Quad processor capable


-Xeon MP, EM64T (32/64-bit) and IBM XA64e 4th generation chipset

Six Core processor capable


-Xeon Processor 7400 series 2.66 GHz/1066 MHz front side bus

IBM Enterprise X-Architecture chipsets


Enables scaling of Xeon MP and Itanium II systems beyond 4-way capability up to 96 cores in a single server (multi-node)

AMD Processors
-AMD Opteron family of processors

The Intel processor family has several offerings in common use today. High-performance System x servers use all of the processors in the chart above. It should be noted that, although the servers discussed in this course are multi-processor capable, not all servers you see in the field will actually have multiple processors installed. In many cases, the base server ships with one processor, with spare slots or sockets for additional processors as the customers needs grow. The x3950 M2 provides an uncomplicated, cost-effective and highly flexible solution. With the ability to scale up to a maximum of 96cores using Intel six-core processors, while maintaining balanced performance between processors, memory and I/O, thex3950 M2 can easily accommodate business expansion and the resulting need for additional application space. Unique flexibility of the configurations allows the system to populate a minimum of two CPUs per chassis for additional access to memory and I/O that addresses an organizations specific application requirements. This flexibility allows for the creation of a12-core, 32-DIMM server utilizing only two processor sockets for processor licensing-constrained applications, and can be scaled to a 48-core, 128DIMM server utilizing only eight processors. For servers equipped with AMD processors, IBM uses the Opteron multi-core parts.

smpdr3.13-xw0001.pdf

50

May 2009

XW0001 - Servicing IBM System x Servers Part II Processor and VRM Failures

The SP detects the failure and handles the error


-SP will re-start the server
SP holds the failed processor in reset to allow POST to complete Server resumes on remaining processors if possible
-Performance is degraded but users have access to the servers resources

Help is available if a processor or VRM fails as the Service Processor will log the event. When the SP detects a failed processor or VRM, it handles the error and attempts to make the server functional. The SP will deal with this situation by attempting to re-boot the server to any surviving processors.

smpdr3.13-xw0001.pdf

51

May 2009

XW0001 - Servicing IBM System x Servers Part II Replacing a Failed Processor or VRM

Service implications
-System restarts if there are good processors/VRMs remaining -Processor slot may need to be manually re-enabled upon repair - Following replacement of the failed component, run Setup (F1) to check processor slot status -The PDSG will advise the correct part for VRMs (slot or system board)

When you replace a failed processor or VRM:


-Check system-specific configuration requirements in the PDSG - If necessary, re-enable the processor socket/slot and test the new processor -Reboot the server and ensure POST message reflects all processors are active

A VRM failure may give the appearance that a processor has failed. The system event log should capture the specific details of the fail and enable you to identify if it was the VRM or the processor that had the error. If it is the VRM, it could be in one of several places in the server. Some system x and xSeries models have VRMs built into the system board, some have VRM slots and some have both. Your knowledge of the System x and xSeries model will help you to identify the exact location. Light Path Diagnostics will usually indicate the failing part and, if no light is visible or if you need to verify the failure, check the HMM/PDG for additional information. Once the failed part has been identified and replaced, it is important that you test the system to make sure that the associated processor is functioning normally. Upon replacement of the component, the initialization of the processor slot may or may not be automatically detected by BIOS. It may be necessary to manually enable the processor slot before the system is restored to full functionality. Check the HMM/PDG for the correct procedure.

smpdr3.13-xw0001.pdf

52

May 2009

XW0001 - Servicing IBM System x Servers Part II

Memory Technologies

This section looks at IBMs memory protection technologies and describes how these memory technologies change the behavior of a server and how you service it when memory faults occur.

smpdr3.13-xw0001.pdf

53

May 2009

XW0001 - Servicing IBM System x Servers Part II Error Checking and Correcting (ECC) Memory

Additional bits on a memory DIMM store checksum data to verify memory contents (72 bits vs. 64)
-During each write, a new checksum is calculated and stored in the additional bits on the DIMM -During a read, the checksum is compared with the data bits and verifies data as valid and/or corrects single bit errors

ECC memory is limited to Single Error Correct/Double Error Detect (SEC/DED)


-Memory in critical mode must be replaced as only one ECC action is available at a time

Non-servers traditionally use 64 Bit (non-parity) memory, but the absolute minimum memory quality requirement is ECC (72 bits). This memory type is standard across the System x and xSeries server range. Due to the nature of memory configurations in modern servers, a single bit error is still the most common type of error. ECC offers the ability to detect and correct any single bit error and works well in most situations for most general purpose server requirements. From a service perspective, if ECC is correcting a persistent error, the DIMM ultimately needs to be replaced. Unless the server has encountered a second, uncorrectable error, it is likely that the server will still be running. You may need to schedule a suitable time to replace the failing DIMM.

smpdr3.13-xw0001.pdf

54

May 2009

XW0001 - Servicing IBM System x Servers Part II ChipKill Memory

ChipKill memory provides a higher level of error checking and correcting capabilities
- Uses standard ECC DIMMs - Corrects up to 4-bit memory errors
IBM patented technology performs on-the-fly correction
-Improves reliability 600 times over standard ECC memory

-Especially important for business-critical applications where large amounts of memory are installed
1 in 5 servers with more than 1 GB of memory may have multi-bit errors each year Large Database Servers can take many extra hours to recover from a system failure (for example, time to re-initialize the database)
Where standard ECC protection is not enough, many high-end IBM System x and xSeries servers now offer ChipKill support. This technology extends the basic ECC capabilities to be able to support the loss of an entire DRAM device on a DIMM the equivalent of 4 bits of bad data. Very large databases can take several hours to resynchronize, rebuild or restart following a shutdown so a customer should/will factor this into service availability planning when deciding on a suitable memory technology for their server. As with ECC, a memory system that has invoked a ChipKill event is likely to be running. You will not be able to simply take out the bad DIMM and replace it without scheduling a suitable time with the customer.

smpdr3.13-xw0001.pdf

55

May 2009

XW0001 - Servicing IBM System x Servers Part II Hot Spare Memory

Extends memory availability beyond ChipKill capabilities


-Some memory DIMMs are reserved for the hot-spare function
Total available memory will be reduced if hot-spare memory is enabled

-UNLESS MEMORY IS MIRRORED, memory DIMMs are NOT hot removable!


System must be brought down to replace bad memory

-Function needs to be enabled in BIOS (customer choice)


Function is supported on the System x 3800

Hot-spare memory reserves a bank of memory to cover the user memory in the server. The extra/hot-spare memory is idle until it is needed. The Service Processor monitors memory performance and tracks errors. Before the ECC threshold is reached, the failing memory is copied to the hot-spare DIMMs during the refresh cycle, and the questionable memory is switched off. In order for this to work, the failure must be correctable by ECC or ChipKill correction algorithms and the memory swapped by the controller before a fatal error halts/crashes the NOS. Traditionally, for hot-spare memory to work, all memory in all banks must be identical.

smpdr3.13-xw0001.pdf

56

May 2009

XW0001 - Servicing IBM System x Servers Part II Memory ProteXion

Combination of technologies for ultimate memory reliability


-Spare bits on each DIMM can be used to move data from failed bits (up to 2 bits) on DRAMs to good spares -Memory can also be mirrored for ultimate protection
Mirroring is supported on x3950, x3950E, x3850M2 and x3950M2

Memory ProteXion is the term given to the memory system of a number of System x and xSeries servers that are based around the IBM Enterprise X-Architecture chipsets. The memory configuration provides for spare bits on each DIMM. If a bit of memory goes bad, it will be moved by the memory controller to a new location on the DIMM. (Routinely ECC correction has taken 8 extra bits, out of 72 to provide ECC protection, but recent innovations at IBM have found a way to do that with only 6. Leaving two spare bits per 72 pin memory DIMM) Memory can also be mirrored. In a mirrored configuration, half of the memory is reserved for the copy so the total maximum possible memory is reduced by half. As with all memory failures, the bad DIMM must ultimately be replaced to avoid further failures stopping the NOS. But ONLY in a mirrored memory configuration, will you be allowed to hot replace a failed memory DIMM. If you are unable to hot replace a failed DIMM, you are likely to need to schedule downtime on high performance System x and xSeries servers as they are built to survive even serious memory faults.

smpdr3.13-xw0001.pdf

57

May 2009

XW0001 - Servicing IBM System x Servers Part II Memory Mirroring

Requires two memory ports and supporting hardware


-Memory controller and BIOS work together to create and exact duplicate of one port in the other -Mirrored memory systems enable hot replace of defective DIMMs

Here are some basic rules for memory mirroring to work.

smpdr3.13-xw0001.pdf

58

May 2009

XW0001 - Servicing IBM System x Servers Part II When a DIMM Fails in a Server

SP detects the failure and handle the error


-SP re-starts the server if memory is not mirrored - SP will hold the failed DIMM or bank of DIMMs in reset to allow POST to complete - Server resumes on remaining DIMMs if possible - Performance is degraded but users will be able to use the servers resources

Service implications
- If memory is mirrored, the server will be running and it may be possible to remove the failed DIMM without stopping the NOS - If memory is not mirrored, the system may have restarted itself if there was good memory remaining - If so, it will be necessary to shut down the server to make repairs -Memory slot or bank may need to be manually re-enabled - Following replacement of the failed component, run Setup (F1) to check memory slot status

When the SP detects a failed DIMM, it handles the error and attempts to make the server functional. If memory is mirrored, the hardware will have switched off the port containing the bad DIMM. In this case, you may be able to remove the failed DIMM without stopping the NOS. The procedures for removing a failed DIMM in a mirrored configuration are contained in the HMM or PDG. In a system without mirrored memory, you will need to shut down the server to replace a failed DIMM. Upon replacement of the component, the initialization of the memory slot may or may not be automatically detected by BIOS. It may be necessary to manually enable the DIMM slot or a bank of DIMMs before the system is restored to full functionality.

smpdr3.13-xw0001.pdf

59

May 2009

XW0001 - Servicing IBM System x Servers Part II

Active PCI, PCI-X and PCI- Express

Active PCI, PCI-X and PCI-Express were developed to add the ability to hot add, remove and replace adapters and controllers to a system without the need to shut down the NOS. This section describes the technology and how to work with it.

smpdr3.13-xw0001.pdf

60

May 2009

XW0001 - Servicing IBM System x Servers Part II Active Slot Implementation

All 4-way and some 2-way servers have Active slots


-Additional hardware supports sensing and power control -The OS needs to be able to support Active slots
Device drivers are aware of the power state of the hardware Individual adapters need to be supported by intelligent drivers
-This is typically supplied by the adapter manufacturer

-In a redundant adapter configuration, failed adapters can be removed and replaced without shutting down the OS
PCI-

While not exclusive to high-end servers, Active PCI technology is common to all high-end System x and xSeries servers. Active PCI (and Active PCI-X) enables the option to potentially add, remove and replace adapters while the NOS is running. Device drivers are needed to support both the technology and any adapters that will make use of the technology. Where two adapters are coupled together in a redundant configuration, for example two network adapters, a failure can be fixed without stopping the NOS. Active PCI requirements Hardware Interlock Switch 2 LEDs per Active PCI slot

Power Attention
Software Device Driver

Adapter manufacturer
System Driver

Machine manufacturer
System Service

Operating System manufacturer


smpdr3.13-xw0001.pdf 61 May 2009

XW0001 - Servicing IBM System x Servers Part II Servicing a Server with Active Slots

If an adapter fails in an Active bus, it can be removed without stopping the OS


-Procedures vary according to the OS and adapter
Simply removing power from a failed adapter may crash the server, even though the adapter has failed
-Adapter-specific procedures must be followed

-When a working adapter is reinserted, additional steps may be needed to activate it


PCI-

If you are called to a server which has Active slots enabled and working, you will need to consult with the customer before attempting to replace a failed adapter. Any customer who adopts this technology will be reluctant to let you stop the NOS to replace the failed adapter and you may be required to replace the adapter hot. Procedures vary from NOS to NOS AND from adapter to adapter. However, in general, the NOS is informed that an adapter is about to be removed and Active PCI/PCI-X switch card is used to remove power to a slot prior to removal. When you have completed the repair and fitted the replacement adapter, the NOS may need to be told that the repair is complete.

smpdr3.13-xw0001.pdf

62

May 2009

XW0001 - Servicing IBM System x Servers Part II

Service Processors

Here, we look at the system management hardware (Service Processors) you will find in highperformance System x and xSeries servers.

smpdr3.13-xw0001.pdf

63

May 2009

XW0001 - Servicing IBM System x Servers Part II Service Processor Types

Baseboard Management Controller (BMC)


-IPMI-compliant service processor
Stores event log and other system information Accessible via an ethernet connection to the (shared) eth0 port of the host
-Accessible via <F1> Setup or <F2> Diagnostics but requires the OS to be stopped

Remote Supervisor Adapter II (RSA-II)


-Powerful, ethernet graphical (Web) management tool
Standard on x3950, x3950E, x3850 M2 and x3950M2 Optional on x3755, x3800 and x3850

Service Processors (SP) are often divided into two groups. Basic Service Processor (BMC) - Runs on the 5v continuously-on power, and is used to power on/off the server - monitors I2C bus for sensor activity, and stores logs / information about events - Responds to issues, and errors (light path diagnostics, fans, reboots) - provide limited information access to the machine while powered off (if machine is plugged in) Advanced Service Processor (RSA2) Runs on the 5v continuously-on power and Monitors/collects information from BMC Can be programmed to page out support personnel when a problem occurs Powerful web interface for easy remote management Remote video, remote control, push down code features .

smpdr3.13-xw0001.pdf

64

May 2009

XW0001 - Servicing IBM System x Servers Part II Base Management Controller (BMC)

Independent microcontroller used to perform low level system monitoring and control functions. BMC Functions:
-Initial system check out at AC on -BMC event log maintenance -System power state tracking -System initialization -System software state tracking -System event state monitoring -System fan speed control

IPMI BMC event log messages


-Contains information, warning and error messages
The Intelligent Platform Management Interface (IPMI)-compliant Baseboard Management Controller is for system health monitoring and management. The BMC maintains a system event log (SEL) that can be accessed during POST via the F1 key sequence and while the host OS is running if it is configured for remote access.

smpdr3.13-xw0001.pdf

65

May 2009

XW0001 - Servicing IBM System x Servers Part II Remote Supervisor Adapter (RSA)

RSA, RSA II and RSA II Slimline


Management independent of server or OS status
-Full remote control of hardware and OS (only direct via LAN) -Remote power control -Remote flash update -User administration and security

Additional features of RSA II


-Default fixed IP address (192.168.70.125) -Dongle provides RS-485 and serial ports -Host video is provided by the RSA II - RSA II Slimline does not provide host video

Management tool access


-IBM Director, Telnet, ANSI terminal, Web browser

Remote Supervisor Adapters (RSA) are full featured management adapters with a host of features to provide both in-band and out-of-band management capabilities, including full remote control Through the RSA and RSA II, you can interrogate and manage logs, control and monitor the power state of the host server, apply flash updates to host and any attached I/O expansion enclosures and take full remote control of the host console while the NOS is running. RSAs support the following: Web-based management: embedded in the adapter, a small web server provides the capability to connect through the dedicated LAN port and access a user friendly interface, based on HTML code, to perform configuration and monitoring of the server. Remote graphic console redirection: When connecting through the dedicated LAN port, the card will make it possible to grab video data and perform a complete console redirection with text, graphics, keyboard and mouse support. DNS/DHCP support: In addition to static IP configuration, the RSA supports DHCP and DNS. Putting the card in a network where a DHCP is installed will generate its automatic configuration; avoiding the need to run configuration routines through the management software. NT blue screen capture: The most recent OS failure screen can be captured, avoiding the annoying step of restarting the server to reproduce the error. Attach event log to e-mail alerts: The event log can be sent out as an attachment of an e-mail to administrators to notify them of any problem that affected the server. DB-9 connector (RSA only): The card has a standard DB-9 connector, making cabling easier. Externally visible LEDs: Power and error LEDs are on the rear bezel, removing the need to lift the covers in order to check the status of the card.

smpdr3.13-xw0001.pdf

66

May 2009

XW0001 - Servicing IBM System x Servers Part II RSA II Adapter Features / Layout
1. Status LEDs (Heartbeat & Power - heartbeat blinking, power solid during normal operation) 2. Pinhole Reset (Service Processor Software Reset) 3. Mini-USB Connector (Host OS Comm. / Remote Disk,Mouse,Keyboard) 4. External Power Supply 5. RJ45 Ethernet Connector (Web Interface) 6. DB15 VGA Video Connector (Host Video) 7. Video Compression Memory 8. Non-Serviceable Clock Battery 9. Video Compression Chip 10. Remote Floppy,Mouse,Keyboard Chip 11. ATI Radeon 7000VE (a.k.a RV-100) (Video) 12. PCI Connector (System Video) 13. Ethernet PHY 14. Flash Memory (Service Processor) 15. PowerPC CPU (Service Processor) 16. Video Memory (System Video) 17. CPU Memory (Service Processor) 18. Real-time Clock

The RSA2 adapter replaced the RSA adapter starting in 2003 and currently comes in several slightly different flavors. The above photos shows some of the complex features of the full RSA II adapter (e.g. mounted on its own video card). The RSA2 SlimLine adapter mounts on an existing video adapter in many of the newer System x servers. There is also a RSA2 SlimLine Refresh 1, and a RSA2-EXA adapter. The essential differences of these renditions can be found on the following website. http://www.redbooks.ibm.com/abstracts/tips0146.html This RSA-2 adapter is a complex, half size adapter which needs to be flashed for the supported server that it is installed in. Depending on the level of code installed in the RSA II, the adapter can be reset with either a 5-5-10 second (5 seconds pushed, 5 seconds not pushed, 10 seconds pushed) or a straight 10 second pushed reset using a paper clip. A reset of the adapter will set it back to factory defaults, cause the adapter to reboot, and try for two (2) minutes to obtain a DHCP address before resorting to a 192.168.70.125 if/when it can not find a DHCP server.

smpdr3.13-xw0001.pdf

67

May 2009

XW0001 - Servicing IBM System x Servers Part II RSA II Web Interface

In this picture you can see an example of the interface that will be presented to the user when connecting an RSA II through a Web browser.

smpdr3.13-xw0001.pdf

68

May 2009

XW0001 - Servicing IBM System x Servers Part II Event Log

If a failure occurs in the system:


A fault LED on the operator panel card is illuminated Event log information can be viewed though the RDA Web interface

Through the RSA, you can interrogate and manage event logs to assist in problem isolation and repair. Note: you can access the RSA II event logs even if the host is in standby power mode.

smpdr3.13-xw0001.pdf

69

May 2009

XW0001 - Servicing IBM System x Servers Part II Flash Updates

The RSA-II has upgradeable BIOS and firmware


-The software can be installed from within a supported operating system - Microsoft and Linux executable files are available for download -Remote flash images file is also available - Updates are applied via the RSA II web interface

1. 2.

Browse for downloaded file Select to Update

The IBM Remote Supervisor Adapter II has three different update package options: a Windows update package, a Linux update package, and a Zip file package. (e.g. sample web link is The Windows and Linux update packages can be installed from one of these NOSs. (e.g. provided that the NOS driver for the RSA2 is installed) The Zip file package is used to update the RSA2 adapter from the Web Interface. The package consists of a readme, a change history and the Zip file containing the following PKT files. PAETBRUS.PKT is traditionally the name of the Boot ROM file PAETMNUS.PKT is traditionally the name of the Main Application file If access to the server is possible, these components can be updated with the use of flash images. Images can be downloaded from the IBM support Web site, which cam be used to make the necessary diskettes. If access to the server is not possible or if the RSA2 is under management through a Web browser, updates can also be applied via the web browser connection. In this case, the update images are different but can still be downloaded from the IBM support web site.

smpdr3.13-xw0001.pdf

70

May 2009

XW0001 - Servicing IBM System x Servers Part II Console Redirection

Both text and graphics redirection is available


-Hardware connection requirements:
Text redirection is available through the serial port as well as the ethernet connection Graphical redirection works through the ethernet only

-Software requirements for remote POST screens, remote Setup and remote Diagnostics:
Terminal program or IBM Director or a WEB browser Supported Java engine

-During Boot, the RSA2 adapter can be loaded with a Disk or CD image/file and the server can boot from this image file.

Console redirection can be very useful for diagnosing problems where access to the server console is required. Using a variety of connection methods and software interfaces, the RSA gives full remote control capabilities. Depending on the level of access to the hardware, you can perform almost any task that you could perform while actually standing at the server itself. If the RSA ethernet port is connected to the customer LAN, you can even take control of the server from another location in theory, anywhere in the world provided you know the IP address of the adapter and have the necessary security permissions to access the interface. This facility is very powerful and must be used with extreme care. Also, accessing a server console in this way should only be undertaken with the permission of the customer. One other very important feature of the RSA2s Remote Disk feature is that a file ( diskette or CD image file) can be accessed by the server via the RSA2 adapter. The image is first loaded on the RSA2 adapter. Then when the server is rebooted, its boot sequence can be altered (e.g. press F12) to boot from it . ( The server will now boot from the remote file, as if it was really an attached diskette driver, or CDROM drive.) This can be used to flash the various server hardware features remotely.

smpdr3.13-xw0001.pdf

71

May 2009

XW0001 - Servicing IBM System x Servers Part II SP Functions Comparison

Feature / Function
Monitoring Automatic Server Restart Capture Windows Blue Screens Environmental Monitors Interface with Light-Path Optional Power Source PFA on system components POST, Loader, O/S Timeouts Alerting Alert to pager SMTP Email SNMP Traps SNMP via PPP Management/configuration ANSI-based Management Director-based Management Telnet-based Management Web-based Management Remote BIOS Update Remote Control Remote POST / Diagnostics View Status Logs View Vital Product Data Connectivity 10/100 Ethernet DHCP support DNS support PPP Shared serial support

BMC
Yes No Yes Yes No Yes Yes Yes No No No Yes (via SoL) Yes Yes No No No No Yes Yes Yes (shared) No No No No

RSAII
Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes* No Yes Yes Yes Yes Yes** Yes Yes Yes Yes Yes Yes Yes Yes Yes

Here is an at a glance comparison of the monitoring and alerting capabilities of the Service Processors found in System x servers. *Only SNMPv1 traps supported. **Direct flashing of BIOS/Diags firmware is not supported (can be done using the remote disk feature instead).

smpdr3.13-xw0001.pdf

72

May 2009

XW0001 - Servicing IBM System x Servers Part II Summary

This topic has enabled you to:


-Describe the advanced technologies used in highperformance System x servers -Describe the system management capabilities of highperformance System x servers

This topic discussed the technologies incorporated into high-performance System x servers.

smpdr3.13-xw0001.pdf

73

May 2009

XW0001 - Servicing IBM System x Servers Part II

Topic 5 Working With Scalable Systems

This topic discusses the service implications of working with scalable systems.

smpdr3.13-xw0001.pdf

74

May 2009

XW0001 - Servicing IBM System x Servers Part II Topic Objectives

By the end of this topic, you will be able to:


-Define the terms scalable, node, complex and partition -Describe the data and system management cabling of a scaled system -Describe how to use the RSA II WEB interface to configure a scalable partition on an x3950, x3950E and x3950M2

When servicing multi-node systems, it is important to understand the relationship between nodes in the partition and how the partition is wired together.

smpdr3.13-xw0001.pdf

75

May 2009

XW0001 - Servicing IBM System x Servers Part II Scalability Terminology

Scalable
-A system that is able to join with another computing resource to act as a single, larger server

Node
-A single computing resource (a server)
Capable of operating alone or joined (scaled)

Complex
-Two or more nodes
Joined together physically

Partition
A complex that is running a single instance of an OS

The term scalable is used to describe a device that has the ability to operate in a joined fashion, along with another computing device, to appear as a single, large server. A node is the smallest unit of a scaled system. A node can operate standalone, as well as in a complex. A complex is a collection of nodes, physically and joined together to form a large computing resource. A partition is a complex that is running a single instance of an OS across all processors and memory in the complex.

smpdr3.13-xw0001.pdf

76

May 2009

XW0001 - Servicing IBM System x Servers Part II x3950 M2 Scalability Schematics 8-way and 16-way
Upper SMP
Port 1 Port 2

module
Port 3

BMC RSA

x3950 / x3950E 8-way configuration x3950,x460/MXE


Port 1 Port 2 Port 3 BMC RSA

x3950/460/MXE
Port 1 Port 2 Port 3 BMC RSA

x3950 M2 Upper SMP


Port 1 Port 2

module
Port 3

BMC

x3950 16-way configuration


RSA Port 1 Port 2 Port 3

x3950/460/MXE
BMC RSA

x3950/460/MXE
Port 1 Port 2 Port 3 BMC RSA

x3950/460/MXE
Port 1 Port 2 Port 3 BMC RSA

x3950/460/MXE
Port 1 Port 2 Port 3 BMC RSA

Here are the cabling schematics for 8-way and 16-way operation across all the supported scalable systems. When scaling the x3950, x3950E, 460 or MXE 460 to an 8-way partition or all the above systems to a 16-way partition, the RSAs play a key part in creating and maintaining the partition as they hold the partition data and maintain communications between all nodes in the partition. The data flows across the scalability cables. Each node contains a scalability controller (part of the XA chipset) that is effectively a high speed switch. Each node above is directly connected to each other node so much of the switching technology embedded in the controller is not used. Note that ethernet hubs are used in all but the most simple of partitions as there are many devices that need to connect to a common management LAN in order for scaling to work, while still providing real time access to management processors and functions.

smpdr3.13-xw0001.pdf

77

May 2009

XW0001 - Servicing IBM System x Servers Part II Scalability Schematic 32-way


xSeries 460 32-way configuration
Port 2 Port 3

x460/MXE
Port 1 BMC RSA

x460/MXE
Port 1 Port 2 Port 3 BMC RSA

x460/MXE
Port 1 Port 2 Port 3 BMC RSA

x460/MXE
Port 1 Port 2 Port 3 BMC RSA

x460/MXE
Port 1 Port 2 Port 3 BMC RSA

x460/MXE
Port 1 Port 2 Port 3 BMC RSA

x460/MXE
Port 1 Port 2 Port 3 BMC RSA

x460/MXE
Port 1 Port 2 Port 3 BMC RSA

Here is the cabling schematic for a 32-way xSeries x3950, x3950E, 460/MXE 460 partition. As you can see from the scalability cabling in this schematic, each node is directly connected to three other nodes in the partition. This time, each node acts as a router to the nodes that are not directly connected, fully exploiting the switching capabilities of the scalability controllers in the nodes. Without the ability to maintain routing tables, it would not be possible to scale eight nodes together.

smpdr3.13-xw0001.pdf

78

May 2009

XW0001 - Servicing IBM System x Servers Part II Scalability Requirements

All RSAs in the partition must be connected


-An ethernet hub is required for more than two nodes

All scalability cables must be fitted BIOS and firmware levels must match across all nodes Previous partition information must be deleted
-Stale partition descriptor data may cause nodes to fail to merge

Here are the basic rules that will allow multiple nodes to merge into a partition. Before a partition can merge, however, parameters must be set to identify all nodes in the partition.

smpdr3.13-xw0001.pdf

79

May 2009

XW0001 - Servicing IBM System x Servers Part II Partitioning Overview

System x and xSeries servers use static partitioning


-A less complicated hardware implementation, and can operate with current commodity operating systems
Static partitioning (SPAR) is a model of partitioning that enables reconfiguration of a multi-node complex along nodal boundaries after shutdown and restart of the effected partitions, rather than the entire complex. The key feature of SPAR is the ability to independently manage and service individual partitions through software without having to shutdown, physically power-down, power-up, and restart unaffected partitions

Static partitions are those that require a reboot to change the configuration. This is a simplified model that fits well with existing OSes that rely on hardware to mask the fact that it is running on processors and memory from several physical nodes.

smpdr3.13-xw0001.pdf

80

May 2009

XW0001 - Servicing IBM System x Servers Part II Configuration

To create a complex:
-Flash BIOS, BMC and RSA of all nodes to same levels -Gather IP addresses (static or dynamic) for all RSAs
SP networks must either have static IP addresses or have DHCP leases to maintain consistent IP addressing
-The IP addresses that are assigned to the RSAs must not change once nodes are scaled and running -This is true for static IP addresses and DHCP leases

-For x3950, x3950E, xSeries 460 and MXE 460:


Define the RSA and BMC IP addresses in <F1> Advanced Setup Define the partition details in the RSA II Web interface

-Partition tables still exist and are stored on each local RSA Partition tables still exist and are stored on each local BMC
Before attempting to create a complex, ensure that BIOS, BMC and RSA firmware match across all nodes and that RSA clocks match. By doing this, if a failure occurs the information written to the event logs will correlate. The configuration of a complex is performed in one of two places, depending on the node type. For older systems, the configuration is created and stored via the <F1> Setup program. On newer systems, all configuration tasks are performed though the RSA II Web interface.

smpdr3.13-xw0001.pdf

81

May 2009

XW0001 - Servicing IBM System x Servers Part II Scalable Partitioning Using <F1> Setup

Boot the chassis that will be the primary node


Enter <F1> Setup
-Select <Advanced Setup> on the main menu and <Static Partition Information> - In <Secondary Host Name>, enter the IP address of the secondary node RSA - Navigate to <Save Static Partition Information> and press <Enter> -Navigate to <Start Options> on the main menu - Set <Boot Fail Count> to <Disabled> -Power down the primary node and remove AC

Boot the chassis that will be the secondary node


Enter <F1> Setup
-Navigate to <Start Options> on the main menu - Set <Boot Fail Count> to <Disabled> - Do not enter any information regarding the primary node TCP/IP address -Power down the secondary node and remove AC

Here are the instructions to create a complex using <F1> Setup.

smpdr3.13-xw0001.pdf

82

May 2009

XW0001 - Servicing IBM System x Servers Part II Scalable Partitioning Using the RSA II Web Interface Sub Menus

The following tasks can be performed:


-Obtain status -Create a partition -Control a partition -Delete a partition

Status View current and new scalable partitions data in the graphical user interface provided by RSA-2 Scalable Partitioning Web interface. This menu is automatically displayed after each task below (create, control and delete) has completed. Create Partition Task Create new scalable partitions with RSA-2 Scalable Partitioning Web interface. Control Partition Task Control new and current scalable partitions with RSA-2 Scalable Partitioning Web interface. Controls are: 1. Moving new partition to current partition - new partition is a staging area for current partitions. New partitions can be created while current partitions are running. 2. Starting current partitions 3. Stopping current partition Delete Partition Task Selections are: 1. Delete Partition Settings on all ASM's members of New Scalable Partition. 2. Delete Partition Settings on all ASM's members of Current Scalable Partition. 3. Delete Partition Settings only for this (local) ASM Member of Current Scalable Partition.

smpdr3.13-xw0001.pdf

83

May 2009

XW0001 - Servicing IBM System x Servers Part II Level 4 Cache Considerations

Partitions are supported by L4 cache to speed communications across the processor busses
-On earlier Scalable systems (the xSeries 440, 445, and 455), L4 cache is separate from main memory -On the x3950, x3950E, xSeries 460 and MXE 460, the scalability chip has an integrated L4 Scalability Memory Cache (SMC) which utilizes main memory
When BIOS reports available memory per node to the O/S, it must first subtract the scalability cache size (256MB)

On first and second generation scalable systems, the L4 cache was physically separate from main memory. All main memory is available to the OS. On third generation scalable systems, the cache controller utilizes host memory for the cache. The customer will notice a difference between reported memory (that which is available to the OS) and physically installed memory.

smpdr3.13-xw0001.pdf

84

May 2009

XW0001 - Servicing IBM System x Servers Part II Scaled System Management Considerations

When a partition is merged:


Diagnostics (<F2>) is only available on at node level Light Path applies to each individual chassis, not the complex The Service Processor must be functional and SP LAN connected to others in the group The OS does not know physical boundaries and sees the complex as one system Event logs map memory and PCI bus information to a chassis in addition to a slot Processor speeds must be the same within and across all chassis Multi-chassis configuration code on the RSA II is only available if the Scalability Cartridge Assembly is detected in an x3950, x3950E, xSeries 460 or MXE 460

Here are some things to remember when working with partitions and scaled systems.

smpdr3.13-xw0001.pdf

85

May 2009

XW0001 - Servicing IBM System x Servers Part II Scalability Port Test from Diagnostics

Scalability Ports can be tested using System Diagnostics <F2>, under the Basic menu option from each chassis. The new Diagnostic Test Scalability Port Test is an Interactive Test which requires the user to follow the text on the screen.

smpdr3.13-xw0001.pdf

86

May 2009

XW0001 - Servicing IBM System x Servers Part II x3950 M2 Scalability Overview

The x3950 M2 can be scaled to create complex partition that is running a single instance of an OS

The term scalable is used to describe a device that has the ability to operate in a joined fashion, along with another computing device, to appear as a single, large server. A node is the smallest unit of a scaled system. A node can operate standalone, as well as in a complex. A complex is a collection of nodes, physically and joined together to form a large computing resource. A partition is a complex that is running a single instance of an OS across all processors and memory in the complex.

smpdr3.13-xw0001.pdf

87

May 2009

XW0001 - Servicing IBM System x Servers Part II x3950 M2 Scalability

Scalability configurations supported are 2,3,4 nodes Port cabling same as x3950
-New Cables (deep plug w/iPass connectors)

Scalability key required to enable scalability


-key plugs into the processor board at J14 connector

4 GB minimum is required for successful boot


-One processor and two DIMMs minimum in each

Only USB keyboard and Mouse are supported to boot stand alone
-Hit remind button to initiate standalone boot as USB devices are not initialized at start of merge process

Configuration can have one or more scalable partitions. Each scalable partition supports an independent operating system installation. The scalable partition uses a single, contiguous memory space and provides access to all associated adapters and hard disk drives. PCI slot numbering starts with the primary node and continues with the secondary nodes, in numeric order of the logical node IDs. Before you create scalable partitions, read the following information: Make sure that all nodes in the multi-node configuration contain the following software and hardware: The current level of BIOS code, SAS BIOS code, service processor firmware, BMC firmware, and FPGA firmware. Note: To check for the latest firmware levels and to download firmware updates, go to http://www.ibm.com/systems/support/. Microprocessors that are the same cache size and type, and the same clock speed. Make sure that each node contains the following hardware: A minimum of one microprocessor and one memory card with one pair of DIMMs Note: The nodes can vary in the number of microprocessors and the amount of memory each contains, above the minimum. A ScaleXpander key on the microprocessor board to enable multi-node operation Make sure that the primary node contains a minimum of 4 GB of memory The Scalability installation Option documentation is available

smpdr3.13-xw0001.pdf

88

May 2009

XW0001 - Servicing IBM System x Servers Part II Chassis Scalability requires ScaleXpander Option Kit

System x3850 M2 Non-Scalable

System x3950 M2 Scalable

ScaleXpander Option Kit Scalability icon lights up when active The x3850 M2 can be upgraded to a x3950 M2 with the ScaleXpander Option Kit

Closer look

Notes: The IBM ScaleXpander Option Kit can be used to upgrade the x3850M2 for scalability. The IBM ScaleXpander Option Kit can be used interconnect the SMP Expansion Ports of two or more servers to form multi-node configurations. With the ScaleXpander Option Kit, the non-scalable x3850 M2 transforms into a scalable, x3950 M2. This scaleable configuration supports up to 16-sockets and 92 processor cores.

smpdr3.13-xw0001.pdf

89

May 2009

XW0001 - Servicing IBM System x Servers Part II ScaleXpander Option Key

The ScaleXpander Option Kit is installed in a slot near the front of the systemboard During POST, the BMC reads VPD on the chip to verify the system can scale Each chassis to be scaled requires the kit to be installed
ScaleXpander Option Key

Notes: In order to merge chassis, the ScaleXpander Option Kit needs to be installed in a slot near the front of the systemboard. During POST, the BMC will read VPD on the chip to verify the system can scale. Each chassis to be scaled requires the kit to be installed.

smpdr3.13-xw0001.pdf

90

May 2009

XW0001 - Servicing IBM System x Servers Part II Processor Board Scalability Connectors

This slide is the Processor board Connections The Scalability key required to enable scalability the key plugs into the processor board at J14 connector. Three connectors on the rear of the system are used to connect the physical system together. A management network consisting of the RSA and BMC from each of the system to be scaled is required.

smpdr3.13-xw0001.pdf

91

May 2009

XW0001 - Servicing IBM System x Servers Part II Rear view scalability connections and cable

Scalability Cable

Notes: This slide shows the scalability cable and SMP connectors on the rear of the x3950 M2.

smpdr3.13-xw0001.pdf

92

May 2009

XW0001 - Servicing IBM System x Servers Part II Scalability cables

Scalability cable release lever

The cabling information is for multi-node configurations that consist of two or (when supported) three servers, for up to 12-socket operation. A node is a server that is interconnected with other servers or nodes through the SMP Expansion Ports to share system resources. Two-node configuration A twonode configuration requires two 3.0 m (9.8-foot) ScaleXpander cables. (for two node Configuration) Attach Scalability cables to from port 1 to port 1, and port 2 to port 2

smpdr3.13-xw0001.pdf

93

May 2009

XW0001 - Servicing IBM System x Servers Part II Rear view scalability cables connected

Notes: This slide shows the deep-plug scalability cables installed into the SMP ports on the rear of the x3950 M2. Note the location of the scalability release levers.

smpdr3.13-xw0001.pdf

94

May 2009

XW0001 - Servicing IBM System x Servers Part II Two Node Scalability Cable Layout

Two-node configuration A two-node configuration requires two 3.0 m (9.8-foot) ScaleXpander cables. To cable a two-node configuration for up to eight-socket operation, complete the following steps: Label each end of each ScaleXpander cable according to where it will be connected to each server. Connect the ScaleXpander cables to node 1: a. Connect one end of a ScaleXpander cable to port 1 on node 1; then, route the cable through the node 1 wire-form clips on the cable-management arm. b. Connect one end of a ScaleXpander cable to port 2 on node 1; then, route the cable through the node 1 wire-form clips on the cable-management arm. Connect the ScaleXpander cables to node 2: a. Locate the ScaleXpander cable that is connected to port 1 on node 1; then, connect the opposite end of the cable to port 1 of node 2. Next, route the cable through the node 2 wire-form clips on the cable-management arm. b. Locate the ScaleXpander cable that is connected to port 2 on node 1; then, connect the opposite end of the cable to port 2 of node 2. Next, route the cable through the node 2 wire-form clip on the cable-management arm.

Three-node or four node configuration A three-node configuration requires three 3.0 m (9.8-foot) ScaleXpander cables. To cable a three-node configuration for up to 12-socket operation, For detailed instructions and cable layout refer to the IBM System x3850 M2 and System x3950 M2 Type 7141Problem Determination and Service Guide

smpdr3.13-xw0001.pdf

95

May 2009

XW0001 - Servicing IBM System x Servers Part II x3950 and x3950 M2 Scalability Comparison

x3950 -RSA managed partitioning -Complex descriptor and partition descriptor -Partitioning done across ethernet -No topology awareness
Manual system discovery required to setup RSA IP addresses No cable status or debug reporting

x3950 M2 -BMC managed partitioning -Complex descriptor


Describes complex and partition per system

-Partitioning done across scalability cables -Aware of entire complex topology


Cable problems and sophisticated debugging available

-Each RSA only partition aware


Controls only the local partition Control only available from primary system

-Each system complex aware


Control all partitions from one page Controls all partitions or individual systems Aware of all system states

-Partition deletion for updates

-Preserve partition with standalone

Unlike previous scalable systems, the IBM System x3950 M2 BMC manages the scalable partitioning (rather than the RSA).

smpdr3.13-xw0001.pdf

96

May 2009

XW0001 - Servicing IBM System x Servers Part II New Architecture

BMC automatically discovers scalable systems


New systems are discovered by checking all the ports
- System topology (all cable connections) discovered - Systems identified by UUID in complex descriptor

Changes in the scalability cables are discovered


- Remote changes available to neighbors

Complex Descriptor Data Structure filled out


Partition Control (Power On, Off, Reset.), Partition Create manually or default partition Partition Delete, Reset to Defaults Standalone for debug purposes Clients (RSA, BIOS) read structure and process the information
- BIOS uses it to setup systems to merge - RSA uses it to present graphical image of the complex

External components can create and control partitions through nine available scalability commands

The new architecture uses an RSA connected to one of the nodes to act as a web based scalable complex management console from which partitions can be created and controlled. Cable topology and scalable port status will also be available from this complex management console. Partition creation and control may be performed from an RSA or IPMI client; partition management will be handled within each BMC. The new architecture will perform automatic node topology discovery using the FPGA and BMC, so that every node will be able to communicate with every other node using the scalable management bus. The previous architecture required the user to set up in advance the Ethernet IP addresses of all the RSAs before partitions could be created, and further required partition creation be performed from the boot node of each partition. The new architecture has removed all of these cumbersome requirements, making it possible to connect systems out of the box and go directly to partition creation. Partition creation is now streamlined to a single RSA web page where partition configuration data can be distributed to target member BMCs and stored in NVRAM. A test for pre-existing partitions is performed and their status is checked to ensure that the partition is powered off prior to reconfiguration. Partition IDs are utilized by the FPGA to enable uniform behavior by all nodes in a partition during power and reset operations. Partition-wide platform options such as mirroring are also distributed so that each BIOS can have consistent settings in advance of partition merging during the system boot phase.

smpdr3.13-xw0001.pdf

97

May 2009

XW0001 - Servicing IBM System x Servers Part II BMC Role

Performs Auto Discovery


Discovers all the systems in the complex Discovers the connections on all the ports Aware of the complete topology Updates necessary registers for related components (BIOS, FPGA, RSA)

Retains and maintains all Complex Information


Stored data structure Clients send commands and BMC keeps the data consistent Performs data manipulation per partition

Routes all the information


Controls local and remote systems and partitions Keeps data structures consistent on all systems in complex
The BMC will be used for creation and storage of complex descriptors as well as automatically generating a default partition based on complex topology (all complex nodes will be partition members). The BMC will also control the static partitioning states. Partition creation/configuration can be performed using RSAII Web interface or user application aware of RSAII dot commands or OEM IPMI commands through the BMC. The BMC will provide an automatically generated partition for users who do not care to manually create partition's) once the user identifies the primary/boot node in the partition. The automatic generated partition descriptor will only support all complex nodes being partition members of the same partition. A CLI interface to dot commands or OEM IPMI commands will be supported allowing of scripting tools to generate partition's) based on the user needs Note: RSA is still required for partition definition.

smpdr3.13-xw0001.pdf

98

May 2009

XW0001 - Servicing IBM System x Servers Part II RSA II Role

Unlike previous scalable systems, the role of the RSA has changed
-Reads scalable complex information from the BMC -Displays scalable complex topology to user including:
Incorrect cabling displayed and noted Port problems displayed and noted Non-scaled systems displayed and noted Provides partition and system

-As in previous versions Partition and system state are displayed

In multi-node integration, the partition configuration is written once through RSAII to the BMC then the FPGA interface. The FPGA interface allows for routing partition configuration to each partition members BMC and FPGA interface (Virtual ICMB) (Intelligent Chassis Management Bus). This complex configuration will be stored in each local nodes BMC NVRAM. The partition configurations are contained in the complex configuration. During complex/partition configuration, the BMC will only use one buffer for all data, no longer holding two buffer (active/candidate) like previous scalable systems. The data structure of the complex descriptor will be stored in each local nodes BMC NVRAM. The data structure will have a version check to ensure consistency. This data structure of the complex descriptor will be shared between all the user applications creating and controlling static partitioning Note: RSA is still required for partition definition.

smpdr3.13-xw0001.pdf

99

May 2009

XW0001 - Servicing IBM System x Servers Part II RSA II Interface (Create Partition)

To create a scalable partition, complete the following steps: 1. Connect the ScaleXpander cables. 2. Connect all nodes to an ac power source and make sure that they are not running an operating system. Note: If the nodes are part of an existing partition, all nodes must be in Standby mode, which means that the nodes are part of the partition but operate independently. Click Force under Standalone Boot on the Scalable Complex Management page to enable the Standby mode. 3. Connect and log in to the Remote Supervisor Adapter II Web interface 4: In the navigation pane, click Manage Partition(s) under Scalable Partitioning. Use the Scalable Complex Management page to create, delete, control, and view scalable partitions.. Select the primary node; then, automatically or manually create a scalable partition Click Auto under Partition Configure to automatically create a single partition that uses all nodes in the multi-node configuration Click Create under Partition Configure to manually assign nodes to the partition See the Remote Supervisor Adapter II SlimLine and Remote Supervisor Adapter II Users Guide for more information; then, continue with the procedure to create a scalable partition.

smpdr3.13-xw0001.pdf

100

May 2009

XW0001 - Servicing IBM System x Servers Part II Scalable Complex Management page

To create a scalable partition, complete the following steps: 1. Connect the ScaleXpander cables. 2. Connect all nodes to an ac power source and make sure that they are not running an operating system. Note: If the nodes are part of an existing partition, all nodes must be in Standby mode, which means that the nodes are part of the partition but operate independently Click Force under Standalone Boot on the Scalable Complex Management page to enable the Standby mode. 3. Connect and log in to the Remote Supervisor Adapter II Web interface. See the Remote Supervisor Adapter II SlimLine and Remote Supervisor Adapter II Users Guide for more information; then, continue with the procedure to create a scalable partition. 4. In the navigation pane, click Manage Partition's under Scalable Partitioning. Use the Scalable Complex Management page to create, delete, control, and view scalable partitions. A page similar to the one in the following illustration is displayed.

smpdr3.13-xw0001.pdf

101

May 2009

XW0001 - Servicing IBM System x Servers Part II RSA II Interface ( Partition Started )

Select the primary node; then, automatically or manually create a scalable partition: 1. Click Auto under Partition Configure to automatically create a single partition that uses all nodes in the multi-node configuration. 2. Click Create under Partition Configure to manually assign nodes to the partition. Note: Click Redraw to reorder the sequence in which the nodes appear in the diagram on the page. You can, for example, reorder the diagram to reflect the order in which the nodes are installed in a rack. The nodes are reordered according to the ScaleXpander cabling, with the node that you select in the top position.

smpdr3.13-xw0001.pdf

102

May 2009

XW0001 - Servicing IBM System x Servers Part II Partition Information Partition ID 1

Click Partition ID to define operation of the partition and view information about the partition. A page similar to the one in the following illustration is displayed. The following non selectable fields display information about the partition: 1. The Partition Count field displays the number of nodes in the partition. 2. The Partition Validity field displays the following status: Valid (which indicates the configuration is correct). 3. The Partition field displays one of the following statuses: Stopped: The partition is inactive, and the nodes can be reassigned to a partition. Started: The partition is active. Resetting: The configuration is resetting. Unknown: The partition contains unidentified port or chassis IDs a) In the Partition merge timeout minutes field, select the number of minutes POST waits for the scalable nodes to merge resources. The default value is 6 minutes. b) Allow at least 8 seconds for each GB of memory in the scalable partition. c) In the On merge failure, attempt partial merge? field, select whether POST should attempt a partial merge if one error is detected during full merge. Yes is the default value. d) In the Memory Mirroring? field, select whether memory mirroring is enabled in all nodes in the partition. Yes is the default value. e) Click Save.

smpdr3.13-xw0001.pdf

103

May 2009

XW0001 - Servicing IBM System x Servers Part II Chassis Merge

In order to merge chassis:


-All secondary nodes must contain same core count as primary
Can have different speeds, but not different core count

Notes: In order to merge chassis, all secondary nodes must contain same core count as the primary node. They can have different speeds, but not core count. The screen shows that chassis number 2 processors do not match the primary and the error message appears.

smpdr3.13-xw0001.pdf

104

May 2009

XW0001 - Servicing IBM System x Servers Part II Chassis Merge

In order to merge chassis:


-Each chassis must have 4 GB of memory installed

Notes: In addition, in order to merge chassis, all chassis must have at least 4 GB of memory installed. The screen shows the error message if this condition is not met.

smpdr3.13-xw0001.pdf

105

May 2009

XW0001 - Servicing IBM System x Servers Part II Boot Standalone

In order to boot standalone:


-Cannot press ESC key to bypass merge as USB support is not available at merge
Press Blue Remind button or reconfigure partition information via RSA II interface to force standalone

Notes: Any chassis can boot into standalone mode. You can boot into standalone several different ways. First, since you cannot press ESC key to bypass merge as USB support is not available at merge time, you can press the Blue Remind button. Or you can reconfigure the partition information via RSA II interface to force standalone.

smpdr3.13-xw0001.pdf

106

May 2009

XW0001 - Servicing IBM System x Servers Part II Boot Standalone

Once merged, press ESC to force a reboot to standalone mode

Notes: Another way to boot into standalone status is to wait till the chassis merge, then press the ESC key to force a reboot to standalone mode.

smpdr3.13-xw0001.pdf

107

May 2009

XW0001 - Servicing IBM System x Servers Part II Boot Standalone

Notes: This is a sample Scalable Complex Management screen showing how you would modify the settings to boot into standalone.

smpdr3.13-xw0001.pdf

108

May 2009

XW0001 - Servicing IBM System x Servers Part II Partition Information Manage

The following non selectable fields display information about the partition: 1. The Partition Count field displays the number of nodes in the partition. 2. The Partition Validity field displays the following status: Valid (which indicates the configuration is correct). 3. The Partition field displays one of the following statuses: Stopped: The partition is inactive, and the nodes can be reassigned to a partition. Started: The partition is active. Resetting: The configuration is resetting. Unknown: The partition contains unidentified port or chassis IDs In the Partition merge timeout minutes field, select the number of minutes POST waits for the scalable nodes to merge resources. The default value is 6 minutes. Allow at least 8 seconds for each GB of memory in the scalable partition. In the On merge failure, attempt partial merge? field, select whether POST should attempt a partial merge if one error is detected during full merge. Yes is the default value. c. In the Memory Mirroring? field, select whether memory mirroring is enabled in all nodes in the partition. Yes is the default value. Click Save.

smpdr3.13-xw0001.pdf

109

May 2009

XW0001 - Servicing IBM System x Servers Part II BIOS Changes

Processors listed by Node

Notes: One of the changes in the System x3950 M2 BIOS screens is that you can now see all the processors in a multi-node complex.

smpdr3.13-xw0001.pdf

110

May 2009

XW0001 - Servicing IBM System x Servers Part II Summary

This topic has enabled you to:


-Define the terms scalable, node, complex and partition -Describe the data and system management cabling of a scaled system -Describe how to use the RSA II WEB interface to configure a scalable partition on an x3950, x3950E and x3950M2

This topic discussed scalability.

smpdr3.13-xw0001.pdf

111

May 2009

XW0001 - Servicing IBM System x Servers Part II

Topic 6 Dynamic System Analysis

This topic discusses Dynamic System Analysis (DSA) and how it can be used to provide service on high-performance System x servers.

smpdr3.13-xw0001.pdf

112

May 2009

XW0001 - Servicing IBM System x Servers Part II Objectives

By the end of this topic, you will be able to:


-Describe the functions of Dynamic System Analysis (DSA) -List the data gathering capabilities of DSA -Describe the DSA package offerings and installation requirements of each -Describe how Preboot DSA operates on high-performance System x servers

This topic discusses the significant aspects of DSA and what you need to know in order to use it to solve problems.

smpdr3.13-xw0001.pdf

113

May 2009

XW0001 - Servicing IBM System x Servers Part II Dynamic System Analysis (DSA) Overview

DSA is an information collection and analysis tool


Used to aid in the diagnosis of system problems Creates a merged log that includes events from the OS, from the service processor event logs and from any devices that store event or error information
-DSA also collects product data from the hardware that is installed in the system where it is available

The information is collected into a compressed XML file. The file can be sent to IBM Support to assist in finding and resolving problems. In addition, DSA provides a local viewer and can display the contents of the XML file in a Web browser.

Here is a summary of the main characteristics of DSA.

smpdr3.13-xw0001.pdf

114

May 2009

XW0001 - Servicing IBM System x Servers Part II DSA Data Collection

DSA collects and analyzes the following:

Dynamic System Analysis (DSA) is a collection of probes that hunt the system for information. It has the capability to plug itself into drivers and firmware to pull logs, then, interprets the information into a useable format. IPMI and RSA drivers must be installed prior to using DSA. If there is no RSA present DSA is able to pull information from the BMC as long as the IPMI mapping layer and driver are installed.

smpdr3.13-xw0001.pdf

115

May 2009

XW0001 - Servicing IBM System x Servers Part II DSA Packages

DSA Portable Edition


-Runs from the command prompt on a supported system without altering any system files or system settings. It collects system information in sensitive customer environments with only temporary use of system resources.

DSA Installable Edition


-Provides a permanent installation of DSA onto a system. This installation shares a similar command prompt interface with the portable edition. With DSA Installable Edition, you can also get an UpdateXpress comparison analysis.

DSA Bootable Edition


-Bootable Edition executes and starts the collection process. The DSA collection process is completed, and an interactive menu is displayed.

Preboot DSA
-A blend of the diagnostic routines behind the F2 option and the DSA data gathering capabilities
There are several editions of IBM DSA The portable edition runs on a supported system without altering any system files or system settings. No files are installed on the system under investigation. The installable edition installs directly on the system. This edition can be run directly from the console of the system under investigation. DSA is supported on Windows and Linux operating systems. The readme file lists the specific information regarding NOS support and installation instructions for the different NOSes. Running DSA with the default options will create an XML file that can be sent to IBM support. The XML file is stored locally on the system under investigation. Command line switches are used to run DSA in a way that will create the necessary HTML files to read the results locally. Preboot Diagnostics (DSA) is installed on a internal USB key in some of IBM High performance Servers . Preboot DSA is activated by pressing F2 at the BIOS prompt screen. Same procedure we used when entering Diagnostics on the older systems DSA versions are available for download from the IBM support Web site. Note: Linux Portable and Installable versions are for Linux / VMware. VMware ESX 3.0 users should run the Red Hat 3, 32-bit version of DSA.

smpdr3.13-xw0001.pdf

116

May 2009

XW0001 - Servicing IBM System x Servers Part II Portable and Installable DSA Prerequisites

DSA will run without any additional software but may not include all of the available logs without the installation of device drivers
To read a BMC SEL, the system must have the following device drivers installed and running:
-IPMI Device Driver -IPMI Mapping Layer -Note. The installation sequence of these drivers is critical. They MUST be installed in the order shown above

To read the RSA event log, the RSA driver must be installed

smpdr3.13-xw0001.pdf

117

May 2009

XW0001 - Servicing IBM System x Servers Part II DSA Comparison Features

DSA has the ability to compare a report for a system against known firmware and driver levels that are available from IBM
This feature compares DSA outputs for firmware and device drivers with those found on the UpdateXpress CD-ROM set To run the comparison tool, the relevant UpdateXpress CD-ROM must be in the system CD ROM drive

DSA also has a difference checker


Compares two DSA outputs Highlights changes
-Firmware versions -Device driver levels -Installed applications and new hardware configurations

DSA has the ability to compare code levels against a set of code levels on an UpdateXpress CD-ROM. This can be useful if code mismatches are suspected to be the cause of problems. DSA can also compare two DSA reports to track changes for two points in time. The difference checker will highlight any significant changes to the system environment.

smpdr3.13-xw0001.pdf

118

May 2009

XW0001 - Servicing IBM System x Servers Part II Preboot DSA

Incorporated in x3850 M2 and x3950 M2


-Accessed by pressing F2 at boot time
Options to run diagnostics or enter into DSA data gathering

Preboot DSA is integrated into the System x3850 M2 and x3950 M2. It is accessed via the F2 key sequence when the IBM splash screen loads.
Preboot DSA can be accessed if the system reaches state 4 completion of POST.

smpdr3.13-xw0001.pdf

119

May 2009

XW0001 - Servicing IBM System x Servers Part II Preboot DSA - Capabilities System Data Collection Providers
- System Overview Mfr, version, prod name, serial no, uuid, critical details - Network Settings Hostname, physical network port info, global settings - Hardware Inventory Processor, memory, disk info, monitor info, system card info, devices scsi, usb, optical, other - PCI Information - Devices, bridges, slots - Firmware/VPD - Network, SP, BIOS, other vpd - SP Configurations Settings general, TCP/IP, SNMP, dial-out, dial-in - LSI Controller Controller info, physical & logical drive info - System Management Data, logs, Light Path LED settings - BIST results RSA, IPMI - Event logs ASM, IPMI - Merged devices - Memory diagnostics log - DSA Error log

Diagnostic Tests
- Memory Test runs in standalone mode - BMC I2C Test - Check Point Panel Test - Optical Test Read Error Test Self Test Verify Media Installed - RSA Restart Test - TPM Test - Ethernet Test Control Registers EEPROM Internal Memory Interrupt LEDs MAC Loopback PHY Loopback MII Registers - Stress Tests CPU Stress Test Memory Stress Test - HDD Test

Her is a summary of the capabilities of preboot DSA.

smpdr3.13-xw0001.pdf

120

May 2009

XW0001 - Servicing IBM System x Servers Part II Initiating a Preboot DSA Session

Entering the Preboot DSA environment can take several minutes

Preboot DSA will take up to 10 min to load.

smpdr3.13-xw0001.pdf

121

May 2009

XW0001 - Servicing IBM System x Servers Part II Memory Tests

Quick Memory Test main menu selection screen

Notes:
By default, you will be taken the Memory Test Main menu screen. Test that can be executed are: Quick Memory test Full Memory test Change Options To exit Memory test and enter DSA from here, you would select Quit to DSA.

smpdr3.13-xw0001.pdf

122

May 2009

XW0001 - Servicing IBM System x Servers Part II Entering Preboot DSA

Select Quit to DSA to enter DSA main menu selection screen

By default, you will be taken the diagnostic menu screen. To run DSA from here, select Quit to DSA.

smpdr3.13-xw0001.pdf

123

May 2009

XW0001 - Servicing IBM System x Servers Part II Preboot DSA Command Line

Preboot DSA offers several options in a command line menu system

Preboot DSA offers a command menu where you have the opportunity to make a selection. GUI - take you the a graphical environment CMD - offers various command as an option COPY - copy DSA results to a removable media EXIT - exits the program HELP - is also available

Preboot DSA Command menu . 1. COLLECT collects system information 2. VIEW displays the collected data on a local console in text viewer 3. ENUMTESTS list available test 4. EXECTEST menu used to select a test to execute 5. GETEXTENDEDEDRESULTS retrieves and displays diagnostic results 6. TRANSFER send s the collected data to IBM support 7. QUIT exits the Preboot DSA
The copy command will be used most by the customers and the field community to capture all the logs to a USB key and then have those logs emailed to IBM support for analysis In the lab session of this course you will be running this command to capture the logs and then analyze the data

smpdr3.13-xw0001.pdf

124

May 2009

XW0001 - Servicing IBM System x Servers Part II Preboot DSA Graphical Interface

Select GUI to enter the Graphical User Menu

DSA Diagnostic Tests


The Preboot DSA graphical interface offers clickable items for system diagnostics, information gathering and help, as well as an exit button.

smpdr3.13-xw0001.pdf

125

May 2009

XW0001 - Servicing IBM System x Servers Part II Graphical Diagnostics

Select Diagnostics from the main menu to load the diagnostic tests page

From this page, you can select and run a variety of diagnostic tests on system hardware.

smpdr3.13-xw0001.pdf

126

May 2009

XW0001 - Servicing IBM System x Servers Part II System Information

Select System Information from the main menu to run DSA


DSA collects a wide range of information from hardware components

Preboot DSA provides the following data in System Information System configuration Installed applications and hot fixes Device drivers and system services Network interfaces and settings Hardware inventory including PCI information Vital Product Data and BIOS and firmware information Drive health information LSI, RAID controller configuration Event logs for ServeRAID controller and service processors

smpdr3.13-xw0001.pdf

127

May 2009

XW0001 - Servicing IBM System x Servers Part II Scaled System Information Gathering

The Primary nodes Preboot Diagnostic ( DSA) gathers and displays the systems that are in the scaled partition.

In a scaled system configuration Preboot Diagnostic (DSA) the primary node will gather system information for all the scaled systems in the partition.

smpdr3.13-xw0001.pdf

128

May 2009

XW0001 - Servicing IBM System x Servers Part II Two Node Graphical Diagnostics

In a Scaled configuration the Primary nodes Preboot Diagnostic tests the systems that are scaled.

Two node test

The Preboot Diagnostic on the primary node will test the scaled systems. Pay close attention to the Ethernet test in the screen shot above.

smpdr3.13-xw0001.pdf

129

May 2009

XW0001 - Servicing IBM System x Servers Part II DSA Automated Report Submission

Preboot DSA has an option to send data to IBM


-This option requires the following:
Eth0 must be wired to the clients network on an active port DHCP lease to Eth0 from an active DHCP server Clients approval and network permissions to use the FTP protocol to transfer the files to IBM support (possible firewall issues)

Preboot DSA can automatically transmit the DSA data to IBM support for analysis. Here is a list of requirements that MUST be met in order for this process to be successful.

smpdr3.13-xw0001.pdf

130

May 2009

XW0001 - Servicing IBM System x Servers Part II Summary

This topic has enabled you to:


-Describe the functions of Dynamic System Analysis (DSA) -List the data gathering capabilities of DSA -Describe the DSA package offerings and installation requirements of each -Describe how Preboot DSA operates on high-performance System x servers

Almost all System x and xSeries servers support some of the more advanced technologies that IBM has designed and developed. This topic discusses these service processor technologies and describes the implications of working with them in the field.

smpdr3.13-xw0001.pdf

131

May 2009

XW0001 - Servicing IBM System x Servers Part II

Topic 7 Problem Solving

This topic discusses how to solve problems on the System x3859, x3950 M2.

smpdr3.13-xw0001.pdf

132

May 2009

XW0001 - Servicing IBM System x Servers Part II Topic Objectives

By the end of this topic, you will be able to:


-Identify the tools available for problem solving -Identify the sequence in which to use the tools -Identify when the tools can be used and what you can expect to get from the tools

This topic deals with information gathering and analysis. Without information, you can not understand what is wrong and you can not apply solutions.

smpdr3.13-xw0001.pdf

133

May 2009

XW0001 - Servicing IBM System x Servers Part II Service and Support Tools

The following tools are available for the 3950 M2:


-PDSG -Light Path Diagnostics -Beep codes -POST codes -BMC -RSA* -Preboot DSA* -Adapter BIOS messages -DSA installable/portable*
* Indicates the main focus for our gathering process

The list of tools available for this system is quite large The most important aspect of using the tools is to recognize which tools you should be placing all our trust into. Also, you need to understand when to use them and how to use them. The following pages in this topic will explain all those interactions.

smpdr3.13-xw0001.pdf

134

May 2009

XW0001 - Servicing IBM System x Servers Part II The Six System States

The six system states are used as the basis for problem analysis and repair
-Each state offers new information gathering and analysis tools -Each state builds on the last state for tool availability
System State 1. There is no AC power 2. There is AC power but no DC output Data Gathering Visual BMC RSA Light path Checkpoint codes F1 and F2 (possibly) Beep codes Adapter BIOS msgs (Adaptec, LSI, etc.) ServeRAID Manager MegaRAID Storage Manager F2 Preboot Diagnostics (DSA) NOS boot messages Blue screen Safe mode DSA NOS event logs Data Analysis PDSG/HMM SvcCon, SMBridge RSA event log PDSG RETAIN tips IBM support Web site F2 Preboot Diagnostics (DSA) PDSG RETAIN tips F2 Preboot Diagnostics (DSA) NOS vendor messages

3. There is AC and DC power but the system fails to complete POST

4. There is AC and DC power, the system completes POST but the NOS fails to start loading 5. There is AC and DC power, the system completes POST but the NOS fails to complete loading 6. There is AC and DC power, the system completes POST and the NOS completes loading but stops during operation

DSA

All IBM System x servers start in a uniform manner. All have a common set of interfaces to advise where in the power-up sequence the server has reached. This chart shows the possible information gathering tools on the left and the possible information analysis tools on the right. All servers are supported by documentation, which forms part of the tool set for both information gathering and information analysis. For example, a Problem Determination and Service Guide (PDSG), contains lists of errors that may occur (information gathering) during POST but also contain probable causes of the error (information analysis). It is also important to realize with the above chart the each state builds on to the previous state. Example in system state two we have most importantly the RSA,but we also have BMC, Light Path and from state one, the PDSG and visual symptoms. So each state builds on the previous and you have those previous states data gathering tools and resources to rely upon. It is important to stress that not all information sources are available in all system states. This page summarizes what tools are available and when.

smpdr3.13-xw0001.pdf

135

May 2009

XW0001 - Servicing IBM System x Servers Part II Service and Support Tools

What To Expect On A Service Engagement


-Customer
Depending on system state, RSA logs, Preboot DSA logs, DSA logs

-SSR
Preboot DSA and or RSA logs and diagnostic results

-Remote Support Agent


Analysis of DSA/RSA logs for problem isolation and FRU action plans from the analysis by RSA Analysis of driver levels, firmware versions, installed service packs Analysis of installed components to meet ServerProven requirements

Here is a summary of what you can expect to see when engaged on a service call with this system. Note. Available data sources will depend on the system state.

smpdr3.13-xw0001.pdf

136

May 2009

XW0001 - Servicing IBM System x Servers Part II RSA

RSA II is standard in all x3950, x3850 M2 and 3950 M2 systems


Light Path Diagnostics is driven by the BMC; the BMC reports to the RSA RSA is the primary hardware tool as it interprets the BMC logs into action plans If RSA does not report a problem that the BMC sees, then that is a defect which needs to be addressed
-It is still important to view and capture all the sources of data input and then compare that input -If everything is working as designed, the RSA will have the source of the problem and the plan of action

The RSA adapter is alive from system state 1 to system state 6 and is available to log into without any interruption to the customer or OS environment. As you will see in the following pages, DSA in all versions from Preboot to installable will capture the RSA logs and data into its logs to report findings. The RSA in this system is similar to all previous systems. Logon and information capture is the same as before.

smpdr3.13-xw0001.pdf

137

May 2009

XW0001 - Servicing IBM System x Servers Part II Data Gathering Sources

Three data gathering sources available:


-RSA with the ability to logon from state 1-6 and view and safe logs for transmittals -Preboot DSA with the ability to gather data from system state 3-5
Captures RSA data plus more

-DSA for system state six


In all instances of DSA the RSA data, BMC data plus more is collected if the IPMI drivers are installed DSA installable also captures all driver versions, applications info, services running and OS logs

For DSA installable and portable, the customer must install the drivers prior to running DSA for RSA data.

smpdr3.13-xw0001.pdf

138

May 2009

XW0001 - Servicing IBM System x Servers Part II Concerns and Issues

Preboot DSA has an option to send data to IBM


-This option REQUIRES the following:
Eth0 must be wired to the clients network on an active port DHCP lease to Eth0 from an active DHCP server Clients approval and network permissions to use the FTP protocol to transfer the files to IBM support (possible firewall issues)

Preboot DSA can automatically transmit the DSA data to IBM support for analysis. Here is a list of requirements that MUST be met in order for this process to be successful.

smpdr3.13-xw0001.pdf

139

May 2009

XW0001 - Servicing IBM System x Servers Part II SVCCon and SMBrige

Both tools are available for the x460 and x3950M2


-The only use for these tools is for a client or SSR to clear the BMC log without having to change the system state

Although a BMC gathers an even log, as the system has an RSA II as standard, the RSA event log is the preferred log. However, the system information light will be illuminated if the BMC log reaches 75% full. Following any service activity, use either SVCCon or SMBridge to clear the BMC log in readiness for any future problems and log reporting.

smpdr3.13-xw0001.pdf

140

May 2009

XW0001 - Servicing IBM System x Servers Part II CP Codes on Light Path Card

The client will now see the CP (checkpoint) codes from the Light Path Diagnostic panel
-CP codes are not documented in the PDSG
Explain to the client that this is a service only display used only by support personnel

-The BMC recordsCP codes and the RSA displays them in the log
This only occurs if the system is connected to AC for a minimum of two minutes before the power on button is pressed (to give the BMC/RSA2 time to boot/communicate)

When a system is connected for the first time to AC, the BMC will take up to two minutes to initialize internally, until this is complete the BMC cannot communicate to the RSA and the RSA will not be able to capture any power on failures and/or CP codes.

smpdr3.13-xw0001.pdf

141

May 2009

XW0001 - Servicing IBM System x Servers Part II RETAIN

As with any new product announcement it is extremely important to search/query RETAIN for any tips that match the symptoms displayed
-In some cases, not all features are available from the initial product release but are added to the system after product GA (General Availability) date.
Published capabilities are contained in the announcement letters Review RETAIN for those features that are not enabled yet

New products, as they are released, may not have all of their possible features available on GA date. The announcement letter for the product will list all of the features that are supported at GA, as well as a prediction of when new features will be forth coming. The RETAIN tip database will contain up to date information on the status of new features in the product.

smpdr3.13-xw0001.pdf

142

May 2009

XW0001 - Servicing IBM System x Servers Part II Topic Summary

This topic has enabled you to:


-Identify the tools available for problem solving -Identify the sequence in which to use the tools -Identify when the tools can be used and what you can expect to get from the tools

This topic has identified the support tools available on the System x3859, x3950 M2.

smpdr3.13-xw0001.pdf

143

May 2009

XW0001 - Servicing IBM System x Servers Part II

Topic 8 Support References

This topic discusses where to go for help once this course is finished.

smpdr3.13-xw0001.pdf

144

May 2009

XW0001 - Servicing IBM System x Servers Part II Topic Objectives

At the end of this topic, you will be able to:


-Identify documentation resources available to support the servers discussed in this class -Identify the support web sites for the units and what they offer

Support information can take many forms. Here, we will discuss the key information sources for these systems and how to access them.

smpdr3.13-xw0001.pdf

145

May 2009

XW0001 - Servicing IBM System x Servers Part II Documentation

System documentation (Users guide, installation guide, etc.)


-Useful for confirming shipping group contents (missing parts, etc.) and initial customer setup

Problem Determination and Service Guide (PDSG)


-Available electronically (Adobe Acrobat PDF format) from the IBM support web site or on the Service Update CD-ROM -Primary support document for diagnostics and troubleshooting

The system documentation, which ships with every new system may also prove useful for verifying the basic setup of the server or I//O expansion drawer. As many of the components of modern servers are customer replaceable units (CRUs) as well as FRUs, some setup instructions are contained in the system manuals. If you are called to a newly installed server, you will want to verify that the customer has, in fact, correctly installed everything. The Problem Determination and Service Guide (PDSG) (formerly known as the Hardware Maintenance Manual (HMM) is the primary reference document for the systems covered in this course. All PDSG/HMMs are now available electronically in Adobe Acrobat Portable Document Format (PDF). The PDSG contains all the disassembly and reassembly steps, beep codes and error descriptions to assist you in isolating a failed FRU or FRUs. You will need Adobe Acrobat Reader version 4 or higher to view the contents properly as this is the minimum supported revision of the reader.

smpdr3.13-xw0001.pdf

146

May 2009

XW0001 - Servicing IBM System x Servers Part II Server Support Web Site

Central Support site for all products


http://www.ibm.com/jct01004c/systems/support/supportsite.wss/brandmain?brandind=5000008

IBM has launched a new central support site for all products. The address is listed above. It should be noted that web addresses change from time to time. In future, this web address may change but IBM normally links older web addresses to the new address for several months at least after the old site closes. If you bookmark this site in your browser, be sure to maintain your bookmarks as site addresses change. The navigation bar on the left provides the main topics available on the web site.

smpdr3.13-xw0001.pdf

147

May 2009

XW0001 - Servicing IBM System x Servers Part II Software and Device Drivers

Central site for downloading software files


http://www.ibm.com/systems/support/supportsite.wss/docdisplay?lndocid=MIGR4JTS2T&brandind=5000020

Software and Device Drivers IBM System x provides easy/quick access to the wide range of firmware updates as well as the software/device drivers for supported operating systems for each System x server, BladeCenter or Storage Enclosure. If you are an authorized servicer, there is also a dealer support site, with a nice collection of some of the more popular links for each product. (e.g. https://www304.ibm.com/systems/support/supportsite.wss/docdisplay?lndocid=SERVOPTN&brandind=5000008#x460 )

smpdr3.13-xw0001.pdf

148

May 2009

XW0001 - Servicing IBM System x Servers Part II ServerProven Web Site

Reference site for device compatibility


http://www.ibm.com/jct09002c/isv/eserver/serverproven/index.html

Whilst IBM extensively tests third party hardware and software and, in many cases, approves them for use with System x servers, not all devices or combinations of devices are tested/supported. If you are working with a server which contains third party devices, you can check for compatibility here. You may find assistance which is not contained in the primary documentation here which can help you to isolate a fault.

smpdr3.13-xw0001.pdf

149

May 2009

XW0001 - Servicing IBM System x Servers Part II System x Support Repository

Server Support site w/information and photographs


https://www.ibm.com/systems/support/reflib/

This site is the central repository for a collection of information and photographs of many IBM System x, BladeCenter, eServer, and xSeries Servers intended for support personnel. (Note: This site and the subsequent one is NOT for the full list of IBM products and was often put together from the documents that the education group provided updated training materials on.)

smpdr3.13-xw0001.pdf

150

May 2009

XW0001 - Servicing IBM System x Servers Part II IBM Server - Bios Simulators

Reference site for Bios Simulators


https://www.ibm.com/systems/support/reflib/simulators/

Many times, the support people do not have immediate physical access to the machine that someone is asking for help with. These pages contain one of the ship level BIOS files with a simulator that shows how many of the System x, BladeCenter, eServer, xSeries machines can be configured. The simulator shows screens similar to the ones that the customer would use to configure their server after pressing F1 during the system boot. (e.g. The Bios level may be different between the simulator version and the one installed on the customers machine.) We have also included an Options simulator for the BladeCenter management module, and numerous adapters (Note: as of this writing, several servers are still missing from the entire support matrix.)

smpdr3.13-xw0001.pdf

151

May 2009

XW0001 - Servicing IBM System x Servers Part II Configuration Tools Website

This site contains links to COG, xRef and other helpful configuration tools
http://www.ibm.com/systems/x/hardware/configtools.html

This Web site contains links, descriptions of several Configuration tools. (Note: While these pages are intended for pre-sale support, they are often useful for Business Partner, and in a Service/Post sale environment. The COG contains general information about IBM products and supported options for currently shipping equipment (updated each month) The xRef documents provide a brief technical overview of each of the servers in the System x/BladeCenter , Intellistations, and withdrawn systems. (e.g. past servers are removed from the originals and made available in the withdrawn systems xRef) Other Configuration tools deal with BladeCenter Interoperability, Rack Configuration, and Power / Equipment sizings.

smpdr3.13-xw0001.pdf

152

May 2009

XW0001 - Servicing IBM System x Servers Part II Summary

This topic has enabled you to:


-Identify documentation resources available to support the servers discussed in this class -Identify the support web sites for the units and what they offer

This topic has discussed several helpful support Sites for configuring, maintaining, and troubleshooting IBM Servers.

smpdr3.13-xw0001.pdf

153

May 2009

XW0001 - Servicing IBM System x Servers Part II Course Summary

This course has enabled you to:


-Identify the serviceability features of System x highperformance servers -Describe the advanced technologies used in System x servers and their service implications -Describe the management characteristics of System x servers -Perform a series of setup, configuration and troubleshooting tasks on System x servers and associated peripherals

This course is now complete. Thank you for attending. System x and BladeCenter Service and Support Education hopes you have enjoyed it and found it both interesting and valuable to your job. If you have any comments or suggestions regarding this education, please let your instructor know and s/he will pass them on to the education development teams. We ALWAYS act on comments and suggestions as we constantly seek to improve our education offerings.

smpdr3.13-xw0001.pdf

154

May 2009

You might also like