HP Advanced Memory Error Detection Technology

HP Advanced Memory Error Detection Technology
Technology brief
Introduction ......................................................................................................................................... 2 SDRAM technology .............................................................................................................................. 2 Memory errors .................................................................................................................................... 3 Traditional memory error classifications .............................................................................................. 4 Correctable and uncorrectable errors ................................................................................................. 4 Why memory errors are increasing ........................................................................................................ 4 Server memory capacity is increasing ................................................................................................. 4 DRAM technology is changing ........................................................................................................... 5 HP Advanced Memory Error Detection Technology .................................................................................. 5 Enhancements.................................................................................................................................. 5 Advantages ..................................................................................................................................... 6 HP ProLiant servers supported ............................................................................................................ 6 Conclusion .......................................................................................................................................... 6 For more information ............................................................................................................................ 8
Introduction
Across the industry, memory errors have increased significantly due to the growth in overall server memory capacity and the increase in the number of bits per DRAM chip. Uncorrectable memory errors can cause applications and operating systems to crash, so they are costly in terms of downtime and repairs. Over the past 18 years, HP has introduced several memory technology innovations to ensure data reliability and protection. In 1999, we introduced the Pre-Failure Alert notification system to monitor and predict potential problems with critical components such as system memory modules (DIMMs). The notification system sends an alert to a system administrator when a DIMM exceeds a predefined threshold for correctable memory errors. This lets the administrator schedule server maintenance to replace a DIMM that may fail, avoiding unexpected interruption of business operations. In the ProLiant System ROM upgrade (version May 2011 or later), we have enhanced protection with HP Advanced Memory Error Detection Technology. This innovation seeks out specific defects that either cause performance degradation or significantly increase the probability of an uncorrectable (non-recoverable) memory condition. By improving the prediction of non-recoverable memory events, this technology prevents unnecessary DIMM replacements and increases server uptime. This paper details the enhancements in and advantages of HP Advanced Memory Error Detection Technology. It begins with a description of Synchronous DRAM (SDRAM) technology and memory errors, and it explains why memory errors are occurring more frequently.
SDRAM technology
A standard Error Correction Code (ECC) DDR3 DIMM delivers 72 bits at a time to a memory bus. The 72-bit data blocka 64-bit data word and 8 bits of ECCis called a rank. As shown in Figure 1, one rank consists of data from nine DRAM chips that provide 8 bits each (called x8 or by 8 chips) or 18 DRAM chips that provide 4 bits each (x4 chips). DIMMs are classified as single-rank, dual-rank, or quad-rank (not shown). Quad-ranked DIMMs can have 72 x4 DRAM chips or 36 x8 DRAM chips, including ECC chips. Memory manufacturers use multiple ranks to increase the capacity of DIMMs per memory channel. Today, a quad-ranked DDR3 DIMM with 4 Gb DRAM chips has a usable capacity of 32 GB.
Figure 1: Single-sided and double-sided SDRAM DIMMs and corresponding DIMM rank
Each DDR3 DRAM chip contains billions of memory cells ordered in eight banks (arrays) of rows and columns. Using a 2 Gb x4 DRAM chip as an example, each bank contains 2 10 (1,024) rows and 216 (65,536) columns, totaling more than 256 million cells per bank and more than 2 billion cells per chip. Each memory cell contains a circuit with a transistor and a capacitor that stores an electrical charge. The charge state of the capacitor represents binary informationa 1 or 0 data bit. Capacitors can only store a charge briefly, so they must be recharged (refreshed) thousands of times per second. The operating voltage of the DIMM determines the level of the electrical charge. As shown in the read operation Figure 2, the memory controller sends the address signalsbank, row, and columnthat specify the location of the target DRAM cell. In the designated bank, the row decoder activates the row (word line) and the column decoder activates the column (bit line). Next, the capacitor in the target cell sends its stored charge through the bit line to the sense amplifier. Because the stored charges are very small, the sense amplifiers detect and amplify each charge before sending the data to the I/O buffer. The sense amplifiers are also responsible for restoring capacitors to their original state after reading the data. The 4 or 8 bits of data, called a data symbol, then go through the output data pins to the memory bus.
Figure 2: Representation of DIMM, chip, bank, cell hierarchy per rank
Memory errors
Several events or conditions can cause errors in individual memory cells, in multiple cells in different rows (a column failure), or in multiple cells in different columns (a row failure). For example, a defect in a word line or bit line can prevent part of a row or column from receiving a signal, resulting in a row or column failure. A row or column failure can also result from a failure in the row decoder or column decoder circuitry. A defect in a sense amplifier can also cause a column failure. Additionally, several phenomena, called noise sources, can degrade signals in route to the sense amplifiers.
The industry has traditionally classified memory errors by the number of bits affected and the causes of the errors. But for systems with large memory footprints, its more meaningful to classify errors as correctable or uncorrectable. The following sections explain this distinction.
Traditional memory error classifications

Memory errors are commonly classified according to the number of bits affected in a 64-bit data word. An error in one bit of a data word is a single-bit error. An error in more than one bit of a data word is a multi-bit error. Memory errors are also classified as hard or soft depending on what caused them. DRAM defects, bad solder joints, and data pin issues cause hard errors so that the device consistently returns incorrect results. For example, a stuck memory cell returns the same bit value, even when a different bit is written to it. In contrast, soft errors are transient and non-repeating. They can be caused by an electrical disturbance inside the memory array or on the memory interface.
Correctable and uncorrectable errors

The outcome of a memory error depends on whether it can be corrected. Some row failures and column failures are correctable depending on both the DIMM configuration (x4 or x8) and the error correction capability of the system. ECC can correct single-bit errors within a single x4 or x8 DRAM chip, but ECC can only detect a multi-bit error. Only 4 DRAM chips allow the use of advanced errorcorrection control technologies1 in server environments. Advanced error-correction control technologies can detect and correct multi-bit failures in a single x4 DRAM chip. Their algorithms can correct any single-bit or multi-bit errors in a 4-bit symbol, also known as a symbol error. This allows recovery from a x4 DRAM chip failure. The algorithms can also detect two symbol errors across two x4 DRAM chips. Intel Xeon- and AMD Opteron-based systems use advanced error-correction control technologies to correct one 4-bit symbol error and detect two symbol errors (single-symbol correct, double-symbol detect). If there is an error in more than two symbols, the technologies may not be able to detect them. Another technology known as Double Device Data Correction (DDDC) can correct errors in two symbols and detect errors in three symbols (double-symbol correct, triple-symbol detect). This means that if one DRAM chip fails, but the DIMM remains in operation, DDDC will continue to work even if a second chip has an error or fails. Intel Xeon systems support DDDC in lockstep memory mode. In lockstep mode, two channels operate as a single channel so that each write and read operation moves a cache line two channels wide. Both channels split the cache line to provide 2x 8-bit error detection and 8-bit error correction within a single DRAM.
Why memory errors are increasing

Two trends increase the likelihood of memory errors in servers: Server memory capacity is increasing. DRAM technology is changing to meet the demand for higher DIMM storage capacity.
Server memory capacity is increasing

The growth of high-performance computing (HPC) and virtualized IT environments is driving operating systems to address more memory. This is causing manufacturers to expand the memory capacity of servers. In the last 5 years, the average memory capacity per server has grown by more than 500%from 5.6 GB to 33 GB per server across all HP ProLiant server lines.
1
Intel Single Device Data Correction and IBM Chipkill
Maximum server memory capacity is also increasing to meet the demands of HPC and virtualization. For example, an HP ProLiant DL580 G7 server fully populated with 32 GB DIMMs contains 2 TB of system memory, which translates to 18 trillion memory cells.
DRAM technology is changing

Memory manufacturers increase DIMM storage capacity by decreasing DRAM feature size (increasing chip density). As DRAM cells become smaller, manufacturers lower the operating voltage to increase the memory speed and decrease power use. Memory manufacturers have lowered the operating voltage for standard DIMMs from 2.5 V, to 1.8 V, to 1.5 V and eventually 1.25 V. Smaller feature sizes and higher operating frequencies equate to fewer stored charges in the capacitors. This smaller number of stored charges reduces tolerance to noise sources and makes it more difficult for sense amplifiers to interpret the bit value of a capacitors charge accurately. Also, reducing the number of stored charges makes it easier to change the state of a cell. This combined with higher bit density, increases the number of bits that may be affected by an ionizing event, such as an alpha particle.
HP Advanced Memory Error Detection Technology

Because of higher memory error frequency, some server administrators are unnecessarily shutting down servers to replace DIMMs that experience correctable errors. The best way to prevent unnecessary DIMM replacements is to filter out superfluous errors and identify critical errors that can lead to a shutdown. Thats the goal of HP Advanced Memory Error Detection Technology.
Enhancements
The HP Advanced Memory Error Detection Technology algorithm analyzes multiple parameters of correctable memory error events and intelligently detects when the system is at increased probability of a non-recoverable, uncorrectable memory error condition. The algorithm performs calculations on 4-bit and 8-bit symbols instead of analyzing individual bits. It tracks multiple parameters of correctable memory errors and, after considering several properties of the DIMM, it decides when to notify the administrator to replace the DIMM. The algorithm does not prematurely alert customers to replace DIMMs based on single-bit errors because they negligibly increase the probability of an uncorrectable error. The algorithm considers unique parameters of correctable memory errors for x8 DIMMs as compared to x4 DIMMs. This is because advanced memory-correction control technologies cannot protect these DIMMs against a complete DRAM chip failure. The algorithm also detects bank failures for x4 or x8 DIMMs because these failures may increase the probability of an uncorrectable memory error. The HP iLO3 management processor sends an alert to the servers administrator when a DIMM exceeds a predefined threshold for correctable memory errors or experiences an uncorrectable memory error. The administrator can view a log of correctable and uncorrectable memory error events through the Integrated Management Log (IML) as shown in Figures 3A and 3B. The administrator can access the IML using a supported browser, even when the server is off. The administrators ability to view the event log when the server is off can be beneficial when troubleshooting remote host server problems.
Figure 3A: Example of IML event log with Correctable Memory Error alerts
Figure 3B: Example of IML event log with an Uncorrectable Memory Error alert
Advantages
The HP Advanced Memory Error Detection Technology algorithm is better at pinpointing critical memory errors that can shut down a server. It reduces server downtime by alerting server administrators only when the server is truly at a higher risk of receiving a non-recoverable uncorrectable memory error. Server administrators can then better plan downtime to replace degraded DIMMs, avoiding the unplanned downtime associated with a non-recoverable memory error.
HP ProLiant servers supported

HP Advanced Memory Error Detection Technology is introduced in the System ROM upgrade (May 2011 or later) for certain Intel Xeon-based ProLiant G6 and G7 platforms and for certain AMD Opteron-based ProLiant G7 platforms. For a list of specific servers, go to the For more information section. The technology will be implemented in future generations of ProLiant servers.
Conclusion
Since 1999, the HP Pre-Failure Alert notification system has alerted customers of potential failure in DDR3 DIMMs that exceed a predefined threshold for correctable memory errors. This allowed administrators to schedule server maintenance to replace a DIMM and avoid unexpected interruption of business operations. But over the past few years, the number of reported memory errors has increased due to the growth in server memory capacity and the increase in DRAM chip density. These reported memory errors include particular errors that do not significantly increase the probability of a non-recoverable memory
condition. As a result, administrators have unnecessarily or prematurely replaced good DIMMs at a cost of unnecessary downtime and repairs. In the ProLiant System ROM upgrade (version May 2011 or later), we have enhanced memory error protection with HP Advanced Memory Error Detection Technology. This innovation monitors several memory parameters and seeks out specific defects that either cause performance degradation or significantly increase the probability of a non-recoverable memory condition. By improving the prediction of critical memory error conditions, this technology prevents unnecessary DIMM replacement and increases server uptime.
For more information

Visit the URLs listed below if you need additional information.
Resource description Certain ProLiant G7-Series ServersSYSTEM ROM UPGRADE REQUIRED for Certain ProLiant G7-Series Servers Configured with Intel Xeon 5500 Series Processors or Intel Xeon 5600 Series Processors ProLiant Servers- SYSTEM ROM UPGRADE REQUIRED for ProLiant G6 Servers Configured with Intel Xeon 5500 Series Processors, Intel Xeon 5600 Series Processors, or Intel Xeon 3500 Series Processors ProLiant Servers - SYSTEM ROM UPGRADE REQUIRED - HP Advanced Error Detection Technology Increases Server Uptime and Is Available Via the May 2011 (or Later) System ROM Upgrade for Certain HP ProLiant G6 and G7 Servers Memory technology evolution: an overview of system memory technologies DDR3 memory technology Web address http://h20000.www2.hp.com/bizsupport/TechSupport/Docu ment.jsp?locale=en_US&objectID=c02914487
http://h20000.www2.hp.com/bizsupport/TechSupport/Docu ment.jsp?locale=en_US&objectID=c02914394
http://h20000.www2.hp.com/bizsupport/TechSupport/Docu ment.jsp?locale=en_US&objectID=c02914486
http://h20000.www2.hp.com/bc/docs/support/SupportMan ual/c00256987/c00256987.pdf http://h20000.www2.hp.com/bc/docs/support/SupportMan ual/c02126499/c02126499.pdf
Send comments about this paper to TechCom@HP.com Follow us on Twitter: http://twitter.com/ISSGeekatHP
Copyright 2011 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. Intel and Intel Xeon are trademarks of Intel Corporation in the United States and other countries.AMD and AMD Opteron are trademarks of Advanced Micro Devices, Inc. TC0000818, July 2011

HP Advanced Memory Error Detection Technology

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HP Advanced Memory Error Detection Technology

Uploaded by

Copyright:

Available Formats

HP Advanced Memory Error Detection Technology

Figure 2: Representation of DIMM, chip, bank, cell hierarchy per rank

Traditional memory error classifications

Correctable and uncorrectable errors

Why memory errors are increasing

Server memory capacity is increasing

Intel Single Device Data Correction and IBM Chipkill

DRAM technology is changing

HP Advanced Memory Error Detection Technology

HP ProLiant servers supported

For more information

http://h20000.www2.hp.com/bc/docs/support/SupportMan ual/c00256987/c00256987.pdf http://h20000.www2.hp.com/bc/docs/support/SupportMan ual/c02126499/c02126499.pdf

Send comments about this paper to TechCom@HP.com Follow us on Twitter: http://twitter.com/ISSGeekatHP

You might also like