Ramas PSR

GSO BASE :
Node :
Check Fault History to determine what component is reporting the temperature iss
ue.
If it is a node component then run "nodestatus" and "measurementinfo -A" to chec
k temperature and fan status.
Verify if there are any FAN or power supply failures on the reporting node. Use
command nodestatus and Verify from SMclient tree.
Notify CSR of any failures on the reporing node.
1)CPU x Thermal Ctrl Upper Critical
Solution : Run "nodestatus" and "measurementinfo -T" to check state of CPU tempe
rature.
If there is no over-temp condition (alert autosolved) and this a firs
t time event, monitor the node for any reoccurrence of the alert.
If the temperature alert is for a single node, have the CSR check air flow.
If the temperature alert is for multiple nodes, have CSR check environmental con
ditions for data center.
If you are unable to determine the roo cause collect dumplook and SMclient event
s and measurementinfo -A for the reporting node.
Disk array :
Login into the system and verify the disk array status.
Collect the support bundle for the disk array and escalate to GSO HW team.
GSO hardware :
Node :
1) HSBP Temperature Lower Critical
This is a "false" alert.
Follow instructions in KAP1B2EBA to clear (Procedure to clear false "Critical HS
BP Temperature Lower Critical" alert on Urbanna nodes)
2) CPU x Thermal Ctrl Upper Critical
Run "nodestatus" and "measurementinfo -T" to check state of CPU temperature.
If there is no over-temp condition (alert autosolved) and this a first time even
t, monitor the node for any reoccurrence of the alert.
If an over-temp condition is present on any CPUs, replace the CPU.
--- If the alerts repeatedly come from one node chassis, then perform the follo
wing -
Check for other alerts on this chassis which identifies a component failure
(fan, power supply, CPU, etc.).
Run "nodestatus" and "measurementinfo -T" to check state of node and chassi
s components.
Replace any component reported as failed or marginal.
--- If the temperature alerts are being reported by multiple chassis/cabinets t
hen perform the following Run "nodestatus" and "measurementinfo -T" to verify the temperature alerts
and identify any excessive temperatures being reported.
If temperatures are being reported as excessive in multiple cabinets, have
the site team check the status of the air handlers and room temperature at the c
ustomer site to take corrective action as needed to restore proper cooling to th
e system.
Disk arrays :
Nominal Temperature Exceeded
>> What Caused the Problem?
>> The nominal temperature of the tray has been exceeded. Either a fan has fail
ed, an obstruction is blocking the air flow to or from the tray, or the temperat
ure of the room is too high. The Recovery Guru Details area provides specific in
formation you will need as you follow the recovery steps.
>> Caution: Potential loss of data access. If the temperature of the tray conti
nues to rise, the affected tray may automatically shut down. Fix the problem imm
ediately, before it becomes more serious. The automatic shutdown conditions depe
nd on the model of the tray.
Nominal Temperature Exceeded

Power FAN Canister
1/26/16 12:04:08 AM 281B Failure Critical Nominal temperature exceeded 0/0/0
Temperature Sensor Tray 2, Power-fan canister 1, Slot 1 Controller in slot A
1/26/16 04:44:22 PM 281A Internal Informational Temperature changed to optima
l 0/0/0 Temperature Sensor Tray 2, Power-fan canister 1, Slot 1 Controller in sl
ot A
Drive
5/3/15 12:02:28 PM 481189 100A Error Informational Drive returned CHECK CONDIT
ION 6/b/1 Drive Tray 0, Slot 17 Controller in slot B
The 06/0B/01 drive check condition indicates that the specified temperature has
been exceeded. This can be either the maximum or the minimum temperature.
Cause:
-----
E-Series Storage Arrays are designed to operate in a temperature range of 50F (10C
) to 95F (35C). They can tolerate temperature change at a rate of 41F (5C) per hour.
When the array is located at a high altitude, this range must be lowered to 33.8F
(1C) for every 3280 feet (1km) above sea level.
When any of the Component reaches 10C or 35C, then Controller would write an Eve
nt "Nominal temperature exceeded".
The storage array requires the relative humidity to be above 20% to prevent unex
pected electrostatic discharges and below 80% to prevent corrosion.
NOTE
There are fans located in the power supplies on each tray. There is a temperat
ure sensor in each Enclosure Service Module (ESM), controller and power supply.
Possible Causes :
1, Open hole in the back of the cabinet or Floor tile in front with a solid tile
(no vent).
2, Recent changes with the room's cooling system or the obstruction of "hot" air
flow could possibly be the cause for the sensors being triggered.
3, Obstruction of blocking the air flow from the tray or Probably Hot Air from o
ther System.
4, FAN Speed is low or the Power-FAN Canister itself is bad.
Solution :
--------SCENARIO I
stateCaptureData :
THERMAL SENSOR - 0x156782d0
ThermalSensorRef : 0b 00 50 08 0e 52 bf 46 c0 00 03 00 00 00 00 00 00 00 00 00
status : 0x2 NOMINAL_TEMP_EXCEED
slot/tray ref : 3/0e 50 08 0e 52 bf 46 c0 00 00 00 00 00 00 00 00 00 00 00 00 <
<<<<<<<<< SAS Address of the Enclosure
THERMAL SENSOR - 0x155ba3b8
ThermalSensorRef : 0b 00 50 08 0e 52 bf 46 c0 00 03 00 00 00 00 00 00 00 00 00
status : 0x2 NOMINAL_TEMP_EXCEED
slot/tray ref : 3/0e 50 08 0e 52 bf 46 c0 00 00 00 00 00 00 00 00 00 00 00 00
sasShowEncls
showEncls: A SAS level of -1 indicates no connection
Tray : SAS Level :
View From Ctlr A
: Enclosure
ID : CtlrA CtlrB : ch#/expDevH ch#/expDevH : Logical ID
---- : ----- ----- : ------------ ------------ : ---------------0 :
1
1 : ch0 /0x65
ch1 /0x12
: 0000000000000000
1 :
2
3 : ch0 /0x25
ch1 /0x45
: 50080e52bf427000
2 :
3
2 : ch0 /0x43
ch1 /0x27
: 50080e52bf46c000 <<<<<< Enc
losure Logical ID is Tray#2
getObjectGraph_MT(8,0,0,0,0,0,0,0,0,0)
Support CRU Slot Type:

[Temp Sensor]
Status: Non-Critical
Access Path Status: Okay
Swapped?: No
Predicted Fail?: No
Enclosure Logical ID: 50 08 0E 52 BF 46 C0 00 <<<<<<<<<<<<<<< (Tray#2 Slot#1)
devAddress: 0x0320
Slot: 3
Parent Slot: 1
Relative Slot: 1
Relative Parent Locations: Enclosure(1) Support CRU(1)
Replacement Policy: Parent
Removal Policy: Parent
Generic VPD Byte Array:
CRU Type: Subcomponent
LED Control: None cfg:0x0
LED State: No-State
Alarm Control: Off
Component Reported: Yes
Error Rep When Missing: Yes
SAA Request / SAA On: No / No
Last MEL Logged: 0x281b (for Debug ONLY!)
Temperature (C): 10
<<<<<<<<<<<<<<<<<<<<<<<
Reached 10 Degree Celcius. (Tray#2 Power FAN Canister has a Warning Set)
Warning?: Yes
<<<<<<<<<<<<<<<<<<<<
<<< Warning is Set.
Failed?: No
CPU Temp Sensor?: No
Support CRU Slot Type: Fan - PS
[Fan]
Status: Okay
Access Path Status: Okay
Swapped?: No
Predicted Fail?: No
Enclosure Logical ID: 00 00 00 00 00 00 00 00
devAddress: 0x0124
Slot: 1
Parent Slot: 1
Relative Slot: 1
Relative Parent Locations: Enclosure(1) Support CRU(1)
Replacement Policy: Parent
Removal Policy: Parent
Generic VPD Byte Array:
CRU Type: Subcomponent
LED Control: None cfg:0x0
LED State: No-State
Alarm Control: Off
Component Reported: Yes
Error Rep When Missing: Yes
SAA Request / SAA On: No / No
Last MEL Logged: 0x0 (for Debug ONLY!)
Speed: 3200 RPM, Category Low
<<<<<<<<<<<<<< Check the Sp
eed of the FAN if FAN is working well.
Fault LED?: No
Code: Speed (RPM): Description:
0000 0 - 499 Fan stopped

0001 500 - 999 Fan at lowest speed
0002 1000 - 1499 Fan at second lowest speed
0003 1500 - 1999 Fan at speed 3
0004 2000 - 2499 Fan at speed 4
0005 2500 - 2999 Fan at speed 5
0006 3000 - 3499 Fan at intermediate speed
0007 > 3500 Fan at highest speed
Controller A
Following Command from StateCaptureData can also be checked the Tray if the com
ponents are Optimal.
-> ssmShowSubTree
[Storage Array]
[0x0101 - Enclosure - Tray ID: 0, ELI: 0000000000000000, Status: Okay]
0x0104 - Controller - Config: Snowmass in Camden (266x), Vendor: [LSI ], Produc
t: [INF-01-00 ], Firmware: 0786, Status: Okay
0x011e - Backup Drive - Status: Missing
0x0120 - Host Card - Status: Okay
0x0122 - Battery - Status: Missing
0x012a - Temp Sensor - Status: Okay
0x012b - Temp Sensor - Status: Okay
0x0130 - Cache DIMM - Status: Okay
0x011f - Backup Drive - Status: Missing
0x012c - Temp Sensor - Status: Okay
0x012d - Temp Sensor - Status: Okay
0x0102 - Support CRU - Status: Okay
0x0124 - Fan - Status: Okay
0x0128 - Power Supply - Status: Okay
0x012e - Temp Sensor - Status: Okay
0x012f - Temp Sensor - Status: Okay
0x0106 - Drive - Status: Okay
0x010a - Drive - Status: Okay
0x010b - Drive - Status: Okay
0x010c - Drive - Status: Okay
0x010d - Drive - Status: Okay
0x010e - Drive 0x010f - Drive 0x0110 - Drive 0x0111 - Drive 0x0112 - Drive 0x0113 - Drive 0x0114 - Drive 0x0115 - Drive 0x0116 - Drive 0x0117 - Drive 0x0118 - Drive 0x0119 - Drive 0x011a - Drive 0x011b - Drive 0x011c - Drive 0x011d - Drive value = 0 = 0x0
Status:
Status:
Status:
Status:
Status:
Status:
Status:
Status:
Status:
Status:
Status:
Status:
Status:
Status:
Status:
Status:
Okay
Okay
Okay
Okay
Okay
Okay
Okay
Missing
Missing
Missing
Missing
Missing
Missing
Missing
Missing
Okay
Controller B
-> ssmShowSubTree
[Storage Array]
[0x0101 - Enclosure - Tray ID: 0, ELI: 0000000000000000, Status: Okay]
0x011e - Backup Drive - Status: Missing
0x012a - Temp Sensor - Status: Okay
0x012b - Temp Sensor - Status: Okay
0x011f - Backup Drive - Status: Missing
0x012c - Temp Sensor - Status: Okay
0x012d - Temp Sensor - Status: Okay
0x012e - Temp Sensor - Status: Okay
0x012f - Temp Sensor - Status: Okay
0x010a - Drive - Status: Okay
0x010b - Drive - Status: Okay
0x010c - Drive - Status: Okay
0x010d - Drive - Status: Okay
0x010e - Drive - Status: Okay
0x010f - Drive - Status: Okay
0x0111 - Drive 0x0112 - Drive 0x0113 - Drive 0x0114 - Drive 0x0115 - Drive 0x0116 - Drive 0x0117 - Drive 0x0118 - Drive 0x0119 - Drive 0x011a - Drive 0x011b - Drive 0x011c - Drive 0x011d - Drive value = 0 = 0x0
Status:
Status:
Status:
Status:
Status:
Status:
Status:
Status:
Status:
Status:
Status:
Status:
Status:
Okay
Okay
Okay
Okay
Missing
Missing
Missing
Missing
Missing
Missing
Missing
Missing
Okay
ACTION PLAN :
CSR should check the Tray if there is any
1, Block the Air Flow to Cabinet if possible.
2, Move Perforated/Vented Tile can also be moved and Solid Tile can be placed i
f the Temperature gets Colder.
3, If the Data Center looks ok, then the Offending Component (Power-Fan Caniste
r) should be replaced.
SCENARIO II :
Individual Drive experiences 6/b/1 Temperature condition
From State capture
THERMAL SENSOR - 0x9b8eef0
status : NominalTempExceed(0x2)
tray/slot : 0/2
trayRef : 0e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
thermalSensorRef: 0b 00 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 00 00
THERMAL SENSOR - 0x9be442c
status : NominalTempExceed(0x2)
tray/slot : 0/2
trayRef : 0e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
thermalSensorRef: 0b 00 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 00 00
Mel Event
2/22/16 1:09:01 PM 66011 281B Failure Critical Nominal temperature exceeded 0/0
/0 Temperature Sensor Tray 0, Slot 2 Controller in slot A
Action Plan :
6/b/1 is reported by an Individual Drive, That Drive Can be replaced.
SCENARIO III
Multiple Drives had Drive Check Condition 6/b/1 and caused the Performance issu
e
DAMC001-1-19(2660-7.84.46.30) Several Drives Reported 6/b/1
Drives - ST9300503SS - FW MS05 and MS06
Drive ID 6/b/1 Temperature Errors
12 33
13 436
14 863
15 1559
16 509
17 3068
18 396
19 390
20 211
21 0
22 0
23 0
24 8
Note :
Checking other arrays in the system All the Disk Arrays reporting 6/b/1 also not
as many as this array(DAMC001-1-19) but they are occurring there as well.
This site in Particular was using the firefly drives which will report an under
-temp condition (6/0b/01) if they drop below an internal temperature of 20C and
disable the aggressive seek feature which can cause a performance hit with some cu
stomers.
There is nothing wrong with the drives and they will continue operating OK even
if these alerts are generated. They DO NOT need to be replaced. The alerts are a
reaction to the environment being very cold and the drives are detecting an int
ernal temperature below 20C.
The effect of the temperature dropping below 20 C is the following
1. The 6/b/1 alert will be generated.
2. A feature known as aggressive seek will be stepped down 1 level to compensate
for the lube thickening at the cooler temperature. This has caused some performa
nce issues at some sites but not all.
3. The stepdown is intended to protect the drive so there should be no damage d
one to the drive.
4. The aggressive seek feature will return to the normal level of operation once
the temperature returns to 20C or higher.
How to stop these alerts
----------------------------------If the array cabinets are running very cold, the site team should examine the en
vironment to see of the cabinets are too close or in the direct air flow of air
handlers in the computer room which is excessively cooling them and see if there
are ways to divert the air.
If the computer room is just excessively cold then they may need to slowly rais
e the temperature control a degree each day until the alerts stop.
Both options have been successfully used at other sites to stop these alerts an
d prevent periodic performance issues when a drive remains too cold for an exten
ded period of time.
REFERENCE :
Incident#RECGGC52M, Site#WFOODS4, NetApp Case#2005641716
At this Site, Several Drives have been replaced due to the Performance Degradat
ion.
Replace ONLY when there is a Performance degradation.
Additional info :
IMPACT ON PERFORMANCE :
When the drive case temperature drops below the threshold value, the check cond
ition will be reported with a likely impact on the performance.
The impact on the performance is a result of the drive running in a read-verify
mode.
The threshold is dependent on the drive or the firmware, and as a result will n
ot be the same for every drive. For example, drive models ST9300503SS(MS06 is La
test Drive FW as on 2/25/2016) and ST32000445SS(MS05 is latest Drive FW as on 2/
15/2016) have a low temperature threshold of 20C/68F.
NOTE :
There will be a rise in temperature of around 10C in a standard system drive tr
ay. Accordingly, the room temperature has to be about 10C higher than the indivi
dual drive threshold value.
A case that was reported where the drives were unable to write to the DACstore
as a result of read-verify, when recreating the configuration, until the drive t
emperature was above the threshold.

Ramas PSR

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ramas PSR

Uploaded by

Copyright:

Available Formats

GSO BASE :

Nominal Temperature Exceeded

Support CRU Slot Type:

0000 0 - 499 Fan stopped

You might also like