Professional Documents
Culture Documents
1 of 27
https://support.microsoft.com/en-us/kb/2288515
Appearance
Description
Healthy
Green check
mark
Critical
Red check
mark
10/12/2015 11:27 PM
2 of 27
Unknown
Gray agent
name, gray
check mark
https://support.microsoft.com/en-us/kb/2288515
Unknown
Green circle,
no check
mark
Before you begin to troubleshoot the agent "grayed out" issue, you should first
understand the Operations Manager topology, and then define the scope of the
10/12/2015 11:27 PM
3 of 27
https://support.microsoft.com/en-us/kb/2288515
issue. The following questions may help you to define the scope of the issue:
How many agents are affected?
Are the agents experiencing the issue in the same network segment?
Do the agents report to the same management server?
How often do the agents enter and remain in a gray state?
How do you typically recover from this situation (for example, restart the
agent health service, clear the cache, rely upon automatic recovery)?
Are the Heartbeat failure alerts generated for these agents?
Does this issue occur during a specific time of the day?
Does this issue persist if you failover these agents to another
management server or gateway?
When did this problem start?
Were any changes made to the agents, the management servers, or the
gateway or management group?
Are the affected agents Windows clustered systems?
Is the Health Service State folder excluded from antivirus scanning?
What is the environment this is occurring in OpsMgr SP1, R2, 2012?
10/12/2015 11:27 PM
4 of 27
https://support.microsoft.com/en-us/kb/2288515
management server.
Troubleshooting typically starts at the level immediately above the
unavailable component.
Scenario 1
Only a few agents are affected by the issue. These agents report to different
management servers. Agents remain unavailable on a regular basis. Although
you are able to clear the agent cache to help resolve the issue temporarily, the
problem recurs after a few days.
Resolution 1
To resolve the issue in this scenario, follow these steps:
1. Apply the appropriate hotfix to the affected operating systems.
Windows 2008 R2 and Windows 7
This fix is included in Service Pack 1 (SP1).
Windows 2008 and Windows Vista
Install 2553708.
Windows 2003
Install 981263.
2. Exclude the Agent cache from antivirus scanning.
3. Stop the Health service.
4. Clear the Agent cache.
5. Start the Health service.
Note We recommend that you proactively apply the hotfixes that are listed in
step 1 to all monitored systems. This includes the management servers.
Additionally, exclude the agent or management cache from antivirus scanning
to prevent this issue from spreading to other systems.
For more information about these procedures, click the following article
numbers to view the articles in the Microsoft Knowledge Base:
10/12/2015 11:27 PM
5 of 27
https://support.microsoft.com/en-us/kb/2288515
Scenario 2
Only a few agents are affected by the issue. These agents report to different
management servers. Agents remain inactive constantly. Although you are able
to clear the agent cache, this does not reolve the issue.
Resolution 2
To resolve the issue in this scenario, follow these steps:
1. Determine whether the Health Service is turned on and is currently
running on the management server or gateway. If the Health Service has
stopped responding, generate an Adplus dump in a service hang mode to
help determine the cause of the problem. For more information, click the
following article number to view the article in the Microsoft Knowledge
Base:
286350 How to use Network Monitor to capture network traffic
2. Examine the Operations Manager Event log on the agent to locate any of
the following events:
Event ID: 1102
Event Source: HealthService
Event Description:
Rule/Monitor "%4" running for instance "%3" with id:"%2" cannot be
initialized and will not be loaded. Management group "%1"
Event ID: 1103
Event Source: HealthService
Event Description:
Summary: %2 rule(s)/monitor(s) failed and got unloaded, %3 of them
reached the failure limit that prevents automatic reload. Management
group "%1". This is summary only event, please see other events with
descriptions of unloaded rule(s)/monitor(s).
10/12/2015 11:27 PM
6 of 27
https://support.microsoft.com/en-us/kb/2288515
10/12/2015 11:27 PM
7 of 27
https://support.microsoft.com/en-us/kb/2288515
10/12/2015 11:27 PM
8 of 27
https://support.microsoft.com/en-us/kb/2288515
10/12/2015 11:27 PM
9 of 27
https://support.microsoft.com/en-us/kb/2288515
Scenario 3
All the agents that report to a particular management server or gateway are
unavailable.
Resolution 3
To resolve the issue in this scenario, follow these steps:
1. Try to determine what kind of workloads the management server or
gateway is monitoring. Such workloads might include network devices,
cross-platform agents, synthetic transactions, Windows agents, and
agentless computers.
2. Determine whether the Health Service is running on the management
server or gateway.
3. Determine whether the management server is running in maintenance
mode. If it is necessary, remove the server from maintenance mode.
4. Examine the Operations Manager Event log on the agent for any of the
events that are listed in Scenario 2. In the case of Event ID: 21006, follow
the same guidelines that are mentioned in Scenario 2. Additionally in this
case, this event indicates that management server or gateway cannot
communicate with its parent server. In Operations Manager 2007 and R2
for a management server, the parent server is the root management sever
(RMS). For a gateway, the parent server may be any management server.
(Refer to step 3 in the Scenario 2 resolution.)
5. If the health service is monitoring network devices, and the management
10/12/2015 11:27 PM
10 of 27
https://support.microsoft.com/en-us/kb/2288515
server is running on a Windows Server 2003 system, you may also want to
apply the following KB 982501 hotfix. For more information, click the
following article number to view the article in the Microsoft Knowledge
Base:
982501 The monitoring of SNMP devices may stop intermittently in
System Center Operations Manager or in System Center Essentials
6. Examine the Operations Manager Event log for the following events.
These events typically indicate that performance issues exist on the
management server or Microsoft SQL Server that is hosting the
OperationsManager or OperationsManagerDW database:
Event ID: 2115
Event Source: HealthService
Event Description:
A Bind Data Source in Management Group %1 has posted items to the
workflow, but has not received a response in %5 seconds. This indicates a
performance or functional problem with the workflow.%n Workflow Id :
%2%n Instance : %3%n Instance Id : %4%n
Event ID: 5300
Event Source: HealthService
Event Description:
Local health service is not healthy. Entity state change flow is stalled with
pending acknowledgement. %n%nManagement Group: %2
%nManagement Group ID: %1
Event ID: 4506
Event Source: HealthService
Event Description: Operations Manager
Data was dropped due to too much outstanding data in rule "%2" running
for instance "%3" with id:"%4" in management group "%1".
Event ID: 31551
Event Source: Health Service Modules
Event Description:
Failed to store data in the Data Warehouse. The operation will be
retried.%rException '%5': %6 %n%nOne or more workflows were affected
by this. %n%nWorkflow name: %2 %nInstance name: %3 %nInstance ID:
%4 %nManagement group: %1
Event ID: 31552
Event Source: Health Service Modules
Event Description:
Failed to store data in the Data Warehouse.%rException '%5': %6
%n%nOne or more workflows were affected by this. %n%nWorkflow
name: %2 %nInstance name: %3 %nInstance ID: %4 %nManagement
10/12/2015 11:27 PM
11 of 27
https://support.microsoft.com/en-us/kb/2288515
group: %1
Event ID: 31553
Event Source: Health Service Modules
Event Description:
Data was written to the Data Warehouse staging area but processing
failed on one of the subsequent operations.%rException '%5': %6
%n%nOne or more workflows were affected by this. %n%nWorkflow
name: %2 %nInstance name: %3 %nInstance ID: %4 %nManagement
group: %1
Event ID:31557
Event Source: Health Service Modules
Event Description:
Failed to obtain synchronization process state information from Data
Warehouse database. The operation will be retried.%rException '%5': %6
%n%nOne or more workflows were affected by this. %n%nWorkflow
name: %2 %nInstance name: %3 %nInstance ID: %4 %nManagement
group: %1
7. Event ID 3155X may also be logged because of incorrect "Run as" account
configurations or missing permissions for the "Run as" accounts. FOr more
information, see the the following Microsoft Technet blog, which includes
a Microsoft Office Excel worksheet that lists the permissions for various
accounts that are used by OpsMgr:
OpsMgr security account rights mapping - what accounts need what
privileges?
Note To troubleshoot management server or gateway performance and SQL
Server performance, see the "Resolutions" section for the next scenarios.
Scenarios 4 and 5
Scenarios 4
All the agents that report to a specific management server alternate
intermittently between healthy and gray states.
Scenarios 5
All the agents in the environment alternate intermittently between healthy and
gray states.
Resolutions 4 and 5
To resolve the issue in either of these scenarios, first determine the cause of the
issue. Common causes of temporary server unavailability include the following:
The parent server of the agents is temporarily offline.
10/12/2015 11:27 PM
12 of 27
https://support.microsoft.com/en-us/kb/2288515
Agents are flooding the management server with operational data, such
as alerts, states, discoveries, and so on. This may cause an increased use of
system resources on the OpsMgr database and on the OpsMgr servers.
Network outages caused a temporary communication failure between the
parent server and the agents.
Management pack (MP) changes occurred. In OpsMgr Console, these
changes require an OpsMgr configuration and an MP redistribution to the
agents. If the change affect a larger agent base, this may cause increased
use of system resources usage on the OpsMgr database and OpsMgr
servers.
The key to troubleshooting in these scenarios is to understand the duration of
the server unavailability and the time of day during which it occurred. This will
help you to quickly narrow the scope of the problem.
Management server
During a configuration update burst (that is caused by MP import and
discovery), the typical bottlenecks are, first, the CPU and, second, the OpsMgr
installation disk I/O. The management server is responsible of forwarding
10/12/2015 11:27 PM
13 of 27
https://support.microsoft.com/en-us/kb/2288515
Gateway
The gateway is both CPU-bound and I/O-bound. When the gateway is relaying a
large amount of data, both the CPU and I/O operations may show high usage.
Most of the CPU usage is caused by the decompression, compression,
encryption, and decryption of the incoming data, and also by the transfer of that
data. All data that is received by the gateway and from the agents is stored in a
persistent queue on disk, to be read and forwarded to the management server
by the gateway Health service. This can cause heavy disk usage. This usage can
be significant when the gateway is taken temporarily offline and must then
handle accumulated agent data that the agents generated and tried to send
when the GW was still offline.
To troubleshoot the issue in this situation, collect the following information for
each affected management server or gateway:
Exact Windows version, edition, and build number (for example, Windows
Server 2003 Enterprise x64 SP2)
Number of processors
Amount of RAM
Drive that contains the Health Service State folder
Whether the antivirus software is configured to exclude the Health Service
store
Note For more information, click the following article number to view the
article in the Microsoft Knowledge Base:
975931Recommendations for antivirus exclusions that relate
to Operations Manager
RAID level (0, 1, 5, 0+1 or 1+0) for the drive that is used by the Health
Service State
Number of disks used for the RAID
10/12/2015 11:27 PM
14 of 27
https://support.microsoft.com/en-us/kb/2288515
General troubleshooting
To troubleshoot the issue in this situation, collect the following information for
each affected management server or gateway:
Exact Windows version, edition, and build number (for example, Windows
Server 2003 Enterprise x64 SP2)
10/12/2015 11:27 PM
15 of 27
https://support.microsoft.com/en-us/kb/2288515
Number of processors
Amount of RAM
Amount of memory that is allocated to SQL Server
Whether SQL Server is 32-bit and is AWE enabled
Note You can find most of this information in SQL Server Management
Studio or in SQL Server Enterprise Manager. To do this, open the
Properties window of the server, and then click the General and Memory
tabs. The General tab includes the SQL Server version, the Windows
version, the platform, the amount of RAM, and the number of processors.
The Memory tab includes the memory that is allocated to SQL Server. In
Microsoft SQL Server 2008 and in Microsoft SQL Server 2005, the
Memory tab also includes the AWE option. To determine whether AWE is
enabled in Microsoft SQL Server 2000, run the following command in the
Microsoft SQL Query Analyzer:
sp_configure 'show advanced options', 1
RECONFIGURE
GO
sp_configure 'awe enabled'
The returned values for config_value and for run_value will be 1 if AWE is
enabled.
If OS is 32-bit and RAM is 4 GB or greater, check whether the /pae or /3gb
switches exist in the Boot.ini. file. These options could be configured
incorrectly if the server was originally installed by having 4 GB or less of
RAM, and if the RAM was later upgraded.
For 32-bit servers that have 4 GB of RAM, the /3gb switch in Boot.ini
increases the amount of memory that SQL Server can address (from 2 to 3
GB). For 32-bit servers that have more than 4 GB of RAM, the /3gb switch
in Boot.ini could actually limit the amount of memory that SQL Server can
address. For these systems, add the /pae switch to Boot.ini, and then
enable AWE in SQL Server.
On a multi-processor system, check the Max Degree of Parallelism
(MAXDOP) setting. In SQL Server 2008 and in SQL Server 2005, this option
is on the Advanced tab in the Properties dialog box for the server. To
determine this setting in SQL Server 2000, run the following command in
SQL Query Analyzer:
sp_configure 'show advanced options', 1
RECONFIGURE
GO
sp_configure 'max degree of parallelism'
10/12/2015 11:27 PM
16 of 27
https://support.microsoft.com/en-us/kb/2288515
The default value is 0, which means that all available processors will be
used. A setting of 0 is fine for servers that have eight or fewer processors.
For servers that have more than eight processors, the time that it takes
SQL Server to coordinate the use of all processors may be
counterproductive. Therefore, for servers that have more than eight
processors, you generally should set Max Degree of Parallelism to a
value of 8. To do this, run the following command in SQL Query Analyzer:
sp_configure 'show advanced options', 1
GO
RECONFIGURE WITH OVERRIDE
GO
sp_configure 'max degree of parallelism', 8
GO
RECONFIGURE WITH OVERRIDE
GO
Drive letters that contain data warehouse or Ops and Tempdb files
Whether the antivirus software is configured to exclude SQL data and log
files (Antivirus software cannot scan SQL database files. Trying to do this
can degrade performance.)
Amount of free space on drives that contain data warehouse or Ops and
Tempdb files
Storage type (SAN or local)
RAID level (0, 1, 5, 0+1 or 1+0) for drives that are used by SQL Server
If SAN storage us used: amount of spindles on each LUN that is used by
SQL Server
In OpsMgr 2007 SP1: whether hotfix 969130 (data warehouse event
grooming) or SP1 hotfix rollup 971541 is applied
If the converted Exchange 2007 managment pack is being used or has
ever been used: amount of rows in the LocalizedText table in the Ops DB
and in the EventPublisher table in the data warehouse database
Note To determine the row amounts, run the following commands:
USE OperationsManager SELECT COUNT(*) FROM LocalizedText
USE OperationsManagerDW SELECT COUNT(*) FROM
EventPublisher
10/12/2015 11:27 PM
17 of 27
https://support.microsoft.com/en-us/kb/2288515
10/12/2015 11:27 PM
18 of 27
https://support.microsoft.com/en-us/kb/2288515
10/12/2015 11:27 PM
19 of 27
https://support.microsoft.com/en-us/kb/2288515
10/12/2015 11:27 PM
20 of 27
https://support.microsoft.com/en-us/kb/2288515
10/12/2015 11:27 PM
21 of 27
https://support.microsoft.com/en-us/kb/2288515
10/12/2015 11:27 PM
22 of 27
https://support.microsoft.com/en-us/kb/2288515
10/12/2015 11:27 PM
23 of 27
https://support.microsoft.com/en-us/kb/2288515
10/12/2015 11:27 PM
24 of 27
https://support.microsoft.com/en-us/kb/2288515
Process(Microsoft.Mom.ConfigServiceHost)\Virtual Bytes
Process(Microsoft.Mom.ConfigServiceHost)\Working Set
Process(Microsoft.Mom.Sdk.ServiceHost)\% Processor Time
Process(Microsoft.Mom.Sdk.ServiceHost)\Private Bytes
Process(Microsoft.Mom.Sdk.ServiceHost)\Thread Count
Process(Microsoft.Mom.Sdk.ServiceHost)\Virtual Bytes
Process(Microsoft.Mom.Sdk.ServiceHost)\Working SetOpsMgr specific
performance counters: These counters are OpsMgr specific counters that
indicate the performance of specific aspects of OpsMgr on the root
management server:
Health Service\Workflow Count: The number of workflows that are
running on this root management server.
Health Service Management Groups(*)\Active File Uploads: The number of
file transfers that this root management server is handling i.e.,
configuration and management pack uploads to agents. If this value
remains higher for a long time, and it does not drop, this indicates that
not much discovery or management pack is being imported at the
moment, and that there could be a problem in file transfer.
Health Service Management Groups(*)\Send Queue % Used: The size of
the persistent queue.
Health Service Management Groups(*)\Bind Data Source Item Drop Rate:
The number of data items dropped by the root management server for
database or data warehouse data collection write actions. When this
counter value is not 0, the root management server or database is
overloaded because it cant handle the incoming data item fast enough or
because a data item burst is occurring. The dropped data items will be
resent by agents. After the overloaded or burst situation is finished, these
data items will be inserted into the database or into the data warehouse.
Health Service Management Groups(*)\Bind Data Source Item Incoming
Rate: The number of data items received by the root management server
for database or data warehouse data collection write actions.
Health Service Management Groups(*)\Bind Data Source Item Post Rate:
The number of data items that the root management server wrote to the
database or to the data warehouse for database or data warehouse data
collection write actions.
OpsMgr Connector\Bytes Received: The number of network bytes
10/12/2015 11:27 PM
25 of 27
https://support.microsoft.com/en-us/kb/2288515
received by the root management server i.e., the size of incoming bytes
before decompress.
OpsMgr Connector\Bytes Transmitted: The number of network bytes sent
by the root management server i.e., the size of outgoing bytes after
compression.
OpsMgr Connector\Data Bytes Received: The number of data bytes
received by the root management server i.e., the size of incoming data
after decompression.
OpsMgr Connector\Data Bytes Transmitted: The number of data bytes
sent by the root management server i.e., the size of outgoing data
before compression.
OpsMgr Connector\Open Connections: The number of connections open
on the root management server. It should be same as the number of
agents or management servers that are directly connected to it.
OpsMgr Config Service\Number Of Active Requests: The number of
configuration or management pack requests that are being processing by
the Config service.
OpsMgr Config Service\Number Of Queued Requests: The number of
queued config or management pack requests sent to the Config service. If
it is high for a long time, the instance space or management pack space is
changing too frequently.
OpsMgr SDK Service\Client Connections: The number of SDK connections.
OpsMgr DB Write Action Modules(*)\Avg. Batch Size: The number of a
data items or batches that are received by database write action modules.
If this number is 5,000, a data item burst is occurring.
OpsMgr DB Write Action Modules(*)\Avg. Processing Time: The number of
seconds that a database write action modules takes to insert a batch into
a database. If this number is often larger than 60, a database insertion
performance issue is occurring.
OpsMgr DW Writer Module(*)\Avg. Batch Processing Time, ms: The
number of milliseconds that it takes for a data warehouse write action to
insert a batch of data items into a data warehouse.
OpsMgr DW Writer Module(*)\Avg. Batch Size: The average number of
data items or batches that are received by data warehouse write action
modules.
OpsMgr DW Writer Module(*)\Batches/sec: The number of batches
10/12/2015 11:27 PM
26 of 27
https://support.microsoft.com/en-us/kb/2288515
Support
Security
Contact Us
10/12/2015 11:27 PM
27 of 27
https://support.microsoft.com/en-us/kb/2288515
Account support
10/12/2015 11:27 PM