Troubleshooting Gray Agent States in System Center Operations Manager

Troubleshooting gray agent states in System Center Operations Manager
1 of 27
https://support.microsoft.com/en-us/kb/2288515
This article describes how to troubleshoot problems in which an agent, a

management server, or a gateway is unavailable or "grayed out" in System
Center Operations Manager.
An agent, a management server, or a gateway can have one of the following

states, as indicated by the color of the agent name and icon in the Monitoring
pane.
State
Appearance
Description
Healthy
Green check
mark
The agent or management server is

running normally.
Critical
Red check
mark
There is a problem on the agent or

management server.
10/12/2015 11:27 PM
2 of 27
Unknown
Gray agent
name, gray
check mark
The health service watcher on the root

management server (RMS) that is
watching the health service on the
monitored computer is no longer
receiving heartbeats from the agent. The
health service watcher had been receiving
heartbeats previously, and the health
service was reported as healthy). This also
means that the management servers are
no longer receiving any information from
the agent.
This issue may occur because the
computer that is running the agent is not
running or there are connectivity issues.
You can find more information about the
Health Service Watcher view.
Unknown
Green circle,
no check
mark
The status of the discovered item is

unknown. There is no monitor available
for this specific discovered item.
An agent, a management server, or a gateway may become unavailable for any

of the following reasons:
Heartbeat failure
Invalid configuration
System workflows failure
OpsMgr Database or data warehouse performance issues
RMS or primary MS or gateway performance issues
Network or authentication issues
Health service issues (service is not running)
Before you begin to troubleshoot the agent "grayed out" issue, you should first
understand the Operations Manager topology, and then define the scope of the
10/12/2015 11:27 PM
3 of 27
issue. The following questions may help you to define the scope of the issue:
How many agents are affected?
Are the agents experiencing the issue in the same network segment?
Do the agents report to the same management server?
How often do the agents enter and remain in a gray state?
How do you typically recover from this situation (for example, restart the
agent health service, clear the cache, rely upon automatic recovery)?
Are the Heartbeat failure alerts generated for these agents?
Does this issue occur during a specific time of the day?
Does this issue persist if you failover these agents to another
management server or gateway?
When did this problem start?
Were any changes made to the agents, the management servers, or the
gateway or management group?
Are the affected agents Windows clustered systems?
Is the Health Service State folder excluded from antivirus scanning?
What is the environment this is occurring in OpsMgr SP1, R2, 2012?
Your troubleshooting strategy will be dictated by which component is inactive,

where that component falls within the topology, and how widespread the
problem is. Consider the following conditions:
If the agents that report to a particular management server or gateway are
unavailable, troubleshooting should start at the management server or
gateway level.
If the gateways that report to a particular management server are
unavailable, troubleshooting should start at the management server level.
For agentless systems, for Network devices, and for Unix/Linux servers,
troubleshooting should start at the agent, management server, or
gateway that is monitoring these objects.
If all the systems are unavailable, troubleshooting should start at the root
10/12/2015 11:27 PM
4 of 27
management server.
Troubleshooting typically starts at the level immediately above the
unavailable component.
Consider the following scenarios.
Scenario 1
Only a few agents are affected by the issue. These agents report to different
management servers. Agents remain unavailable on a regular basis. Although
you are able to clear the agent cache to help resolve the issue temporarily, the
problem recurs after a few days.
Resolution 1
To resolve the issue in this scenario, follow these steps:
1. Apply the appropriate hotfix to the affected operating systems.
Windows 2008 R2 and Windows 7
This fix is included in Service Pack 1 (SP1).
Windows 2008 and Windows Vista
Install 2553708.
Windows 2003
Install 981263.
2. Exclude the Agent cache from antivirus scanning.
3. Stop the Health service.
4. Clear the Agent cache.
5. Start the Health service.
Note We recommend that you proactively apply the hotfixes that are listed in
step 1 to all monitored systems. This includes the management servers.
Additionally, exclude the agent or management cache from antivirus scanning
to prevent this issue from spreading to other systems.
For more information about these procedures, click the following article
numbers to view the articles in the Microsoft Knowledge Base:
10/12/2015 11:27 PM
5 of 27
982018 An update that improves the compatibility of Windows 7 and

Windows Server 2008 R2 with Advanced Format Disks is available
2553708 A hotfix rollup that improves Windows Vista and Windows Server
2008 compatibility with Advanced Format disks
981263 Management servers or assigned agents unexpectedly appear as
unavailable in the Operations Manager console in Windows Server 2003 or
Windows Server 2008
975931 Recommendations for antivirus exclusions that relate to MOM 2005
and to Operations Manager 2007
Scenario 2
Only a few agents are affected by the issue. These agents report to different
management servers. Agents remain inactive constantly. Although you are able
to clear the agent cache, this does not reolve the issue.
Resolution 2
1. Determine whether the Health Service is turned on and is currently
running on the management server or gateway. If the Health Service has
stopped responding, generate an Adplus dump in a service hang mode to
help determine the cause of the problem. For more information, click the
following article number to view the article in the Microsoft Knowledge
Base:
286350 How to use Network Monitor to capture network traffic
2. Examine the Operations Manager Event log on the agent to locate any of
the following events:
Event ID: 1102
Event Source: HealthService
Event Description:
Rule/Monitor "%4" running for instance "%3" with id:"%2" cannot be
initialized and will not be loaded. Management group "%1"
Event ID: 1103
Event Description:
Summary: %2 rule(s)/monitor(s) failed and got unloaded, %3 of them
reached the failure limit that prevents automatic reload. Management
group "%1". This is summary only event, please see other events with
descriptions of unloaded rule(s)/monitor(s).
10/12/2015 11:27 PM
6 of 27
Event ID: 1104

Event Description:
RunAs profile in workflow "%4", running for instance "%3" with id:"%2"
cannot be resolved. Workflow will not be loaded. Management group
"%1"
Event ID: 1105
Event Description:
Type mismatch for RunAs profile in workflow "%4", running for instance
"%3" with id:"%2". Workflow will not be loaded. Management group "%1"
Event ID: 1106
Event Description:
Cannot access plain text RunAs profile in workflow "%4", running for
instance "%3" with id:"%2". Workflow will not be loaded. Management
group "%1"
Event ID: 1107
Event Description:
Account for RunAs profile in workflow "%4", running for instance "%3"
with id:"%2" is not defined. Workflow will not be loaded. Please associate
an account with the profile. Management group "%1"
Event ID: 1108
Event Description:
An Account specified in the Run As Profile "%7" cannot be resolved.
Specifically, the account is used in the Secure Reference Override "%6".
%n%n This condition may have occurred because the Account is not
configured to be distributed to this computer. To resolve this problem,
you need to open the Run As Profile specified below, locate the Account
entry as specified by its SSID, and either choose to distribute the Account
to this computer if appropriate, or change the setting in the Profile so that
the target object does not use the specified Account. %n%nManagement
Group: %1 %nRun As Profile: %7 %nSecureReferenceOverride name: %6
%nSecureReferenceOverride ID: %4 %nObject name: %3 %nObject ID: %2
%nAccount SSID: %5
Event ID: 4000
Event Description:
A monitoring host is unresponsive or has crashed. The status code for the
host failure was %1.
10/12/2015 11:27 PM
7 of 27
Event ID: 21016

Event Source: OpsMgr Connector
Event Description:
OpsMgr was unable to set up a communications channel to %1 and there
are no failover hosts. Communication will resume when %1 is available
and communication from this computer is allowed.
Event ID: 21006
Event Description:
The OpsMgr Connector could not connect to %1:%2. The error code is
%3(%4). Please verify there is network connectivity, the server is running
and has registered its listening port, and there are no firewalls blocking
traffic to the destination.
Event ID: 20070
Event Description:
The OpsMgr Connector connected to %1, but the connection was closed
immediately after authentication occurred. The most likely cause of this
error is that the agent is not authorized to communicate with the server,
or the server has not received configuration. Check the event log on the
server for the presence of 20000 events, indicating that agents which are
not approved are attempting to connect.
Event ID: 20051
Event Description:
The specified certificate could not be loaded because the certificate is not
currently valid. Verify that the system time is correct and re-issue the
certificate if necessary%n Certificate Valid Start Time : %1%n Certificate
Valid End Time : %2
Event Source: ESE
Event Category: Transaction Manager
Event ID: 623
Description: HealthService (<PID>) The version store for instance
<instance> ("<name>") has reached its maximum size of <value>Mb. It is
likely that a long-running transaction is preventing cleanup of the version
store and causing it to build up in size. Updates will be rejected until the
long-running transaction has been completely committed or rolled back.
Possible long-running transaction:
SessionId: <value>
Session-context: <value>
Session-context ThreadId: <value>.
Cleanup:<value>
10/12/2015 11:27 PM
8 of 27
3. If you locate the following specific events, follow these guidelines:

Events 1102 and 1103: These events indicate that some of the
workflows failed to load. If these are the core system workflows,
these events could cause the issue. In this case, focus on resolving
these events.
Events 1104, 1105, 1106, 1107, and 1108: These events may cause
Events 1102 and 1103 to occur. Ttypically, this would occur because
of misconfigured "Run as" accounts. In OpsMgr R2, this typically
occurs because the "Run as" accounts are configured to be used
with the wrong class or are not configured to be distributed to the
agent.
Event 4000 This event indicates that the Monitoringhost.exe
process crashed. If this problem is caused by a Dll mismatch or by
missing registry keys, you may be able to resolve the problem by
reinstalling the agent. If the problem presists, try to resolve it by
using the following methods:
Run a Process Monitor capture until the point the process
crashes. For more information, visit the following Microsoft
Sysinternals website: Process Monitor v2.96
Generate an Adplus dump in crash mode. For more
information, click the following article number to view the
article in the Microsoft Knowledge Base:
286350 How to use ADPlus.vbs to troubleshoot "hangs"
and "crashes"
If the agent is monitoring network devices, and the agent is
running on Windows Server 2003, apply hotfix in KB 982501.
For more information, click the following article number to
view the article in the Microsoft Knowledge Base:
982501 The monitoring of SNMP devices may stop
intermittently in System Center Operations Manager or in
System Center Essentials
Event ID 21006: This event indicates that communication issues
exist between the agent and the management server. If the agent
uses a certificate for mutual authentication, verify that the certificate
is not expired and that the agent is using the correct certificate. If
Kerberos is being used, verify that the agent can communicate with
Active Directory. If authentication is working correctly, this may
mean that the packets from the agent are not reaching the
management server or gateway. Try to establish a simple telnet to
port 5723 from the agent to the management server. Additionally,
run a simultaneous network trace between the agent and the
10/12/2015 11:27 PM
9 of 27
management server while you reproduce the communication

failures. This can help you to determine whether the packets are
reaching the management server, and whether any device between
the two components is trying to optimize the traffic or is dropping
some packets. For more information, click the following article
number to view the article in the Microsoft Knowledge Base:
812953 How to use Network Monitor to capture network traffic
Event ID: 623 This event typically occurs in a large Operations
Manager environment in which a management server or an agent
computer manages many workflows. For more information, click the
following article number to view the article in the Microsoft
Knowledge Base:
975057 One or more management servers and their managed
devices are dimmed in the Operations Manager Console of
Operations Manager
Scenario 3
All the agents that report to a particular management server or gateway are
unavailable.
Resolution 3
1. Try to determine what kind of workloads the management server or
gateway is monitoring. Such workloads might include network devices,
cross-platform agents, synthetic transactions, Windows agents, and
agentless computers.
2. Determine whether the Health Service is running on the management
server or gateway.
3. Determine whether the management server is running in maintenance
mode. If it is necessary, remove the server from maintenance mode.
4. Examine the Operations Manager Event log on the agent for any of the
events that are listed in Scenario 2. In the case of Event ID: 21006, follow
the same guidelines that are mentioned in Scenario 2. Additionally in this
case, this event indicates that management server or gateway cannot
communicate with its parent server. In Operations Manager 2007 and R2
for a management server, the parent server is the root management sever
(RMS). For a gateway, the parent server may be any management server.
(Refer to step 3 in the Scenario 2 resolution.)
5. If the health service is monitoring network devices, and the management
10/12/2015 11:27 PM
10 of 27
server is running on a Windows Server 2003 system, you may also want to
apply the following KB 982501 hotfix. For more information, click the
following article number to view the article in the Microsoft Knowledge
Base:
982501 The monitoring of SNMP devices may stop intermittently in
System Center Operations Manager or in System Center Essentials
6. Examine the Operations Manager Event log for the following events.
These events typically indicate that performance issues exist on the
management server or Microsoft SQL Server that is hosting the
OperationsManager or OperationsManagerDW database:
Event ID: 2115
Event Description:
A Bind Data Source in Management Group %1 has posted items to the
workflow, but has not received a response in %5 seconds. This indicates a
performance or functional problem with the workflow.%n Workflow Id :
%2%n Instance : %3%n Instance Id : %4%n
Event ID: 5300
Event Description:
Local health service is not healthy. Entity state change flow is stalled with
pending acknowledgement. %n%nManagement Group: %2
%nManagement Group ID: %1
Event ID: 4506
Event Description: Operations Manager
Data was dropped due to too much outstanding data in rule "%2" running
for instance "%3" with id:"%4" in management group "%1".
Event ID: 31551
Event Source: Health Service Modules
Event Description:
Failed to store data in the Data Warehouse. The operation will be
retried.%rException '%5': %6 %n%nOne or more workflows were affected
by this. %n%nWorkflow name: %2 %nInstance name: %3 %nInstance ID:
%4 %nManagement group: %1
Event ID: 31552
Event Description:
Failed to store data in the Data Warehouse.%rException '%5': %6
%n%nOne or more workflows were affected by this. %n%nWorkflow
name: %2 %nInstance name: %3 %nInstance ID: %4 %nManagement
10/12/2015 11:27 PM
11 of 27
group: %1
Event ID: 31553
Event Description:
Data was written to the Data Warehouse staging area but processing
failed on one of the subsequent operations.%rException '%5': %6
group: %1
Event ID:31557
Event Description:
Failed to obtain synchronization process state information from Data
Warehouse database. The operation will be retried.%rException '%5': %6
group: %1
7. Event ID 3155X may also be logged because of incorrect "Run as" account
configurations or missing permissions for the "Run as" accounts. FOr more
information, see the the following Microsoft Technet blog, which includes
a Microsoft Office Excel worksheet that lists the permissions for various
accounts that are used by OpsMgr:
OpsMgr security account rights mapping - what accounts need what
privileges?
Note To troubleshoot management server or gateway performance and SQL
Server performance, see the "Resolutions" section for the next scenarios.
Scenarios 4 and 5
Scenarios 4
All the agents that report to a specific management server alternate
intermittently between healthy and gray states.
Scenarios 5
All the agents in the environment alternate intermittently between healthy and
gray states.
Resolutions 4 and 5
To resolve the issue in either of these scenarios, first determine the cause of the
issue. Common causes of temporary server unavailability include the following:
The parent server of the agents is temporarily offline.
10/12/2015 11:27 PM
12 of 27
Agents are flooding the management server with operational data, such
as alerts, states, discoveries, and so on. This may cause an increased use of
system resources on the OpsMgr database and on the OpsMgr servers.
Network outages caused a temporary communication failure between the
parent server and the agents.
Management pack (MP) changes occurred. In OpsMgr Console, these
changes require an OpsMgr configuration and an MP redistribution to the
agents. If the change affect a larger agent base, this may cause increased
use of system resources usage on the OpsMgr database and OpsMgr
servers.
The key to troubleshooting in these scenarios is to understand the duration of
the server unavailability and the time of day during which it occurred. This will
help you to quickly narrow the scope of the problem.
For OpsMgr 2007 and R2 - Root management server (RMS)

Configuration update bursts are caused by management pack imports and by
discovery data. When system performance is slow, the most likely bottlenecks
are, first, the CPU and, second, the OpsMgr installation disk I/O.
The RMS is responsible for generating and sending configuration files to all
affected Health Services.
For Workflow reloading (which is caused by new configuration on RMS), the
most likely bottlenecks are the same: the CPU first, and OpsMgr installation disk
I/O second. The RMS is responsible for reading the configuration file, for
loading and initializing all workflows that run on it, and for updating the RMS
HealthService store when the configuration file is updated on the RMS.
For local workflow activity bursts (which is when agents change their
availability), the most likely bottleneck is the CPU. If you find that the CPU is not
working at maximum capacity, the next most likely bottleneck is the hard disk.
The RMS is responsible for monitoring the availability of all agents that are
using RMS local workflows. The RMS also hosts distributed dependency
monitors that use the disk.
Management server
During a configuration update burst (that is caused by MP import and
discovery), the typical bottlenecks are, first, the CPU and, second, the OpsMgr
installation disk I/O. The management server is responsible of forwarding
10/12/2015 11:27 PM
13 of 27
configuration files from the RMS to the target agents.

For Operational data collection, bottlenecks are typically caused by the CPU. The
disk I/O may also be at maximum capacity, but that is not as likely. The
management server is responsible for decompressing and decrypting incoming
operational data, and inserting it into the Operational Database. It also sends
acknowledgements (ACKs) back to the agents or gateways after it receives
operational data, and uses disk queuing to temporarily store these outgoing
ACKs. Lastly, the management server will also forward monitor state changes (by
using a disk queue) to the RMS for distributed dependency monitors.
Gateway
The gateway is both CPU-bound and I/O-bound. When the gateway is relaying a
large amount of data, both the CPU and I/O operations may show high usage.
Most of the CPU usage is caused by the decompression, compression,
encryption, and decryption of the incoming data, and also by the transfer of that
data. All data that is received by the gateway and from the agents is stored in a
persistent queue on disk, to be read and forwarded to the management server
by the gateway Health service. This can cause heavy disk usage. This usage can
be significant when the gateway is taken temporarily offline and must then
handle accumulated agent data that the agents generated and tried to send
when the GW was still offline.
To troubleshoot the issue in this situation, collect the following information for
each affected management server or gateway:
Exact Windows version, edition, and build number (for example, Windows
Server 2003 Enterprise x64 SP2)
Number of processors
Amount of RAM
Drive that contains the Health Service State folder
Whether the antivirus software is configured to exclude the Health Service
store
Note For more information, click the following article number to view the
article in the Microsoft Knowledge Base:
975931Recommendations for antivirus exclusions that relate
to Operations Manager
RAID level (0, 1, 5, 0+1 or 1+0) for the drive that is used by the Health
Service State
Number of disks used for the RAID
10/12/2015 11:27 PM
14 of 27
Whether battery-backed write cache is enabled on the array controller
Operational Database (OperationsManager)

For the OperationsManager database, the most likely bottleneck is the disk
array. If the disk array is not at maximum I/O capacity, the next most likely
bottleneck is the CPU. The database will experience occasional slowdowns and
operational "data storms (very high incidences of events, alerts, and
performance data or state changes that persist for a relatively long time). A
short burst typically does not cause any significant delay for an extended period
of time.
During operational data insertion, the database disks are primarily used for
writes. CPU use is usually caused by SQL Server churn. This may occur when you
have large and complex queries, heavy data insertion, and the grooming of
large tables (which, by default, occurs at midnight). Typically, the grooming of
even large Events and Performance Data tables does not consume excessive
CPU or disk resources. However, the grooming pf the Alert and State Change
tables can be CPU-intensive for large tables.
The database is also CPU-bound when it handles configuration redistribution
bursts, which are caused by MP imports or by a large instance space change. In
these cases, the Config service queries the database for new agent
configuration. This ususally causes CPU spikes to occur on the database before
the service sends the configuration updates to the agents.
Data Warehouse (OperationsManagerDW)

For the OperationsManagerDW database, the most likely bottleneck is the disk
array. This usually occurs because of very large operational data insertions. In
these cases, the disks are mostly busy performing writes. Usually, the disks are
performing few reads, except to handle manually-generated Reporting views
because these run queries on the data warehouse.
CPU usage is usually caused by SQL Server churn. CPU spikes may occur during
heavy partitioning activity (when tables become very large and then get
partitioned), the generation of complex reports, and large amounts of alerts in
the database, with which the data warehouse must constantly sync up.
General troubleshooting
To troubleshoot the issue in this situation, collect the following information for
each affected management server or gateway:
Exact Windows version, edition, and build number (for example, Windows
Server 2003 Enterprise x64 SP2)
10/12/2015 11:27 PM
15 of 27
Number of processors
Amount of RAM
Amount of memory that is allocated to SQL Server
Whether SQL Server is 32-bit and is AWE enabled
Note You can find most of this information in SQL Server Management
Studio or in SQL Server Enterprise Manager. To do this, open the
Properties window of the server, and then click the General and Memory
tabs. The General tab includes the SQL Server version, the Windows
version, the platform, the amount of RAM, and the number of processors.
The Memory tab includes the memory that is allocated to SQL Server. In
Microsoft SQL Server 2008 and in Microsoft SQL Server 2005, the
Memory tab also includes the AWE option. To determine whether AWE is
enabled in Microsoft SQL Server 2000, run the following command in the
Microsoft SQL Query Analyzer:
sp_configure 'show advanced options', 1
RECONFIGURE
GO
sp_configure 'awe enabled'
The returned values for config_value and for run_value will be 1 if AWE is
enabled.
If OS is 32-bit and RAM is 4 GB or greater, check whether the /pae or /3gb
switches exist in the Boot.ini. file. These options could be configured
incorrectly if the server was originally installed by having 4 GB or less of
RAM, and if the RAM was later upgraded.
For 32-bit servers that have 4 GB of RAM, the /3gb switch in Boot.ini
increases the amount of memory that SQL Server can address (from 2 to 3
GB). For 32-bit servers that have more than 4 GB of RAM, the /3gb switch
in Boot.ini could actually limit the amount of memory that SQL Server can
address. For these systems, add the /pae switch to Boot.ini, and then
enable AWE in SQL Server.
On a multi-processor system, check the Max Degree of Parallelism
(MAXDOP) setting. In SQL Server 2008 and in SQL Server 2005, this option
is on the Advanced tab in the Properties dialog box for the server. To
determine this setting in SQL Server 2000, run the following command in
SQL Query Analyzer:
RECONFIGURE
GO
sp_configure 'max degree of parallelism'
10/12/2015 11:27 PM
16 of 27
The default value is 0, which means that all available processors will be
used. A setting of 0 is fine for servers that have eight or fewer processors.
For servers that have more than eight processors, the time that it takes
SQL Server to coordinate the use of all processors may be
counterproductive. Therefore, for servers that have more than eight
processors, you generally should set Max Degree of Parallelism to a
value of 8. To do this, run the following command in SQL Query Analyzer:
GO
RECONFIGURE WITH OVERRIDE
GO
sp_configure 'max degree of parallelism', 8
GO
RECONFIGURE WITH OVERRIDE
GO
Drive letters that contain data warehouse or Ops and Tempdb files
Whether the antivirus software is configured to exclude SQL data and log
files (Antivirus software cannot scan SQL database files. Trying to do this
can degrade performance.)
Amount of free space on drives that contain data warehouse or Ops and
Tempdb files
Storage type (SAN or local)
RAID level (0, 1, 5, 0+1 or 1+0) for drives that are used by SQL Server
If SAN storage us used: amount of spindles on each LUN that is used by
SQL Server
In OpsMgr 2007 SP1: whether hotfix 969130 (data warehouse event
grooming) or SP1 hotfix rollup 971541 is applied
If the converted Exchange 2007 managment pack is being used or has
ever been used: amount of rows in the LocalizedText table in the Ops DB
and in the EventPublisher table in the data warehouse database
Note To determine the row amounts, run the following commands:
USE OperationsManager SELECT COUNT(*) FROM LocalizedText
USE OperationsManagerDW SELECT COUNT(*) FROM
EventPublisher
Counters to identify memory pressure
10/12/2015 11:27 PM
17 of 27
MSSQL$<instance>: Buffer Manager: Page Life expectancy How long

pages persist in the buffer pool. If this value is below 300 seconds, it may
indicate that the server could use more memory. It could also result from
index fragmentation.
MSSQL$<instance>: Buffer Manager: Lazy Writes/sec Lazy writer frees
space in the buffer by moving pages to disk. Generally, the value should
not consistently exceed 20 writes per second. Ideally, it would be close to
zero.
Memory: Available Mbytes - Values below 100 MB may indicate memory
pressure. Memory pressure is clearly present when this amount is less
than 10 MB.
Process: Private Bytes: _Total: This is the amount of memory (physical and
page) being used by all processes combined.
Process: Working Set: _Total: This is the amount of physical memory being
used by all processes combined. If the value for this counter is
significantly below the value for Process: Private Bytes: _Total, it indicates
that processes are paging too heavily. A difference of more than 10% is
probably significant.
Counters to identify disk pressure

Capture these Physical Disk counters for all drives that contain SQL data or log
files:
% Idle Time: How much disk idle time is being reported. Anything below
50 percent could indicate a disk bottleneck.
Avg. Disk Queue Length: This value should not exceed 2 times the number
of spindles on a LUN. For example, if a LUN has 25 spindles, a value of 50
is acceptable. However, if a LUN has 10 spindles, a value of 25 is too high.
You could use the following formulas based on the RAID level and
number of disks in the RAID configuration:
RAID 0: All of the disks are doing work in a RAID 0 set
Average Disk Queue Length <= # (Disks in the array) *2
RAID 1: half the disks are doing work; therefore, only half of them
can be counted toward Disks Queue
Average Disk Queue Length <= # (Disks in the array/2) *2
RAID 10: half the disks are doing work; therefore, only half of them
can be counted toward Disks Queue
Average Disk Queue Length <= # (Disks in the array/2) *2
10/12/2015 11:27 PM
18 of 27
RAID 5: All of the disks are doing work in a RAID 5 set

Average Disk Queue Length <= # Disks in the array *2
Avg. Disk sec/Transfer: The number of seconds it takes to complete
one disk I/O
Avg. Disk sec/Read: The average time, in seconds, of a read of data
from the disk
Avg. Disk sec/Write: The average time, in seconds, of a write of data
to the disk
Note The last three counters in this list should consistently have
values of approximately .020 (20 ms) or lower and should never
exceed.050 (50 ms). The following are the thresholds that are
documented in the SQL Server performance troubleshooting guide:
Less than 10 ms: very good
Between 10 - 20 ms: okay
Between 20 - 50 ms: slow, needs attention
Greater than 50 ms: serious I/O bottleneck
Disk Bytes/sec: The number of bytes being transferred to or from
the disk per second
Disk Transfers/sec: The number of input and output operations per
second (IOPS)
When % Idle Time is low (10 percent or less), this means that the disk is
fully utilized. In this case, the last two counters in this list (Disk Bytes/sec
and Disk Transfers/sec) provide a good indication of the maximum
throughput of the drive in bytes and in IOPS, respectively. The throughput
of a SAN drive is highly variable, depending on the number of spindles,
the speed of the drives, and the speed of the channel. The best bet is to
check with the SAN vendor to find out how many bytes and IOPS the
drive should support. If % Idle Time is low, and the values for these two
counters do not meet the expected throughput of the drive, engage the
SAN vendor to troubleshoot.
The following links provide deeper insight into troubleshooting SQL Server
performance:
Troubleshooting Performance Problems in SQL Server 2008
Troubleshooting Performance Problems in SQL Server 2005
10/12/2015 11:27 PM
19 of 27
OpsMgr Performance counters

The following sections describe the performance counters that you can use to
monitor and troubleshoot OpsMgr performance.
Gateway server role

Overall performance counters: These counters indicate the overall
performance of the gateway:
Processor(_Total)\% Processor Time
Memory\% Committed Bytes In Use
Network Interface(*)\Bytes Total/sec
LogicalDisk(*)\% Idle Time
LogicalDisk(*)\Avg. Disk Queue LengthOpsMgr process generic
performance counters: These counters indicate the overall performance
of OpsMgr processes on the gateway:
Process(HealthService)\%Processor Time
Process(HealthService)\Private Bytes (depending on how many
agents this gateway is managing, this number may vary and could
be several hundred megabytes)
Process(HealthService)\Thread Count
Process(HealthService)\Virtual Bytes
Process(HealthService)\Working Set
Process(MonitoringHost*)\% Processor Time
Process(MonitoringHost*)\Private Bytes
Process(MonitoringHost*)\Thread Count
Process(MonitoringHost*)\Virtual Bytes
Process(MonitoringHost*)\Working SetOpsMgr specific performance
counters: These counters are OpsMgr specific counters that indicate the
performance of specific aspects of OpsMgr on the gateway:
Health Service\Workflow Count
Health Service Management Groups(*)\Active File Uploads: The
number of file transfers that this gateway is handling. This
represents the number of management pack files that are being
uploaded to agents. If this value remains at a high level for a long
10/12/2015 11:27 PM
20 of 27
time, and there is not much management pack importing at a given

moment, these conditions may generate a problem that affects file
transfer.
Health Service Management Groups(*)\Send Queue % Used: The
size of persistent queue. If this value remains higher than 10 for a
long time, and it does not drop, this indicates that the queue is
backed up. This condition is cause by an overloaded OpsMgr
system because the management server or database is too busy or
is offline.
OpsMgr Connector\Bytes Received: The number of network bytes
received by the gateway i.e., the amount of incoming bytes before
decompression.
OpsMgr Connector\Bytes Transmitted: The number network bytes
sent by the gateway i.e., the amount of outgoing bytes after
compression.
OpsMgr Connector\Data Bytes Received: The number of data bytes
received by the gateway i.e., the amount of incoming data after
decompression.
OpsMgr Connector\Data Bytes Transmitted: The number of data
bytes sent by the gateway i.e. the amount of outgoing data before
compression.
OpsMgr Connector\Open Connections: The number of connections
that are open on gateway. This number should be same as the
number of agents or management servers that are directly
connected to the gateway.
Management server role

Overall performance counters: These counters indicate the overall performance
of the management server:
LogicalDisk(*)\Avg. Disk Queue LengthOpsMgr process generic performance
counters: These counters indicate the overall performance of OpsMgr processes
on the management server:
Process(HealthService)\% Processor Time
10/12/2015 11:27 PM
21 of 27
Process(HealthService)\Private Bytes Depending on how many agents

this management server is managing, this number may vary, and it could
be several hundred megabytes.
Process(MonitoringHost*)\Working SetOpsMgr specific performance
counters: These counters are OpsMgr specific counters that indicate the
performance of specifric aspects of OpsMgr on the management server:
Health Service\Workflow Count: The number of workflows that are
running on this management server.
Health Service Management Groups(*)\Active File Uploads: The number of
file transfers that this management server is handling. This represents the
number of management pack files that are being uploaded to agents. If
this value remains at a high level for a long time, and there is not much
management pack importing at a given moment, these conditions may
generate a problem that affects file transfer.
Health Service Management Groups(*)\Send Queue % Used: The size of
the persistent queue. If this value remains higher than 10 for a long time,
and it does not drop, this indicates that the queue is backed up. This
condition is cause by an overloaded OpsMgr system because the OpsMgr
system (for example, the root management server) is too busy or is offline.
Health Service Management Groups(*)\Bind Data Source Item Drop Rate:
The number of data items that are dropped by the management server for
database or data warehouse data collection write actions. When this
counter value is not 0, the management server or database is overloaded
because it cant handle the incoming data item fast enough or because a
data item burst is occurring. The dropped data items will be resent by
agents. After the overload or burst situation is finished, these data items
will be inserted into the database or into the data warehouse.
Health Service Management Groups(*)\Bind Data Source Item Incoming
Rate: The number of data items received by the management server for
10/12/2015 11:27 PM
22 of 27
database or data warehouse data collection write actions.

Health Service Management Groups(*)\Bind Data Source Item Post Rate:
The number of data items that the management server wrote to the
database or data warehouse for data collection write actions.
received by the management server i.e., the size of incoming bytes
before decompression.
OpsMgr Connector\Bytes Transmitted: The number of network bytes sent
by the management server i.e., the size of outgoing bytes after
compression.
received by the management server i.e., the size of incoming data after
decompress)
OpsMgr Connector\Data Bytes Transmitted: The number of data bytes
sent by the management server i.e., the size of outgoing data before
compression)
OpsMgr Connector\Open Connections: The number of connections open
on management server. It should be same as the number of agents or
root management server that are directly connected to it.
OpsMgr database Write Action Modules(*)\Avg. Batch Size: The number
of a data items or batches that are eceived by database write action
modules. If this number is 5,000, a data item burst is occurring.
OpsMgr DB Write Action Modules(*)\Avg. Processing Time: The number of
seconds a database write action modules takes to insert a batch into
database. If this number is often greater than 60, a database insertion
performance issue is occurring.
OpsMgr DW Writer Module(*)\Avg. Batch Processing Time, ms: The
number of milliseconds for data warehouse write action to insert a batch
of data items into a data warehouse.
OpsMgr DW Writer Module(*)\Avg. Batch Size: The average number of
data items or batches received by data warehouse write action modules.
OpsMgr DW Writer Module(*)\Batches/sec: The number of batches
received by data warehouse write action modules per second.
OpsMgr DW Writer Module(*)\Data Items/sec: The number of data items
10/12/2015 11:27 PM
23 of 27
OpsMgr DW Writer Module(*)\Dropped Data Item Count: The number of

data items dropped by data warehouse write action modules.
OpsMgr DW Writer Module(*)\Total Error Count: The number of errors
that occurred in a data warehouse write action module.
Root management server role

Overall performance counters: These counters indicate the overall performance
of the root management server:
LogicalDisk(*)\Avg. Disk Queue LengthOpsMgr process generic performance
counters: These counters indicate the overall performance of OpsMgr processes
on the root management server:
Process(HealthService)\% Processor Time
Process(HealthService)\Private Bytes (Depending on how many agents this
root management server is managing, this number may vary and could be
several hundred Megabytes.)
Process(MonitoringHost*)\Working Set
Process(Microsoft.Mom.ConfigServiceHost)\% Processor Time
Process(Microsoft.Mom.ConfigServiceHost)\Private Bytes
Process(Microsoft.Mom.ConfigServiceHost)\Thread Count
10/12/2015 11:27 PM
24 of 27
Process(Microsoft.Mom.ConfigServiceHost)\Virtual Bytes
Process(Microsoft.Mom.ConfigServiceHost)\Working Set
Process(Microsoft.Mom.Sdk.ServiceHost)\% Processor Time
Process(Microsoft.Mom.Sdk.ServiceHost)\Private Bytes
Process(Microsoft.Mom.Sdk.ServiceHost)\Thread Count
Process(Microsoft.Mom.Sdk.ServiceHost)\Virtual Bytes
Process(Microsoft.Mom.Sdk.ServiceHost)\Working SetOpsMgr specific
performance counters: These counters are OpsMgr specific counters that
indicate the performance of specific aspects of OpsMgr on the root
management server:
Health Service\Workflow Count: The number of workflows that are
running on this root management server.
Health Service Management Groups(*)\Active File Uploads: The number of
file transfers that this root management server is handling i.e.,
configuration and management pack uploads to agents. If this value
remains higher for a long time, and it does not drop, this indicates that
not much discovery or management pack is being imported at the
moment, and that there could be a problem in file transfer.
Health Service Management Groups(*)\Send Queue % Used: The size of
the persistent queue.
Health Service Management Groups(*)\Bind Data Source Item Drop Rate:
The number of data items dropped by the root management server for
database or data warehouse data collection write actions. When this
counter value is not 0, the root management server or database is
overloaded because it cant handle the incoming data item fast enough or
because a data item burst is occurring. The dropped data items will be
resent by agents. After the overloaded or burst situation is finished, these
data items will be inserted into the database or into the data warehouse.
Health Service Management Groups(*)\Bind Data Source Item Incoming
Rate: The number of data items received by the root management server
for database or data warehouse data collection write actions.
Health Service Management Groups(*)\Bind Data Source Item Post Rate:
The number of data items that the root management server wrote to the
database or to the data warehouse for database or data warehouse data
collection write actions.
10/12/2015 11:27 PM
25 of 27
received by the root management server i.e., the size of incoming bytes
before decompress.
OpsMgr Connector\Bytes Transmitted: The number of network bytes sent
by the root management server i.e., the size of outgoing bytes after
compression.
received by the root management server i.e., the size of incoming data
after decompression.
OpsMgr Connector\Data Bytes Transmitted: The number of data bytes
sent by the root management server i.e., the size of outgoing data
before compression.
OpsMgr Connector\Open Connections: The number of connections open
on the root management server. It should be same as the number of
agents or management servers that are directly connected to it.
OpsMgr Config Service\Number Of Active Requests: The number of
configuration or management pack requests that are being processing by
the Config service.
OpsMgr Config Service\Number Of Queued Requests: The number of
queued config or management pack requests sent to the Config service. If
it is high for a long time, the instance space or management pack space is
changing too frequently.
OpsMgr SDK Service\Client Connections: The number of SDK connections.
OpsMgr DB Write Action Modules(*)\Avg. Batch Size: The number of a
data items or batches that are received by database write action modules.
If this number is 5,000, a data item burst is occurring.
OpsMgr DB Write Action Modules(*)\Avg. Processing Time: The number of
seconds that a database write action modules takes to insert a batch into
a database. If this number is often larger than 60, a database insertion
performance issue is occurring.
OpsMgr DW Writer Module(*)\Avg. Batch Processing Time, ms: The
number of milliseconds that it takes for a data warehouse write action to
insert a batch of data items into a data warehouse.
OpsMgr DW Writer Module(*)\Avg. Batch Size: The average number of
data items or batches that are received by data warehouse write action
modules.
OpsMgr DW Writer Module(*)\Batches/sec: The number of batches
10/12/2015 11:27 PM
26 of 27

OpsMgr DW Writer Module(*)\Data Items/sec: The number of data items
OpsMgr DW Writer Module(*)\Dropped Data Item Count: The number of
data items that are dropped by data warehouse write action modules)
OpsMgr DW Writer Module(*)\Total Error Count (This is number of errors
happened in data warehouse write action modules.
Article ID: 2288515 - Last Review: 07/09/2012 20:47:00 - Revision: 3.0

Applies to
Microsoft System Center Operations Manager 2007 R2
Microsoft System Center Operations Manager 2007
Microsoft System Center Operations Manager 2007 Service Pack 1
Microsoft System Center 2012 Operations Manager
Keywords:
kbtshoot KB2288515
Support
Security
Contact Us
10/12/2015 11:27 PM
27 of 27
Account support
Virus and security
Report a support scam
Supported products list
Safety & Security Center
Disability Answer Desk
Product support lifecycle
Download Security Essentials
Locate Microsoft addresses

worldwide
Malicious Software Removal Tool
10/12/2015 11:27 PM

Troubleshooting Gray Agent States in System Center Operations Manager

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Troubleshooting Gray Agent States in System Center Operations Manager

Uploaded by

Copyright:

Available Formats

Troubleshooting gray agent states in System Center Operations Manager

This article describes how to troubleshoot problems in which an agent, a

An agent, a management server, or a gateway can have one of the following

The agent or management server is

There is a problem on the agent or