Monitoring Platform Common Issues

S.
NoCommon Errors
1 Site scope URL Availability
2 Platform Availability
APMDB maintenance job issues

3
([APM Platform] Database Job Failed )
4 CPU Utilization
5 Disk Space Alerts for C: Drive
6 Physical Memory or Virtual Memory

7 opctrapi
8 Sync_Buffer_issue
Event delay (sitescope/BSMC/NNMi)

9
to OMi console
10 OM Agent alerts
Execute the Downtime Service launcher

11 (weekly once) to avoid the suppression
issues
12 ROD/Boomi issues
13 Atrium to UCMDB integration job
14 UCMDB to RTSM integration job
15 Backup jobs failure

NNMi Events are not coming from 15
16
mins
17 NNM Health Status : error

Steps to be taken
1.Restart the Sitescope service

2.Check the URL availability and drop the mail.
1.We have 3 checks : for WDE, LoadBalancerVerify and Health Check
2.Receive WDE error, the WDE status for the GW should be checked
3.The following are the links to the specific checks for the GWs
1)WDE : http://<GW>8181/ext/mod_mdrv_wrap.dll?type=test
2)HealthCheck : https://<GW>/healthcheck.html
3)LoadBalancerVerify : https://<GW>/topaz/topaz_api/loadBalancerVerify_centers.jsp
[APM Platform] Database Log Space Alert [P] :

1.Please note these emails being sent to the mailbox require an action on our part.
2.If you look at the attached log sent with the email, at the bottom, you see the reason why the job failed -
3."Msg 9002, Sev 17, State 2, Line 25 : The transaction log for database 'BSM_PROFILE_EU' is full.
To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases [SQLSTATE 42000]"
4.If you see such a message, please connect to the SQL Server and open a new query –
->Execute the query in the context of the database for which the failure was reported
sp_helpfile
5.This tells you that the max size the LOG file is allowed to grow to is 160GB
6.Execute the query –
DBCC SQLPERF(LOGSPACE)
7.Once the space is clear please follow below steps to run the job again
->right click on the SQL server agent ->select Job Activity Monitor and right click select ->View job activity
->Go to the respective profile job which is failed and rerun the job,select right click and start-> job at step
1.Login to the respective server RDP.

2.Open task manager check the performance bar.
3.If the Utilization is high continuously, Restart the Sitescope service.
1.Login to the respective RDP.

2.Go to <C:\ProgramData\HP\HP BTO Software\datafiles>
3.Check for the coda files.
4.Execute the following queries in command prompt.
ovc -stop coda
ovc -start coda
5. Check if there are any other highly utilized folder which is not system or application related & clean it if it is not important fo
6.Check the disk space and reply to the mail.
If the alert is for Sitescope Server
1.Stop the sitescope service.
2.Observer the memory and CPU for 5 minutes.
3.If the usage is reduced start the service.
4.Observer the server for another 5 minutes.
5.If the usage is not reduced raise an INC to wintel team.
6.If it is not Sitescope server reach out to wintel team.
1.Login to the respective Server RDP.
2.Type ovc in the command prompt.
3.If the trapi service is aborted type ovc -start opctrapi in the command prompt.
4.Type ovc in the command prompt.
5.Go to <C:\ProgramData\HP\HP BTO Software\log> and check the system.txt file for error message.
6.4.Go to <C:\ProgramData\HP\HP BTO Software\log\OpC> and check the logs are upto date.
7.Take the screen shot and reply to the mail.
1.Login to PHCHBS-SQ350001 and execute the following queries.

1. SELECT COUNT(1) FROM BSM_EVENT.dbo.EVENT_SYNC_BUFFER (NOLOCK) - To Verify the Event Count.
2. SELECT IDENTIFIER, COUNT(1) FROM BSM_EVENT.dbo.EVENT_SYNC_BUFFER (NOLOCK)GROUP BY IDENTIFIER - To verify
3. SELECT * FROM BSM_EVENT.dbo.EVENT_SYNC_LOAD_BALANCE_LOCK (NOLOCK) WHERE TARGET_IDENTIFIER LIKE 'extran
(GW Details)
2. If you observe events struck/Increasing the event count in any one of the GW based on the above queries please proceed th
a. Got to GW server with your ADM account .
b. Go to C:\inetpub\wwwroot to another path
c. Stop the GW services
d. Clear the cache data from the below path.
i. D:\HPBSM\EJBContainer\server\mercury\tmp\sessions
ii. D:\HPBSM\EJBContainer\server\mercury\tmp\deploy
iii. D:\HPBSM\EJBContainer\server\mercury\work\jboss.web
e. Start the GW services
f. Replace the Health check file in C :\inetpub\wwwroot
1.Login to the Primary DPS server

2. Click on All Programs- Aurea-Sonic Management Console
3.Change the connection URL path value “localhost” to Primary DPS” and click on ok
4. Go to Managed Objects->Containersphuseh-SXXXX( issueGateway Server where the events are blocked)-> phuseh-SXXXX
5. Opr_gateway_queue1 messages should be in 0 state.
6. If we observe any count please follow below steps
a. Login in to the gateway server phuseh-SXXXX console URL using admin credentials
http://phuseh-sXXXX.Company.net:11021/invoke?operation=showServiceInfoAsHTML&objectname=Foundations%3Atype%3D
b. Click on restart button hpbsm_opr-scripting-host
7. Please follow the 4,5,6 steps for all the gateway servers as well
8. Then please check the queue size and then event console if events are coming or not in OMI.
If identified that the GWs are queuing up events, please run the following command multiple times from both GW and DPS ser
required for getting the required analysis completed from HPE.
%topaz_home%\opr\support\opr-support-utils.sh -gtd OPR
Failed to report data to HP OM Agent:
1. Login to the server where the issue is reported.
2. Execute the following commands:
ovc –kill
ovpacmd stop ( specific to windows servers)
remove the contents of the location : %ovdatadir%/tmp/OpC/
remove the coda* files from the location : %ovdatadir%/datafiles
ovc –start
ovpacmd start
ovc –status ( all process should be running)
perfstat -p ( all processes should be running)
http://<DPS server where HAC services are running>:8080/jmx-console/HtmlAdaptor?action=inspectMBean&name=Topaz%3A

Step 1: Click Void Stop () and wait for 1 or 2 mins
Step 2 : Click Void Start ()
For incidents creation or updates issue contact to Remedy and Boomi teams
Resolver Group:GL_RoD-Incidents,GroupMail ID:<remedy.support@Company.com>;<is.support@Company.com>
Verification of Atrium - Full Sync Job
Atrium Integration Job:
1. Logon to https://ucmdb.Company.net/ucmdb using admin credentials
2. Go to Data Flow Management > Integration Studio and select the integration point -- Atrium
3. Verify all the Integration jobs are completed successfully
4. If the Integration job is failed then performs the below steps:
a. Click on the failed integration job – “Import Delta” only and verify the job errors
b. Try to re-run that Import Delta integration job by clicking the Delta Synchronization option
c. If the job still fails even after re-running the Import Delta integration job, then go to the probe log path - \\phchbs-
sp320157.Company.net\d$\HP\UCMDB\DataFlowProbe\runtime\log
d. Verify the log file – RemoteProcesses.log and search for the error message –
Remote process DS_Atrium_Import Delta finished with success status false
e. If you are seeing this message then seems to be some issue on Atrium side, please raise a Service request for this issue.
For reference, use the Incident # INC000003772435
f. Please provide username as “HPUCMDB ” if Atrium/ROD team requests
Please verify the below steps if you observe any integration job fails
1. Please login with http://phuseh-s3241:8080/status/ URL using admin credentials.
2. Please verify Default Client status.
3. Status should be “UP” for any one of the server and for another server status should be “Up: Reader”.
4. Please verify the status for below 2 URLS.
http://phuseh-s3324.Company.net:8080/ping/?restrictToWriter=true
http://phuseh-s3241.Company.net:8080/ping/?restrictToWriter=true
5. One URL result should show the status as “UP” and another URL result should shows status as “DOWN”.
In a case we see three consecutive backup failures on APM Servers INC should be created to B&R team.
Resolver Group:GL_Backup-Recovery_HCL
1)Log in NNMi console and check the latest events
2)If events is coming to NNMi and not in OMi then follow the below steps
To check NNMi service using NNMi command - follow the below steps
Login to NNMi target server as administrator or ADM account using RDP.
Open command prompt and go to path
D:\>ovstatus –c
Sample Output
Name PID State Last Message(s)
OVsPMD 1324 RUNNING -
nnmaction 6988 RUNNING Initialization complete.
nnmtrapreceivermd 1408 RUNNING Initialization complete. Trap Receiver is running.
nmsdbmgr 1396 RUNNING Database available.
ovjboss 5064 RUNNING Initialization complete.
D:\>
1.1 Steps to restart NNMi services:
To restart NNMi services using NNMi command, Follow the below steps…
1) Login to NNMi active server as administrator using RDP.
2) Open command prompt and enter command
a) ovstop [To stop nnmi services]
b) ovstart [To start nnmi services]
Note: Do not use –c option for ovstart and ovstop as the application failover is configured for all RNM and GNM servers.
Current status of the services can be checked by ovstatus –c command as shown below
To check NNMi service using NNMi command - follow the below steps
Login to NNMi target server as administrator or ADM account using RDP.
Open command prompt and go to path
D:\>ovstatus –c
Sample Output
Name PID State Last Message(s)
OVsPMD 1324 RUNNING -
nnmaction 6988 RUNNING Initialization complete.
nnmtrapreceivermd 1408 RUNNING Initialization complete. Trap Receiver is running.
nmsdbmgr 1396 RUNNING Database available.
ovjboss 5064 RUNNING Initialization complete.
D:\>
Point of Contact
APM Support
APM Support
APM Support
APM Support
APM Support
Wintel
APM Support
APM Support
APM Support
APM Support
APM Support
ROD/Boomi Team
APM Support
APM Support
B&R Team
APM Support
APM Support

Monitoring Platform Common Issues

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Monitoring Platform Common Issues

Uploaded by

Copyright:

Available Formats

S.

1 Site scope URL Availability

APMDB maintenance job issues

5 Disk Space Alerts for C: Drive

6 Physical Memory or Virtual Memory

Event delay (sitescope/BSMC/NNMi)

Execute the Downtime Service launcher

13 Atrium to UCMDB integration job

14 UCMDB to RTSM integration job

15 Backup jobs failure

17 NNM Health Status : error

1.Restart the Sitescope service

[APM Platform] Database Log Space Alert [P] :

1.Login to the respective server RDP.

1.Login to the respective RDP.

1.Login to PHCHBS-SQ350001 and execute the following queries.

1.Login to the Primary DPS server

http://<DPS server where HAC services are running>:8080/jmx-console/HtmlAdaptor?action=inspectMBean&name=Topaz%3A

You might also like