You are on page 1of 35

“Autonomic Computing

Framework for Error Recovery in


IBM WebSphere MQ”
– A Proof of Concept

Neeraj Bisht, Pawan HN &


Vikram Subramanya
Summer Interns of 2007
IBM India Software Lab, Bangalore
Manager: Arun Shivaswamy
WebShpere MQ Group
Profile: IBM ISL Software
Group
IBM Software Group - largest middleware
company in the world

Brands: WebSphere, Information Mgmt.,


Lotus, Tivoli, and Rational

Technology Areas: SOA, XML, Web 2.0,


Application Servers, Databases, Autonomic
Computing
Motivation for our PoC

Current Scene: WebSphere MQ cannot come


out of erroneous situations by itself
– Needs manual intervention

Objective: To make MQ self-reliant


– Automatic monitoring/analysis of error
– Recovery action

Gist: Expose MQ to AC
Autonomic Computing
What’s Autonomic
Computing?
Aim: To create ‘self-managing’ systems
– Overcome complexity by automating
maintenance

AC makes the system:


– Self-Configuring: adapt to changes, use policies
– Self-Healing: diagnose H/W or S/W disruptions
– Self-Optimizing: maximize IT resource usage
– Self-Protecting: defend from threats/attacks
MAPE-K Loop Architecture
The MAPE-K Loop in AC
Monitor: Collect, filter details from
managed resource

Analyze: Learn IT envt., predict future

Plan: Policy actions to achieve goals

Execute: Run the plan

Knowledge: Data shared among MAPE like


symptoms & policies
IBM WebSphere MQ
What’s IBM WebSphere
MQ?
IBM’s middleware
for messaging &
queuing

Communication
among programs
across a
heterogeneous
network – API calls
Messaging & Queuing

MQ analogous to
email, not phone!
Queue Manager Objects

Queue: To store msg sent by programs; local


or remote

Channel: Logical communication link


– Message Channel: connects 2 QMgrs
– MQI Channel: connects client to QMgr
MQ Error Scenarios
QMgr Crash Error Scenarios

QMgr crash by killing the OAM


process – amqzfuma.exe
– Recovery: Close connection, restart QMgr

QMgr crash due to access violation in


the agent process
– Recovery: Close connection, restart QMgr
More MQ Error Scenarios

Backward version DLLs placed in the


machine
– Recovery: Find installation path from
registry; delete/rename

DCOM user ID configured incorrectly


– Recovery: Run “amqmjpse -s –r”
What Did We Do?

These error scenarios were manually


induced into MQ

Populated Symptom catalog with possible


errors. Parsed the generated error logs to
detect them

Developed AC framework (MAPE-K loop)


to call recovery procedure automatically
IBM Tools Used
Error Log Analysis
Eclipse-based tool, IBM Log & Trace Analyzer
(LTA)

Converts textual log records into Common Base


Event (CBE) format by parsing

Log View of LTA:


Symptom Database

Knowledge base of problems & solutions for


a software product
– Symptom description: Why the problem occurs?
– Rules to identify a problem: XPath expressions
– Recommended action

LTA provides a symptom editor

Can also be used for correlation of events


Symptom Editor In LTA
Closing MAPE-K Loop
IBM Problem Determination
Assistant (PDA)
Tool to achieve closed AC loop

Components:
– Generic Lop Adapter (GLA)
– Symptom Catalog
– Analysis Engine
– Action Processor
– Manager: Notification, configuration, auto-
update
Our Project:
GUI and Source Code
Explained
Use Case Realization of
QMgr Crash WebSphere
WebSphere
Management MQ
MQ
Management Application
Application based
based on
on AC
AC framework
framework

Error Queue
Queue Manager
Manager
Notification
Notification CBE for WMQ
GLA
erroneous GLA for
for WMQ
WMQ Logs
Router
Router
situation

Action APIs

CBE CBE
AC
AC Centric
Centric Technologies
Technologies
Restart
Action: Generic
CBE Save
Generic Log
Log Adaptor
Adaptor
Correlation
Correlation Action
Action (GLA)
(GLA) // Log
Log Trace
Trace
Analysis
Analysis Engine
Engine Analyzer
Engine
Engine (If
(If needed)
needed) Processor
Processor Analyzer (LTA)
(LTA) for
for WMQ
WMQ
CBE Action: Runtime platform
Change
Runtime platform
TPTP
TPTP
Save XPath
XPath Correlation
Correlation
Load Engine
Engine (if(if needed)
needed)
Management
Management rules Context
Context
Data
Data

Symptom
Symptom
Database
Database for
for
WMQ
WMQ
Project GUI
List of MQ Processes
‘Putter’ Application

Puts msg in a non-full queue


While(Q.Connection not closed)
If (Q.CurrDepth < Q.MaxDepth)
Q.Put (msg);
Else wait();
‘Getter’ Application

Receives msg in a non-empty queue


While(Q.Connection not closed)
If (Q.CurrDepth > 0)
Q.Get (msg);
Else wait();
Induce QMgr Crash: Kill
OAM Process (!)
Manually issue the command
– “taskkill /f /im amqzfuma.exe”
– forcefully kills the Fuma image

QMgr crashes! All


processes are killed
Putter & Getter Stop
Log File Monitor

Call PDA to continuously ping in the


background
– When generated, log file is parsed into
CBE format
– Error is matched with the symptom
catalog
– User is alerted
– Recovery action is called
QMgr Restart Action API

After detection of QMgr crash,


Close all existing connections to the
QMgr

Restart using the command “STRMQM”

If restart fails, wait and output error


code
Putter & Getter Restart ☺
What Have We Achieved?
For the first time, benefits of autonomic computing
are realized on WebSphere MQ

Common MQ errors are successfully overcome in


our demonstration

Feasibility is high, since time & space cost is


minimum

Value Addition to MQ as a self-managing resource


Future As We See

AC framework extends to all MQ errors;


Makes MQ completely Self-Reliant

Manual intervention drastically reduces,


cutting labor costs to IBM; Productivity
increases

We predict a Paradigm Shift in the MQ


product & maintenance team
Thank You!
Managers, Mr. Arun Shivaswamy of WebSphere
MQ group & Mr. M R Ananda, AC team at IBM
ISL, Bangalore

Team-mates: Neeraj Bisht, IITB, Pawan HN, NITK


and Vikram Subramanya, NITK

MQ team, AC team at IBM

IBM, for giving us valuable exposure to industrial


research, with some cash coming our way too (!)

You might also like