You are on page 1of 43

Kathy Mitton, Tivoli Storage Manager Server Manager TSM Symposium, September 2011

TSM Administration Revisited

2011 IBM Corporation

Abstract

This session will explore TSM server operations, daily maintenance, and best practices to optimize your TSM server. The speaker will also discuss the administrative and reporting capabilities for the server along with examples and rationale for managing and scheduling server maintenance tasks.

2011 IBM Corporation

Agenda
Revisiting TSM practices Lifecycle Best Practices Workflow Scripts and Sequencing Schedules Operational Limits Monitoring References

2011 IBM Corporation

Revisit TSM Practices Periodically To Keep Pace With Change


TSM environments are impacted by data growth changing business needs people/personnel changes Many TSM installations propagate new TSM instances based on when the administrator first learned the product when administrator last had to develop a solution for the challenges faced by the organization at that time Frequently newly deployed servers will duplicate existing servers in order to simplify administration, the more uniform things are the easier it is to administer leverage the time/effort spent in figuring out an earlier server deployment and configuration Revisiting your TSM administrative setup may allow you to further streamline your operations identify and exploit newer TSM functionality Goals for TSM practices simplify administration reduce potential for errors
4 2011 IBM Corporation

TSM And Your Business Are Dynamic


Database
DB2

TSM Server

TSM Server and Database Change based on: Growth Changes to H/W and vendors

Storage Hierarchy

Storage hierarchy changes based on: Capacity requirements Changes to H/W and vendors Performance and cost needs of an organization

Client Workloads change over time: More clients More data Different types of clients Network/infrastructure changes

Take a Look at Your TSM Environment: Are you using and exploiting the best functions
and features TSM offers? Has your environment changed such that TSM is not being used optimally?

2011 IBM Corporation

Have you looked at reporting lately?

Reporting and monitoring introduced in V6.1. Release to release improvements in: Install Configuration Deployment More Reports

An aggregated view of reporting and monitoring for the entire TSM environment

6
6

2011 IBM Corporation

Have you looked at Administration Center lately?

With TSM 6.2, the administration center can be used to orchestrate the push of updates to windows clients

7 7

2011 IBM Corporation

Utilizing Best Practice Work Flow Will Minimize Problems


Best practice concepts for Tivoli Storage Manager are based on: Field Experiences Observations from customer implementations Feedback through problem reports, market requirements, business partner discussions, and other communication channels Development insight based on the design and implementation of the algorithms, processes, and code Generally, the topics discussed are applicable to both V5.x and V6.x TSM servers There may be command syntax shown that is specific to a newer release Some topics or discussion points may be applicable only to newer release For example, the importance of BACKUP VOLHIST is much more significant for V6.x.

2011 IBM Corporation

Overview of Server Workflow Daily Cycle

Client Workload Time

Server Workload

Data Ingest (Backup, Archive, HSM)

Server Maintenance Activities

2011 IBM Corporation

TSM Wheel of Life Overview


SAN

Whether viewed as a sine wave cycle or the Wheel of Life, some view of the cyclic nature of TSM operations and the daily support for these operations is helpful
10 2011 IBM Corporation

Observing System Resource Relative to Workflow Cycle Will Help Provide Guidance For Changes
The peaks during workload are limited by total available resource on the machine (CPU, Memory, I/O throughput, etc) The client workloads are usually done using schedules Most often, the main data ingest is through a nightly backup window which may be one or more schedules initiating the backup of various groups of clients The server actions are the back-end maintenance actions necessary to protect the client data by performing backup storage pools position the data appropriately in the hierarchy based on policies, storage management, and the data flow through the server perform the other server operations to keep the database, storage hierarchy, and system healthy and ready for the next set of actions Client operations may happen (and often do) throughout the day For example, archive operation for DB logs can occur as needed as opposed to limited only to the nightly ingest window Resources such as mount points need to be considered for these always possible operations

11

2011 IBM Corporation

Identify Peak Workload and Task Overlap For Workflow Improvement


During the client workload phase, the server resources (storage, CPU, memory, and I/O bandwidth) should be devoted to supporting the workload At the peak of the client workload, the majority of the server resources should be in support of the client workload At 6.x, weve tested over 500 concurrent client sessions and have seen that server performance can degrade between 500 and 700 concurrent sessions The number of concurrent sessions any single server might achieve will be highly dependent on server resource During the server workload phase, the server resources are being dedicated to managing the recently received data from the client workload These resources are necessary for the storage, policy management, and maintenance of the server Optimal server size is based on whether all operations can complete in 24hour period When workloads overlap or are not given sufficient resource impacts may occur: Less CPU and memory available to support a given operation Performance degradation Insufficient space Data placement may be sub-optimal Operations may fail

12

2011 IBM Corporation

Server Workflow Goals and Priorities

Database
DB2

TSM Server

Storage Hierarchy

Disaster Recovery and Availability: Onsite recovery through DB restore or clustering (where available). Offsite recovery through DB restore + copy storage pools. Other offsite recovery techniques

Protect the server: Data movement activities (reclamation, migration) Expiration Identify processing for deduplication enabled environments. Protect the client data: Storage pool backup Copy active Database backup

13

2011 IBM Corporation

Illustration and Best Practice Sequencing of Server Workflows


Time
Protecting the Client Data: STORAGE POOL BACKUP COPY ACTIVEDATA DATABASE BACKUP BACKUP VOLHIST/DEVCLASS

Protecting the Server: EXPIRE INVENTORY RECLAMATION MIGRATION

Prepare and Execute for Disaster Recovery: DELETE VOLHIST MOVE DRMEDIA PREPARE

Identify

Table Reorganization

14

2011 IBM Corporation

Other Workflow Observations


Flows and sequencing are not absolute There may be reasons to sequence things differently Proof is in providing capabilities specific to your environment Database backup for V6 changes the paradigms compared to V5 TYPE=FULL being used predominantly This causes the pruning of the ARCHIVE log space Necessary for proper care and feeding of server health Helps manage or mitigate amount of storage needed for archive logs In practice, TYPE=INCR may backup close to the same amount and take almost the same amount of time as TYPE=FULL Take extra precautions to protect the volume history file for restore purposes. Make multiple copies of the file Store to many different locations Critical for RESTORE processing for the server Unlike V5, V6 volume history cant be built by hand and without volume history TSM server is NOT restorable Deduplication exhibits different characteristics than traditional server processes: IDENTIFY can be configured as ALWAYS on We do see environments that use it as an always on process Reclamation workloads increase based on having to reclaim based on policy deletion (expiration) and also deduplicate chunk elimination V6 Reorganization will require some isolated cycles
15 2011 IBM Corporation

Employing Disruptive Technologies May Change Workflows


Disruptive technologies are being used to bring new capabilities to TSM environments TSM processes and activities may be eliminated or require re-sequencing as result of disruptive technology External events or capabilities may affect what needs to be done within TSM One example of disruptive technology is offsite disaster recovery/availability: Disk subsystem Replication of TSM db/log/homeDir to target site Replication of FILE or other DISK storage pools to target site Critical consideration: consistency groups VTL Replication of storage pool to target site Critical consideration: how to reconcile against available server database at target site TSM V6 HADR

16

2011 IBM Corporation

Disruptive Technologies Example


Server A
(Primary)

Server A
(Disaster Recovery)

Replication of server database (DB, log) using either: Device level replication with consistency groups. V6.2 server using database HADR.
DB DB

Replication of storage pool(s) using: Disk device level replication with consistency groups. VTL to VTL system replication.

17

2011 IBM Corporation

Validate Your Scripts Implement Best-Practice Workflow


We still see scripts not exploiting PARALLEL and SERIAL sequencing enhancements even though these script enhancements have been around for many years Many scripts may not be running optimally because they arent exploiting the hardware Or they may not be sequenced optimally based on the total workload that needs to be accomplished This workflow can be implemented via: Scheduled administrative actions Schedule of TYPE=ADMIN Scripts Important structure/sequencing commands: PARALLEL Starts individual threads for each action within a parallel execution block Synchronizes progress by awaiting results from each parallel execution task SERIAL Indicates commands should be run serially, one after the other on the server thread Used to re-converge or synchronize from a PARALLEL set of operations to a SERIAL (single threaded) set of operations To Exploit PARALLEL and SERIAL script constructs commands should: Use WAIT=YES where available Also consider using DURATION=NN where necessary to better manage time

18

2011 IBM Corporation

Illustration of PARALLEL and SERIAL


Single Command (thread) Until PARALLEL encountered.

PARALLEL
5 Commands run in parallel.

SERIAL
Re-converge to single when SERIAL keyword encountered.

19

2011 IBM Corporation

Example Script
PARALLEL BACKUP STGPOOL X WAIT=YES BACKUP STGPOOL Y WAIT=YES BACKUP STGPOOL Z WAIT=YES SERIAL PARALLEL MIGRATE STGPOOL X HIGHMIG=nn LOWMIG=mm RECLAIM=NO WAIT=YES MIGRATE STGPOOL Y HIGHMIG=nn LOWMIG=mm RECLAIM=NO WAIT=YES MIGRATE STGPOOL Z HIGHMIG=nn LOWMIG=mm RECLAIM=NO WAIT=YES EXPIRE INVENTORY DURATION=qq RESOURCE=nn WAIT=YES SERIAL PARALLEL RECLAIM STGPOOL X THRESHOLD=nn DURATION=qq WAIT=YES RECLAIM STGPOOL Y THRESHOLD=nn DURATION=qq WAIT=YES RECLAIM STGPOOL Z THRESHOLD=nn DURATION=qq WAIT=YES SERIAL BACKUP DB TYPE=FULL WAIT=YES BACKUP VOLHIST FILENAMES=/path1/volhist,/path2/volhist,/path3/volhist BACKUP DEVCONFIG FILENAMES=/path1/dc,/path2/dc,/path3/dc

20

2011 IBM Corporation

Script Illustrated
Parallel

Storage Pool Backups (x3)


Serial Parallel

Migration (x3) and Expiration


Serial Parallel

Reclamation (x3)
Serial

BACKUP DB, BACKUP VOLHIST, BACKUP DEVCONFIG

21

2011 IBM Corporation

Use Simple Visualization to Validate Your Schedule Optimization


Using visualization techniques to depict schedules and their relationships one to the other can be useful Helpful to: Determine what is running when What are the overlaps between schedules? When schedules are running, what resources are needed and what load or constraints does this put on the server?

22

2011 IBM Corporation

Administrative Schedules with Overlaps


Administrative Schedule Overlap and Sequencing (Current)
22:00 20:00 18:00 16:00 14:00 12:00 10:00 8:00 6:00 4:00 2:00 0:00 0 0.5 1 1.5 2 2.5 3 3.5 DB Backup Backup VolHist Expiration Reclaim_Copy Reclaim_Tape StgPoolBk_Start Migr_Start

Overlap 1

Overlap 2

An example of using a spreadsheet and schedule window information to visualize schedule sequencing and overlaps.
23 2011 IBM Corporation

Administrative Schedules Reconfigured to Reduce Overlap

Administrative Scheduling Overlap and Sequencing (Proposed)


22:00 20:00 18:00

Time of Day

16:00 14:00 12:00 10:00 8:00 6:00 4:00 2:00 0:00 0 0.5 1 1.5 2 2.5

Backup Stgpool Migration Expiration Reclamation DB Backup Prepare

Proposed adjustments: Eliminate most overlap Only remaining overlap is expiration and migration which generally contend for different resources
24 2011 IBM Corporation

Improve Reliability By Staying Within TSM Operational Limits


The server operational limit is the point at which One or more critical server health and maintenance operations can not be contained within the time available to do it Available system resources (CPU, RAM, etc) are exhausted at or prior to peak load being satisfied and supported Many different indicators exist that may signal a server is at its operational limits. Depending upon how/why the limit is reached, symptoms may be: Degraded performance Failed operations Or no overt sign and not known until disaster recovery of server or clients is needed

25

2011 IBM Corporation

TSM Operational Limits: Database Backup


Backing up the server database is critical to the health and maintenance of the server Required in order to prune space from ARCHIVE log directories Needed in order to restore the database and recover from local device failure for database or active log TSM 6.1 tests database size to 1 TB, for V6.2 it is tested to 2 TB Operational limits due to DB size may be encountered PRIOR to these upper end tested value Will vary by customer based on workload, infrastructure, and organizational requirements (RTO) DB backup duration should periodically be evaluated to determine if limit is reached and another server is needed Time needed to backup the database exceeds time available Time to restore database is longer than recovery time objectives (RTO) for a disaster recovery (real or simulated)

26

2011 IBM Corporation

TSM Operational Limits: Saturation


TSM is limited to the hardware and resources available to it

Operational limit may be reached when: Server overruns/saturates available CPU on system at peak workload or less then peak workload Server overruns available RAM on system and drives high pagefile use I/O bandwidth is saturated: DB or active log performance degraded because I/O cant keep up Storage pool actions performance degraded because I/O cant keep up Saturation or overrun of CPU, RAM, I/O bandwidth achieved at or prior to achieving peak workload For example, in the lab weve demonstrated that more then 1500 concurrent client sessions to the SERVER pushed it to saturation with available memory and CPU such that performance degraded significantly

27

2011 IBM Corporation

Process Rule of Thumb


Many server processes can be run explicitly or implicitly in parallel (concurrently). Example of explicit: Scheduling migration and expiration to run at the same time Example of implicit: Expiration (V6) provides the RESOURCE parameter which indicates the number of threads that will be used to perform the process As a general rule of thumb, do not specify more concurrent processes than there are CPU cores on the machines For example, a 16-way box running expiration and deduplication identify could be configured to run with: Expiration RESOURCE= set to a value in the range of 4 to 8 4-6 Identify processes running In this case, if 8 and 6 were used, that would provide 8 expiration threads and 6 identify processes These 14 tasks would roughly align with 14 of the available 16 CPU cores on the box This would leave 2 CPU cores for support of other operations and tasks This is not an absolute and will vary by processor architecture and system workload. This is a starting point for server configuration and how to avoid overrunning the available resources Tied to a monitoring strategy where I/O, CPU, and memory are being watched, this will assist with managing an appropriate workload and sequencing on a server

28

2011 IBM Corporation

TSM Operational Limits: Possibilities


Many operational limits are resource or time based

Taking steps to improve infrastructure may result in faster operations and may mitigate or remedy the operational limit

For example, if the operational limit is database backup: Using a faster device for the db backup may eliminate the limit Improving I/O subsystem and bandwidth for DB and logs may address the issue

In cases where it is not possible or practical to resolve via improved or changed infrastructure, this may represent a cap to the existing server and need to implement and balance workload to another TSM server

29

2011 IBM Corporation

Monitoring: Whats Available


TSM Provides Messages Logged to local server activity log Event routing to many different targets Summary Information Accounting Records V6 Reporting and Monitoring Administration Center Third party vendor tools are also available: Reporting Tools Monitoring Capabilities and Tools

30

2011 IBM Corporation

TSM Best Practice Monitoring May Involve More than Simply TSM
TSM is a large, multi-threaded software application. It exploits or has dependencies on: CPU the application and database perform many calculation/instruction intensive operations I/O to Disk: This relates to the database, active log, and archive log Bottlenecks such as not enough parallel I/O capability or insufficient bandwidth (small channels) can affect server performance, scalability, and throughput I/O to Storage Hierarchy: This can be disk (TSM device classes of type DISK or FILE) and sequential media (Real tape and VTLs) Often controllers or other virtualized appliances used for storage devices. (SVC, VTL, etc) Devices may be locally attached (SCSI) or fiber attached (SAN) Network: TSM is a client/server application with its client operations almost entirely network driven

31

2011 IBM Corporation

Consider Monitoring Items Internal and External to TSM


External to TSM TSM is application software. Relies upon: Operating system Drivers Devices Firmware Internal to TSM Client Operations Server Processes Other Server tasks such as memory management, scheduling, etc.

32

2011 IBM Corporation

TSM Topology: External to TSM

Direct Attach Devices (Disk, Tape Server/Host


NIC

HBA AIX (TSM Server) SCSI

LAN/WAN

SAN

33

2011 IBM Corporation

Monitoring the Server/Host and Local Devices


Example: System is AIX Most host hardware (NIC, HBA, SCSI, Planar, etc) logs to the system errpt a to see device reported errors Monitor system resources: Tools like topas or others can be used to periodically look at the system and assess health Topas can be configured to run periodically (AIX 6.1) Inittab entry: /usr/bin/topasrec -L -s 300 -R 1 -r 6 -o /etc/perf/daily/ -ypersistent=1 -r 6 indicates how many to keep around Raw data viewed/formatted using topasout command Stats can be collected over time to see historical trends Check for filesystems running low on space In particular /, /opt, Monitor paging space Growth of page file use without corresponding increases in load or activity may indicate a resource leak and can affect performance or ultimately lead to a server termination

34

2011 IBM Corporation

TSM Topology: External to TSM

Direct Attach Devices (Disk, Tape Server/Host


NIC

HBA AIX (TSM Server) SCSI

LAN/WAN

SAN

35

2011 IBM Corporation

Monitoring the LAN/WAN


Usually monitored during exception or periods of degradation

Network teams/owner typically have monitoring tools in place to: Identify and alert to outages Identify and alert to degradation

From TSM perspective: Symptoms would be failed client operations due to communication issues. (socket error, send error, receive error) Not usually evaluated or investigated unless issues are occurring

36

2011 IBM Corporation

TSM Topology: External to TSM

Direct Attach Devices (Disk, Tape Server/Host


NIC

HBA AIX (TSM Server) SCSI

LAN/WAN

SAN

37

2011 IBM Corporation

Monitoring for a SAN and SAN attached devices


Errors not centrally logged. Usually logged to specific device that is generating the event Often times individual error logs need to be accessed and evaluated Errors can be anywhere in the chain from the HBA to the device Fiber Channel Switch, router gateway Library Drive Disk Array SAN controller (SVC or equivalent)

Virtualization can hide/mask errors VTL, SAN Controller, etc are systems unto themselves running: Embedded host, OS, drivers, devices, etc. Evaluation of health may require vendor involvement as the relationship between logs, devices, and errors or symptoms may not be surfaced to end-user

38

2011 IBM Corporation

TSM Monitoring: Client Operations


Client Operations: TSM clients (backup/archive, TDP, API, etc) are the core end-user representing the data being protected by TSM Operations may be: Scheduled Manually initiated Automatically initiated (such as archiving log files from DB2/UDB client) Monitor for: Failed sessions or schedules Schedule issues such as missed or failed schedule events Unusual or unexpected session termination Clients log ANExxxx messages to the server If the client encounters a local error (error on the client system) while an operation is in-flight, many of these will be reported and logged to the server Client operations record summary information (SELECT * FROM SUMMARY) as well as logging messages to the server activity log

39

2011 IBM Corporation

TSM Monitoring: Server Processes


Server Processes: Server maintenance tasks such as Expiration, Storage Pool Backup, Migration, Reclamation and such are done as server processes Many processes can be WAIT=YES (synchronous) or WAIT=NO (asynchronous) Server processes ALL issue process started and ended messages End messages report statistics as applicable End messages report SUCCESS or FAILURE of an operation Monitor for: Failed processes Cancelled processes will report as failed, there should be other messages logged to activity log indicating the cancellation If processes are not succeeding, evaluate: Was this an appropriate time/reason for process to be run? If an insufficient resource issue, can additional resources be made available? Or can the process be initiated at a different time when resources are available? Server processes record summary information (SELECT * FROM SUMMARY) as well as logging messages to the server activity log

40

2011 IBM Corporation

TSM Monitoring: Analysis Using Message Tokens


Activity log messages are tagged with session and process tokens. Session token example: ANR1234E xxxxx (SESSION: 985) Process token example: ANR2345E xxxx (PROCESS: 38291) Session and process token example: ANR3456E xxxx (SESSION: 765 PROCESS: 998512) Use: If a process or session encounters an error, query the activity log using the session or process number to see all the messages for that action Other messages before and after the failure may be found easily in this fashion. For example: QUERY ACTLOG SEARCH=(SESSION: 985) to search for all the messages for session 985 More details about the events leading up to or the actual error itself may be seen by doing this

41

2011 IBM Corporation

Conclusion
Server Workflow Priorities: Protect Client Data Maintain the Server Protect the Server Priorities then provide sequencing of actions which can Be orchestrated via scheduled (type=admin) scripts Scripts structured using PARALLEL and SERIAL semantics to sequence actions and manage resources while satisfying the workflow priority actions Operational limits Have been defined Steps to identify and possible actions have been discussed Monitoring considerations have been discussed for: Server topology And the server itself

42

2011 IBM Corporation

A few useful links


V6 Deployment best practices: http://www-01.ibm.com/support/docview.wss?uid=swg21421060

Database Reorganization: http://www-01.ibm.com/support/docview.wss?uid=swg21452146

Memory requirements for V6: http://www-01.ibm.com/support/docview.wss?uid=swg21450229

TSM configured for HADR: http://www.ibm.com/developerworks/wikis/display/tivolistoragemanager/Electronic+v aulting+using+deduplicated+remote+copy+storage+pools

43

2011 IBM Corporation

You might also like