Professional Documents
Culture Documents
IBM
AIX Version 7.2
IBM
Note
Before using this information and the product it supports, read the information in “Notices” on page 275.
This edition applies to AIX Version 7.1 and to all subsequent releases and modifications until otherwise indicated in
new editions.
© Copyright IBM Corporation 2015, 2018.
US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
Contents
About this document . . . . . . . . . v Common syntax for tuning commands . . . . 200
Highlighting . . . . . . . . . . . . . . v Tunable file-manipulation commands . . . . 202
Case-sensitivity in AIX . . . . . . . . . . . v Initial setup . . . . . . . . . . . . . 205
ISO 9000. . . . . . . . . . . . . . . . v Reboot tuning procedure . . . . . . . . 205
Recovery Procedure . . . . . . . . . . 206
Performance Tools Guide and Reference 1 Kernel tuning using the SMIT interface . . . . 206
The procmon tool . . . . . . . . . . . . 211
What's new in Performance Tools Guide and
Overview of the procmon tool . . . . . . . 211
Reference . . . . . . . . . . . . . . . 2
Components of the procmon tool . . . . . . 212
CPU Utilization Reporting Tool (curt) . . . . . . 2
Filtering processes . . . . . . . . . . . 214
Syntax for the curt Command . . . . . . . 2
Performing AIX commands on processes . . . 214
Measurement and Sampling . . . . . . . . 3
Profiling tools . . . . . . . . . . . . . 215
Examples of the curt command . . . . . . . 4
The timing commands . . . . . . . . . 215
Simple performance lock analysis tool (splat) . . . 32
The prof command . . . . . . . . . . 215
splat command syntax. . . . . . . . . . 32
The gprof command . . . . . . . . . . 217
Measurement and sampling . . . . . . . . 33
The tprof command . . . . . . . . . . 219
Examples of generated reports . . . . . . . 35
The svmon command . . . . . . . . . . 226
Hardware performance monitor APIs and tools . . 50
Security . . . . . . . . . . . . . . 226
Performance monitor accuracy . . . . . . . 51
The svmon configuration file . . . . . . . 227
Performance monitor context and state . . . . 51
Summary report metrics . . . . . . . . . 227
Performance monitoring agent . . . . . . . 52
Report formatting options . . . . . . . . 228
POWERCOMPAT events . . . . . . . . . 53
Segment details and -O options . . . . . . 230
Thread accumulation and thread group
Additional -O options . . . . . . . . . 234
accumulation . . . . . . . . . . . . . 57
Reports details . . . . . . . . . . . . 238
Security considerations . . . . . . . . . 57
Remote Statistics Interface API Overview . . . . 259
The pmapi library . . . . . . . . . . . 58
Remote Statistics Interface list of subroutines 260
The hpm library and associated tools . . . . . 71
RSI Interface Concepts and Terms . . . . . 261
Perfstat API programming . . . . . . . . . 80
A Simple Data-Consumer Program . . . . . 266
API characteristics . . . . . . . . . . . 81
Expanding the data-consumer program. . . . 269
Global interfaces. . . . . . . . . . . . 81
Inviting data suppliers . . . . . . . . . 270
Component-Specific interfaces . . . . . . . 100
A Full-Screen, character-based monitor . . . . 272
WPAR Interfaces . . . . . . . . . . . 166
List of RSI Error Codes . . . . . . . . . 272
RSET Interfaces. . . . . . . . . . . . 176
Cached metrics interfaces . . . . . . . . 181
Node interfaces. . . . . . . . . . . . 184 Notices . . . . . . . . . . . . . . 275
Change history of the perfstat API . . . . . 193 Privacy policy considerations . . . . . . . . 277
Kernel tuning . . . . . . . . . . . . . 198 Trademarks . . . . . . . . . . . . . . 277
Migration and compatibility . . . . . . . 198
Tunables file directory . . . . . . . . . 199 Index . . . . . . . . . . . . . . . 279
Tunable parameters type . . . . . . . . 200
The information contained in this document pertains to systems running AIX 7.1, or later. Any content
that is applicable to earlier releases will be noted as such.
Highlighting
The following highlighting conventions are used in this document:
Bold Identifies commands, subroutines, keywords, files, structures, directories, and other items whose names are
predefined by the system. Also identifies graphical objects such as buttons, labels, and icons that the user
selects.
Italics Identifies parameters whose actual names or values are to be supplied by the user.
Monospace Identifies examples of specific data values, examples of text similar to what you might see displayed,
examples of portions of program code similar to what you might write as a programmer, messages from
the system, or information you should actually type.
Case-sensitivity in AIX
Everything in the AIX operating system is case-sensitive, which means that it distinguishes between
uppercase and lowercase letters. For example, you can use the ls command to list files. If you type LS, the
system responds that the command is not found. Likewise, FILEA, FiLea, and filea are three distinct file
names, even if they reside in the same directory. To avoid causing undesirable actions to be performed,
always ensure that you use the correct case.
ISO 9000
ISO 9000 registered quality systems were used in the development and manufacturing of this product.
The path to achieving this objective is a balance between appropriate expectations and optimizing the
available system resources. The performance-tuning process demands great skill, knowledge, and
experience, and cannot be performed by only analyzing statistics, graphs, and figures. If results are to be
achieved, the human aspect of perceived performance must not be neglected. Performance tuning also
takes into consideration problem-determination aspects as well as pure performance issues.
Limitations originating from the sizing phase will either limit the possibility of tuning, or incur greater
cost to overcome them. The system might not meet the original performance expectations because of
unrealistic expectations, physical problems in the computer environment, or human error in the design or
implementation of the system. In the worst case, adding or replacing hardware might be necessary. Be
particularly careful when sizing a system to permit enough capacity for unexpected system loads. In
other words, do not design the system to be 100 percent busy from the start of the project.
When a system in a productive environment still meets the performance expectations for which it was
initially designed, but the demands and needs of the utilizing organization have outgrown the system's
basic capacity, performance tuning is performed to delay or even to avoid the cost of adding or replacing
hardware.
Many performance-related issues can be traced back to operations performed by a person with limited
experience and knowledge who unintentionally restricted some vital logical or physical resource of the
system.
Note: The metrics reported by any statistics tool such as lparstat, vmstat, iostat, mpstat and so on
including the applications that are based on Perfstat API or SPMI API varies to a certain extent at any
point of time. If the command is run multiple times for an instance, the values may not be similar for
that instance.
In this PDF file, you might see revision bars (|) in the left margin that identify new and changed
information.
October 2016
The following information is a summary of the updates made to this topic collection:
v Updated the Node interfaces topic with the perfstat_cluster_disk interface example.
The curt command works with both uniprocessor and multiprocessor AIX Version 4 and AIX Version 5
traces.
curt -i inputfile [-o outputfile] [-n gensymsfile] [-m trcnmfile] [-a pidnamefile] [-f timestamp] [-l timestamp] [-r
PURR][-ehpstP]
Flags
Item Descriptor
-i inputfile Specifies the input AIX trace file to be analyzed.
-o outputfile Specifies an output file (default is stdout).
-n gensymsfile Specifies a names file produced by gensyms.
-m trcnmfile Specifies a names file produced by trcnm.
-a pidnamefile Specifies a PID-to-process name mapping file.
-f timestamp Starts processing trace at timestamp seconds.
-l timestamp Stops processing trace at timestamp seconds.
-r PURR Uses the PURR register to calculate CPU times.
-e Outputs elapsed time information for system calls.
-h Displays usage text (this information).
-p Outputs detailed process information.
-s Outputs information about errors returned by system calls.
-t Outputs detailed thread information.
-P Outputs detailed pthread information.
Parameters
The following table lists the minimum trace hooks required for the curt command. Using only these trace
hooks will limit the size of the trace file. However, other events on the system might not be captured in
this case. This is significant if you intend to analyze the trace in more detail.
Trace hooks 119 and 135 are used to report on the time spent in the exit system call. Trace hooks 134, 139,
210, and 465 are used to keep track of TIDs, PIDs and process names.
Trace hook 492 is used to report on the time spent in the hypervisor.
Trace hooks 605 and 609 are used to report on the time spent in the pthreads library.
To get the PTHREAD hooks in the trace, you must execute your pthread application using the
instrumented libpthreads.a library.
Trace and name files are generated using the following process:
1. Build the raw trace. On a 4-way machine, this will create files as listed in the example code below.
One raw trace file per CPU is produced. The files are named trace.raw-0, trace.raw-1, and so forth for
each CPU. An additional file named trace.raw is also generated. This is a master file that has
information that ties together the other CPU-specific traces.
Note: If you want pthread information in the curt report, you must add the instrumented libpthreads
directory to the library path, LIBPATH, when you build the trace. Otherwise, the export LIBPATH
statement in the example below is unnecessary.
2. Merge the trace files. To merge the individual CPU raw trace files to form one trace file, run the
trcrpt command. If you are tracing a uniprocessor machine, this step is not necessary.
3. Create the supporting gensymsfile and trcnmfile files by running the gensyms and trcnm
commands. Neither the gensymsfile nor the trcnmfile file are necessary for the curt command to run.
However, if you provide one or both of these files, or if you use the trace command with the -n
option, the curt command outputs names for system calls and interrupt handlers instead of just
addresses. The gensyms command output includes more information than the trcnm command
output, and so, while the trcnmfile file will contain most of the important address to name mapping
data, a gensymsfile file will enable the curt command to output more names, and is the preferred
address to name mapping data collection command.
The following is an example of how to generate input files for the curt command:
# HOOKS="100,101,102,103,104,106,10C,119,134,135,139,200,210,215,38F,419,465,47F,488,489,48A,
48D,492,605,609"
# SIZE="1000000"
# export HOOKS SIZE
# trace -n -C all -d -j $HOOKS -L $SIZE -T $SIZE -afo trace.raw
# export LIBPATH=/usr/ccs/lib/perf:$LIBPATH
# trcon ; pthread.app ; trcstop
# unset HOOKS SIZE
# ls trace.raw*
trace.raw trace.raw-0 trace.raw-1 trace.raw-2 trace.raw-3
# trcrpt -C all -r trace.raw > trace.r
# rm trace.raw*
# ls trace*
trace.r
# gensyms > gensyms.out
# trcnm > trace.nm
The following is an overview of the content of the report that the curt command generates:
v A report header, including the trace file name, the trace size, and the date and time the trace was
taken. The header also includes the command that was used when the trace was run. If the PURR
register was used to calculate CPU times, this information is also included in the report header.
v For each CPU (and a summary of all the CPUs), processing time expressed in milliseconds and as a
percentage (idle and non-idle percentages are included) for various CPU usage categories.
v For each CPU (and a summary of all the CPUs), processing time expressed in milliseconds and as a
percentage for CPU usage in application mode for various application usage categories.
v Average thread affinity across all CPUs and for each individual CPU.
v For each CPU (and for all the CPUs), the Physical CPU time spent and the percentage of total time this
represents.
v Average physical CPU affinity across all CPUs and for each individual CPU.
v The physical CPU dispatch histogram of each CPU.
v The number of preemptions, and the number of H_CEDE and H_CONFER hypervisor calls for each
individual CPU.
v The total number of idle and non-idle process dispatches for each individual CPU.
v Average pthread affinity across all CPUs and for each individual CPU.
v The total number of idle and non-idle pthread dispatches for each individual CPU.
v Information on the amount of CPU time spent in application and system call (syscall) mode expressed
in milliseconds and as a percentage by thread, process, and process type. Also included are the number
of threads per process and per process type.
v Information on the amount of CPU time spent executing each kernel process, including the idle
process, expressed in milliseconds and as a percentage of the total CPU time.
v Information on the amount of CPU time spent executing calls to libpthread, expressed in milliseconds
and as percentages of the total time and the total application time.
v Information on completed system calls that includes the name and address of the system call, the
number of times the system call was executed, and the total CPU time expressed in milliseconds and
as a percentage with average, minimum, and maximum time the system call was running.
v Information on pending system calls, that is, system calls for which the system call return has not
occurred at the end of the trace. The information includes the name and address of the system call, the
thread or process which made the system call, and the accumulated CPU time the system call was
running expressed in milliseconds.
v Information on completed hypervisor calls that includes the name and address of the hypervisor call,
the number of times the hypervisor call was executed, and the total CPU time expressed in
milliseconds and as a percentage with average, minimum, and maximum time the hypervisor call was
running.
v Information on pending hypervisor calls, which are hypervisor calls that were not completed by the
end of the trace. The information includes the name and address of the hypervisor call, the thread or
process which made the hypervisor call, and the accumulated CPU time the hypervisor call was
running, expressed in milliseconds.
v Information on completed pthread calls that includes the name of the pthread call routine, the number
of times the pthread call was executed, and the total CPU time expressed in milliseconds and the
average, minimum, and maximum time the pthread call was running.
v Information on pending pthread calls, that is, pthread calls for which the pthread call return has not
occurred at the end of the trace. The information includes the name of the pthread call, the process, the
thread and the pthread which made the pthread call, and the accumulated CPU time the pthread call
was running expressed in milliseconds.
To create additional, specialized reports, run the curt command using the following flags:
Item Descriptor
-e Produces reports containing statistics and additional information on the System Calls Summary Report, Pending System
Calls Summary Report, Hypervisor Calls Summary Report, Pending Hypervisor Calls Summary Report, System NFS
Calls Summary Report, Pending NFS Calls Summary, Pthread Calls Summary, and the Pending Pthread Calls Summary.
The additional information pertains to the total, average, maximum, and minimum elapsed times that a system call was
running.
-s Produces a report containing a list of errors returned by system calls.
-t Produces a report containing a detailed report on thread status that includes the amount of CPU time the thread was in
application and system call mode, what system calls the thread made, processor affinity, the number of times the thread
was dispatched, and to which CPU(s) it was dispatched. The report also includes dispatch wait time and details of
interrupts.
-p Produces a report containing a detailed report on process status that includes the amount of CPU time the process was
in application and system call mode, application time details, threads that were in the process, pthreads that were in
the process, pthread calls that the process made and system calls that the process made.
-P Produces a report containing a detailed report on pthread status that includes the amount of CPU time the pthread was
in application and system call mode, system calls made by the pthread, pthread calls made by the pthread, processor
affinity, the number of times the pthread was dispatched and to which CPU(s) it was dispatched, thread affinity, and
the number of times the pthread was dispatched and to which kernel thread(s) it was dispatched. The report also
includes dispatch wait time and details of interrupts.
This section explains the default report created by the curt command, as follows:
# curt -i trace.r -n gensyms.out -o curt.out
General information:
The general information displays the time and date when the report was generated, and is followed by
the syntax of the curt command line that was used to produce the report.
This section also contains some information about the AIX trace file that was processed by the curt
command. This information consists of the trace file's name, size, and its creation date. The command
used to invoke the AIX trace facility and gather the trace file is displayed at the end of the report.
System summary:
The system summary information produced by the curt command describes the time spent by the whole
system (all CPUs) in various execution modes.
The System Summary example indicates that the CPU is spending most of its time in application mode.
There is still 4234.76 ms of IDLE time so there is enough CPU to run applications. If there is insufficient
CPU power, do not expect to see any IDLE time. The Avg. Thread Affinity value is 0.99 showing good
processor affinity; that is, threads returning to the same processor when they are ready to be run again.
The system application summary information produced by the curt command describes the time spent by
the system as a whole (all CPUs) in various execution modes.
The same description that was given for the system summary and system application summary applies
here, except that this report covers each processor rather than the whole system.
PROC 0 : 15
PROC 24 : 15
The application summary, by Tid, displays an output of all the threads that were running on the system
during the time of trace collection and their CPU consumption. The thread that consumed the most CPU
time during the time of the trace collection is displayed at the top of the output.
Application Summary (by Tid)
----------------------------
-- processing total (msec) -- -- percent of total processing time --
combined application syscall combined application syscall name (Pid Tid)
======== =========== ======= ======== =========== ======= ===================
4986.2355 4986.2355 0.0000 24.4214 24.4214 0.0000 cpu(18418 32437)
4985.8051 4985.8051 0.0000 24.4193 24.4193 0.0000 cpu(19128 33557)
4982.0331 4982.0331 0.0000 24.4009 24.4009 0.0000 cpu(18894 28671)
83.8436 2.5062 81.3374 0.4106 0.0123 0.3984 disp+work(20390 28397)
72.5809 2.7269 69.8540 0.3555 0.0134 0.3421 disp+work(18584 32777)
69.8023 2.5351 67.2672 0.3419 0.0124 0.3295 disp+work(19916 33033)
63.6399 2.5032 61.1368 0.3117 0.0123 0.2994 disp+work(17580 30199)
63.5906 2.2187 61.3719 0.3115 0.0109 0.3006 disp+work(20154 34321)
62.1134 3.3125 58.8009 0.3042 0.0162 0.2880 disp+work(21424 31493)
60.0789 2.0590 58.0199 0.2943 0.0101 0.2842 disp+work(21992 32539)
...(lines omitted)...
In the example above, we can investigate why the system is spending so much time in application mode
by looking at the Application Summary (by Tid), where we can see the top three processes of the report
are named cpu, a test program that uses a great deal of CPU time. The report shows again that the CPU
spent most of its time in application mode running the cpu process. Therefore the cpu process is a
candidate to be optimized to improve system performance.
The application summary, by Pid, has the same content as the application summary, by Tid, except that
the threads that belong to each process are consolidated and the process that consumed the most CPU
time during the monitoring period is at the beginning of the list.
The name (PID) (Thread Count) column shows the process name, its process ID, and the number of
threads that belong to this process and that have been accumulated for this line of data.
Application Summary (by Pid)
----------------------------
-- processing total (msec) -- -- percent of total processing time --
combined application syscall combined application syscall name (Pid)(Thread Count)
======== =========== ======= ======== =========== ======= ===================
4986.2355 4986.2355 0.0000 24.4214 24.4214 0.0000 cpu(18418)(1)
4985.8051 4985.8051 0.0000 24.4193 24.4193 0.0000 cpu(19128)(1)
4982.0331 4982.0331 0.0000 24.4009 24.4009 0.0000 cpu(18894)(1)
83.8436 2.5062 81.3374 0.4106 0.0123 0.3984 disp+work(20390)(1)
72.5809 2.7269 69.8540 0.3555 0.0134 0.3421 disp+work(18584)(1)
69.8023 2.5351 67.2672 0.3419 0.0124 0.3295 disp+work(19916)(1)
63.6399 2.5032 61.1368 0.3117 0.0123 0.2994 disp+work(17580)(1)
63.5906 2.2187 61.3719 0.3115 0.0109 0.3006 disp+work(20154)(1)
62.1134 3.3125 58.8009 0.3042 0.0162 0.2880 disp+work(21424)(1)
60.0789 2.0590 58.0199 0.2943 0.0101 0.2842 disp+work(21992)(1)
...(lines omitted)...
The application summary by process type consolidates all processes of the same name and sorts them in
descending order of combined processing time.
...(lines omitted)...
The Kproc summary, by Tid, displays an output of all the kernel process threads that were running on
the system during the time of trace collection and their CPU consumption. The thread that consumed the
most CPU time during the time of the trace collection is displayed at the beginning of the output.
Kproc Summary (by Tid)
-----------------------
-- processing total (msec) -- -- percent of total time --
combined kernel operation combined kernel operation name (Pid Tid Type)
======== ====== =========== ======== ====== =========== ===================
1930.9312 1930.9312 0.0000 13.6525 13.6525 0.0000 wait(8196 8197 W)
2.1674 2.1674 0.0000 0.0153 0.0153 0.0000 .WSMRefreshServe(0 3 -)
1.9034 1.9034 1.8020 0.0135 0.0135 0.0128 nfsd(36882 49177 N)
0.6609 0.5789 0.0820 0.0002 0.0002 0.0000 kbiod(8050 86295 N)
...(lines omitted)...
Kproc Types
-----------
Type Function Operation
==== ============================ ==========================
W idle thread -
N NFS daemon NFS Remote Procedure Calls
Kproc Types
Item Descriptor
Type A single letter to be used as an index into this listing.
Function A description of the nominal function of this type of kernel process.
Operation A description of the traced operations for this type of kernel process.
The application Pthread summary, by PID, displays an output of all the multi-threaded processes that
were running on the system during trace collection and their CPU consumption, and that have spent time
making pthread calls. The process that consumed the most CPU time during the trace collection is
displays at the beginning of the list.
Application Pthread Summary (by Pid)
------------------------------------
-- processing total (msec) -- -- percent of total application time --
application other pthread application other pthread name (Pid)(Pthread Count)
=========== ========== ========== =========== ========== ========== =========================
1277.6602 1274.9354 2.7249 23.8113 23.7605 0.0508 ./pth(245964)(52)
802.6445 801.4162 1.2283 14.9586 14.9357 0.0229 ./pth32(245962)(12)
...(lines omitted)...
The System Calls Summary provides a list of all the system calls that have completed execution on the
system during the monitoring period. The list is sorted by the total CPU time in milliseconds consumed
by each type of system call.
System Calls Summary
--------------------
Count Total Time % sys Avg Time Min Time Max Time SVC (Address)
(msec) time (msec) (msec) (msec)
======== =========== ====== ======== ======== ======== ================
605 355.4475 1.74% 0.5875 0.0482 4.5626 kwrite(4259c4)
733 196.3752 0.96% 0.2679 0.0042 2.9948 kread(4259e8)
3 9.2217 0.05% 3.0739 2.8888 3.3418 execve(1c95d8)
38 7.6013 0.04% 0.2000 0.0051 1.6137 __loadx(1c9608)
1244 4.4574 0.02% 0.0036 0.0010 0.0143 lseek(425a60)
45 4.3917 0.02% 0.0976 0.0248 0.1810 access(507860)
63 3.3929 0.02% 0.0539 0.0294 0.0719 _select(4e0ee4)
2 2.6761 0.01% 1.3380 1.3338 1.3423 kfork(1c95c8)
207 2.3958 0.01% 0.0116 0.0030 0.1135 _poll(4e0ecc)
228 1.1583 0.01% 0.0051 0.0011 0.2436 kioctl(4e07ac)
9 0.8136 0.00% 0.0904 0.0842 0.0988 .smtcheckinit(1b245a8)
5 0.5437 0.00% 0.1087 0.0696 0.1777 open(4e08d8)
15 0.3553 0.00% 0.0237 0.0120 0.0322 .smtcheckinit(1b245cc)
2 0.2692 0.00% 0.1346 0.1339 0.1353 statx(4e0950)
33 0.2350 0.00% 0.0071 0.0009 0.0210 _sigaction(1cada4)
1 0.1999 0.00% 0.1999 0.1999 0.1999 kwaitpid(1cab64)
102 0.1954 0.00% 0.0019 0.0013 0.0178 klseek(425a48)
...(lines omitted)...
The pending system calls summary provides a list of all the system calls that have been executed on the
system during the monitoring period but have not completed. The list is sorted by Tid.
Pending System Calls Summary
----------------------------
Accumulated SVC (Address) Procname (Pid Tid)
...(lines omitted)...
The Hypervisor calls summary provides a list of all the hypervisor calls that have completed execution
on the system during the monitoring period. The list is sorted by the total CPU time, in milliseconds,
consumed by each type of hypervisor call.
Hypervisor Calls Summary
------------------------
Count Total Time % sys Avg Time Min Time Max Time HCALL (Address)
(msec) time (msec) (msec) (msec)
======== =========== ====== ======== ======== ======== =================
4 0.0077 0.00% 0.0019 0.0014 0.0025 H_XIRR(3ada19c)
4 0.0070 0.00% 0.0017 0.0015 0.0021 H_EOI(3ad6564)
The pending Hypervisor calls summary provides a list of all the hypervisor calls that have been executed
on the system during the monitoring period but have not completed. The list is sorted by Tid.
Pending Hypervisor Calls Summary
--------------------------------
Accumulated HCALL (Address) Procname (Pid Tid)
Time (msec)
============ ========================= ==========================
0.0066 H_XIRR(3ada19c) syncd(3916 5981)
The system NFS calls summary provides a list of all the system NFS calls that have completed execution
on the system during the monitoring period. The list is divided by NFS versions, and each list is sorted
by the total CPU time, in milliseconds, consumed by each type of system NFS call.
System NFS Calls Summary
------------------------
Count Total Time Avg Time Min Time Max Time % Tot % Tot Opcode
(msec) (msec) (msec) (msec) Time Count
======== =========== ======== ======== ======== ===== ===== =============
253 48.4115 0.1913 0.0952 1.0097 98.91 98.83 RFS2_READLINK
2 0.3959 0.1980 0.1750 0.2209 0.81 0.78 RFS2_LOOKUP
1 0.1373 0.1373 0.1373 0.1373 0.28 0.39 RFS2_NULL
-------- ----------- -------- -------- -------- ----- ----- -------------
256 48.9448 0.1912 NFS V2 TOTAL
The pending NFS calls summary provides a list of all the system NFS calls that have executed on the
system during the monitoring period but have not completed. The list is sorted by the Tid.
Pending NFS Calls Summary
-------------------------
Accumulated Sequence Number Procname (Pid Tid)
Time (msec) Opcode
============ =============== ==========================
0.0831 1038711932 nfsd(1007854 331969)
0.0833 1038897247 nfsd(1007854 352459)
0.0317 1038788652 nfsd(1007854 413931)
0.0029 NFS4_ATTRCACHE kbiod(100098 678934)
..(lines omitted)...
The Pending System NFS Calls Summary has the following fields:
Item Descriptor
Accumulated Time (msec) The accumulated CPU time that the system spent processing the pending system NFS call,
expressed in milliseconds.
Sequence Number The sequence number represents the transaction identifier (XID) of an NFS operation. It is
used to uniquely identify an operation and is used in the RPC call/reply messages. This
number is provided instead of the operation name because the name of the operation is
unknown until it completes.
Opcode The name of pending operation NFS V4.
Procname (Pid Tid) The name of the process associated with the thread that made the system NFS call, its
process ID, and the thread ID.
The Pthread calls summary provides a list of all the pthread calls that have completed execution on the
system during the monitoring period. The list is sorted by the total CPU time, in milliseconds, consumed
by each type of pthread call.
The pending Pthread calls summary provides a list of all the pthread calls that have been executed on the
system during the monitoring period but have not completed. The list is sorted by Pid-Ptid.
Pending Pthread Calls Summary
-----------------------------
Accumulated Pthread Routine Procname (Pid Tid Ptid)
Time (msec)
============ =============== ==========================
1990.9400 pthread_join ./pth32(245962 1007759 1)
The Pending Pthread System Calls Summary has the following fields:
Item Descriptor
Accumulated Time The accumulated CPU time that the system spent processing the pending pthread call, expressed in
(msec) milliseconds.
Pthread Routine The name of the pthread routine of the libpthreads library.
Procname (Pid Tid Ptid) The name of the process associated with the thread and the pthread which made the pthread call, its
process ID, the thread ID and the pthread ID.
FLIH summary:
The FLIH (First Level Interrupt Handler) summary lists all first level interrupt handlers that were called
during the monitoring period.
The Global FLIH Summary lists the total of first level interrupts on the system, while the Per CPU FLIH
Summary lists the first level interrupts per CPU.
Global Flih Summary
-------------------
Count Total Time Avg Time Min Time Max Time Flih Type
(msec) (msec) (msec) (msec)
====== =========== =========== =========== =========== =========
2183 203.5524 0.0932 0.0041 0.4576 31(DECR_INTR)
946 102.4195 0.1083 0.0063 0.6590 3(DATA_ACC_PG_FLT)
12 1.6720 0.1393 0.0828 0.3366 32(QUEUED_INTR)
CPU Number 1:
Count Total Time Avg Time Min Time Max Time Flih Type
(msec) (msec) (msec) (msec)
====== =========== =========== =========== =========== =========
4 0.2405 0.0601 0.0517 0.0735 3(DATA_ACC_PG_FLT)
258 49.2098 0.1907 0.0060 0.5076 5(IO_INTR)
515 55.3714 0.1075 0.0080 0.3696 31(DECR_INTR)
...(lines omitted)...
The following are FLIH types that were depicted in the FLIH summary.
SLIH summary:
The Second level interrupt handler (SLIH) Summary lists all second level interrupt handlers that were
called during the monitoring period.
The Global Slih Summary lists the total of second level interrupts on the system, while the Per CPU Slih
Summary lists the second level interrupts per CPU.
Global Slih Summary
-------------------
Count Total Time Avg Time Min Time Max Time Slih Name(Address)
(msec) (msec) (msec) (msec)
====== =========== =========== =========== =========== =================
43 7.0434 0.1638 0.0284 0.3763 s_scsiddpin(1a99104)
1015 42.0601 0.0414 0.0096 0.0913 ssapin(1990490)
...(lines omitted)...
The report generated with the -e flag includes the data shown in the default report, and also includes
additional information in the System Calls Summary, the Pending System Calls Summary, the Hypervisor
Calls Summary, the Pending Hypervisor Calls Summary, the System NFS Calls Summary, the Pending
NFS Calls Summary, the Pthread Calls Summary and the Pending Pthread Calls Summary.
The following is an example of the additional information reported by using the -e flag:
# curt -e -i trace.r -m trace.nm -n gensyms.out -o curt.out
# cat curt.out
...(lines omitted)...
...(lines omitted)...
...(lines omitted)...
The system call, hypervisor call, NFS call, and pthread call reports in the preceding example have the
following fields in addition to the default System Calls Summary, Hypervisor Calls Summary, System
NFS Calls Summary, and Pthread Calls Summary :
Item Descriptor
Tot ETime (msec) The total amount of time from when each instance of the call was started until it completed. This
time will include any time spent servicing interrupts, running other processes, and so forth.
Avg ETime (msec) The average amount of time from when the call was started until it completed. This time will
include any time spent servicing interrupts, running other processes, and so forth.
Min ETime (msec) The minimum amount of time from when the call was started until it completed. This time will
include any time spent servicing interrupts, running other processes, and so forth.
Max ETime (msec) The maximum amount of time from when the call was started until it completed. This time will
include any time spent servicing interrupts, running other processes, and so forth.
Accumulated ETime (msec) The total amount of time from when the pending call was started until the end of the trace. This
time will include any time spent servicing interrupts, running other processes, and so forth.
The preceding example report shows that the maximum elapsed time for the kwrite system call was
422.2323 msec, but the maximum CPU time was 4.5626 msec. If this amount of overhead time is unusual
for the device being written to, further analysis is needed.
The report generated with the -s flag includes the data shown in the default report, and data on errors
returned by system calls.
# curt -s -i trace.r -m trace.nm -n gensyms.out -o curt.out
# cat curt.out
...(lines omitted)...
...(lines omitted)...
If a large number of errors of a specific type or on a specific system call point to a system or application
problem, other debug measures can be used to determine and fix the problem.
The report generated with the -t flag includes the data shown in the default report, and also includes a
detailed report on thread status that includes the amount of time the thread was in application and
system call mode, what system calls the thread made, processor affinity, the number of times the thread
was dispatched, and to which CPUs it was dispatched.
The report also includes dispatch wait time and details of interrupts:
...(lines omitted)...
--------------------------------------------------------------------------------
Report for Thread Id: 48841 (hex bec9) Pid: 143984 (hex 23270)
Process Name: oracle
---------------------
Total Application Time (ms): 70.324465
Total System Call Time (ms): 53.014910
Total Hypervisor Call Time (ms): 0.077000
Count Total Time Avg Time Min Time Max Time SVC (Address)
(msec) (msec) (msec) (msec)
======== =========== =========== =========== =========== ================
69 34.0819 0.4939 0.1666 1.2762 kwrite(169ff8)
77 12.0026 0.1559 0.0474 0.2889 kread(16a01c)
510 4.9743 0.0098 0.0029 0.0467 times(f1e14)
73 1.2045 0.0165 0.0105 0.0306 select(1d1704)
68 0.6000 0.0088 0.0023 0.0445 lseek(16a094)
12 0.1516 0.0126 0.0071 0.0241 getrusage(f1be0)
...(lines omitted)...
If the thread belongs to an NFS kernel process, the report will include information on NFS operations
instead of System calls:
Report for Thread Id: 1966273 (hex 1e00c1) Pid: 1007854 (hex f60ee)
Process Name: nfsd
---------------------
Total Kernel Time (ms): 3.198998
Total Operation Time (ms): 28.839927
Total Hypervisor Call Time (ms): 0.000000
The report generated with the -p flag includes the data shown in the default report and also includes a
detailed report for each process that includes the Process ID and name, a count and list of the thread IDs,
and the count and list of the pthread IDs belonging to the process. The total application time, the system
call time, and the application time details for all the threads of the process are given. Lastly, it includes
summary reports of all the completed and pending system calls, and pthread calls for the threads of the
process.
The following example shows the report generated for the router process (PID 129190):
Process Details for Pid: 129190
7 Tids for this Pid: 245889 245631 244599 82843 78701 75347 28941
9 Ptids for this Pid: 2057 1800 1543 1286 1029 772 515 258 1
Count Total Time % sys Avg Time Min Time Max Time SVC (Address)
(msec) time (msec) (msec) (msec)
======== =========== ====== ======== ======== ======== ================
93 3.6829 0.05% 0.0396 0.0060 0.3077 kread(19731c)
23 2.2395 0.03% 0.0974 0.0090 0.4537 kwrite(1972f8)
30 0.8885 0.01% 0.0296 0.0073 0.0460 select(208c5c)
1 0.5933 0.01% 0.5933 0.5933 0.5933 fsync(1972a4)
106 0.4902 0.01% 0.0046 0.0035 0.0105 klseek(19737c)
13 0.3285 0.00% 0.0253 0.0130 0.0387 semctl(2089e0)
6 0.2513 0.00% 0.0419 0.0238 0.0650 semop(2089c8)
3 0.1223 0.00% 0.0408 0.0127 0.0730 statx(2086d4)
1 0.0793 0.00% 0.0793 0.0793 0.0793 send(11e1ec)
9 0.0679 0.00% 0.0075 0.0053 0.0147 fstatx(2086c8)
4 0.0524 0.00% 0.0131 0.0023 0.0348 kfcntl(22aa14)
5 0.0448 0.00% 0.0090 0.0086 0.0096 yield(11dbec)
3 0.0444 0.00% 0.0148 0.0049 0.0219 recv(11e1b0)
1 0.0355 0.00% 0.0355 0.0355 0.0355 open(208674)
1 0.0281 0.00% 0.0281 0.0281 0.0281 close(19728c)
...(omitted lines)...
Count Total Time % sys Avg Time Min Time Max Time Pthread Routine
(msec) time (msec) (msec) (msec)
======== =========== ====== ======== ======== ======== ================
19 0.0477 0.00% 0.0025 0.0017 0.0104 pthread_join
1 0.0065 0.00% 0.0065 0.0065 0.0065 pthread_detach
1 0.6208 0.00% 0.6208 0.6208 0.6208 pthread_kill
6 0.1261 0.00% 0.0210 0.0077 0.0779 pthread_cancel
21 0.7080 0.01% 0.0337 0.0226 0.1222 pthread_create
If the process is an NFS kernel process, the report will include information on NFS operations instead of
System and Pthread calls:
Process Details for Pid: 1007854
Process Name: nfsd
252 Tids for this Pid: 2089213 2085115 2081017 2076919 2072821 2068723
2040037 2035939 2031841 2027743 2023645 2019547
The report generated with the -P flag includes the data shown in the default report and also includes a
detailed report on pthread status.
Count Total Time Avg Time Min Time Max Time SVC (Address)
(msec) (msec) (msec) (msec)
======== =========== ======== ======== ======== ================
1 3.3898 3.3898 3.3898 3.3898 _exit(409e50)
61 0.8138 0.0133 0.0089 0.0254 kread(5ffd78)
11 0.4616 0.0420 0.0262 0.0835 thread_create(407360)
22 0.2570 0.0117 0.0062 0.0373 mprotect(6d5bd8)
12 0.2126 0.0177 0.0100 0.0324 thread_setstate(40a660)
115 0.1875 0.0016 0.0012 0.0037 klseek(5ffe38)
12 0.1061 0.0088 0.0032 0.0134 sbrk(6d4f90)
23 0.0803 0.0035 0.0018 0.0072 trcgent(4078d8)
...(lines omitted)...
Count Total Time % sys Avg Time Min Time Max Time Pthread Routine
(msec) time (msec) (msec) (msec)
======== =========== ====== ======== ======== ======== ================
11 0.9545 0.01% 0.0868 0.0457 0.1833 pthread_create
8 0.0725 0.00% 0.0091 0.0064 0.0205 pthread_join
1 0.0553 0.00% 0.0553 0.0553 0.0553 pthread_detach
1 0.0341 0.00% 0.0341 0.0341 0.0341 pthread_cancel
1 0.0229 0.00% 0.0229 0.0229 0.0229 pthread_kill
The information in the application time details report includes the following:
Item Descriptor
Total Pthread Call Time The amount of time, expressed in milliseconds, that the pthread spent in traced pthread library
calls.
Total Pthread Dispatch Time The amount of time, expressed in milliseconds, that the pthread spent in libpthreads dispatch code.
Total Pthread Idle Dispatch The amount of time, expressed in milliseconds, that the pthread spent in libpthreads vp_sleep code.
Time
Total Other Time The amount of time, expressed in milliseconds, that the pthread spent in non-traced user mode
code.
Total number of pthread The total number of times a pthread belonging to the process was dispatched by the libpthreads
dispatches dispatcher.
Total number of pthread idle The total number of times a thread belonging to the process was in the libpthreads vp_sleep code.
dispatches
The splat tool is not currently equipped to analyze the behavior of the Virtual Memory Manager (VMM)
and PMAP locks used in the AIX kernel.
splat [-i file] [-n file] [-o file] [-d [bfta]] [-l address][-c class] [-s [acelmsS]] [-C#] [-S#] [-t start] [-T stop] [-p]
splat -h [topic]
splat -j
Flags
The flags of the splat command are:
Item Descriptor
-i inputfile Specifies the &SWsym.AIX; trace log file input.
-n namefile Specifies the file containing output of the gensyms command.
-o outputfile Specifies an output file (default is stdout).
-d detail Specifies the level of detail of the report.
-c class Specifies class of locks to be reported.
-l address Specifies the address for which activity on the lock will be reported.
-s criteria Specifies the sort order of the lock, function, and thread.
-C CPUs Specifies the number of processors on the MP system that the trace was drawn from. The default is 1. This
value is overridden if more processors are observed to be reported in the trace.
-S count Specifies the number of items to report on for each section. The default is 10. This gives the number of locks
to report in the Lock Summary and Lock Detail reports, as well as the number of functions to report in the
Function Detail and threads to report in the Thread detail (the -s option specifies how the most significant
locks, threads, and functions are selected).
-t starttime Overrides the start time from the first event recorded in the trace. This flag forces the analysis to begin an
event that occurs starttime seconds after the first event in the trace.
Parameters
The parameters associated with the splat command are:
Item Descriptor
inputfile The AIX trace log file input. This file can be a merge trace file generated using the trcrpt -r command.
namefile File containing output of the gensyms command.
outputfile File to write reports to.
detail The detail level of the report, it can be one of the following:
basic Lock summary plus lock detail (the default)
function
Basic plus function detail
m Miss rate
s Spin count
The procedure for generating these files is shown in the trace section. When you run trace, you will
usually use the flag -J splat to capture the events analyzed by splat (or without the -J flag, to capture all
events). The significant trace hooks are shown in the following table:
The execution interval is the entire time that a workload runs. This interval is arbitrarily long for server
workloads that run continuously. The trace interval is the time actually captured in the trace log file by
trace. The length of this trace interval is limited by how large a trace log file will fit on the file system.
In contrast, the analysis interval is the portion of the trace interval that is analyzed by the splat
command. The -t and -T flags indicate to the splat command to start and finish analysis some number of
seconds after the first event in the trace. By default, the splat command analyzes the entire trace, so this
analysis interval is the same as the trace interval.
Note: As an optimization, the splat command stops reading the trace when it finishes its analysis, so it
indicates that the trace and analysis intervals end at the same time even if they do not.
To most accurately estimate the effect of lock activity on the computation, you will usually want to
capture the longest trace interval that you can, and analyze that entire interval with the splat command.
The -t and -T flags are usually used for debugging purposes to study the behavior of the splat command
across a few events in the trace.
As a rule, either use large buffers when collecting a trace, or limit the captured events to the ones you
need to run the splat command.
Trace discontinuities
The splat command uses the events in the trace to reconstruct the activities of threads and locks in the
original system.
If part of the trace is missing, it is because one of the following situations exists:
v Tracing was stopped at one point and restarted at a later point.
v One processor fills its trace buffer and stops tracing, while other processors continue tracing.
v Event records in the trace buffer were overwritten before they could be copied into the trace log file.
Some versions of the AIX kernel or PThread library might be incompletely instrumented, so the traces
will be missing events. The splat command might not provide correct results in this case.
Data addresses are used to identify locks; instruction addresses are used to identify the point of
execution. These addresses are captured in the event records in the trace, and used by the splatcommand
to identify the locks and the functions that operate on them.
However, these addresses are not of much use to the programmer, who would rather know the names of
the lock and function declarations so that they can be located in the program source files. The conversion
of names to addresses is determined by the compiler and loader, and can be captured in a file using the
gensyms command. The gensyms command also captures the contents of the /usr/include/sys/
lockname.h file, which declares classes of kernel locks.
The gensyms output file is passed to the splat command with the -n flag. When splat reports on a kernel
lock, it provides the best identification that it can.
Kernel locks that are declared are resolved by name. Locks that are created dynamically are identified by
class if their class name is given when they are created. The libpthreads.a instrumentation is not
equipped to capture names or classes of PThread synchronizers, so they are always identified by address
only.
Execution summary
The execution summary report is generated by default when you use the splat command.
start stop
-------------------- --------------------
trace interval (absolute tics) 967436752 969072535
(relative tics) 0 1635783
(absolute secs) 58.057947 58.156114
(relative secs) 0.000000 0.098167
analysis interval (absolute tics) 967436752 969072535
(trace-relative tics) 0 1635783
(self-relative tics) 0 1635783
(absolute secs) 58.057947 58.156114
(trace-relative secs) 0.000000 0.098167
(self-relative secs) 0.000000 0.098167
**************************************************************************************
From the example above, you can see that the execution summary consists of the following elements:
v The splat version and build information, disclaimer, and copyright notice.
v The command used to run splat.
v The trace command used to collect the trace.
v The host on which the trace was taken.
v The date that the trace was taken.
v A sentence specifying whether the PURR register was used to calculate CPU times.
v The real-time duration of the trace, expressed in seconds.
v The maximum number of processors that were observed in the trace (the number specified in the trace
conditions information, and the number specified on the splat command line).
v The cumulative processor time, equal to the duration of the trace in seconds times the number of
processors that represents the total number of seconds of processor time consumed.
v A table containing the start and stop times of the trace interval, measured in tics and seconds, as
absolute timestamps, from the trace records, as well as relative to the first event in the trace
v The start and stop times of the analysis interval, measured in tics and seconds, as absolute timestamps,
as well as relative to the beginning of the trace interval and the beginning of the analysis interval.
The following example shows a sample of the gross lock summary report.
***************************************************************************************
Unique Acquisitions Acq. or Passes Total System
Total Addresses (or Passes) per Second Spin Time
--------- --------- ------------ -------------- ------------
AIX (all) Locks: 523 523 1323045 72175.7768 0.003986
RunQ: 2 2 487178 26576.9121 0.000000
Simple: 480 480 824898 45000.4754 0.003986
Transformed: 22 18 234 352.3452
Krlock: 50 21 76876 32.6548 0.000458
Complex: 41 41 10969 598.3894 0.000000
PThread CondVar: 7 6 160623 8762.4305 0.000000
Mutex: 128 116 1927771 105165.2585 10.280745 *
RWLock: 0 0 0 0.0000 0.000000
The gross lock summary report table consists of the following columns:
Per-lock summary
The pre-locl summary report is generated by default when you use the splat command.
T Acqui- Wait
y sitions or Locks or Percent Holdtime
Lock Names, p or Trans- Passes Real Real Comb
Class, or Address e Passes Spins form %Miss %Total / CSec CPU Elapse Spin
********************** * ****** ***** **** ***** ****** ********* ******* ****** *******
PROC_INT_CLASS.0003 Q 486490 0 0 0.0000 36.7705 26539.380 5.3532 100.000 0.0000
THREAD_LOCK_CLASS.0012 S 323277 0 9468 0.0000 24.4343 17635.658 6.8216 6.8216 0.0000
THREAD_LOCK_CLASS.0118 D 323094 0 4568 0.0000 24.4205 17625.674 6.7887 6.7887 0.0000
ELIST_CLASS.003C S 80453 0 201 0.0000 6.0809 4388.934 1.0564 1.0564 0.0000
ELIST_CLASS.0044 S 80419 0 110 0.0000 6.0783 4387.080 1.1299 1.1299 0.0000
tod_lock C 10229 0 0 0.0000 0.7731 558.020 0.2212 0.2212 0.0000
LDATA_CONTROL_LOCK.0000 D 1833 0 10 0.0000 0.1385 99.995 0.0204 0.0204 0.0000
U_TIMER_CLASS.0014 S 1514 0 23 0.0000 0.1144 82.593 0.0536 0.0536 0.0000
The first line indicates the maximum number of locks to report (100 in this case, but we show only 14 of
the entries here) as specified by the -S 100 flag. The report also indicates that the entries are sorted by the
total number of acquisitions or passes, as specified by the -sa flag. The various Kernel locks and PThread
V A PThread condition-variable
The RunQ lock is a special case of the simple lock, although its pattern of usage will differ markedly
from other lock types. The splat command distinguishes it from the other simple locks to ease its
analysis.
In an AIX SIMPLE Lock report, the first line starts with either [AIX SIMPLE Lock] or [AIX RunQ lock].
Acqui- Miss Spin Transf. Busy Percent Held of Total Time Process
ThreadID sitions Rate Count Count Count CPU Elapse Spin Transf. ProcessID Name
~~~~~~~~ ~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~ ~~~~~~~~~~~~~
775 11548 0.34 39 0 0 0.06 0.10 0.00 0.00 774 wait
35619 3 25.00 1 0 0 0.00 0.00 0.00 0.00 18392 sleep
31339 21 4.55 1 0 0 0.00 0.00 0.00 0.00 7364 java
35621 2 0.00 0 0 0 0.00 0.00 0.00 0.00 18394 locktrace
Elapsed The total number of elapsed seconds that the lock was held by any thread, whether
running or suspended.
Real Wait
The percentage of elapsed real time that any thread was waiting to acquire this lock. If
two or more threads are waiting simultaneously, this wait time will only be charged once.
To determine how many threads were waiting simultaneously, look at the WaitQ Depth
statistics.
Total Acquisitions The number of times that the lock was acquired in the analysis interval. This includes successful
simple_lock_try calls.
Acq. holding krlock The number of acquisitions made by threads holding a Krlock.
Transform count The number of Krlocks that have been used (allocated and freed) by the simple lock.
SpinQ The minimum, maximum, and average number of threads spinning on the lock, whether executing
or suspended, across the analysis interval.
Krlocks SpinQ The minimum, maximum, and average number of threads spinning on a Krlock allocated by the
simple lock, across the analysis interval.
PROD The associated Krlocks prod calls count.
CONFER SELF The confer to self calls count for the simple lock and the associated Krlocks.
CONFER TARGET The confer to target calls count for the simple lock and the associated Krlocks
CONFER ALL The confer to all calls count for the simple lock and the associated Krlocks.
HANDOFF The associated Krlocks handoff calls count.
The Lock Activity with Interrupts Enabled (milliseconds) and Lock Activity with Interrupts Disabled
(milliseconds) sections contain information on the time that each lock state is used by the locks.
The states that a thread can be in (with respect to a given simple or complex lock) are as follows:
Item Descriptor
(no lock reference) The thread is running, does not hold this lock, and is not attempting to acquire this lock.
LOCK The thread has successfully acquired the lock and is currently executing.
LOCK with KRLOCK The thread has successfully acquired the lock, while holding the associated Krlock, and is currently
executing.
SPIN The thread is executing and unsuccessfully attempting to acquire the lock.
KRLOCK LOCK The thread has successfully acquired the associated Krlock and is currently executing.
KRLOCK SPIN The thread is executing and unsuccessfully attempting to acquire the associated Krlock.
TRANSFORM The thread has successfully allocated a Krlock that it associates itself to and is executing.
The Lock Activity sections of the report measure the intervals of time (in milliseconds) that each thread
spends in each of the states for this lock. The columns report the number of times that a thread entered
the given state, followed by the maximum, minimum, and average time that a thread spent in the state
once entered, followed by the total time that all threads spent in that state. These sections distinguish
whether interrupts were enabled or disabled at the time that the thread was in the given state.
A thread can acquire a lock prior to the beginning of the analysis interval and release the lock during the
analysis interval. When the splat command observes the lock being released, it recognizes that the lock
had been held during the analysis interval up to that point and counts the time as part of the
state-machine statistics. For this reason, the state-machine statistics might report that the number of times
that the lock state was entered might actually be larger than the number of acquisitions of the lock that
were observed in the analysis interval.
The Lock Activity sections of the report measure the intervals of time (in milliseconds) that each thread
spends in each of the states for this lock. The columns report the number of times that a thread entered
the given state, followed by the maximum, minimum, and average time that a thread spent in the state
once entered, followed by the total time that all threads spent in that state.
These sections of the report distinguish whether interrupts were enabled or disabled at the time that the
thread was in the given state.
Acqui- Miss Spin Wait Busy Percent Held of Total Time Process
ThreadID sitions Rate Count Count Count CPU Elapse Spin Wait ProcessID Name
~~~~~~~~ ~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~ ~~~~~~~~~~~~~
775 11548 0.34 39 0 0 0.06 0.10 0.00 0.00 774 wait
35619 3 25.00 1 0 0 0.00 0.00 0.00 0.00 18392 sleep
31339 21 4.55 1 0 0 0.00 0.00 0.00 0.00 7364 java
35621 2 0.00 0 0 0 0.00 0.00 0.00 0.00 18394 locktrace
Elapsed The total number of elapsed seconds that the lock was held by any thread, whether
running or suspended.
Percent Held This field contains the following sub-fields:
Real CPU
The percentage of the cumulative processor time that the lock was held by an executing
thread.
Real Elapsed
The percentage of the elapsed real time that the lock was held by any thread at all, either
running or suspended.
Comb(ined) Spin
The percentage of the cumulative processor time that running threads spent spinning
while trying to acquire this lock.
Real Wait
The percentage of elapsed real time that any thread was waiting to acquire this lock. If
two or more threads are waiting simultaneously, this wait time will only be charged once.
To determine how many threads were waiting simultaneously, look at the WaitQ Depth
statistics.
SpinQ The minimum, maximum, and average number of threads spinning on the lock, whether executing
or suspended, across the analysis interval.
WaitQ The minimum, maximum, and average number of threads waiting on the lock, across the analysis
interval.
The Lock Activity with Interrupts Enabled (milliseconds) and Lock Activity with Interrupts Disabled
(milliseconds) sections contain information on the time that each lock state is used by the locks.
The states that a thread can be in (with respect to a given simple or complex lock) are as follows:
Item Descriptor
(no lock reference) The thread is running, does not hold this lock, and is not attempting to acquire this lock.
LOCK The thread has successfully acquired the lock and is currently executing.
SPIN The thread is executing and unsuccessfully attempting to acquire the lock.
UNDISP The thread has become undispatched while unsuccessfully attempting to acquire the lock.
WAIT The thread has been suspended until the lock comes available. It does not necessarily acquire the lock at
that time, but instead returns to a SPIN state.
PREEMPT The thread is holding this lock and has become undispatched.
A thread can acquire a lock prior to the beginning of the analysis interval and release the lock during the
analysis interval. When the splat command observes the lock being released, it recognizes that the lock
had been held during the analysis interval up to that point and counts the time as part of the
state-machine statistics. For this reason, the state-machine statistics can report that the number of times
that the lock state was entered might actually be larger than the number of acquisitions of the lock that
were observed in the analysis interval.
RunQ locks are used to protect resources in the thread management logic. These locks are acquired a
large number of times and are only held briefly each time. A thread need not be executing to acquire or
Function detail:
The function detail report is obtained by using the -df or -da options of splat.
Elapse(d)
The percentage of the elapsed real time that the lock was held by any thread at all,
whether running or suspended, that had acquired the lock through a call to this function.
Spin The percentage of cumulative processor time that executing threads spent spinning on the
lock while trying to acquire the lock through a call to this function.
Wait The percentage of elapsed real time that executing threads spent waiting for the lock while
trying to acquire the lock through a call to this function.
Return Address The return address to this calling function, in hexadecimal.
Start Address The start address to this calling function, in hexadecimal.
Offset The offset from the function start address to the return address, in hexadecimal.
The functions are ordered by the same sorting criterion as the locks, controlled by the -s option of splat.
Further, the number of functions listed is controlled by the -S parameter. The default is the top ten
functions.
Thread Detail:
The Thread Detail report is obtained by using the -dt or -da options of splat.
At any point in time, a single thread is either running or it is not. When a single thread runs, it only runs
on one processor. Some of the composite statistics are measured relative to the cumulative processor time
when they measure activities that can happen simultaneously on more than one processor, and the
magnitude of the measurements can be proportional to the number of processors in the system. In
contrast, the thread statistics are generally measured relative to the elapsed real time, which is the
amount of time that a single processor spends processing and the amount of time that a single thread
spends in an executing or suspended state.
Elapse(d)
The percentage of the elapsed real time that this thread held the lock while running or
suspended.
Spin The percentage of elapsed real time that this thread executed while spinning on the lock.
Wait The percentage of elapsed real time that this thread spent waiting on the lock.
Process ID The Process identifier (only for simple and complex lock report).
Process Name Name of the process using the lock (only for simple and complex lock report).
Complex-Lock report:
AIX Complex lock supports recursive locking, where a thread can acquire the lock more than once before
releasing it, as well as differentiating between write-locking, which is exclusive, from read-locking, which
is not exclusive.
This report begins with [AIX COMPLEX Lock]. Most of the entries are identical to the simple lock report,
while some of them are differentiated by read/write/upgrade. For example, the SpinQ and WaitQ
statistics include the minimum, maximum, and average number of threads spinning or waiting on the
lock. They also include the minimum, maximum, and average number of threads attempting to acquire
the lock for reading versus writing. Because an arbitrary number of threads can hold the lock for reading,
the report includes the minimum, maximum, and average number of readers in the LockQ that holds the
lock.
A thread might hold a lock for writing; this is exclusive and prevents any other thread from securing the
lock for reading or for writing. The thread downgrades the lock by simultaneously releasing it for writing
and acquiring it for reading; this permits other threads to also acquire the lock for reading. The reverse of
this operation is an upgrade; if the thread holds the lock for reading and no other thread holds it as well,
the thread simultaneously releases the lock for reading and acquires it for writing. The upgrade operation
might require that the thread wait until other threads release their read-locks. The downgrade operation
does not.
A thread might acquire the lock to some recursive depth; it must release the lock the same number of
times to free it. This is useful in library code where a lock must be secured at each entry-point to the
library; a thread will secure the lock once as it enters the library, and internal calls to the library
entry-points simply re-secure the lock, and release it when returning from the call. The minimum,
maximum, and average recursion depths of any thread holding this lock are reported in the table.
A thread holding a recursive write-lock is not permitted to downgrade it because the downgrade is
intended to apply to only the last write-acquisition of the lock, and the prior acquisitions had a real
reason to keep the acquisition exclusive. Instead, the lock is marked as being in the downgraded state,
which is erased when the this latest acquisition is released or upgraded. A thread holding a recursive
read-lock can only upgrade the latest acquisition of the lock, in which case the lock is marked as being
upgraded. The thread will have to wait until the lock is released by any other threads holding it for
reading. The minimum, maximum, and average recursion-depths of any thread holding this lock in an
upgraded or downgraded state are reported in the table.
No time is reported to perform a downgrade because this is performed without any contention. The
upgrade state is only reported for the case where a recursive read-lock is upgraded. Otherwise, the
thread activity is measured as releasing a read-lock and acquiring a write-lock.
The function and thread details also break down the acquisition, spin, and wait counts by whether the
lock is to be acquired for reading or writing.
The mutex and read/write lock are related to the AIX complex lock. You can view the similarities in the
lock detail reports. The condition-variable differs significantly from a lock, and this is reflected in the
report details.
The PThread library instrumentation does not provide names or classes of synchronizers, so the
addresses are the only way we have to identify them. Under certain conditions, the instrumentation can
capture the return addresses of the function call stack, and these addresses are used with the gensyms
output to identify the call chains when these synchronizers are created. The creation and deletion times of
the synchronizer can sometimes be determined as well, along with the ID of the PThread that created
them.
Mutex reports:
The PThread mutex is similar to an AIX simple lock in that only one thread can acquire the lock, and is
like an AIX complex lock in that it can be held recursively.
[PThread MUTEX] ADDRESS: 00000000F0154CD0
Parent Thread: 0000000000000001 creation time: 26.232305
Pid: 18396 Process Name: trcstop
Creation call-chain ==================================================================
00000000D268606C .pthread_mutex_lock
00000000D268EB88 .pthread_once
00000000D01FE588 .__libs_init
00000000D01EB2FC ._libc_inline_callbacks
00000000D01EB280 ._libc_declare_data_functions
00000000D269F960 ._pth_init_libc
00000000D268A2B4 .pthread_init
00000000D01EAC08 .__modinit
000000001000014C .__start
======================================================================================
| | | Percent Held ( 26.235284s )
Acqui- | Miss Spin Wait Busy | Secs Held | Real Real Comb Real
sitions | Rate Count Count Count |CPU Elapsed | CPU Elapsed Spin Wait
1 | 0.000 0 0 0 |0.000006 0.000006 | 0.00 0.00 0.00 0.00
--------------------------------------------------------------------------------------
Depth Min Max Avg
SpinQ 0 0 0
WaitQ 0 0 0
Recursion 0 1 0
In addition to the common header information and the [PThread MUTEX] identifier, this report lists the
following lock details:
Elapse(d)
The total number of elapsed seconds that the lock was held, whether the thread was
running or suspended.
Percent Held This field contains the following sub-fields:
Real CPU
The percentage of the cumulative processor time that the lock was held by an executing
thread.
Real Elapsed
The percentage of the elapsed real time that the lock was held by any thread, either
running or suspended.
Comb(ined) Spin
The percentage of the cumulative processor time that running threads spent spinning
while trying to acquire this lock.
Real Wait
The percentage of elapsed real time that any thread was waiting to acquire this lock. If two
or more threads are waiting simultaneously, this wait time will only be charged once. To
learn how many threads were waiting simultaneously, look at the WaitQ Depth statistics.
Depth This field contains the following sub-fields:
SpinQ The minimum, maximum, and average number of threads spinning on the lock, whether
executing or suspended, across the analysis interval.
WaitQ The minimum, maximum, and average number of threads waiting on the lock, across the
analysis interval.
Recursion
The minimum, maximum, and average recursion depth to which each thread held the lock.
If the -dt or -da options are used, the splat command reports the following pthread details.
Item Descriptor
PThreadID The PThread identifier.
Acquisitions The number of times that this pthread acquired the mutex.
Miss Rate The percentage of acquisition attempts by the pthread that failed to secure the mutex.
Spin Count The number of unsuccessful attempts by this pthread to secure the mutex.
Wait Count The number of times that this pthread was forced to wait until the mutex came available.
Busy Count The number of trylock calls that returned busy.
Elapse(d)
The percentage of the elapsed real time that this pthread held the mutex while running or
suspended.
Spin The percentage of elapsed real time that this pthread executed while spinning on the
mutex.
Wait The percentage of elapsed real time that this pthread spent waiting on the mutex.
If the -df or -da options are used, the splat command reports the function details.
The splat command reports the following function details:
Item Descriptor
PThreadID The PThread identifier.
Acquisitions The number of times that this function acquired the mutex.
Miss Rate The percentage of acquisition attempts by the function that failed to secure the mutex.
Spin Count The number of unsuccessful attempts by this function to secure the mutex.
Wait Count The number of times that this function was forced to wait until the mutex came available.
Busy Count The number of trylock calls that returned busy.
Percent Held of Total Time This field contains the following sub-fields:
CPU The percentage of the elapsed real time that this function executed while holding the
mutex.
Elapse(d)
The percentage of the elapsed real time that this function held the mutex while running
or suspended.
Spin The percentage of elapsed real time that this function executed while spinning on the
mutex.
Wait The percentage of elapsed real time that this function spent waiting for the mutex.
Return Address The return address to this calling function, in hexadecimal.
Start Address The start address to this calling function, in hexadecimal.
Offset The offset from the function start address to the return address, in hexadecimal.
The PThread read/write lock is similar to an AIX complex lock in that it can be acquired for reading or
writing.
Writing is exclusive in that a single thread can only acquire the lock for writing, and no other thread can
hold the lock for reading or writing at that point. Reading is not exclusive, so more than one thread can
hold the lock for reading. Reading is recursive in that a single thread can hold multiple read-acquisitions
on the lock. Writing is not recursive.
[PThread RWLock] ADDRESS: 000000002FF228E0
Parent Thread: 0000000000000001 creation time: 5.236585 deletion time: 6.090511
Pid: 7362 Process Name: /home/testrwlock
Creation call-chain ==================================================================
0000000010000458 .main
00000000100001DC .__start
=============================================================================
| | | Percent Held ( 26.235284s )
Acqui- | Miss Spin Wait | Secs Held | Real Real Comb Real
sitions | Rate Count Count |CPU Elapsed | CPU Elapsed Spin Wait
1150 |40.568 785 0 |21.037942 12.0346 |80.19 99.22 30.45 46.29
--------------------------------------------------------------------------------------
Readers Writers Total
Depth Min Max Avg Min Max Avg Min Max Avg
LockQ 0 2 0 0 1 0 0 2 0
Acquisitions Miss Spin Count Wait Count Busy Percent Held of Total Time
PthreadID Write Read Rate Write Read Write Read Count CPU Elapse Spin Wait
~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~
772 0 207 78.70 0 765 0 796 0 11.58 15.13 29.69 23.21
515 765 0 1.80 14 0 14 0 0 80.10 80.19 49.76 23.08
258 0 178 3.26 0 6 0 5 0 12.56 17.10 10.00 20.02
Acquisitions Miss Spin Count Wait Count Busy Percent Held of Total Time
Function Name Write Read Rate Write Read Write Read Count CPU Elapse Spin Wait Return Address Start Address Offset
^^^^^^^^^^^^^^^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^
._pthread_body 765 385 40.57 14 771 0 0 0 1.55 3.10 1.63 0.00 00000000D268944C 00000000D2684180 000052CC
In addition to the common header information and the [PThread RWLock] identifier, this report lists the
following lock details:
Item Descriptor
Parent Thread Pthread id of the parent pthread.
creation time Elapsed time in seconds after the first event recorded in trace (if available).
deletion time Elapsed time in seconds after the first event recorded in trace (if available).
PID Process identifier.
Process Name Name of the process using the lock.
Call-chain Stack of called methods (if available).
Acquisitions The number of times that the lock was acquired in the analysis interval.
Miss Rate The percentage of attempts that failed to acquire the lock.
Spin Count The number of unsuccessful attempts to acquire the lock.
Wait Count The current PThread implementation does not force pthreads to wait for read/write locks. This reports
the number of times a thread, spinning on this lock, is undispatched.
Seconds Held This field contains the following sub-fields:
CPU The total number of processor seconds that the lock was held by an executing pthread. If the
lock is held multiple times by the same pthread, only one hold interval is counted.
Elapse(d)
The total number of elapsed seconds that the lock was held by any pthread, whether the
pthread was running or suspended.
Percent Held This field contains the following sub-fields:
Real CPU
The percentage of the cumulative processor time that the lock was held by any executing
pthread.
Real Elapsed
The percentage of the elapsed real time that the lock was held by any pthread, either
running or suspended.
Comb(ined) Spin
The percentage of the cumulative processor time that running pthreads spent spinning while
trying to acquire this lock.
Real Wait
The percentage of elapsed real time that any pthread was waiting to acquire this lock. If two
or more threads are waiting simultaneously, this wait time will only be charged once. To
learn how many pthreads were waiting simultaneously, look at the WaitQ Depth statistics.
Depth This field contains the following sub-fields:
LockQ The minimum, maximum, and average number of pthreads holding the lock, whether
executing or suspended, across the analysis interval. This is broken down by
read-acquisitions, write-acquisitions, and total acquisitions.
SpinQ The minimum, maximum, and average number of pthreads spinning on the lock, whether
executing or suspended, across the analysis interval. This is broken down by
read-acquisitions, write-acquisitions, and total acquisitions.
WaitQ The minimum, maximum, and average number of pthreads in a timed-wait state for the
lock, across the analysis interval. This is broken down by read-acquisitions,
write-acquisitions, and total acquisitions.
Condition-Variable report:
The PThread condition-variable is a synchronizer, but not a lock. A PThread is suspended until a signal
indicates that the condition now holds.
[PThread CondVar] ADDRESS: 0000000020000A18
Parent Thread: 0000000000000001 creation time: 0.216301
Pid: 7360 Process Name: /home/splat/test/condition
Creation call-chain ========================================================
00000000D26A0EE8 .pthread_cond_timedwait
0000000010000510 .main
00000000100001DC .__start
=========================================================================
| | Spin / Wait Time ( 26.235284s )
| Fail Spin Wait | Comb Comb
Passes | Rate Count Count | Spin Wait
1 |50.000 1 0 | 26.02 0.00
-------------------------------------------------------------------------
Depth Min Max Avg
SpinQ 0 1 1
WaitQ 0 0 0
Fail Spin Wait % Total Time
PThreadID Passes Rate Count Count Spin Wait
~~~~~~~~~ ~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~
1 1 50.0000 1 0 99.1755 0.0000
In addition to the common header information and the [PThread CondVar] identifier, this report lists the
following details:
Item Descriptor
Passes The number of times that the condition was signaled to hold during the analysis interval.
Fail Rate The percentage of times that the condition was tested and was not found to be true.
Spin Count The number of times that the condition was tested and was not found to be true.
Wait Count The number of times that a pthread was forced into a suspended wait state waiting for the condition
to be signaled.
Spin / Wait Time This field contains the following sub-fields:
Comb Spin
The total number of processor seconds that pthreads spun while waiting for the condition.
Comb Wait
The total number of elapsed seconds that pthreads spent in a wait state for the condition.
Depth This field contains the following sub-fields:
SpinQ The minimum, maximum, and average number of pthreads spinning while waiting for the
condition, across the analysis interval.
WaitQ The minimum, maximum, and average number of pthreads waiting for the condition,
across the analysis interval.
If the -dt or -da options are used, the splat command reports the following pthread details.
Wait The percentage of elapsed real time that this pthread spent waiting for the condition to
hold.
If the -df or -da options are used, the splat command reports the following function details.
Item Descriptor
Function Name The name of the function that passed or attempted to pass this condition.
Passes The number of times that this function was notified that the condition passed.
Fail Rate The percentage of times that the function checked the condition and did not find it to be true.
Spin Count The number of times that the function checked the condition and did not find it to be true.
Wait Count The number of times that this function was forced to wait until the condition became true.
Percent Total Time This field contains the following sub-fields:
Spin The percentage of elapsed real time that this function spun while testing the condition.
Wait The percentage of elapsed real time that this function spent waiting for the condition to
hold.
Return Address The return address to this calling function, in hexadecimal.
Start Address The start address to this calling function, in hexadecimal.
Offset The offset from the function start address to the return address, in hexadecimal.
Note: The APIs and the events available on each of the supported processors have been completely
separated by design. The events available, their descriptions, and their current testing status (which are
different on each processor) are in separately installable tables, and are not described here because none
of the API calls depend on the availability or status of any of the events.
The status of an event, as returned by the pm_initialize API initialization routine, can be verified,
unverified, caveat, broken, group-only, thresholdable, or shared (see “Performance monitor accuracy” about
testing status and event accuracy).
An event filter (which is any combination of the status bits) must be passed to the pm_initialize routine
to force the return of events with status matching the filter. If no filter is passed to the pm_initialize
routine, no events will be returned.
Events marked unverified have undefined accuracy. Use caution with unverified events. The Performance
Monitor API is essentially providing a service to read hardware registers that might not have any
meaningful content.
Users can experiment with unverified event counters and determine for themselves if they can be used for
specific tuning situations.
These contexts are an extension to the regular processor and thread contexts and include one 64-bit
counter per hardware counter and a set of control words. The control words define which events are
counted and when counting is on or off.
Thread context
Optional Performance Monitor contexts can also be associated with each thread. The AIX operating
system and the Performance Monitor kernel extension automatically maintain sets of 64-bit counters for
each of these contexts.
A thread group is defined as all the threads created by a common ancestor thread. By definition, all the
threads in a thread group count the same set of events, and, with one exception described below, the
group must be created before any of the descendant threads are created. This restriction is due to the fact
that, after descendant threads are created, it is impossible to determine a list of threads with a common
ancestor.
A counting state is associated with each group. When the group is created, its counting state is inherited
from the initial thread in the group. For thread members of a group, the effective counting state is the
result of AND-ing their own counting state with the group counting state. This provides a way to
effectively control the counting state for all threads in a group. Simply manipulating the group-counting
state will affect the effective counting state of all the threads in the group. Threads inherit their complete
Performance Monitor state from their parents when the thread is created. A thread Performance Monitor
context data (the value of the 64-bit counters) is not inherited, that is, newly created threads start with
counters set to zero.
The following are the main components of the performance monitoring agent:
xmtopas
The data-supplier daemon, which permits a system where this daemon runs to supply
performance statistics to data-consumer programs on the local or remote hosts. This daemon also
provides the interface to SNMP.
POWERCOMPAT events
The POWERCOMPAT events provide a list of hardware events that are available for processor
compatibility modes and are used as a subset of the actual processor events.
You can use the processor compatibility modes to move logical partitions between systems that have
different processor types without upgrading the operating system environments in the logical partition.
The processor compatibility mode allows the destination system to provide the logical partition with a
subset of processor capabilities that are supported by the operating systems environment in the logical
partition.
The following hardware events are supported in the POWERCOMPAT compatibility mode for different
versions of the AIX operating system.
Table 1. POWERCOMPAT events
Counter Event name Supported AIX version
1 PM_1PLUS_PPC_CMPL v AIX 6.1 with 6100-04, or later
v AIX 7.1, or later
1 PM_CYC v AIX 6.1 with 6100-04, or later
v AIX 7.1, or later
1 PM_DATA_FROM_L1.5 v AIX 6 with 6100-07, or earlier
v AIX 7 with 7100-01, or earlier
1 PM_FLOP v AIX 6.1 with 6100-04, or later
v AIX 7.1, or later
1 PM_GCT_NOSLOT_CYC v AIX 6.1 with 6100-04, or later
v AIX 7.1, or later
1 PM_IERAT_MISS v AIX 6.1 with 6100-04, or later
v AIX 7.1, or later
1 PM_INST_CMPL v AIX 6 with 6100-07, or earlier
v AIX 7 with 7100-01, or earlier
Similarly, when a thread stops counting or reads its Performance Monitor data, its 64 bit accumulation
counters are also updated by adding the current value of the Performance Monitor hardware counters to
them. Again, if the thread is a member of a group, the group accumulation counters are also updated,
regardless of whether the counter read or stop was for the thread or for the thread group.
The group-level accumulation data is kept consistent with the individual Performance Monitor data for
the thread members of the group, whenever possible. When a thread voluntarily leaves a group, that is,
deletes its Performance Monitor context, its accumulated data is automatically subtracted from the
group-level accumulated data. Similarly, when a thread member in a group resets its own data, the data
in question is subtracted from the group level accumulated data. When a thread dies, no action is taken
on the group-accumulated data.
The only situation where the group-level accumulation is not consistent with the sum of the data for each
of its members is when the group-level accumulated data has been reset, and the group has more than
one member. This situation is detected and marked by a bit returned when the group data is read.
Security considerations
The system-level APIs calls are only available from the root user except when the process tree option is
used. In that case, a locking mechanism prevents calls being made from more than one process. This
mechanism ensures ownership of the API and exclusive access by one process from the time that the
system-level contexts are created until they are deleted.
Enabling the process tree option results in counting for only the calling process and its descendants; the
default is to count all activities on each processor.
Because the system-level APIs would report bogus data if thread contexts where in use, system-level API
calls are not enabled at the same time as thread-level API calls. The allocation of the first thread context
will take the system-level API lock, which will not be released until the last context has been deallocated.
When using first party calls, a thread is only permitted to modify its own Performance Monitor context.
The only exception to this rule is when making group level calls, which obviously affect the group
context, but can also affect other threads' context. Deleting a group deletes all the contexts associated with
the group, that is, the caller context, the group context, and all the contexts belonging to all the threads in
the group.
Access to a Performance Monitor context not belonging to the calling thread or its group is available only
from the target process's debugger program. The third party API calls are only permitted when the target
process is either being ptraced by the API caller, that is, the caller is already attached to the target
process, and the target process is stopped or the target process is stopped on a /proc file system event
and the caller has the privilege required to open its control file.
Some processor support two threshold multipliers, others none, meaning that thresholding is not
supported at all. You can not use the pm_init routine with processors newer than POWER4. You must
use the pm_initialize routine for newer processors.
For each event returned, in addition to the testing status, the pm_init routine also returns the identifier to
be used in subsequent API calls, a short name, and a long name. The short name is a mnemonic name in
the form PM_MNEMONIC. Events that are the same on different processors will have the same
mnemonic name. For instance, PM_CYC and PM_INST_CMPL are respectively the number of processor
cycles and instruction completed and should exist on all processors. For each event returned, a
thresholdable flag is also returned. This flag indicates whether an event can be used with a threshold. If
so, then specifying a threshold defers counting until a number of cycles equal to the threshold multiplied
by the processor's selected threshold multiplier has been exceeded.
The Performance Monitoring API enables the specification of event groups instead of individual events.
Event groups are predefined sets of events. Rather than each event being individually specified, a single
group ID is specified. The interface to the pm_init routine has been enhanced to return the list of
supported event groups in a structure of type pm_groups_info_t pointed to by a new optional third
parameter. To preserve binary compatibility, the third parameter must be explicitly announced by OR-ing
the PM_GET_GROUPS bitflag into the filter. Some events on some platforms can only be used from
within a group. This is indicated in the threshold flag associated with each event returned. The following
convention is used:
On some platforms, use of event groups is required because all the events are marked g or G. Each of the
event groups that are returned includes a short name, a long name, and a description similar to those
associated with events, as well as a group identifier to be used in subsequent API calls and the events
contained in the group (in the form of an array of event identifiers).
The testing status of a group is defined as the lowest common denominator among the testing status of
the events that it includes. If at least one event has a testing status of caveat, the group testing status is at
best caveat, and if at least one event has a status of unverified, then the group status is unverified. This is
not returned as a group characteristic, but it is taken into account by the filter. Like events, only groups
with status matching the filter are returned.
For each event a status is returned, indicating the event status: validated, unvalidated, or validated with
caveat. The status also indicates if the event can be used in a group or not, if it is a thresholdable event
and if it is a shared event.
Some events on some platforms can be used only within a group. In the case where an event group is
specified instead of individual events, the events are defined as grouped only events.
For each returned event, a thresholdable state is also returned. It indicates whether an event can be used
with a threshold. If so, specifying a threshold defers counting until it exceeds a number of cycles equal to
the threshold multiplied by the selected processor threshold multiplier.
Some processors support two hardware threads per physical processing unit. Each thread implements a
set of counters, but some events defined for those processors are shared events. A shared event, is
controlled by a signal not specific to a particular thread's activity and sent simultaneously to both sets of
hardware counters, one for each thread. Those events are marked by the shared status.
For each returned event, in addition to the testing status, the pm_initialize routine returns the identifier
to be used in subsequent API calls, as a short name and a long name. The short name is a mnemonic
name in the form PM_MNEMONIC. The same events on different processors will have the same
mnemonic name. For instance, PM_CYC and PM_INST_CMPL are respectively the number of processor
cycles and the number of completed instructions, and should exist on all processors.
The Performance Monitoring API enables the specification of event groups instead of individual events.
Event groups are predefined sets of events. Rather than to specify individually each event, a single group
ID can be specified. The interface to the pm_initialize routine returns the list of supported event groups
in a structure of type pm_groups_info_t whose address is returned in the third parameter.
On some platforms, the use of event groups is required because all events are marked as group-only.
Each event group that is returned includes a short name, a long name, and a description similar to those
associated with events, as well as a group identifier to be used in subsequent API calls and the events
contained in the group (in the form of an array of event identifiers).
The testing status of a group is defined as the lowest common denominator among the testing status of
the events that it includes. If the testing status of at least one event is caveat, then the group testing status
If the proctype parameter is not set to PM_CURRENT, the Performance Monitor APIs library is not
initialized and the subroutine only returns information about the specified processor in its parameters,
pm_info2_t and pm_groups_info_t, taking into account the filter. If the proctype parameter is set to
PM_CURRENT, in addition to returning the information described, the Performance Monitor APIs library
is initialized and ready to accept other calls.
In this mode, the pmapi periodically changes the setting of the counting and accumulates values and
counting time for multiple sets of events. The time each event set is counted before switching to the next
set can be in the range of 10 ms to 30 s. The default value is 100 ms.
The values returned include the number of times all sets of events have been counted, and for each set,
the accumulated counter values and the accumulated time the set was counted. The accumulated time is
measured up to three different ways: using Time Base, and when available, using the PURR time and one
the SPURR time. These times are stored in a timebase format that can be converted to time by using the
time_base_to_time function. These times are meant to be used to normalize the results across the
complete measurement interval.
Several basic pmapi calls have the following multiplexing mode variations indicated by the _mx suffix:
Counter multi-mode
Counter multi-mode is similar to multiplexing mode. The counting mode in multiplexing mode is
common to all the event sets.
The multi-mode allows you to associate a counting mode with each event set, but as the counting mode
differs for an event set to another one, the results of the counting cannot be normalized on the complete
measurement interval.
Several basic pmapi calls have the following multi-mode variations indicated by the _mm suffix:
pm_set_program_mm
Sets the counting configuration. It differs from the pm_set_program_mx function in that it accepts
a set of groups and associated counting mode to be counted.
pm_get_program_mm
Retrieves the current Performance Monitor settings. It differs from the pm_get_program_mx
function in that it accepts a set of groups and associated counting mode.
WPAR counting
It is possible to monitor the system-wide activity of a specific WPAR from the Global WPAR. In this case,
only the activity of the processes running in this WPAR will be monitored.
Several basic pmapi calls have the following per-WPAR variations indicated by the _wp suffix:
pm_set_program_wp, pm_set_program_wp_mm
Same as the pm_set_program subroutine or the pm_set_program_mm subroutine, except that the
pm_init(filter, &pminfo)
pm_set_program_mythread(&prog);
pm_start_mythread();
Get the information about all the event-groups for a specific processor example:
The following example displays how to obtain all the event-groups that are supported for a specific
processor.
#include <stdio.h>
#include <stdlib.h>
#include <pmapi.h>
int main()
{
int rc = 0;
pm_info2_t events;
pm_groups_info_t groups;
int filter = 0;
/*
* Get the events and groups supported for POWER4.
* To get the events and groups supported for the current processor,
* use PM_CURRENT.
*/
int processor_type = PM_POWER4;
int group_idx = 0;
int counter_idx = 0;
int ev_count = 0;
int event_found = 0;
/*
* PM_VERIFIED - To get list of verified events
* PM_UNVERIFIED - To get list of unverified events
* PM_CAVEAT - To get list of events that are usable but with caveats
*/
The following example illustrates how to look at the performance monitor data while the program is
executing.
from a debugger at breakpoint (1)
pm_initialize(filter);
(2) pm_get_program_pthread(pid, tid, ptid, &prog);
... display PM programmation ...
continue program
The following program is an example of a count of a single WPAR from the global WPAR.
main ()
{
pm_prog_t prog;
pm_wpar_ctx_info_t wp_list;
int nwpars = 1;
cid_t cid;
pm_start_wp(cid);
... workload ...
pm_stop_wp(cid);
pm_delete_program_wp(cid);
}
Count all active WPARs from the Global WPAR and retrieve per-WPAR data:
The following program is an example of a count of all active WPARS from the global WPAR and also
retrieves per-WPAR data.
main ()
{
pm_prog_t prog;
pm_wpar_ctx_info_t *wp_list;
int nwpars;
/* set programming */
...
prog.mode.b.wpar_all = 1; /* collect per-WPAR data */
pm_set_program(&prog);
pm_start();
... workload ...
pm_stop();
/* retrieve the number of WPARs that were active during the counting */
nwpars = 0;
pm_get_wplist(NULL, NULL, &nwpars);
/* allocate an array large enough to retrieve WPARs contexts */
wp_list = malloc(nwpars * sizeof (pm_wpar_ctx_info_t));
/* retrieve WPARs contexts */
pm_get_wplist(NULL, wp_list, &nwpars);
pm_delete_program();
}
The following is a simple multi-threaded example with independent threads counting the same set of
events.
# include <pmapi.h>
pm_data_t data2;
void *
doit(void *)
{
(1) pm_start_mythread();
pm_stop_mythread();
pm_get_data_mythread(&data2);
}
main()
{
pthread_t threadid;
pthread_attr_t attr;
pthread_addr_t status;
pm_program_mythread(&prog);
(2) pm_start_mythread();
pm_stop_mythread();
pm_get_data_mythread(&data);
pthread_join(threadid, &status);
In the preceding example, counting starts at (1) and (2) for the main and auxiliary threads respectively
because the initial counting state was off and it was inherited by the auxiliary thread from its creator.
The following example has two threads in a counting-group. The body of the auxiliary thread's
initialization routine is the same as in the previous example.
main()
{
... same initialization as in previous example ...
(2) pm_start_mythread();
pm_stop_mythread();
pm_get_data_mythread(&data)
pthread_join(threadid, &status);
pm_get_data_mygroup(&data)
In the preceding example, the call in (2) is necessary because the call in (1) only turns on counting for the
group, not the individual threads in it. At the end, the group results are the sum of both threads results.
The following example has two threads in a counting-group. The body of the auxiliary thread's
initialization routine is the same as in the previous example.
main()
{
pm_info2_t pminfo;
pm_groups_info_t pmginfo;
pm_prog_mx_r prog;
pm_events_prog_t event_set[2];
pm_data_mx_t data;
int filter = PM_VERIFIED; /* get list of verified events */
pm_initialize(filter, &pminfo, &pmginfo, PM_CURRENT )
prog.mode.w = 0; /* start with clean mode */
prog.mode.b.user = 1; /* count only user mode */
prog.mode.b.is_group = 1; /* specify event group */
prog.events_set = event_set;
prog.nb_events_prog = 2; /* two event group counted */
prog.slice_duration = 200; /* slice duration for each event group is 200ms */
for (i = 0; i < pminfo.maxpmcs; i++) {
event_set[0][i] = COUNT_NOTHING;
event_set[1][i] = COUNT_NOTHING;
}
The following example has two threads in a counting-group. The body of the auxiliary thread's
initialization routine is the same as in the previous example.
This example is similar to the previous one except that it uses the multi-mode functionality, and
associates a mode with each group counted.
main()
{
pm_info2_t pminfo;
pm_groups_info_t pmginfo;
pm_prog_mm_t prog;
pm_data_mx_t data;
pm_prog_t prog_set[2];
int filter = PM_VERIFIED; /* get list of verified events */
pm_initialize(filter, &pminfo, &pmginfo, PM_CURRENT );
prog.prog_set = prog_set;
prog.nb_set_prog = 2; /* two groups counted */
prog.slice_duration = 200; /* slice duration for each event group is 200ms */
prog_set[0].mode.w = 0; /* start with clean mode */
prog_set[0].mode.b.user = 1; /* grp 0: count only user mode */
prog_set[0].mode.b.is_group = 1; /* specify event group */
prog_set[0].mode.b.proctree = 1; /* turns process tree counting on:
this option is common to all counted groups */
prog_set[1].mode.w = 0; /* start with clean mode */
prog_set[1].mode.b.kernel = 1; /* grp 1: count only kernel mode */
prog_set[1].mode.b.is_group = 1; /* specify event group */
for (i = 0; i < pminfo.maxpmcs; i++) {
prog_set[0].events[i] = COUNT_NOTHING;
prog_set[1].events[i] = COUNT_NOTHING;
}
prog_set[0].events[0] = 1; /* count events in group 1 in the first set */
prog_set[1].events[0] = 3; /* count events in group 3 in the first set */
pm_set_program_mygroup_mm(&prog); /* create counting group */
pm_start_mygroup();
pthread_create(&threadid, &attr, doit, NULL);
pm_start_mythread();
... usefull work ....
pm_stop_mythread();
pm_get_data_mythread_mx(&data);
printf ("Main thread results:\n");
for (i = 0; i < 2 ; i++) {
group_number = prog_set[i].events[0];
printf ("Group #%d: %s\n", group_number, pmginfo.event_groups[group_number].short_name);
printf (" counting time: %d ms\n", data.accu_set[i].accu_time);
printf (" counting values:\n");
The following example with a reset call illustrates the impact on the group data. The body of the
auxiliary thread is the same as before, except for the pm_start_mythread call, which is not necessary in
this case.
main()
{
... same initialization as in previous example...
pm_stop_mythread()
pm_reset_data_mythread()
pthread_join(threadid, &status);
pm_get_data_mygroup(&data)
In the preceding example, the main thread and the group counting state are both on before the auxiliary
thread is created, so the auxiliary thread will inherit that state and start counting immediately.
At the end, data1 is equal to data because the pm_reset_data_mythread automatically subtracted the
main thread data from the group data to keep it consistent. In fact, the group data remains equal to the
sum of the auxiliary and the main thread data, but in this case, the main thread data is null.
A libpmapi pragma is a light-weight subroutine that is exported through the libpmapi library, which
provides access to the PMU registers. A libpmapi pragma uses the mtspr and mfspr instructions instead
of the pmsvcs kernel extension to avoid system calls.
In the following scenarios, if you use the libpmapi pragmas for read and write access to the PMU
registers, -1 is returned, which indicates that the option is not available. Therefore, you cannot access the
PMU registers from a user application in the following scenarios:
v When a system starts
MMCR0[PMCC] is set to 00
PMCs 1-6, MMCR0, MMCRA and MMCR2 registers are read only.
Access using pmc_read_1to4 , pmc_read_5to6 and mmcr_read returns 0
Access using pmc_write and mmcr_write returns -1
v Another PMU-based profiler is used
MMCR0[PMCC] is set to 00
PMCs 1-6, MMCR0, MMCRA and MMCR2 registers are read only.
Access using pmc_read_1to4 , pmc_read_5to6 and mmcr_read returns 0
Access using pmc_write and mmcr_write returns -1
v During LPM
Prior to the Mobility operation, any running PMU counting is stopped and MMCR0[PMCC] is set to 00.
Post Mobility operation, PMCs 1-6, MMCR0, MMCRA and MMCR2 registers are read only.
Access using pmc_read_1to4 , pmc_read_5to6 and mmcr_read returns 0
Access using pmc_write and mmcr_write returns -1
Instead of using the libpmapi pragmas, if you use the mtspr and the mfspr instructions to access the
PMU registers, a SIGILL signal is generated for any write operations.
When nested instrumentation is used, exclusive duration is generated for the outer sections. Average and
standard deviation is provided when an instrumented section is activated multiple times.
The libraries support OpenMP and threaded applications, which requires linking with the thread-safe
version of the library,libhpm_r. Both 32-bit and 64-bit library modules are provided.
The libraries collect information and hardware Performance Monitor summarization during run-time. So,
there could be considerable overhead if instrumentation sections are inserted inside inner loops.
By default, argument passing from Fortran applications to the hpm libraries is done by reference, or
pointer, not by value. Also, there is an extra length argument following character strings. You can modify
the default argument passing method by using the %VAL and %REF built-in functions.
A second source of overhead is due to run-time accumulation and storage of performance data. The hpm
libraries collect information and perform summarization during run-time. Hence, there could be a
considerable amount of overhead if instrumentation sections are inserted inside inner loops.
The hpm library uses hardware counters during the initialization and finalization of the library, retaining
the minimum of the two for each counter as an estimate of the cost of one call to the start and stop
functions. The estimated overhead is subtracted from the values obtained on each instrumented code
section, which ensures that the measurement of error becomes close to zero. However, since this is a
statistical approximation, in some situations where estimated overhead is larger than a measured count
for the application, the approach fails. When the approach fails, you might get the following error
message, which indicates that the estimated overhead was not subtracted from the measured values:
WARNING: Measurement error for <event name> not removed
You can deactivate the procedure that attempts to remove measurement errors by setting the
HPM_WITH_MEASUREMENT_ERROR environment variable to TRUE (1).
Threaded applications
The T/tstart and T/tstop functions respectively start and stop the counters independently on each thread.
If two distinct threads use the same instID parameter, the output indicates multiple calls. However, the
counts are accumulated.
The instID parameter is always a constant variable or integer. It cannot be an expression because the
declarations in the libhpm.h, f_hpm.h, and f_hpm_i8.h header files that contain #define statements are
evaluated during the compiler pre-processing phase, which permits the collection of line numbers and
source file names.
For the hpm libraries, you can select the event set to be used by any of the following methods:
v The HPM_EVENT_SET environment variable, which is either explicitly set in the environment or
specified in the HPM_flags.env file.
v The content of the libHPMevents file.
For the hpmcount and hpmstat commands, you can specify which event types you want to be monitored
and the associated hardware performance counters by any of the following methods:
v Using the -s option
v The HPM_EVENT_SET environment variable, which you can set directly or define in the
HPM_flags.env file
v The content of the libHPM_events file
In all cases, the HPM_flags.env file takes precedence over the explicit setting of the HPM_EVENT_SET
environment variable and the content of the libHPMevents or libHPM_events file takes precedence over
the HPM_EVENT_SET environment variable.
An event group can be specified instead of an event set, using any of the following methods:
v The -g option
v The HPM_EVENT_GROUP environment variable that you can set directly or define in the
HPM_flags.env file
In all cases, the HPM_flags.env file takes precedence over the explicit setting of the
HPM_EVENT_GROUP environment variable. The HPM_EVENT_GROUP environment variable takes
precedence over the explicit setting of the HPM_EVENT_SET environment variable. The
HPM_EVENT_GROUP is a comma separated list of group names or group numbers.
A list of derived metric groups to be evaluated can be specified, using any of the following methods:
v The -m option
In all cases, the HPM_flags.env file take precedence over the explicit setting of the HPM_PMD_GROUP
environment variable. The HPM_PMD_GROUP is a comma-separated list of derived metric group
names.
Each set, group or derived metric group can be qualified by a counting mode. The allowed counting
modes are:
v u: user mode
v k: kernel mode
v h: hypervisor mode
v r: runlatch mode
v n: nointerrupt mode
The counting mode qualifier is separated from the set or group by a colon ":". For example:
HPM_EVENT_GROUP=pm_utilization:uk,pm_completion:u
To use the time slice functionality, specify a comma-separated list of sets instead of a single set number.
By default, the time slice duration for each set is 100 ms, but this can be modified with the
HPM_MX_DURATION environment variable. This value must be expressed in ms, and in the range 10
ms to 30000 ms.
The libHPMevents and libHPM_events files are both supplied by the user and have the same format.
For POWER3 or PowerPC 604 RISC Microprocessor systems, the file contains the counter number and the
event name, like in the following example:
0 PM_LD_MISS_L2HIT
1 PM_TAG_BURSTRD_L2MISS
2 PM_TAG_ST_MISS_L2
3 PM_FPU0_DENORM
4 PM_LSU_IDLE
5 PM_LQ_FULL
6 PM_FPU_FMA
7 PM_FPU_IDLE
For POWER4 and later systems, the file contains the event group name, like in the following example:
pm_hpmcount1
The HPM_flags.env file contains environment variables that are used to specify the event set and for the
computation of derived metrics
Example
HPM_L2_LATENCY 12
HPM_EVENT_SET 5
You can also generate an XML output file by setting the HPM_VIZ_OUTPUT=TRUE environment
variable. The generated output files are named either <progName>_<pid>_<taskID>.viz or
HPM_OUTPUT_NAME_<taskID>.viz.
An alternative time base for the result normalization can be selected using any of the following methods:
v The -b time|purr|spurr option
v The HPM_NORMALIZE environment variable that you can set directly or define in the
HPM_flags.env file
You can list the globally supported metrics for a given processor with the pmlist -D -1 [-p
Processor_name] command.
You can supply the following environment variables to specify estimations of memory, cache, and TLB
miss latencies for the computation of related derived metrics:
v HPM_MEM_LATENCY
v HPM_L3_LATENCY
v HPM_L35_LATENCY
v HPM_AVG_L3_LATENCY
v HPM_AVG_L2_LATENCY
v HPM_L2_LATENCY
You can use the HPM_DIV_WEIGHT environment variable to compute the weighted flips on systems
that are POWER4 and later.
The following C program contains two instrumented sections which perform a trivial floating point
operation, print the results, and then launch the command interpreter to execute the ls -R / 2>&1
>/dev/null command.
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <libhpm.h>
void
do_work()
{
pid_t p, wpid;
int i, status;
float f1 = 9.7641, f2 = 2.441, f3 = 0.0;
f3 = f1 / f2;
printf("f3=%f\n", f3);
p = fork();
if (p == -1) {
perror("Mike fork error");
exit(1);
}
if (p == 0) {
i = execl("/usr/bin/sh", "sh", "-c", "ls -R / 2>&1 >/dev/null", 0);
perror("Mike execl error");
exit(2);
}
else
wpid = waitpid(p, &status, WUNTRACED | WCONTINUED);
if (wpid == -1) {
perror("Mike waitpid error");
exit(3);
}
hpmInit(taskID, "my_program");
hpmStart(1, "outer call");
do_work();
hpmStart(2, "inner call");
do_work();
hpmStop(2);
hpmStop(1);
hpmTerminate(taskID);
}
The following declaration is required on all source files that have instrumentation calls.
#include "f_hpm.h"
Fortran programs call functions that include the f_ prefix, as you can see in the following example:
call f_hpminit( taskID, "my_program" )
call f_hpmstart( 1, "Do Loop" )
do ...
call do_work()
call f_hpmstart( 5, "computing meaning of life" );
call do_more_work();
call f_hpmstop( 5 );
end do
call f_hpmstop( 1 )
call f_hpmterminate( taskID )
When placing instrumentation inside of parallel regions, you should use a different ID for each thread.
The library accepts the use of the same instID for different threads, but the counters are accumulated for
all instances with the same instID.
System component information is also retrieved from the Object Data Manager (ODM) and returned with
the performance metrics.
The API supports extensions so binary compatibility is maintained across all releases of.AIX This interface
is accomplished by using one of the parameters in all the API calls to specify the size of the data
structure to be returned. The interface permits the library to determine the version is use, using the
structures that are growing. It helps the user from being dependent on the different versions. For the list
of extensions in earlier versions of,AIX see the Change History section.
The perfstat API subroutines are present in the libperfstat.a library that are part of the
bos.perf.libperfstat file set, which is installable from the AIX base installation media and requires that the
bos.perf.perfstat file set is installed. The latter contains the kernel extension and is automatically installed
with.AIX
The /usr/include/libperfstat.h file contains the interface declarations and type definitions of the data
structures to use when calling the interfaces. The include file is also part of the bos.perf.libperfstat file
set. Sample source code is provided with bos.perf.libperfstat file set and is present in the
/usr/samples/libperfstat directory.
Related information:
libperfstat.h command
API characteristics
Five types of APIs are available. Global types return global metrics related to a set of components, while
individual types return metrics related to individual components. Both types of interfaces have similar
signatures, but slightly different behavior.
AIX supports different types of APIs such as WPAR and RSET. WPAR types return usage metrics related
to a set of components or individual components specific to a workload partition (WPAR). RSET types
return usage metrics of processors that belong to an RSET. With AIX Version 6.1 Technology Level (TL) 6,
a new type of APIs, called as NODE is available. The NODE types return usage metrics that re related to
a set of components or individual components specific to a remote node in a cluster. The
perfstat_config (PERFSTAT_ENABLE | PERFSTAT_CLUSTER_STATS, NULL) must be used to enable the remote
node statistics collection (that is available in a cluster environment).
All the interfaces return raw data; that is, values of running counters. Multiple calls must be made at
regular intervals to calculate rates.
Several interfaces return data retrieved from the ODM (object data manager) database. This information is
automatically cached into a dictionary that is assumed to be "frozen" after it is loaded. The perfstat_reset
subroutine must be called to clear the dictionary whenever the system configuration has changed. In
order to do a more selective reset, you can use the perfstat_partial_reset function. For more details, see
the “Cached metrics interfaces” on page 181 section.
Most types returned are unsigned long long; that is, unsigned 64 bit data.
Excessive and redundant calls to Perfstat APIs in a short time span can have a performance impact
because time-consuming statistics collected by them are not cached.
For examples of API characteristics, see the sample programs in the /usr/samples/libperfstat directory.
All of the sample programs can be compiled using the provided makefile (/usr/samples/libperfstat/
Makefile.samples).
Global interfaces
Global interfaces report metrics related to a set of components on a system (such as processors, disks, or
memory).
The return value is -1 in case of errors. Otherwise, the number of structures copied is returned. This is
always 1.
The following sections provide examples of the type of data returned and code using each of the
interfaces.
input statistics:
number of packets : 306688
number of errors : 0
number of bytes : 24852688
output statistics:
number of packets : 63005
number of bytes : 11518591
number of errors : 0
The preceding program emulates ifstat's behavior and also shows how perfstat_netinterface_total is
used.
perfstat_cpu_total Interface
The perfstat_cpu_total interface returns a perfstat_cpu_total_t structure, which is defined in the
libperfstat.h file.
Note: Page coalescing is a transparent operation wherein the hypervisor detects duplicate pages, directs
all user reads to a single copy, and reclaims the other duplicate physical memory pages.
Several other processor-related counters (such as number of system calls, number of reads, write, forks,
execs, and load average) are also returned. For a complete list, see the perfstat_cpu_total_t section of the
libperfstat.h header file.
The following program emulates lparstat's behavior and also shows an example of how the
perfstat_cpu_total interface is used:
#include <stdio.h>
#include <sys/time.h>
#include <sys/errno.h>
#include <sys/proc.h>
#include <wpars/wparcfg.h>
#include <libperfstat.h>
#include <stdlib.h>
#define INTERVAL_DEFAULT 1
#define COUNT_DEFAULT 1
#define ACTIVE 0
#define NOTACTIVE 1
perfstat_id_wpar_t wparid;
perfstat_wpar_total_t wparinfo;
perfstat_wpar_total_t *wparlist;
cid_t cid;
/*
*Name: do_cleanup
* free all allocated data structures
*/
void do_cleanup(void)
{
if (wparlist)
free(wparlist);
/*
*Name: display_global_sysinfo_stat
* Function used when called from global.
* Gets all the system metrics using perfstat APIs and displays them
*
*/
void display_global_sysinfo_stat(void)
{
perfstat_cpu_total_t *cpustat,*cpustat_last;
perfstat_id_t first;
/* allocate memory for data structures and check for any error */
printf ("%10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n", "cswch", "scalls", "sread", "swrite", "fork", "exec",
"rchar", "wchar", "deviceint", "bwrite", "bread", "phread");
printf ("%10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n", "=====", "======", "=====", "======", "====", "====",
"=====", "=====", "=========", "======", "=====", "======");
while (count > 0){
sleep(interval);
if (perfstat_cpu_total(NULL ,cpustat, sizeof(perfstat_cpu_total_t), 1) <= 0){
perror("perfstat_cpu_total ");
exit(1);
}
/* print the difference between the old structure and new structure */
printf("%10llu %10llu %10llu %10llu %10llu %10llu %10llu %10llu %10llu %10llu %10llu %10llu\n",(cpustat->pswitch - cpustat_last->pswitch),
(cpustat->syscall - cpustat_last->syscall), (cpustat->sysread - cpustat_last->sysread ),
(cpustat->syswrite - cpustat_last->syswrite),(cpustat->sysfork - cpustat_last->sysfork),
(cpustat->sysexec - cpustat_last->sysexec ), (cpustat->readch - cpustat_last->readch),
(cpustat->writech - cpustat_last->writech ),(cpustat->devintrs - cpustat_last->devintrs),
(cpustat->bwrite - cpustat_last->bwrite), (cpustat->bread - cpustat_last->bread ),
(cpustat->phread - cpustat_last->phread ));
count--;
/*
*Name: display_wpar_sysinfo_stat
* Displays both wpar and global metrics
*
*/
void display_wpar_sysinfo_stat(void)
{
perfstat_wpar_total_t wparinfo;
perfstat_cpu_total_wpar_t cinfo_wpar, cinfo_wpar_last;
perfstat_cpu_total_t sysinfo, sysinfo_last;
/* display the difference between the current and old structure for the current wpar and system wide values*/
printf("%10s %10llu %10llu %10llu %10llu %10llu %10llu %10llu\n",wparinfo.name, (cinfo_wpar.pswitch - cinfo_wpar_last.pswitch),
(cinfo_wpar.syscall - cinfo_wpar_last.syscall), (cinfo_wpar.sysfork - cinfo_wpar_last.sysfork),
(cinfo_wpar.runque - cinfo_wpar_last.runque), (cinfo_wpar.swpque - cinfo_wpar_last.swpque),
(cinfo_wpar.runocc - cinfo_wpar_last.runocc), (cinfo_wpar.swpocc - cinfo_wpar_last.swpocc));
printf("%10s %10llu %10llu %10llu %10llu %10llu %10llu %10llu\n\n", "Global", (sysinfo.pswitch - sysinfo_last.pswitch),
(sysinfo.syscall - sysinfo_last.syscall), (sysinfo.sysfork - sysinfo_last.sysfork),
(sysinfo.runque - sysinfo_last.runque), (sysinfo.swpque - sysinfo_last.swpque),
(sysinfo.runocc - sysinfo_last.runocc), (sysinfo.swpocc - sysinfo_last.swpocc));
count--;
/* Name: display_wpar_total_sysinfo_stat
* displays statistics of individual wpar
*
*/
int display_wpar_total_sysinfo_stat(void)
{
int i, *status;
perfstat_wpar_total_t *wparinfo;
perfstat_cpu_total_wpar_t *cinfo_wpar, *cinfo_wpar_last;
/* allocate memory for the datastructures and check for any error */
status = (int *) calloc(totalwpar ,sizeof(int));
CHECK_FOR_MALLOC_NULL(status);
/*
*Name: showusage
* displays the usage message
*
*/
void showusage()
{
if (!cid)
printf("Usage:simplesysinfo [-@ { ALL | WPARNAME }] [interval] [count]\n ");
else
printf("Usage:simplesysinfo [interval] [count]\n");
exit(1);
}
/* NAME: main
* This function determines the interval, iteration count.
* Then it calls the corresponding functions to display
* the corresponding metrics
*/
if (argc > 2)
showusage();
if (argc){
if ((interval = atoi(argv[0])) <= 0)
showusage();
argc--;
}
if (argc){
if ((count = atoi(argv[1])) <= 0)
showusage();
}
}
do_cleanup();
return(0);
}
The program displays an output that is similar to the following example output:
cswch scalls sread swrite fork exec rchar wchar deviceint bwrite bread phread
===== ====== ===== ====== ==== ==== ===== ===== ========= ====== ===== ======
83 525 133 2 0 1 1009462 264 27 0 0 0
perfstat_memory_total Interface
The perfstat_memory_total interface returns a perfstat_memory_total_t structure, which is defined in the
libperfstat.h file.
Note: Page coalescing is a transparent operation wherein the hypervisor detects duplicate pages, directs
all user reads to a single copy, and can reclaim other duplicate physical memory pages.
Several other memory-related metrics (such as amount of paging space paged in and out, and amount of
system memory) are also returned. For a complete list, see the perfstat_memory_total_t section of the
libperfstat.h header file in Files Reference.
The preceding program emulates vmstat's behavior and also shows an example of how the
perfstat_memory_total interface is used:
#include <stdio.h>
#include <libperfstat.h>
The preceding program emulates vmstat's behavior and also shows how perfstat_memory_total is used.
perfstat_disk_total Interface
The perfstat_disk_total interface returns a perfstat_disk_total_t structure, which is defined in the
libperfstat.h file.
Several other disk-related metrics, such as number of blocks read from and written to disk, are also
returned. For a complete list, see the perfstat_disk_total_t section in the libperfstat.h header file in Files
Reference.
The preceding program emulates iostat's behavior and also shows how perfstat_disk_total is used.
perfstat_netinterface_total Interface
The perfstat_netinterface_total interface returns a perfstat_netinterface_total_t structure, which is
defined in the libperfstat.h file.
Several other network interface-related metrics (such as number of bytes sent and received). For a
complete list, see the perfstat_netinterface_total_t section in the libperfstat.h header file in Files Reference.
Note: Page coalescing is a transparent operation wherein the hypervisor detects duplicate pages, directs
all user reads to a single copy, and reclaims duplicate physical memory pages
For a complete list, see the perfstat_partition_total_t section in the libperfstat.h header file.
The following code shows examples of how to use the perfstat_partition_total function.
perfstat_partition_total_t pinfo;
int rc;
The following example demonstrates emulating the lparstat command in default mode:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <libperfstat.h>
#include <sys/systemcfg.h>
#define INTERVAL_DEFAULT 2
#define COUNT_DEFAULT 10
#ifdef UTIL_AUTO
#define UTIL_MS 1
#define UTIL_PCT 0
#define UTIL_CORE 2
#define UTIL_PURR 0
#define UTIL_SPURR 1
void display_lpar_util_auto(int mode,int cpumode,int count,int interval);
#endif
void display_lpar_util(void);
if(collect_remote_node_stats)
{ /* perfstat_config needs to be called to enable cluster statistics collection */
rc = perfstat_config(PERFSTAT_ENABLE|PERFSTAT_CLUSTER_STATS, NULL);
if (rc == -1)
{
perror("cluster statistics collection is not available");
exit(-1);
}
}
#ifdef UTIL_AUTO
printf("Enter CPU mode.\n");
printf(" 0 PURR \n 1 SPURR \n");
scanf("%d",&cpumode);
printf("Enter print mode.\n");
printf(" 0 PERCENTAGE\n 1 MILLISECONDS\n 2 CORES \n");
scanf("%d",&mode);
if((mode>2)&& (cpumode>1))
{
#else
/* Iterate "count" times */
while (count > 0)
{
display_lpar_util();
sleep(interval);
count--;
}
#endif
if(collect_remote_node_stats)
{ /* Now disable cluster statistics by calling perfstat_config */
perfstat_config(PERFSTAT_DISABLE|PERFSTAT_CLUSTER_STATS, NULL);
}
return(0);
}
last_pcpu_user = lparstats->puser;
last_pcpu_sys = lparstats->psys;
last_pcpu_idle = lparstats->pidle;
last_pcpu_wait = lparstats->pwait;
last_lcpu_user = cpustats->user;
last_lcpu_sys = cpustats->sys;
last_lcpu_idle = cpustats->idle;
last_lcpu_wait = cpustats->wait;
last_busy_donated = lparstats->busy_donated_purr;
last_idle_donated = lparstats->idle_donated_purr;
last_busy_stolen = lparstats->busy_stolen_purr;
last_idle_stolen = lparstats->idle_stolen_purr;
}
printf("\n%5s %5s %6s %6s %5s %5s %5s %5s %4s %5s",
"-----", "----", "-----", "-----", "-----", "-----", "-----", "---", "----", "-----");
} else {
printf("\n%5s %5s %6s %6s %5s %5s %5s %4s %5s",
"%user", "%sys", "%wait", "%idle", "physc", "%entc", "lbusy", "vcsw", "phint");
disp_util_header = 0;
/* first iteration, we only read the data, print the header and save the data */
save_last_values(&cpustats, &lparstats);
return;
}
/* calculate physcial processor tics during the last interval in user, system, idle and wait mode */
delta_pcpu_user = lparstats.puser - last_pcpu_user;
delta_pcpu_sys = lparstats.psys - last_pcpu_sys;
delta_pcpu_idle = lparstats.pidle - last_pcpu_idle;
delta_pcpu_wait = lparstats.pwait - last_pcpu_wait;
/* calculate clock tics during the last interval in user, system, idle and wait mode */
delta_lcpu_user = cpustats.user - last_lcpu_user;
delta_lcpu_sys = cpustats.sys - last_lcpu_sys;
delta_lcpu_idle = cpustats.idle - last_lcpu_idle;
delta_lcpu_wait = cpustats.wait - last_lcpu_wait;
/* calculate entitlement for this partition - entitled physical processors for this partition */
entitlement = (double)lparstats.entitled_proc_capacity / 100.0 ;
/* distributed unused physical processor tics amoung wait and idle proportionally to wait and idle in clock tics */
/* far SPLPAR, consider the entitled physical processor tics as the actual delta physical processor tics */
pcputime = entitled_purr;
}
else if (lparstats.type.b.donate_enabled) { /* if donation is enabled for this DLPAR */
/* calculate busy stolen and idle stolen physical processor tics during the last interval */
/* these physical processor tics are stolen from this partition by the hypervsior
* which will be used by wanting partitions */
delta_busy_stolen = lparstats.busy_stolen_purr - last_busy_stolen;
delta_idle_stolen = lparstats.idle_stolen_purr - last_idle_stolen;
/* calculate busy donated and idle donated physical processor tics during the last interval */
/* these physical processor tics are voluntarily donated by this partition to the hypervsior
* which will be used by wanting partitions */
delta_busy_donated = lparstats.busy_donated_purr - last_busy_donated;
delta_idle_donated = lparstats.idle_donated_purr - last_idle_donated;
/* add busy donated and busy stolen to the kernel bucket, as cpu
* cycles were donated / stolen when this partition is busy */
delta_pcpu_sys += delta_busy_donated;
delta_pcpu_sys += delta_busy_stolen;
/* distribute idle stolen to wait and idle proportionally to the logical wait and idle in clock tics, as
* cpu cycles were stolen when this partition is idle or in wait */
delta_pcpu_wait += delta_idle_stolen *
((double)delta_lcpu_wait / (double)(delta_lcpu_wait + delta_lcpu_idle));
delta_pcpu_idle += delta_idle_stolen *
((double)delta_lcpu_idle / (double)(delta_lcpu_wait + delta_lcpu_idle));
/* distribute idle donated to wait and idle proportionally to the logical wait and idle in clock tics, as
* cpu cycles were donated when this partition is idle or in wait */
delta_pcpu_wait += delta_idle_donated *
((double)delta_lcpu_wait / (double)(delta_lcpu_wait + delta_lcpu_idle));
delta_pcpu_idle += delta_idle_donated *
((double)delta_lcpu_idle / (double)(delta_lcpu_wait + delta_lcpu_idle));
/* add donated to the total physical processor tics for CPU usage calculation, as they were
* distributed to respective buckets accordingly */
pcputime += (delta_idle_donated + delta_busy_donated);
/* add stolen to the total physical processor tics for CPU usage calculation, as they were
* distributed to respective buckets accordingly */
pcputime += (delta_idle_stolen + delta_busy_stolen);
if (lparstats.type.b.pool_util_authority) {
/* Available physical Processor units available in the shared pool (app) */
printf("%5.2f ", (double)(lparstats.pool_idle_time - last_pit) /
XINTFRAC*(double)delta_time_base);
}
save_last_values(&cpustats, &lparstats);
}
#ifdef UTIL_AUTO
void display_lpar_util_auto(int mode,int cpumode,int count,int interval)
{
float user_core_purr,kern_core_purr,wait_core_purr,idle_core_purr;
float user_core_spurr,kern_core_spurr,wait_core_spurr,idle_core_spurr,sum_core_spurr;
u_longlong_t user_ms_purr,kern_ms_purr,wait_ms_purr,idle_ms_purr,sum_ms;
u_longlong_t user_ms_spurr,kern_ms_spurr,wait_ms_spurr,idle_ms_spurr;
perfstat_rawdata_t data;
disp_util_header = 0;
/* first iteration, we only read the data, print the header and save the data */
}
while(count)
{
collect_metrics (&oldt, &lparstats);
sleep(interval);
collect_metrics (&newt, &lparstats);
data.type = UTIL_CPU_TOTAL;
data.curstat = &newt; data.prevstat= &oldt;
data.sizeof_data = sizeof(perfstat_cpu_total_t);
data.cur_elems = 1;
data.prev_elems = 1;
rc = perfstat_cpu_util(&data, &util,sizeof(perfstat_cpu_util_t), 1);
if(rc <= 0)
{
perror("Error in perfstat_cpu_util");
exit(-1);
}
delta_time_base = util.delta_time;
switch(mode)
{
case UTIL_PCT:
printf(" %5.1f %5.1f %5.1f %5.1f %5.4f \n",util.user_pct,util.kern_pct,util.wait_pct,util.idle_pct,util.physical_consumed);
break;
case UTIL_MS:
user_ms_purr=((util.user_pct*delta_time_base)/100.0);
kern_ms_purr=((util.kern_pct*delta_time_base)/100.0);
wait_ms_purr=((util.wait_pct*delta_time_base)/100.0);
idle_ms_purr=((util.idle_pct*delta_time_base)/100.0);
if(cpumode==UTIL_PURR)
{
printf(" %llu %llu %llu %llu %5.4f\n",user_ms_purr,kern_ms_purr,wait_ms_purr,idle_ms_purr,util.physical_consumed);
}
else if(cpumode==UTIL_SPURR)
{
user_ms_spurr=(user_ms_purr*util.freq_pct)/100.0;
kern_ms_spurr=(kern_ms_purr*util.freq_pct)/100.0;
wait_ms_spurr=(wait_ms_purr*util.freq_pct)/100.0;
sum_ms=user_ms_spurr+kern_ms_spurr+wait_ms_spurr;
idle_ms_spurr=delta_time_base-sum_ms;
}
break;
case UTIL_CORE:
user_core_purr=((util.user_pct*util.physical_consumed)/100.0);
kern_core_purr=((util.kern_pct*util.physical_consumed)/100.0);
wait_core_purr=((util.wait_pct*util.physical_consumed)/100.0);
idle_core_purr=((util.idle_pct*util.physical_consumed)/100.0);
user_core_spurr=((user_core_purr*util.freq_pct)/100.0);
kern_core_spurr=((kern_core_purr*util.freq_pct)/100.0);
wait_core_spurr=((wait_core_purr*util.freq_pct)/100.0);
if(cpumode==UTIL_PURR)
{
printf("%5.4f %5.4f %5.4f %5.4f %5.4f\n",user_core_purr,kern_core_purr,wait_core_purr,idle_core_purr,util.physical_consumed);
}
else if(cpumode==UTIL_SPURR)
{
sum_core_spurr=user_core_spurr+kern_core_spurr+wait_core_spurr;
idle_core_spurr=util.physical_consumed-sum_core_spurr;
default:
printf("In correct usage\n");
return;
The program displays an output that is similar to the following example output:
%user %sys %wait %idle physc %entc lbusy vcsw phint
----- ---- ----- ----- ----- ----- ----- ---- -----
0.1 0.4 0.0 99.5 0.01 1.2 0.2 278 0
0.0 0.3 0.0 99.7 0.01 0.8 0.2 271 0
0.0 0.2 0.0 99.8 0.01 0.5 0.1 180 0
0.0 0.2 0.0 99.8 0.01 0.6 0.1 184 0
0.0 0.2 0.0 99.7 0.01 0.6 0.1 181 0
0.0 0.2 0.0 99.8 0.01 0.6 0.1 198 0
0.0 0.2 0.0 99.8 0.01 0.7 0.2 189 0
2.1 3.3 0.0 94.6 0.09 8.7 2.1 216 0
0.0 0.2 0.0 99.8 0.01 0.7 0.1 265 0
perfstat_tape_total Interface
The perfstat_tape_total interface returns a perfstat_tape_total_t structure, which is defined in the
libperfstat.h file.
Several other tape-related metrics (such as number of bytes sent and received). For a complete list, see the
perfstat_tape_total section in the libperfstat.h header file.
The following code shows examples of how to use the perfstat_tape_total function.
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
int main(){
perfstat_tape_total_t *tinfo;
int rc,i;
if(rc==0){
for(i=0;i<rc;i++){
printf("Total number of tapes=%d\n",tinfo[i].number);
printf("Total size of all tapes (in MB)=%lld\n",tinfo[i].size);
printf("Free portion of all tapes(in MB)=%lld\n",tinfo[i].free);
printf("Number of read transfers to/from tape=%lld\n",tinfo[i].rxfers);
printf("Total number of transfers to/from tape=%lld\n",tinfo[i].xfers);
printf("Blocks written to all tapes=%lld\n",tinfo[i].wblks);
printf("Blocks read from all tapes=%lld\n",tinfo[i].rblks);
printf("Amount of time tapes are active=%lld\n",tinfo[i].time);
}
return(0);
}
The preceding program emulates diskstat behavior and also shows how perfstat_tape_total is used.
perfstat_partition_config interface
The perfstat_partition_config interface returns a perfstat_partition_config_t structure, which is
defined in the libperfstat.h file.
For a complete list, see the perfstat_partition_config_t section in the libperfstat.h header file.
==================Hardware Configuration==================
==================Software Configuration==================
OS Name = AIX
OS Version = 7.1
OS Build = Feb 17 2011 15:57:15 1107A_71D
====================LPAR Configuration====================
Number of Logical CPUs = 2
Number of SMT Threads = 2
Number of Drives = 2
Number of NW Adapters = 2
Component-Specific interfaces
Component-specific interfaces report metrics related to individual components on a system (such as a
processor, disk, network interface, or paging space).
The common signature used by all the component interfaces except perfstat_memory_page and
perfstat_hfistat_window is as follows:
int perfstat_subsystem(perfstat_id *name,
perfstat_subsystem_t * userbuff,
int sizeof_struct,
int desired_number);
The return value is -1 in case of error. Otherwise, the number of structures copied is returned. The field
name is either set to NULL or to the name of the next structure available.
An exception to this scheme is when name=NULL, userbuff=NULL and desired_number=0, the total
number of structures available is returned.
To retrieve all structures of a given type, find the number of structures and allocate the required memory
to hold the structures. You must then call the appropriate API to retrieve all structures in one call.
Another method is to allocate a fixed set of structures and repeatedly call the API to get the next set of
structures, each time passing the name returned by the previous call. Start the process with the name set
to "" or FIRST_SUBSYSTEM, and repeat the process.
Minimizing the number of API calls, and the number of system calls, leads to more efficient code, so the
two-call approach is preferred. Some of the examples shown in the following sections illustrate the API
usage using the two-call approach. The two-call approach causes large amount of memory allocation, the
multiple-call approach is sometimes used, and is illustrated in the following examples.
The following sections provide examples of the type of data returned and the code used for each of the
interfaces.
perfstat_cpu interface
The perfstat_cpu interface returns a set of structures of type perfstat_cpu_t, which is defined in the
libperfstat.h file.
Several other CPU-related metrics (such as number of forks, read, write, and execs) are also returned. For
a complete list, see the perfstat_cpu_t section in the libperfstat.h header.
The following code shows an example of how the perfstat_cpu interface is used:
The program displays an output that is similar to the following example output:
Statistics for CPU : cpu0
------------------
CPU user time (raw ticks) : 2585
CPU sys time (raw ticks) : 25994
CPU idle time (raw ticks) : 7688458
CPU wait time (raw ticks) : 3207
number of syscalls : 6051122
In an environment where dynamic logical partitioning is used, the number of perfstat_cpu_t structures
available is equal to the ncpus_high field in the perfstat_cpu_total_t. This number represents the highest
index of any active processor since the last reboot. Kernel data structures holding performance metrics for
processors are not deallocated when processors are turned offline or moved to a different partition and it
stops updating the information. The CPUs field of the perfstat_cpu_total_t structure represents the
number of active processors, but the perfstat_cpu interface returns ncpus_high structures.
Applications can detect offline or moved processors by checking clock-tick increments. If the sum of the
user, sys, idle, and wait fields is identical for a given processor between two perfstat_cpu calls, that
processor has been offline for the complete interval. If the sum multiplied by 10 ms (the value of a clock
tick) does not match the time interval, the processor has not been online for the complete interval.
The preceding program emulates mpstat behavior and also shows how perfstat_cpu is used.
perfstat_cpu_util interface
The perfstat_cpu_util interface returns a set of structures of type perfstat_cpu_util_t, which is defined
in the libperfstat.h file
Both system utilization and per CPU utilization can be obtained by using theperfstat_cpu_util by
mentioning the type field of the perfstat_rawdata_t data structure as UTIL_CPU_TOTAL or UTIL_CPU
respectively. UTIL_CPU_TOTAL and UTIL_CPU are the macros, which can be referred in the definition of
the perfstat_rawdata_t data structure.
oldt = (perfstat_cpu_total_t*)malloc(sizeof(perfstat_cpu_total_t)*1);
if(oldt==NULL){
perror ("malloc");
exit(-1);
}
newt = (perfstat_cpu_total_t*)malloc(sizeof(perfstat_cpu_total_t)*1);
if(newt==NULL){
perror ("malloc");
exit(-1);
}
util = (perfstat_cpu_util_t*)malloc(sizeof(perfstat_cpu_util_t)*1);
if(util==NULL){
perror ("malloc");
exit(-1);
}
The example code to calculate system utilization per CPU, and CPU utilization, by using the
perfstat_cpu_util interface follows:
void main()
{
perfstat_rawdata_t data;
perfstat_cpu_util_t *util;
perfstat_cpu_t *newt,*oldt;
perfstat_id_t id;
int i,cpu_count,rc;
data.cur_elems = cpu_count;
if(data.prev_elems != data.cur_elems)
{
perror("The number of CPUs has become different for defined period");
exit(-1);
}
/* allocate enough memory */
newt = (perfstat_cpu_t *)calloc(cpu_count,sizeof(perfstat_cpu_t));
util = (perfstat_cpu_util_t *)calloc(cpu_count,sizeof(perfstat_cpu_util_t));
if(newt == NULL || util == NULL)
{
perror("Memory Allocation Error");
exit(-1);
}
data.curstat = newt;
#define INTERVAL_DEFAULT 2
#define COUNT_DEFAULT 10
#ifdef UTIL_AUTO
#define UTIL_MS 1
#define UTIL_PCT 0
#define UTIL_CORE 2
#define UTIL_PURR 0
#define UTIL_SPURR 1
void display_lpar_util_auto(int mode,int cpumode,int count,int interval);
#endif
void display_lpar_util(void);
if(collect_remote_node_stats)
{ /* perfstat_config needs to be called to enable cluster statistics collection */
rc = perfstat_config(PERFSTAT_ENABLE|PERFSTAT_CLUSTER_STATS, NULL);
if (rc == -1)
{
perror("cluster statistics collection is not available");
exit(-1);
}
}
#ifdef UTIL_AUTO
if((mode>2)&& (cpumode>1))
{
#else
/* Iterate "count" times */
while (count > 0)
{
display_lpar_util();
sleep(interval);
count--;
}
#endif
if(collect_remote_node_stats)
{ /* Now disable cluster statistics by calling perfstat_config */
perfstat_config(PERFSTAT_DISABLE|PERFSTAT_CLUSTER_STATS, NULL);
}
return(0);
}
last_pcpu_user = lparstats->puser;
last_pcpu_sys = lparstats->psys;
last_pcpu_idle = lparstats->pidle;
last_pcpu_wait = lparstats->pwait;
last_lcpu_user = cpustats->user;
last_lcpu_sys = cpustats->sys;
last_lcpu_idle = cpustats->idle;
last_lcpu_wait = cpustats->wait;
last_busy_donated = lparstats->busy_donated_purr;
last_idle_donated = lparstats->idle_donated_purr;
last_busy_stolen = lparstats->busy_stolen_purr;
last_idle_stolen = lparstats->idle_stolen_purr;
}
printf("\n%5s %5s %6s %6s %5s %5s %5s %5s %4s %5s",
"-----", "----", "-----", "-----", "-----", "-----", "-----", "---", "----", "-----");
} else {
printf("\n%5s %5s %6s %6s %5s %5s %5s %4s %5s",
"%user", "%sys", "%wait", "%idle", "physc", "%entc", "lbusy", "vcsw", "phint");
disp_util_header = 0;
/* first iteration, we only read the data, print the header and save the data */
save_last_values(&cpustats, &lparstats);
return;
}
/* calculate physcial processor tics during the last interval in user, system, idle and wait mode */
delta_pcpu_user = lparstats.puser - last_pcpu_user;
delta_pcpu_sys = lparstats.psys - last_pcpu_sys;
delta_pcpu_idle = lparstats.pidle - last_pcpu_idle;
delta_pcpu_wait = lparstats.pwait - last_pcpu_wait;
/* calculate clock tics during the last interval in user, system, idle and wait mode */
delta_lcpu_user = cpustats.user - last_lcpu_user;
delta_lcpu_sys = cpustats.sys - last_lcpu_sys;
delta_lcpu_idle = cpustats.idle - last_lcpu_idle;
delta_lcpu_wait = cpustats.wait - last_lcpu_wait;
/* calculate entitlement for this partition - entitled physical processors for this partition */
entitlement = (double)lparstats.entitled_proc_capacity / 100.0 ;
/* distributed unused physical processor tics amoung wait and idle proportionally to wait and idle in clock tics */
delta_pcpu_wait += unused_purr * ((double)delta_lcpu_wait / (double)(delta_lcpu_wait + delta_lcpu_idle));
delta_pcpu_idle += unused_purr * ((double)delta_lcpu_idle / (double)(delta_lcpu_wait + delta_lcpu_idle));
/* far SPLPAR, consider the entitled physical processor tics as the actual delta physical processor tics */
pcputime = entitled_purr;
}
else if (lparstats.type.b.donate_enabled) { /* if donation is enabled for this DLPAR */
/* calculate busy stolen and idle stolen physical processor tics during the last interval */
/* these physical processor tics are stolen from this partition by the hypervsior
* which will be used by wanting partitions */
delta_busy_stolen = lparstats.busy_stolen_purr - last_busy_stolen;
delta_idle_stolen = lparstats.idle_stolen_purr - last_idle_stolen;
/* calculate busy donated and idle donated physical processor tics during the last interval */
/* these physical processor tics are voluntarily donated by this partition to the hypervsior
* which will be used by wanting partitions */
delta_busy_donated = lparstats.busy_donated_purr - last_busy_donated;
delta_idle_donated = lparstats.idle_donated_purr - last_idle_donated;
/* add busy donated and busy stolen to the kernel bucket, as cpu
* cycles were donated / stolen when this partition is busy */
delta_pcpu_sys += delta_busy_donated;
delta_pcpu_sys += delta_busy_stolen;
/* distribute idle stolen to wait and idle proportionally to the logical wait and idle in clock tics, as
* cpu cycles were stolen when this partition is idle or in wait */
delta_pcpu_wait += delta_idle_stolen *
((double)delta_lcpu_wait / (double)(delta_lcpu_wait + delta_lcpu_idle));
delta_pcpu_idle += delta_idle_stolen *
/* distribute idle donated to wait and idle proportionally to the logical wait and idle in clock tics, as
* cpu cycles were donated when this partition is idle or in wait */
delta_pcpu_wait += delta_idle_donated *
((double)delta_lcpu_wait / (double)(delta_lcpu_wait + delta_lcpu_idle));
delta_pcpu_idle += delta_idle_donated *
((double)delta_lcpu_idle / (double)(delta_lcpu_wait + delta_lcpu_idle));
/* add donated to the total physical processor tics for CPU usage calculation, as they were
* distributed to respective buckets accordingly */
pcputime += (delta_idle_donated + delta_busy_donated);
/* add stolen to the total physical processor tics for CPU usage calculation, as they were
* distributed to respective buckets accordingly */
pcputime += (delta_idle_stolen + delta_busy_stolen);
if (lparstats.type.b.pool_util_authority) {
/* Available physical Processor units available in the shared pool (app) */
printf("%5.2f ", (double)(lparstats.pool_idle_time - last_pit) /
XINTFRAC*(double)delta_time_base);
}
save_last_values(&cpustats, &lparstats);
}
#ifdef UTIL_AUTO
void display_lpar_util_auto(int mode,int cpumode,int count,int interval)
{
float user_core_purr,kern_core_purr,wait_core_purr,idle_core_purr;
float user_core_spurr,kern_core_spurr,wait_core_spurr,idle_core_spurr,sum_core_spurr;
u_longlong_t user_ms_purr,kern_ms_purr,wait_ms_purr,idle_ms_purr,sum_ms;
u_longlong_t user_ms_spurr,kern_ms_spurr,wait_ms_spurr,idle_ms_spurr;
perfstat_rawdata_t data;
u_longlong_t delta_purr, delta_time_base;
double phys_proc_consumed, entitlement, percent_ent, delta_sec;
perfstat_partition_total_t lparstats;
static perfstat_cpu_total_t oldt,newt;
perfstat_cpu_util_t util;
int rc;
disp_util_header = 0;
/* first iteration, we only read the data, print the header and save the data */
}
while(count)
{
data.type = UTIL_CPU_TOTAL;
data.curstat = &newt; data.prevstat= &oldt;
data.sizeof_data = sizeof(perfstat_cpu_total_t);
data.cur_elems = 1;
data.prev_elems = 1;
rc = perfstat_cpu_util(&data, &util,sizeof(perfstat_cpu_util_t), 1);
if(rc <= 0)
{
perror("Error in perfstat_cpu_util");
exit(-1);
}
delta_time_base = util.delta_time;
switch(mode)
{
case UTIL_PCT:
printf(" %5.1f %5.1f %5.1f %5.1f %5.4f \n",util.user_pct,util.kern_pct,util.wait_pct,util.idle_pct,util.physical_consumed);
break;
case UTIL_MS:
user_ms_purr=((util.user_pct*delta_time_base)/100.0);
kern_ms_purr=((util.kern_pct*delta_time_base)/100.0);
wait_ms_purr=((util.wait_pct*delta_time_base)/100.0);
idle_ms_purr=((util.idle_pct*delta_time_base)/100.0);
if(cpumode==UTIL_PURR)
{
printf(" %llu %llu %llu %llu %5.4f\n",user_ms_purr,kern_ms_purr,wait_ms_purr,idle_ms_purr,util.physical_consumed);
}
else if(cpumode==UTIL_SPURR)
{
user_ms_spurr=(user_ms_purr*util.freq_pct)/100.0;
kern_ms_spurr=(kern_ms_purr*util.freq_pct)/100.0;
wait_ms_spurr=(wait_ms_purr*util.freq_pct)/100.0;
sum_ms=user_ms_spurr+kern_ms_spurr+wait_ms_spurr;
idle_ms_spurr=delta_time_base-sum_ms;
}
break;
case UTIL_CORE:
user_core_purr=((util.user_pct*util.physical_consumed)/100.0);
kern_core_purr=((util.kern_pct*util.physical_consumed)/100.0);
wait_core_purr=((util.wait_pct*util.physical_consumed)/100.0);
idle_core_purr=((util.idle_pct*util.physical_consumed)/100.0);
user_core_spurr=((user_core_purr*util.freq_pct)/100.0);
kern_core_spurr=((kern_core_purr*util.freq_pct)/100.0);
wait_core_spurr=((wait_core_purr*util.freq_pct)/100.0);
if(cpumode==UTIL_PURR)
{
printf("%5.4f %5.4f %5.4f %5.4f %5.4f\n",user_core_purr,kern_core_purr,wait_core_purr,idle_core_purr,util.physical_consumed);
}
else if(cpumode==UTIL_SPURR)
{
sum_core_spurr=user_core_spurr+kern_core_spurr+wait_core_spurr;
idle_core_spurr=util.physical_consumed-sum_core_spurr;
default:
printf("In correct usage\n");
return;
}
count--;
}
}
#endif
The program displays an output that is similar to the following example output:
%user %sys %wait %idle physc %entc lbusy vcsw phint
----- ---- ----- ----- ----- ----- ----- ---- -----
0.1 0.3 0.0 99.6 0.01 1.1 0.2 285 0
0.0 0.3 0.0 99.7 0.01 0.8 0.0 229 0
0.0 0.2 0.0 99.8 0.01 0.6 0.1 181 0
0.1 0.2 0.0 99.7 0.01 0.8 0.1 189 0
0.0 0.3 0.0 99.7 0.01 0.7 0.0 193 0
0.0 0.2 0.0 99.8 0.01 0.7 0.2 204 0
0.1 0.3 0.0 99.7 0.01 0.9 1.0 272 0
0.0 0.3 0.0 99.7 0.01 0.9 0.1 304 0
0.0 0.3 0.0 99.7 0.01 0.9 0.0 212 0
/* #define UTIL_AUTO */
#ifdef UTIL_AUTO
#define UTIL_MS 1
#define UTIL_PCT 0
#define UTIL_CORE 2
#define UTIL_PURR 0
#define UTIL_SPURR 1
void display_metrics_global_auto(int mode,int cpumode,int count,int interval);
#endif
/* Convert 4K pages to MB */
#define AS_MB(X) ((X) * 4096/1024/1024)
/* For WPAR, use NULL else use the actual WPAR ID (for global) */
#define WPAR_ID ((cid)?NULL:&wparid)
#define INTERVAL_DEFAULT 1
#define COUNT_DEFAULT 1
void initialise(void)
{
totalcinfo = (perfstat_cpu_total_t *)malloc(sizeof(perfstat_cpu_total_t));
CHECK_FOR_MALLOC_NULL(totalcinfo);
/*
* NAME: display_metrics_global
* used to display the metrics when called from global
*
*/
void display_metrics_global(void)
{
int i;
perfstat_id_t first;
strcpy(first.name, FIRST_CPU);
while(count)
{
sleep(interval);
if(nflag){
if (perfstat_cpu_total_node(&nodeid, totalcinfo, sizeof(perfstat_cpu_total_t), 1) <= 0){
perror("perfstat_cpu_total_node:");
exit(1);
}
printf("%s\t%#4.1f\t%#4.1f\t%#4.1f\t%#4.1f\t%#4.1d\n",cinfo[i].name,
((double)(delta_user)/(double)(delta_total) * 100.0),
((double)(delta_sys)/(double)(delta_total) * 100.0),
((double)(delta_wait)/(double)(delta_total) * 100.0),
((double)(delta_idle)/(double)(delta_total) * 100.0),
cinfo[i].state);
}
printf("%s\t%#4.1f\t%#4.1f\t%#4.1f\t%#4.1f\n\n","ALL",((double)(delta_user)/(double)(delta_total) * 100.0),
((double)(delta_sys)/(double)(delta_total) * 100.0),
((double)(delta_wait)/(double)(delta_total) * 100.0),
((double)(delta_idle)/(double)(delta_total) * 100.0));
count--;
save_last_values();
}
}
/*
*NAME: display_metrics_wpar
* used to display the metrics when called from wpar
*
*/
void display_metrics_wpar(void)
{
int i;
char last[5];
perfstat_id_wpar_t first;
/*first.spec = WPARNAME;*/
strcpy(first.name,NULL );
if (perfstat_wpar_total( NULL, &winfo, sizeof(perfstat_wpar_total_t), 1) <= 0){
perror("perfstat_wpar_total:");
exit(1);
}
while(count)
{
sleep(interval);
printf("%s\t%#4.1f\t%#4.1f\t%#4.1f\t%#4.1f\n",cinfo[i].name,((double)(delta_user)/(double)(delta_total) * 100.0),
((double)(delta_sys)/(double)(delta_total) * 100.0),
((double)(delta_wait)/(double)(delta_total) * 100.0),
((double)(delta_idle)/(double)(delta_total) * 100.0));
}
printf("%s\t%#4.1f\t%#4.1f\t%#4.1f\t%#4.1f\n\n",last,((double)(delta_user)/(double)(delta_total) * 100.0),
((double)(delta_sys)/(double)(delta_total) * 100.0),
((double)(delta_wait)/(double)(delta_total) * 100.0),
((double)(delta_idle)/(double)(delta_total) * 100.0));
count--;
save_last_values();
}
/*
* NAME: display_metrics_wpar_from_global
* display metrics of wpar when called from global
*
*/
void display_metrics_wpar_from_global(void)
{
char last[5];
int i;
if (perfstat_wpar_total( &wparid, &winfo, sizeof(perfstat_wpar_total_t), 1) <= 0){
perror("perfstat_wpar_total:");
exit(1);
}
if (winfo.type.b.cpu_rset)
strcpy(last,"RST");
else
strcpy(last,"ALL");
strcpy(wparid.u.wparname,wpar);
printf("\n cpu\tuser\tsys\twait\tidle\n\n");
while(count)
{
sleep(interval);
printf("%s\t%#4.1f\t%#4.1f\t%#4.1f\t%#4.1f\n",cinfo[i].name,((double)(delta_user)/(double)(delta_total) * 100.0),
((double)(delta_sys)/(double)(delta_total) * 100.0),
((double)(delta_wait)/(double)(delta_total) * 100.0),
((double)(delta_idle)/(double)(delta_total) * 100.0));
}
count--;
save_last_values();
}
#ifdef UTIL_AUTO
void display_metrics_global_auto(int mode,int cpumode,int count,int interval)
{
float user_core_purr,kern_core_purr,wait_core_purr,idle_core_purr;
float user_core_spurr,kern_core_spurr,wait_core_spurr,idle_core_spurr,sum_core_spurr;
u_longlong_t user_ms_purr,kern_ms_purr,wait_ms_purr,idle_ms_purr,sum_ms;
u_longlong_t user_ms_spurr,kern_ms_spurr,wait_ms_spurr,idle_ms_spurr;
perfstat_rawdata_t data;
u_longlong_t delta_purr;
double phys_proc_consumed, entitlement, percent_ent, delta_sec;
perfstat_partition_total_t lparstats;
static perfstat_cpu_t *oldt,*newt;
perfstat_cpu_util_t *util;
int rc,cpu_count,i;
perfstat_id_t id;
while(count) {
/* first iteration, we only read the data, print the header and save the data */
}
cpu_count = perfstat_cpu(NULL, NULL,sizeof(perfstat_cpu_t),0);
data.type = UTIL_CPU;
data.prevstat= oldt;
data.sizeof_data = sizeof(perfstat_cpu_t);
data.prev_elems = cpu_count;
sleep(interval);
/* Check how many perfstat_cpu_t structures are available after a defined period */
data.cur_elems = cpu_count;
if(data.prev_elems != data.cur_elems)
{
perror("The number of CPUs has become different for defined period");
exit(-1);
}
switch(mode)
{
case UTIL_PCT:
for(i=0;i<cpu_count;i++)
printf("%d %5.1f %5.1f %5.1f %5.1f %5.7f \n",i,util[i].user_pct,util[i].kern_pct,util[i].wait_pct,util[i].idle_pct,util[i].physical_consumed);
break;
case UTIL_MS:
for(i=0;i<cpu_count;i++)
{
user_ms_purr=((util[i].user_pct*util[i].delta_time)/100.0);
kern_ms_purr=((util[i].kern_pct*util[i].delta_time)/100.0);
wait_ms_purr=((util[i].wait_pct*util[i].delta_time)/100.0);
idle_ms_purr=((util[i].idle_pct*util[i].delta_time)/100.0);
if(cpumode==UTIL_PURR)
{
printf("%d\t %llu\t %llu\t %llu\t %llu\t %5.4f\n",i,user_ms_purr,kern_ms_purr,wait_ms_purr,idle_ms_purr,util[i].physical_consumed);
}
else if(cpumode=UTIL_SPURR)
{
user_ms_spurr=(user_ms_purr*util[i].freq_pct)/100.0;
kern_ms_spurr=(kern_ms_purr*util[i].freq_pct)/100.0;
wait_ms_spurr=(wait_ms_purr*util[i].freq_pct)/100.0;
sum_ms=user_ms_spurr+kern_ms_spurr+wait_ms_spurr;
idle_ms_spurr=util[i].delta_time-sum_ms;
}
}
break;
case UTIL_CORE:
for(i=0;i<cpu_count;i++)
{
user_core_purr=((util[i].user_pct*util[i].physical_consumed)/100.0);
kern_core_purr=((util[i].kern_pct*util[i].physical_consumed)/100.0);
wait_core_purr=((util[i].wait_pct*util[i].physical_consumed)/100.0);
idle_core_purr=((util[i].idle_pct*util[i].physical_consumed)/100.0);
user_core_spurr=((user_core_purr*util[i].freq_pct)/100.0);
kern_core_spurr=((kern_core_purr*util[i].freq_pct)/100.0);
wait_core_spurr=((wait_core_purr*util[i].freq_pct)/100.0);
if(cpumode==UTIL_PURR)
{
printf("%d %5.4f %5.4f %5.4f %5.4f %5.4f\n",i,user_core_purr,kern_core_purr,wait_core_purr,idle_core_purr,util[i].physical_consumed);
}
else if(cpumode==UTIL_SPURR)
{
sum_core_spurr=user_core_spurr+kern_core_spurr+wait_core_spurr;
idle_core_spurr=util[i].physical_consumed-sum_core_spurr;
default:
printf("In correct usage\n");
return;
}
count--;
}
}
#endif
/*
*NAME: main
*
*/
cid = corral_getcid();
initialise();
display_configuration();
if(atflag)
display_metrics_wpar_from_global();
else if (cid)
display_metrics_wpar();
else
#ifdef UTIL_AUTO
if((mode>2)&& (cpumode>1))
{
printf("Error: Invalid Input\n");
exit(0);
}
display_metrics_global_auto(mode,cpumode,count,interval);
#else
display_metrics_global();
#endif
if(nflag)
{ /* Now disable cluster statistics by calling perfstat_config */
perfstat_config(PERFSTAT_DISABLE|PERFSTAT_CLUSTER_STATS, NULL);
}
return(0);
}
The program displays an output that is similar to the following example output:
Purr counter value = 54500189780
Spurr counter value = 54501115744
Free memory = 760099
Available memory = 758179
perfstat_diskadapter Interface
The perfstat_diskadapter interface returns a set of structures of type perfstat_diskadapter_t, which is
defined in the libperfstat.h file.
Several other disk adapter-related metrics (such as the number of blocks read from and written to the
adapter) are also returned. For a complete list, see the perfstat_diskadapter_t section in the libperfstat.h
header file.
The following program emulates the diskadapterstat behavior and also shows an example of how the
perfstat_diskadapter interface is used:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <libperfstat.h>
#include <errno.h>
#include <wpars/wparcfg.h>
/* Function prototypes */
/*
* NAME: do_initialization
* This function initializes the data structues.
* It also collects initial set of values.
*
* RETURNS:
* On successful completion:
* - returns 0.
* In case of error
* - exit with code 1.
*/
if (num_adapt == 0) {
printf("There are no disk adapters.\n");
exit(0);
}
if (num_adapt < 0) {
perror("perfstat_diskadapter: ");
exit(1);
}
return (0);
}
/*
/*
* NAME: do_cleanup
* This function frees the memory allocated for the perfstat structures.
*
*/
if (statq) {
free(statq);
}
}
/*
* NAME: collect_diskadapter_metrics
* This function collects the raw values in to
* the specified structures and derive the metrics from the
* raw values
*
*/
void collect_diskadapter_metrics(void)
{
perfstat_id_t first;
unsigned long long delta_read, delta_write,delta_xfers, delta_xrate;
if(collect_remote_node_stats) {
strncpy(nodeid.u.nodename, nodename, MAXHOSTNAMELEN);
nodeid.spec = NODENAME;
strcpy(nodeid.name, FIRST_DISKADAPTER);
rc = perfstat_diskadapter_node(&nodeid ,statq, sizeof(perfstat_diskadapter_t),num_adapt);
}
else {
strcpy(first.name, FIRST_DISKADAPTER);
rc = perfstat_diskadapter(&first ,statq, sizeof(perfstat_diskadapter_t),num_adapt);
}
printf("\n%-8s %7s %8s %8s %8s %8s\n", " Name ", " Disks ", " Size ", " Free ", " ARS ", " AWS ");
printf("%-8s %7s %8s %8s %8s %8s\n", "======", "======", "======", "======", "=====", "=====");
if(collect_remote_node_stats) {
rc = perfstat_diskadapter_node(&nodeid, statp, sizeof(perfstat_diskadapter_t), num_adapt);
}
else {
rc = perfstat_diskadapter(&first ,statp, sizeof(perfstat_diskadapter_t),num_adapt);
}
/*
*NAME: main
*
*/
if(collect_remote_node_stats)
{ /* perfstat_config needs to be called to enable cluster statistics collection */
rc = perfstat_config(PERFSTAT_ENABLE|PERFSTAT_CLUSTER_STATS, NULL);
if (rc == -1)
{
perror("cluster statistics collection is not available");
exit(-1);
}
}
do_initialization();
/* call the functions to collect the metrics and display them */
collect_diskadapter_metrics();
if(collect_remote_node_stats)
{ /* Now disable cluster statistics by calling perfstat_config */
perfstat_config(PERFSTAT_DISABLE|PERFSTAT_CLUSTER_STATS, NULL);
}
return (0);
}
The program displays an output that is similar to the following example output:
Name Disks Size Free ARS AWS
====== ====== ====== ====== ===== =====
vscsi0 1 25568 19616 1 9
Several other disk-related metrics (such as number of blocks read from and written to disk, and adapter
names) are also returned. For a complete list, see the perfstat_disk_t section in the libperfstat.h header
file in Files Reference.
The following program emulates diskstat behavior and also shows an example of how the perfstat_disk
interface is used:
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
perfstat_diskpath Interface
The perfstat_diskpath interface returns a set of structures of type perfstat_diskpath_t, which is defined
in the libperfstat.h file.
Several other disk path-related metrics (such as the number of blocks read from and written through the
path) are also returned. For a complete list, see the perfstat_diskpath_t section in the libperfstat.h header
file.
The following code shows an example of how the perfstat_diskpath interface is used:
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
perror("perfstat_diskpath");
exit(-1);
}
if (tot == 0)
{
perror("perfstat_diskpath");
exit(-1);
}
if (ret <= 0)
{
perror("perfstat_diskpath");
exit(-1);
}
The program displays an output that is similar to the following example output:
Statistics for disk path : hdisk0_Path0
----------------------
number of blocks read : 335354
number of blocks written : 291416
adapter name : vscsi0
perfstat_fcstat Interface
The perfstat_fcstat interface returns a set of structures of type perfstat_fcstat_t, which is defined in the
libperfstat.h file.
* NAME: do_initialization
* This function initializes the data structures.
* It also collects the initial set of values.
*
* RETURNS:
* On successful completion:
* - returns 0.
* In case of error
* - exits with code 1.
*/
int do_initialization(void)
{
/* check how many perfstat_fcstat_t structures are available */
if(collect_remote_node_stats) {
strncpy(nodeid.u.nodename, nodename, MAXHOSTNAMELEN);
nodeid.spec = NODENAME;
tot = perfstat_fcstat_node(&nodeid, NULL, sizeof(perfstat_fcstat_t), 0)
;
}
else if(fc_flag == 1 && wwpn_flag == 1)
{
tot = perfstat_fcstat_wwpn(NULL, NULL, sizeof(perfstat_fcstat_t), 0);
if(tot >= 1)
{
tot = 1;
}
else
{
printf("There is no FC adapter \n");
exit(-1);
}
}
else
{
tot = perfstat_fcstat(NULL, NULL, sizeof(perfstat_fcstat_t), 0);
}
if (tot <= 0) {
printf("There is no FC adapter\n");
exit(0);
}
/*
*Name: display_metrics
if(collect_remote_node_stats) {
strncpy(nodeid.u.nodename, nodename, MAXHOSTNAMELEN);
nodeid.spec = NODENAME;
strcpy(nodeid.name , FIRST_NETINTERFACE);
ret = perfstat_fcstat_node(&nodeid, statq, sizeof(perfstat_fcstat_t), tot);
} else if((fc_flag == 1) && (wwpn_flag == 1)) {
strcpy(wwpn.name , fcadapter_name);
wwpn.initiator_wwpn_name = wwpn_id;
ret = perfstat_fcstat_wwpn( &wwpn, statq, sizeof(perfstat_fcstat_t), tot);
}
else
{
strcpy(first.name , FIRST_NETINTERFACE);
ret = perfstat_fcstat( &first, statq, sizeof(perfstat_fcstat_t), tot);
}
if (ret < 0)
{
free(statp);
free(statq);
perror("perfstat_fcstat: ");
exit(1);
}
while (count)
{
sleep (interval);
if(collect_remote_node_stats) {
ret = perfstat_fcstat_node(&nodeid, statp, sizeof(perfstat_fcstat_t), tot);
}
if((fc_flag == 1) && (wwpn_flag == 1))
{
strcpy(wwpn.name , fcadapter_name);
wwpn.initiator_wwpn_name = wwpn_id;
ret = perfstat_fcstat_wwpn(&wwpn, statp, sizeof(perfstat_fcstat_t), tot);
}
else
{
ret = perfstat_fcstat(&first, statp, sizeof(perfstat_fcstat_t), tot);
}
/* print statistics for the Fiber channel */
for (i = 0; i < ret; i++) {
printf(" FC Adapter name: %s \n", statp[i].name);
printf(" ======================== Traffic Statistics ============================\n");
printf(" Number of Input Requests: %lld \n",
statp[i].InputRequests - statq[i].InputRequests);
printf(" Number of Output Requests: %lld \n",
statp[i].OutputRequests - statq[i].OutputRequests);
printf(" Number of Input Bytes : %lld \n",
statp[i].InputBytes - statq[i].InputBytes);
printf(" Number of Output Bytes : %lld \n",
statp[i].OutputBytes - statq[i].OutputBytes);
printf(" ======================== Transfer Statistics ============================\n");
printf(" Adapter’s Effective Maximum Transfer Value : %lld \n",
statp[i].EffMaxTransfer - statq[i].EffMaxTransfer);
printf(" ======================== Driver Statistics ============================\n");
printf(" Count of DMA failures: %lld \n",
statp[i].NoDMAResourceCnt - statq[i].NoDMAResourceCnt);
printf(" No command resource available :%lld \n",
}
memcpy(statq, statp, (tot * sizeof(perfstat_fcstat_t)));
count--;
}
}
/*
*Name: main
*
*/
if((fc_flag == 1))
{
if(fcadapter_name == NULL )
{
fprintf(stderr, "FC adapter Name should not be NULL");
exit(-1);
}
}
if(wwpn_flag == 1)
{
if(wwpn_id < 0 )
{
fprintf(stderr, "WWPN id should not be negavite ");
exit(-1);
}
}
if(collect_remote_node_stats)
{ /* perfstat_config needs to be called to enable cluster statistics collection */
rc = perfstat_config(PERFSTAT_ENABLE|PERFSTAT_CLUSTER_STATS, NULL);
if (rc == -1)
{
perror("cluster statistics collection is not available");
exit(-1);
}
}
do_initialization();
display_metrics();
if(collect_remote_node_stats)
{ /* Now disable cluster statistics by calling perfstat_config */
perfstat_config(PERFSTAT_DISABLE|PERFSTAT_CLUSTER_STATS, NULL);
}
perfstat_hfistat_window Interface
The perfstat_hfistat_window interface returns a set of structures of type perfstat_hfistat_window_t,
which is defined in the libperfstat.h file.
perfstat_hfistat Interface
The perfstat_hfistat interface returns a set of structures of type perfstat_hfistat_t, which is defined in the
libperfstat.h file.
perfstat_logicalvolume Interface
The perfstat_logicalvolume interface returns a set of structures of type perfstat_logicalvolume_t, which
is defined in the libperfstat.h file.
Several other paging-space-related metrics (such as name, type, and active) are also returned. For a
complete list of other paging-space-related metrics, see the perfstat_logicalvolume_t section in the
libperfstat.h header file in Files Reference.
Note: The perfstat_config (PERFSTAT_ENABLE | PERFSTAT_LV, NULL) must be used to enable the
logical volume statistical collection.
The following code shows an example of how the perfstat_logicalvolume interface is used:
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
int main(){
strcpy(first.name,NULL);
for(i=0;i<lv_count;i++){
printf("\n");
printf("Logical volume name=%s\n",lv[i].name);
printf("Volume group name=%s\n",lv[i].vgname);
printf("Physical partition size in MB=%lld\n",lv[i].ppsize);
printf("total number of logical paritions configured for this logical volume=%lld\n",lv[i].logical_partitions);
printf("number of physical mirrors for each logical partition=%lu\n",lv[i].mirrors);
printf("Number of read and write requests=%lu\n",lv[i].iocnt);
printf("Number of Kilobytes read=%lld\n",lv[i].kbreads);
printf("Number of Kilobytes written=%lld\n",lv[i].kbwrites);
}
The program displays an output that is similar to the following example output:
Logical volume name=hd5
Volume group name=rootvg
Physical partition size in MB=32
total number of logical paritions configured for this logical volume=1
number of physical mirrors for each logical partition=1
Number of read and write requests=0
Number of Kilobytes read=0
Number of Kilobytes written=0
The preceding program emulates vmstat behavior and also shows how perfstat_logicalvolume is used.
perfstat_memory_page Interface
The perfstat_memory_page interface returns a set of structures of type perfstat_memory_page_t, which is
defined in the libperfstat.h file.
Several other disk-adapter related metrics (such as the number of blocks read from and written to the
adapter) are also returned. For a complete list of other disk-adapter-related metrics, see the
perfstat_memory_page_t section in the libperfstat.h header file.
The following program shows an example of how the perfstat_memory_page interface is used:
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
pagesize.psize = FIRST_PSIZE;
avail_psizes = perfstat_memory_page(&pagesize, psize_mem_values, sizeof(perfstat_memory_page_t),
total_psizes);
if(avail_psizes < 1)
{
perror("display_psize_memory_stats: Unable to retrieve memory "
"statistics for the available page sizes.");
exit(-1);
}
for(i=0;i<avail_psizes;i++){
printf("Page size in bytes=%llu\n",psize_mem_values[i].psize);
printf("Number of real memory frames of this page size=%lld\n",psize_mem_values[i].real_total);
printf("Number of pages on free list=%lld\n",psize_mem_values[i].real_free);
printf("Number of pages pinned=%lld\n",psize_mem_values[i].real_pinned);
printf("Number of pages in use=%lld\n",psize_mem_values[i].real_inuse);
printf("Number of page faults =%lld\n",psize_mem_values[i].pgexct);
printf("Number of pages paged in=%lld\n",psize_mem_values[i].pgins);
printf("Number of pages paged out=%lld\n",psize_mem_values[i].pgouts);
printf("\n");
}
return 0;
}
The program displays an output that is similar to the following example output:
Page size in bytes=4096
Number of real memory frames of this page size=572640
Number of pages on free list=364101
Number of pages pinned=171770
Number of pages in use=208539
Number of page faults =1901334
Number of pages paged in=40569
Number of pages paged out=10381
perfstat_netbuffer Interface
The perfstat_netbuffer interface returns a set of structures of type perfstat_netbuffer_t, which is defined
in the libperfstat.h file.
Several other allocation-related metrics (such as high-water mark and freed) are also returned. For a
complete list of other allocation-related metrics, see the perfstat_netbuffer_t section in the libperfstat.h
header file.
The following code shows an example of how the perfstat_netbuffer interface is used: The preceding
program produces the following output:
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
The program displays an output that is similar to the following example output:
By size inuse calls failed delayed free hiwat freed
64 598 12310 14 682 0 10480 0
128 577 8457 16 287 0 7860 0
256 1476 287157 88 716 0 15720 0
512 2016 1993915 242 808 0 32750 0
1024 218 8417 81 158 0 7860 0
2048 563 2077 277 307 0 19650 0
4096 39 127 15 143 0 1310 0
8192 4 16 4 0 0 327 0
16384 128 257 19 4 0 163 0
32768 25 55 9 4 0 81 0
65536 59 121 35 5 0 81 0
131072 3 7 0 217 0 204 0
perfstat_netinterface Interface
The perfstat_netinterface interface returns a set of structures of type perfstat_netinterface_t, which is
defined in the libperfstat.h file.
Several other network-interface related metrics (such as number of bytes sent and received, type, and
bitrate) are also returned. For a complete list of other network-interfaced related metrics, see the
perfstat_netinterface_t section in the libperfstat.h header file in Files Reference.
char *
decode(uchar type) {
switch(type) {
case IFT_LOOP:
return("loopback");
case IFT_ETHER:
return("ethernet");
}
return("other");
}
perror("perfstat_netinterface");
exit(-1);
}
perror("perfstat_netinterface");
exit(-1);
}
input statistics:
number of packets : 306352
number of errors : 0
number of bytes : 24831776
output statistics:
number of packets : 62669
number of bytes : 11497679
number of errors : 0
input statistics:
number of packets : 336
number of errors : 0
number of bytes : 20912
output statistics:
number of packets : 336
number of bytes : 20912
number of errors : 0
The preceding program emulates diskadapterstat behavior and also shows how perfstat_netinterface is
used.
perfstat_netadapter Interface
The perfstat_netadpater interface returns a set of structures of type perfstat_netadapter_t, which is
defined in the libperfstat.h file.
Note: The perfstat_netadpater interface returns only the network Ethernet adapter statistics similar to the
entstat command.
The following program shows an example of how the perfstat_netadapter interface is used:
/* The sample program displays the metrics *
* related to every Individual *
* network adapter in the LPAR*/
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
#include <net/if_types.h>
/* define default interval and count values */
#define INTERVAL_DEFAULT 1
#define COUNT_DEFAULT 1
/*
* NAME: showusage
/*
* NAME: do_initialization
* This function initializes the data structues.
* It also collects the initial set of values.
*
* RETURNS:
* On successful completion:
* - returns 0.
* In case of error
* - exits with code 1.
*/
int do_initialization(void)
{
/* check how many perfstat_netadapter_t structures are available */
if(collect_remote_node_stats) {
strncpy(nodeid.u.nodename, nodename, MAXHOSTNAMELEN);
nodeid.spec = NODENAME;
tot = perfstat_netadapter_node(&nodeid, NULL, sizeof(perfstat_netadapter_t), 0);
}
else
{
tot = perfstat_netadapter(NULL, NULL, sizeof(perfstat_netadapter_t), 0);
}
if (tot == 0)
{
printf("There is no net adapter\n");
exit(0);
}
if (tot < 0)
{
perror("perfstat_netadapter: ");
exit(1);
} /* allocate enough memory for all the structures */
return(0);
}
/*
*Name: display_metrics
* collect the metrics and display them
*
*/
void display_metrics()
{
perfstat_id_t first;
int ret, i;
if(collect_remote_node_stats) {
strncpy(nodeid.u.nodename, nodename, MAXHOSTNAMELEN);
nodeid.spec = NODENAME;
strcpy(nodeid.name , FIRST_NETINTERFACE);
ret = perfstat_netadapter_node(&nodeid, statq, sizeof(perfstat_netadapter_t), tot);
}
else {
strcpy(first.name , FIRST_NETINTERFACE);
ret = perfstat_netadapter( &first, statq, sizeof(perfstat_netadapter_t), tot);
}
if (ret < 0){
free(statp);
free(statq);
perror("perfstat_netadapter: ");
exit(1);
}
while (count)
{
sleep (interval);
if(collect_remote_node_stats)
{
ret = perfstat_netadapter_node(&nodeid, statp, sizeof(perfstat_netadapter_t), tot);
}
}
memcpy(statq, statp, (tot * sizeof(perfstat_netadapter_t)));
count--;
}
}
/*
*Name: main
*
if(collect_remote_node_stats)
{ /* perfstat_config needs to be called to enable cluster statistics collection */
rc = perfstat_config(PERFSTAT_ENABLE|PERFSTAT_CLUSTER_STATS, NULL);
if (rc == -1)
{
perror("cluster statistics collection is not available");
exit(-1);
}
}
do_initialization();
display_metrics();
if(collect_remote_node_stats)
{ /* Now disable cluster statistics by calling perfstat_config */
perfstat_config(PERFSTAT_DISABLE|PERFSTAT_CLUSTER_STATS, NULL);
}
free(statp);
free(statq);
return 0;
}
perfstat_protocol Interface
The perfstat_protocol interface returns a set of structures of type perfstat_protocol_t, which consists of a
set of unions to accommodate the different sets of fields needed for each protocol, as defined in the
libperfstat.h file.
Many other network-protocol related metrics are also returned. For a complete list of network-protocol
related metrics, see the perfstat_protocol_t section in the libperfstat.h header file.
The following code shows an example of how the perfstat_protocol interface is used:
#include <stdio.h>
#include <string.h>
#include <libperfstat.h>
if (ret < 0)
{
perror("perfstat_protocol");
exit(-1);
}
retrieved += ret;
do {
printf("\nStatistics for protocol : %s\n", pinfo.name);
printf("-----------------------\n");
if (!strcmp(pinfo.name,"ip")) {
printf("number of input packets : %llu\n", pinfo.u.ip.ipackets);
printf("number of input errors : %llu\n", pinfo.u.ip.ierrors);
printf("number of output packets : %llu\n", pinfo.u.ip.opackets);
The program displays an output that is similar to the following example output:
number of protocol usage structures available : 11
server statistics:
number of connection-oriented RPC requests : 0
number of rejected connection-oriented RPCs : 0
number of connectionless RPC requests : 0
number of rejected connectionless RPCs : 0
The preceding program emulates protocolstat behavior and also shows how perfstat_protocol is used.
perfstat_pagingspace Interface
The perfstat_pagingspace interface returns a set of structures of type perfstat_pagingspace_t, which is
defined in the libperfstat.h file.
Several other paging-space-related metrics (such as name, type, and active) are also returned. For a
complete list of other paging-space-related metrics, see the perfstat_pagingspace_t section in the
libperfstat.h header file in Files Reference.
pinfo = calloc(tot,sizeof(perfstat_pagingspace_t));
strcpy(first.name, FIRST_PAGINGSPACE);
void main()
{
perfstat_process_t *proct;
perfstat_id_t id;
int i,rc,proc_count;
strcpy(id.name,"");
Number of Processes = 77
Credential Information
Owner Info = 0
WLM Class Name = 257
The program displays an output that is similar to the following example output:
Number of Processes = 77
Credential Information
Owner Info = 0
WLM Class Name = 257
perfstat_process_util interface
The perfstat_process_util interface returns a set of structures of type perfstat_process_t, which is
defined in the libperfstat.h file.
void main()
{
perfstat_process_t *cur, *prev;
bzero(&buf, sizeof(perfstat_rawdata_t));
buf.type = UTIL_PROCESS;
buf.curstat = cur;
buf.prevstat = prev;
buf.sizeof_data = sizeof(perfstat_process_t);
buf.cur_elems = cur_proc_count;
buf.prev_elems = prev_proc_count;
The program displays an output that is similar to the following example output:
=======Process Related Utilization Metrics =======
Process ID = 0
User Mode CPU time = 0.000000
System Mode CPU time = 0.013752
Bytes Written to Disk = 0
Bytes Read from Disk = 0
In Operations from Disk = 0
Out Operations from Disk = 0
=====================================
Process ID = 1
User Mode CPU time = 0.000000
System Mode CPU time = 0.000000
Bytes Written to Disk = 0
Bytes Read from Disk = 0
In Operations from Disk = 0
Out Operations from Disk = 0
=====================================
Process ID = 196614
User Mode CPU time = 0.000000
System Mode CPU time = 0.000000
Bytes Written to Disk = 0
Bytes Read from Disk = 0
In Operations from Disk = 0
Out Operations from Disk = 0
=====================================
Process ID = 262152
User Mode CPU time = 0.000000
System Mode CPU time = 0.000000
Bytes Written to Disk = 0
Bytes Read from Disk = 0
In Operations from Disk = 0
Out Operations from Disk = 0
=====================================
perfstat_processor_pool_util interface
The perfstat_processor_pool_util interface returns a set of structures of type
perfstat_processor_pool_util_t, which is defined in the libperfstat.h file
The use of the perfstat_processor_pool_util API for the system-level utilization follows:
#include <libperfstat.h>
#include <sys/dr.h>
#include <sys/types.h>
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#define COUNT 2
#define INTERVAL 2
void main(int argc, char **argv)
{
perfstat_rawdata_t data;
perfstat_partition_total_t oldt,newt;
perfstat_processor_pool_util_t util,*uti;
static int once=0;
int rc;
u_longlong_t x=0;
int iInter=0,iCount=0;
int c;
while( (c = getopt(argc,argv,"i:c:"))!= EOF ){
switch(c) {
case ’i’:
iInter=atoi(optarg);
break;
case ’c’:
iCount=atoi(optarg);
break;
}
}
}
}
perfstat_tape Interface
The perfstat_tape interface returns a set of structures of type perfstat_tape_t, which is defined in the
libperfstat.h file.
Several other paging-space-related metrics (such as name, type, and active) are also returned. For a
complete list of paging-space-related metrics, see the perfstat_pagingspace_t section in the libperfstat.h
header file in Files Reference.
The following code shows an example of how the perfstat_tape interface is used:
int main(){
int ret, tot, i;
perfstat_tape_t *statp;
perfstat_id_t first;
for(i=0;i<ret;i++){
void main()
{
perfstat_thread_t *threadt;
perfstat_id_t id;
int i,rc,thread_count;
strcpy(id.name,"");
rc = perfstat_thread(&id,threadt,sizeof(perfstat_thread_t),thread_count);
if(rc <= 0)
{
free(threadt);
perror("Error in perfstat_thread");
exit(-1) ;
}
The program displays an output that is similar to the following example output:
Process ID = 6553744
Thread ID = 12345
perfstat_thread_util interface
The perfstat_thread_util interface returns a set of structures of type perfstat_thread_t, which is
defined in the libperfstat.h file.
void main()
{
perfstat_thread_t *cur, *prev;
perfstat_rawdata_t buf;
perfstat_thread_t *thread_util;
perfstat_id_t id;
int cur_thread_count,prev_thread_count;
int i,rc;
prev_thread_count = perfstat_thread(NULL, NULL,sizeof(perfstat_thread_t),0);
if(prev_thread_count <= 0)
{
perror("Error in perfstat_thread");
exit(-1) ;
}
prev = (perfstat_thread_t *)calloc(prev_thread_count,sizeof(perfstat_thread_t));
if(prev == NULL)
{
perror("Memory Allocation Error");
exit(-1) ;
}
strcpy(id.name,"");
prev_thread_count = perfstat_thread(&id,prev,sizeof(perfstat_thread_t),prev_thread_count);
if(prev_thread_count <= 0)
{
free(prev);
perror("Error in perfstat_thread");
exit(-1) ;
}
sleep(PERIOD);
bzero(&buf, sizeof(perfstat_rawdata_t));
buf.type = UTIL_PROCESS;
buf.curstat = cur;
buf.prevstat = prev;
buf.sizeof_data = sizeof(perfstat_thread_t);
buf.cur_elems = cur_thread_count;
buf.prev_elems = prev_thread_count;
/* Calculate Thread Utilization. This returns the number of thread_util structures that are filled */
rc = perfstat_thread_util(&buf,thread_util,sizeof(perfstat_thread_t),cur_thread_count);
if(rc <= 0)
{
free(prev);
free(cur);
free(thread_util);
perror("Error in perfstat_thread_util");
exit(-1);
}
The program displays an output that is similar to the following example output:
Process ID = 6160532
Thread ID = 123456
User Mode CPU time = 21.824531
System Mode CPU time = 0.000000
Bound CPU Id = 1
Related information:
libperfstat.h command
perfstat_volumegroup Interface
The perfstat_volumegroup interface returns a set of structures of type perfstat_logicalvolume_t, which is
defined in the libperfstat.h file.
The following code shows an example of how the perfstat_logicalvolume interface is used:
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
int main(){
int vg_count, rc,i;
perfstat_id_t first;
perfstat_volumegroup_t *vg;
strcpy(first.name,NULL);
return 0;
}
The program displays an output that is similar to the following example output:
The preceding program emulates vmstat behavior and also shows how perfstat_volumegroup is used.
WPAR Interfaces
The following are two types of WPAR interfaces:
v The metrics related to a set of components for a WPAR (such as processors, or memory).
v The specific metrics related to individual components on a WPAR (such as a processor, network
interface, or memory page).
All of the following WPAR interfaces use the naming convention perfstat_subsystem_total_wpar, and use
a common signature:
Item Descriptor
perfstat_cpu_total_wpar Retrieves WPAR processor summary usage metrics
perfstat_memory_total_wpar Retrieves WPAR memory summary usage metrics
perfstat_wpar_total Retrieves WPAR information metrics
perfstat_memory_page_wpar Retrieves WPAR memory page usage metrics
The number of structures copied and returned without errors use the return value of 1. If there are errors,
the return value is -1.
The following sections provide examples of the type of data returned and code using each of the
interfaces.
perfstat_wpar_total Interface
The perfstat_wpar_total interface returns a set of structures of type perfstat_wpar_total_t, which is
defined in the libperfstat.h file.
Several other paging-space-related metrics (such as number of system calls, number of reads, writes,
forks, execs, and load average) are also returned. For a complete list of other paging-space-relate metrics,
see the perfstat_wpar_total_t section in the libperfstat.h header file in Files Reference.
The following program emulates wparstat behavior and also shows an example of how
perfstat_wpar_total is used from the global environment:
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
int main(){
perfstat_wpar_total_t *winfo;
perfstat_id_wpar_t wparid;
int tot, rc, i;
if (tot < 0) {
perror("Error in perfstat_wpar_total");
if (tot == 0) {
printf("No WPARs found in the system\n");
exit(-1);
}
if (rc < 0) {
perror("Error in perfstat_wpar_total");
exit(-1);
}
for(i=0;i<tot;i++){
printf("Name of the Workload Partition=%s\n",winfo[i].name);
printf("Workload partition identifier=%u\n",winfo[i].wpar_id);
printf("Number of Virtual CPUs in partition rset=%d\n",winfo[i].online_cpus);
printf("Amount of memory currently online in Global Partition=%lld\n",winfo[i].online_memory);
printf("Number of processor units this partition is entitled to receive=%d\n",winfo[i].entitled_proc_capacity);
printf("\n");
}
return(0);
}
The program displays an output that is similar to the following example output:
Name of the Workload Partition=test
Workload partition identifier=1
Number of Virtual CPUs in partition rset=2
Amount of memory currently online in Global Partition=4096
Number of processor units this partition is entitled to receive=100
The following code shows an example of how perfstat_wpar_total is used from the WPAR environment:
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
int main(){
perfstat_wpar_total_t *winfo;
perfstat_id_wpar_t wparid;
int tot, rc, i;
if (tot < 0) {
perror("Error in perfstat_wpar_total");
exit(-1);
}
if (tot == 0) {
printf("No WPARs found in the system\n");
exit(-1);
}
for(i=0;i<tot;i++){
printf("Name of the Workload Partition=%s\n",winfo[i].name);
printf("Workload partition identifier=%u\n",winfo[i].wpar_id);
printf("Number of Virtual CPUs in partition rset=%d\n",winfo[i].online_cpus);
printf("Amount of memory currently online in Global Partition=%lld\n",winfo[i].online_memory);
printf("Number of processor units this partition is entitled to receive=%d\n",winfo[i].entitled_proc_capacity);
printf("\n");
}
return(0);
}
perfstat_cpu_total_wpar Interface
The perfstat_cpu_total_wpar interface returns a set of structures of type perfstat_cpu_total_wpar_t,
which is defined in the libperfstat.h file.
Several other paging-space-related metrics (such as number of system calls, number of reads, writes,
forks, execs, and load average) are also returned. For a complete list of other paging-space-related
metrics, see the perfstat_cpu_total_wpar_t section in the libperfstat.h header file.
The following program emulates wparstat behavior and also shows an example of how
perfstat_cpu_total_wpar_t is used from the global environment:
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
int main(){
perfstat_cpu_total_wpar_t *cpustats;
perfstat_id_wpar_t wparid;
perfstat_wpar_total_t *winfo;
int i,j,rc,totwpars;
perror("Error in perfstat_wpar_total");
exit(-1);
}
if (totwpars == 0) {
printf("No WPARs found in the system\n");
exit(-1);
}
if (rc <= 0) {
perror("Error in perfstat_wpar_total");
exit(-1);
}
cpustats=calloc(1,sizeof(perfstat_cpu_total_wpar_t));
rc = perfstat_cpu_total_wpar(&wparid, cpustats, sizeof(perfstat_cpu_total_wpar_t), 1);
if (rc != 1) {
perror("perfstat_cpu_total_wpar");
exit(-1);
}
for(j=0;j<rc;j++){
printf("Number of active logical processors in Global=%d\n",cpustats[j].ncpus);
printf("Processor description=%s\n",cpustats[j].description);
printf("Processor speed in Hz=%lld\n",cpustats[j].processorHZ);
printf("Number of process switches=%lld\n",cpustats[j].pswitch);
printf("Number of forks system calls executed=%lld\n",cpustats[j].sysfork);
printf("Length of the run queue=%lld\n",cpustats[j].runque);
printf("Length of the swap queue=%lld\n",cpustats[j].swpque);
}
}
}
The program displays an output that is similar to the following example output:
Number of active logical processors in Global=8
Processor description=PowerPC_POWER7
Processor speed in Hz=3304000000
Number of process switches=1995
Number of forks system calls executed=322
Length of the run queue=3
Length of the swap queue=1
The following code shows an example of how perfstat_cpu_total_wpar is used from the WPAR
environment:
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
int main(){
perfstat_cpu_total_wpar_t *cpustats;
perfstat_id_wpar_t wparid;
perfstat_wpar_total_t *winfo;
int i,j,rc,totwpars;
perror("Error in perfstat_wpar_total");
exit(-1);
}
if (totwpars == 0) {
printf("No WPARs found in the system\n");
if (rc <= 0) {
perror("Error in perfstat_wpar_total");
exit(-1);
}
cpustats=calloc(1,sizeof(perfstat_cpu_total_wpar_t));
rc = perfstat_cpu_total_wpar(NULL, cpustats, sizeof(perfstat_cpu_total_wpar_t), 1);
if (rc != 1) {
perror("perfstat_cpu_total_wpar");
exit(-1);
}
for(j=0;j<rc;j++){
printf("Number of active logical processors in Global=%d\n",cpustats[j].ncpus);
printf("Processor description=%s\n",cpustats[j].description);
printf("Processor speed in Hz=%lld\n",cpustats[j].processorHZ);
printf("Number of process switches=%lld\n",cpustats[j].pswitch);
printf("Number of forks system calls executed=%lld\n",cpustats[j].sysfork);
printf("Length of the run queue=%lld\n",cpustats[j].runque);
printf("Length of the swap queue=%lld\n",cpustats[j].swpque);
}
}
}
perfstat_memory_total_wpar Interface
The perfstat_memory_total_wpar interface returns a set of structures of type
perfstat_memory_total_wpar_t, which is defined in the libperfstat.h file.
Several other paging-space-related metrics (such as number of system calls, number of reads, writes,
forks, execs, and load average) are also returned. For a complete list of other paging-space-related
metrics, see the perfstat_memory_total_wpar_t section in the libperfstat.h header file.
The following program emulates wparstat behavior and also shows an example of how
perfstat_memory_total_wpar is used from the global environment:
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
perror("Error in perfstat_wpar_total");
exit(-1);
}
for(i=0; i < totwpars; i++)
{
bzero(&wparid, sizeof(perfstat_id_wpar_t));
wparid.spec = WPARID;
wparid.u.wpar_id = winfo[i].wpar_id;
memstats=calloc(1,sizeof(perfstat_memory_total_wpar_t));
rc = perfstat_memory_total_wpar(&wparid, memstats, sizeof(perfstat_memory_total_wpar_t), 1);
if (rc != 1) {
perror("perfstat_memory_total_wpar");
exit(-1);
}
for(j=0;j<rc;j++){
printf("Global total real memory=%lld\n",memstats[j].real_total);
printf("Global free real memory=%lld\n",memstats[j].real_free);
printf("Real memory which is pinned=%lld\n",memstats[j].real_pinned);
printf("Real memory which is in use=%lld\n",memstats[j].real_inuse);
printf("Number of page faults=%lld\n",memstats[j].pgexct);
printf("Number of pages paged in=%lld\n",memstats[j].pgins);
printf("Number of pages paged out=%lld\n",memstats[j].pgouts);
}
}
}
The following code shows an example of how perfstat_memory_total_wpar is used from the WPAR
environment:
perror("Error in perfstat_wpar_total");
exit(-1);
}
for(i=0; i < totwpars; i++)
{
bzero(&wparid, sizeof(perfstat_id_wpar_t));
wparid.spec = WPARID;
wparid.u.wpar_id = winfo[i].wpar_id;
memstats=calloc(1,sizeof(perfstat_memory_total_wpar_t));
rc = perfstat_memory_total_wpar(NULL, memstats, sizeof(perfstat_memory_total_wpar_t), 1);
if (rc != 1) {
perror("perfstat_memory_total_wpar");
exit(-1);
}
for(j=0;j<rc;j++){
printf("Global total real memory=%lld\n",memstats[j].real_total);
printf("Global free real memory=%lld\n",memstats[j].real_free);
printf("Real memory which is pinned=%lld\n",memstats[j].real_pinned);
printf("Real memory which is in use=%lld\n",memstats[j].real_inuse);
printf("Number of page faults=%lld\n",memstats[j].pgexct);
printf("Number of pages paged in=%lld\n",memstats[j].pgins);
printf("Number of pages paged out=%lld\n",memstats[j].pgouts);
}
}
}
perfstat_memory_page_wpar Interface
The perfstat_memory_page_wpar interface returns a set of structures of type
perfstat_memory_page_wpar_t, which is defined in the libperfstat.h file.
Several other paging-space-related metrics (such as number of system calls, number of reads, writes,
forks, execs, and load average) are also returned. For a complete list of other paging-space-related
metrics, see the perfstat_memory_page_wpar_t section in the libperfstat.h header file.
The following program emulates vmstat behavior and also shows an example of how
perfstat_memory_page_wpar is used from the global environment:
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
int main(){
int i, psizes, rc;
perfstat_memory_page_wpar_t *pageinfo;
perfstat_id_wpar_t wparid;
wparid.spec = WPARNAME;
strcpy(wparid.u.wparname,"test");
perfstat_psize_t psize;
psize.psize = FIRST_PSIZE;
/* Get the number of page sizes */
psizes = perfstat_memory_page_wpar(&wparid, NULL, NULL, sizeof(perfstat_memory_page_wpar_t),0);
/*check for error */
if (psizes <= 0 ){
perror("perfstat_memory_page_wpar ");
exit(-1);
}
for(i=0;i<psizes;i++){
printf("Page size in bytes=%lld\n",pageinfo[i].psize);
printf("Number of real memory frames of this page size=%lld\n",pageinfo[i].real_total);
printf("Number of pages pinned=%lld\n",pageinfo[i].real_pinned);
printf("Number of pages in use=%lld\n",pageinfo[i].real_inuse);
printf("Number of page faults=%lld\n",pageinfo[i].pgexct);
printf("Number of pages paged in=%lld\n",pageinfo[i].pgins);
printf("Number of pages paged out=%lld\n",pageinfo[i].pgouts);
printf("Number of page ins from paging space=%lld\n",pageinfo[i].pgspins);
printf("Number of page outs from paging space=%lld\n",pageinfo[i].pgspouts);
The following code shows an example of how perfstat_memory_page_wpar is used from the WPAR
environment:
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
int main(){
int i, psizes, rc;
perfstat_memory_page_wpar_t *pageinfo;
perfstat_id_wpar_t wparid;
perfstat_psize_t psize;
psize.psize = FIRST_PSIZE;
/* Get the number of page sizes */
for(i=0;i<psizes;i++){
printf("Page size in bytes=%lld\n",pageinfo[i].psize);
printf("Number of real memory frames of this page size=%lld\n",pageinfo[i].real_total);
printf("Number of pages pinned=%lld\n",pageinfo[i].real_pinned);
printf("Number of pages in use=%lld\n",pageinfo[i].real_inuse);
printf("Number of page faults=%lld\n",pageinfo[i].pgexct);
printf("Number of pages paged in=%lld\n",pageinfo[i].pgins);
printf("Number of pages paged out=%lld\n",pageinfo[i].pgouts);
printf("Number of page ins from paging space=%lld\n",pageinfo[i].pgspins);
printf("Number of page outs from paging space=%lld\n",pageinfo[i].pgspouts);
printf("Number of page scans by clock=%lld\n",pageinfo[i].scans);
printf("Number of page steals=%lld\n",pageinfo[i].pgsteals);
}
}
RSET Interfaces
The RSET interface reports processor metrics related to an RSET.
All of the following AIX 6.1 RSET interfaces use the naming convention perfstat_subsystem[_total]_rset,
and use a common signature:
Item Descriptor
perfstat_cpu_total_rset Retrieves processor summary metrics of the processors in an RSET
perfstat_cpu_rset Retrieves per processor metrics of the processors in an RSET
perfstat_cpu_t * userbuff,
int sizeof_struct,
int desired_number);
perfstat_cpu_total_t * userbuff,
int desired_number);
The number of structures copied and returned without errors uses the return value of 1. If there are
errors, the return value is -1. The field name is either set to NULL or to the name of the next structure
available.
An exception to this scheme is when name=NULL, userbuff=NULL, and desired_number=0, the total
number of structures available is returned.
To retrieve all structures of a given type, either ask first for their number, allocate enough memory to
hold them all at once, then call the appropriate API to retrieve them all in one call. Else, allocate a fixed
set of structures and repeatedly call the API to get the next such number of structures, each time passing
the name returned by the previous call. Start the process with the name set to "" or FIRST_CPU, and
repeat the process until the name returned is equal to "".
The following sections provide examples of the type of data returned and code using each of the
interfaces.
perfstat_cpu_rset interface
The perfstat_cpu_rset interface returns a set of structures of type perfstat_cpu_t, which is defined in the
libperfstat.h file.
Several other paging-space-related metrics (such as number of forks, reads, writes, and execs) are also
returned. For a complete list of other paging-space-related metrics, see the perfstat_cpu_t section in the
libperfstat.h header file.
The following code shows an example of how perfstat_cpu_rset is used from the global environment:
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
int main(){
int i, retcode, rsetcpus;
if (rsetcpus < 0 ){
perror("perfstat_cpu_rset");
exit(-1);
}
if(!statp){
perror("calloc");
}
for(i=0;i<retcode;i++){
printf("Logical processor name=%s\n",statp[i].name);
printf("Raw number of clock ticks spent in user mode=%lld\n",statp[i].user);
printf("Raw number of clock ticks spent in system mode=%lld\n",statp[i].sys);
printf("Raw number of clock ticks spent in idle mode=%lld\n",statp[i].idle);
printf("Raw number of clock ticks spent in wait mode=%lld\n",statp[i].wait);
}
return 0;
}
The program displays an output that is similar to the following example output:
Logical processor name=cpu0
Raw number of clock ticks spent in user mode=2050
Raw number of clock ticks spent in system mode=22381
Raw number of clock ticks spent in idle mode=6863114
Raw number of clock ticks spent in wait mode=3002
Logical processor name=cpu1
Raw number of clock ticks spent in user mode=10
Raw number of clock ticks spent in system mode=651
Raw number of clock ticks spent in idle mode=6876627
Raw number of clock ticks spent in wait mode=42
Logical processor name=cpu2
Raw number of clock ticks spent in user mode=0
Raw number of clock ticks spent in system mode=610
Raw number of clock ticks spent in idle mode=6876712
Raw number of clock ticks spent in wait mode=0
Logical processor name=cpu3
Raw number of clock ticks spent in user mode=0
Raw number of clock ticks spent in system mode=710
Raw number of clock ticks spent in idle mode=6876612
Raw number of clock ticks spent in wait mode=0
Logical processor name=cpu4
Raw number of clock ticks spent in user mode=243
Raw number of clock ticks spent in system mode=1659
Raw number of clock ticks spent in idle mode=6875427
The following code shows an example of how perfstat_cpu_rset is used from the WPAR environment:
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
int main(){
int i, retcode, rsetcpus;
perfstat_id_wpar_t wparid;
perfstat_cpu_t *statp;
if (rsetcpus < 0 ){
perror("perfstat_cpu_rset");
exit(-1);
}
if(!statp){
perror("calloc");
}
for(i=0;i<retcode;i++){
printf("Logical processor name=%s\n",statp[i].name);
printf("Raw number of clock ticks spent in user mode=%lld\n",statp[i].user);
printf("Raw number of clock ticks spent in system mode=%lld\n",statp[i].sys);
printf("Raw number of clock ticks spent in idle mode=%lld\n",statp[i].idle);
printf("Raw number of clock ticks spent in wait mode=%lld\n",statp[i].wait);
}
return 0;
}
perfstat_cpu_total_rset interface
The perfstat_cpu_total_rset interface returns a set of structures of type perfstat_cpu_total_t, which is
defined in the libperfstat.h file.
Several other paging-space-related metrics (such as number of forks, read, writes, and execs) are also
returned. For a complete list of other paging-space-related metrics, see the perfstat_cpu_total_t section in
the libperfstat.h header file.
The following code shows an example of how the perfstat_cpu_total_rset interface is used from the
global environment:
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
int main(){
perfstat_cpu_total_t *cpustats;
perfstat_id_wpar_t wparid;
int rc,i;
wparid.spec = WPARNAME;
rc = perfstat_cpu_total_rset(NULL,NULL,sizeof(perfstat_cpu_total_t),0);
if (rc <= 0) {
perror("perfstat_cpu_total_rset");
exit(-1);
}
cpustats=calloc(rc,sizeof(perfstat_cpu_total_t));
if(cpustats==NULL){
perror("MALLOC error:");
exit(-1);
}
strcpy(wparid.u.wparname,"test");
rc = perfstat_cpu_total_rset(&wparid, cpustats, sizeof(perfstat_cpu_total_t), rc);
if (rc <= 0) {
perror("perfstat_cpu_total_rset");
exit(-1);
}
for(i=0;i<rc;i++){
printf("Number of active logical processors=%d\n",cpustats[i].ncpus);
printf("Number of configured processors=%d\n",cpustats[i].ncpus_cfg);
printf("Processor description=%s\n",cpustats[i].description);
printf("Processor speed in Hz=%lld\n",cpustats[i].processorHZ);
printf("Raw total number of clock ticks spent in user mode=%lld\n",cpustats[i].user);
printf("Raw total number of clock ticks spent in system mode=%lld\n",cpustats[i].sys);
printf("Raw total number of clock ticks spent idle=%lld\n",cpustats[i].idle);
printf("Raw total number of clock ticks spent wait=%lld\n",cpustats[i].wait);
}
return 0;
}
The following code shows an example of how perfstat_cpu_total_rset is used from the WPAR
environment:
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
int main(){
perfstat_cpu_total_t *cpustats;
perfstat_id_wpar_t wparid;
int rc,i;
rc = perfstat_cpu_total_rset(NULL,NULL,sizeof(perfstat_cpu_total_t),0);
if (rc <= 0) {
perror("perfstat_cpu_total_rset");
exit(-1);
}
cpustats=calloc(rc,sizeof(perfstat_cpu_total_t));
if(cpustats==NULL){
perror("MALLOC error:");
exit(-1);
}
if (rc <= 0) {
perror("perfstat_cpu_total_rset");
exit(-1);
}
for(i=0;i<rc;i++){
printf("Number of active logical processors=%d\n",cpustats[i].ncpus);
printf("Number of configured processors=%d\n",cpustats[i].ncpus_cfg);
printf("Processor description=%s\n",cpustats[i].description);
printf("Processor speed in Hz=%lld\n",cpustats[i].processorHZ);
printf("Raw total number of clock ticks spent in user mode=%lld\n",cpustats[i].user);
printf("Raw total number of clock ticks spent in system mode=%lld\n",cpustats[i].sys);
printf("Raw total number of clock ticks spent idle=%lld\n",cpustats[i].idle);
printf("Raw total number of clock ticks spent wait=%lld\n",cpustats[i].wait);
}
return 0;
}
You can use the following AIX interfaces to refresh the cached metrics:
Parameter Usage
char *name Identifies the name of the component of the cached metric that must be reset
from the libperfstat API cache. If the value of the parameter is NULL, this
signifies all of the components.
u_longlong_t resetmask Identifies the category of the component if the value of the name parameter is
not NULL. The possible values are:
v FLUSH_CPUTOTAL
v FLUSH_DISK
v RESET_DISK_MINMAX
v FLUSH_DISKADAPTER
v FLUSH_DISKPATH
v FLUSH_NETINTERFACE
v FLUSH_PAGINGSPACE
v FLUSH_LOGICALVOLUME
v FLUSH_VOLUMEGROUP
If the value of the name parameter is NULL, the resetmask parameter value
consists of a combination of values. For example:
RESET_DISK_MINMAX|FLUSH_CPUTOTAL|FLUSH_DISK
The perfstat_partial_reset interface can also reset the system's minimum and maximum counters related
to disks and paths. The following table summarizes the various actions of the perfstat_partial_reset
interface:
Action taken when the value of name is Action taken when the value of name is not
The resetmask value
NULL NULL and a single resetmask value is set
FLUSH_CPUTOTAL Flushes the speed and description values in Error. The value of errno is set to EINVAL.
the perfstat_cputotal_t structure.
Flushes the description, adapter, size, free, Flushes the description, adapter, size, free,
and vgname values in every perfstat_disk_t and vgname values in the specified
structure.Flushes the list of disk adapters. perfstat_disk_t structure. Flushes the
Flushes the size, free, and description adapter value in every perfstat_diskpath_t
values in everyperfstat_diskadapter_t structure that matches the disk name that is
FLUSH_DISK
structure. followed by the _Path identifier. Flushes the
size, free, and description values of each
perfstat_diskadapter_t structure that is
linked to a path leading to the disk or to the
disk itself.
Resets the following values in every Error. The value of errno is set to ENOTSUP.
perfstat_diskadapter_t structure:
v wq_min_time
v wq_max_time
RESET_DISK_MINMAX
v min_rserv
v max_rserv
v min_wserv
v max_wserv
Flushes the list of disk adapters. Flushes the Flushes the list of disk adapters. Flushes the
size, free, and description values in every size, free, and description values in every
perfstat_diskadapter_t structure. Flushes the perfstat_diskadapter_t structure.Flushes the
FLUSH_DISKADAPTER adapter value in every perfstat_diskpath_t adapter value in every perfstat_diskpath_t
structure. Flushes the description and structure. Flushes the description and
adapter values in every perfstat_disk_t adapter values in every perfstat_disk_t
structure. structure.
FLUSH_DISKPATH Flushes the adapter value in every Flushes the adapter value in the specified
perfstat_diskpath_t structure. perfstat_diskpath_t structure.
FLUSH_PAGINGSPACE Flushes the list of paging spaces. Flushes the Flushes the list of paging spaces. Flushes the
automatic, type, lpsize, mbsize, hostname, automatic, type, lpsize, mbsize, hostname,
filename, and vgname values in every filename, and vgname values in the specified
perfstat_pagingspace_t structure. perfstat_pagingspace_t structure.
FLUSH_NETINTERFACE Flushes the description value in every Flushes the description value in the
perfstat_netinterface_t structure. specified perfstat_netinterface_t structure.
FLUSH_LOGICALVOLUME Flushes the description value in every Flushes the description value in every
perfstat_logicalvolume_t structure. perfstat_logicalvolume_t structure.
FLUSH_VOLUMEGROUP Flushes the description value in every Flushes the description value in every
perfstat_volumegroup_t structure. perfstat_volumegroup_t structure.
You can see how to use the perfstat_partial_reset interface in the following example code:
#include <stdio.h>
#include <stdlib.h>
#include <libperfstat.h>
/* At this point, we assume the disk free part changes due to chfs for example */
/* if we get disk metrics here, the free field will be wrong as it was
* cached by the libperfstat.
*/
for(i=0;i<retcode;i++){
printf("Name of the disk=%s\n",statp[i].name);
printf("Disk description=%s\n",statp[i].description);
printf("Volume group name=%s\n",statp[i].vgname);
printf("Size of the disk=%lld\n",statp[i].size);
printf("Free portion of the disk=%lld\n",statp[i].free);
printf("Disk block size=%lld\n",statp[i].bsize);
}
}
The program displays an output that is similar to the following example output:
Name of the disk=hdisk0
Disk description=Virtual SCSI Disk Drive
Volume group name=rootvg
Size of the disk=25568
Free portion of the disk=18752
Disk block size=512
Node interfaces
Node interfaces report metrics related to a set of components or individual components of a remote node
in the cluster. The components include processors or memory, and individual components include a
processor, network interface, or memory page of the remote node in the cluster.
The remote node must belong to one of the clusters of the current node, which uses the perfstat API.
The following node interfaces use theperfstat_subsystem_node as the naming convention and a common
signature:
The following common signature is used by the perfstat_subsystem_node interface except the
perfstat_memory_page_node interface:
int perfstat_subsystem_node(perfstat_id_node_t *name,
perfstat_subsystem_t *userbuff,
int sizeof_struct,
int desired_number);
The following table describes the usage of the parameters of the perfstat_subsystem_node interface:
Item Descriptor
perfstat_id_node_t *name Specify the name of the node in name->u.nodenameformat. The name must contain the name of
the first component. For example, hdisk2 for perfstat_disk_node(), where hdisk 2 is the name of
the disk for which you require the statistics.
Note: When you specify a nodename, it must be initialized as NODENAME.
perfstat_subsystem_t *userbuff Points to a memory area that has enough space for the returned structure.
int sizeof_struct Sets this parameter to the size of perfstat_subsystem_t.
int desired_number Specifies the number of structures of type perfstat_subsystem_t to return to a userbuff field.
The perfstat_subsystem_node interface return -1 value for error. Otherwise it returns the number of
structures copied. The field namename is set to the name of the next available structure, and an
exceptional case when userbuff equals NULL and desired_number equals 0, the total number of
structures available is returned.
#define INTERVAL_DEFAULT 2
#define COUNT_DEFAULT 10
if(collect_remote_node_stats)
{
/* perfstat_config needs to be called to enable cluster statistics collection */
ret = perfstat_config(PERFSTAT_ENABLE|PERFSTAT_CLUSTER_STATS, NULL);
if (ret == -1)
{
perror("cluster statistics collection is not available");
exit(-1);
}
}
if(collect_remote_node_stats)
{
/* Remember nodename is already set */
/* Now set name to first interface */
strcpy(nodeid.name, FIRST_DISK);
if(collect_remote_node_stats) {
/* Now disable cluster statistics by calling perfstat_config */
perfstat_config(PERFSTAT_DISABLE|PERFSTAT_CLUSTER_STATS, NULL);
}
}
The program displays an output that is similar to the following example output:
Statistics for disk : hdisk0
----------------------------
description : Virtual SCSI Disk Drive
volume group name : rootvg
adapter name : vscsi0
size : 25568 MB
free space : 19616 MB
number of blocks read : 315130 blocks of 512 bytes
number of blocks written : 228352 blocks of 512 bytes
The following program shows the usage of the vmstat command and an example of using the
perfstat_memory_total_node interface to retrieve the virtual memory details of the remote node:
#include <stdio.h>
#include <libperfstat.h>
#define INTERVAL_DEFAULT 2
#define COUNT_DEFAULT 10
if(collect_remote_node_stats)
{
/* perfstat_config needs to be called to enable cluster statistics collection */
rc = perfstat_config(PERFSTAT_ENABLE|PERFSTAT_CLUSTER_STATS, NULL);
if (rc == -1)
{
perror("cluster statistics collection is not available");
exit(-1);
}
}
if(collect_remote_node_stats)
{
strncpy(nodeid.u.nodename, nodename, MAXHOSTNAMELEN);
nodeid.spec = NODENAME;
rc = perfstat_memory_total_node(&nodeid, &minfo, sizeof(perfstat_memory_total_t), 1);
}
else
{
rc = perfstat_memory_total(NULL, &minfo, sizeof(perfstat_memory_total_t), 1);
}
if(collect_remote_node_stats) {
/* Now disable cluster statistics by calling perfstat_config */
perfstat_config(PERFSTAT_DISABLE|PERFSTAT_CLUSTER_STATS, NULL);
}
}
The program displays an output that is similar to the following example output:
Memory statistics
-----------------
real memory size : 4096 MB
reserved paging space : 512 MB
virtual memory size : 4608 MB
number of free pages : 768401
number of pinned pages : 237429
number of pages in file cache : 21473
total paging space pages : 131072
free paging space pages : 128821
used paging space : 1.72%
number of paging space page ins : 0
number of paging space page outs : 0
number of page ins : 37301
number of page outs : 9692
For a complete list of parameters related to the perfstat_cluster_total_t structure, see the libperfstat.h
header file.
The following code example shows the usage of the perfstat_cluster_total interface:
#include <stdio.h>
#include <libperfstat.h>
typedef enum {
DISPLAY_DEFAULT = 0,
DISPLAY_NODE_DATA = 1,
DISPLAY_DISK_DATA = 2
} display_t;
The perfstat_node_list interface is used to retrieve the list of nodes in the perfstat_node_t structure,
which is defined in the libperfstat.h file. The following selected fields are from the perfstat_node_t
structure:
return (0);
}
The perfstat_cluster_disk interface is used to retrieve the list of disks in the perfstat_disk_data_t
structure. The perfstat_cluster_disk interface is defined in the libperfstat.h file.
The following example code shows the usage of the perfstat_cluster_disk subroutine:
typedef enum {
DISPLAY_NODE_DATA = 1,
DISPLAY_DISK_DATA = 2,
} display_t;
nodeid.spec = NODENAME;
/*Get the number of disks for that node */
num_of_disks = perfstat_cluster_disk(&nodeid,NULL, sizeof(perfstat_disk_data_t), 0);
if (num_of_disks == -1)
{
perror("perfstat_cluster_disk failed");
exit(-1);
}
Interface changes
With the following filesets the rblks and wblks fields of libperfstat are represented by blocks of 512
bytes in the perfstat_disk_total_t, perfstat_diskadapter_t and perfstat_diskpath_t structures, regardless
of the actual block size used by the device for which metrics are being retrieved.
v bos.perf.libperfstat 4.3.3.4
v bos.perf.libperfstat 5.1.0.50
v bos.perf.libperfstat 5.2.0.10
Interface additions
Review the specific interfaces that are available for a fileset.
The following interfaces were added in the bos.perf.libperfstat 5.2.0 file set:
v perfstat_netbuffer
v perfstat_protocol
v perfstat_pagingspace
v perfstat_diskadapter
v perfstat_reset
The perfstat_diskpath interface was added in the bos.perf.libperfstat 5.2.0.10 file set.
The perfstat_partition_total interface was added in the bos.perf.libperfstat 5.3.0.0 file set.
The following interfaces were added in the bos.perf.libperfstat 6.1.2 file set:
v perfstat_cpu_total_wpar
v perfstat_memory_total_wpar
v perfstat_cpu_total_rset
v perfstat_cpu_rset
v perfstat_wpar_total
v perfstat_tape
v perfstat_tape_total
v perfstat_memory_page
Field additions
The following additions have been made to the specified file set levels.
The name field which returns the logical processor name is now of the form cpu0, cpu1, instead of proc0,
proc1 as it was in previous releases.
In addition, the xrate field in the following data structures has been renamed to _rxfers and contains the
number of read transactions when used with selected device drivers or zero:
perfstat_disk_t
perfstat_disk_total_t
perfstat_diskadapter_t
perfstat_diskpath_t
Structure additions
Review the specific structure additions that are available for different file sets.
The following structures are added in the bos.perf.libperfstat 6.1.2.0 file set:
perfstat_cpu_total_wpar_t
perfstat_cpu_total_rset_t
perfstat_cpu_rset_t
perfstat_wpar_total_t
perfstat_tape_t
perfstat_tape_total_t
perfstat_memory_page_t
perfstat_memory_page_wpar_t
perfstat_logicalvolume_t
perfstat_volumegroup_t
The following structures are added in the bos.perf.libperfstat 6.1.6.0 file set:
perfstat_id_node_t
perfstat_node_t
perfstat_cluster_total_t
perfstat_cluster_type_t
perfstat_node_data_t
perfstat_disk_data_t
perfstat_disk_status_t
perfstat_ip_addr_t
The following structures are added in the bos.perf.libperfstat 6.1.7.0 file set:
Kernel tuning
You can make permanent kernel-tuning changes without having to edit any rc files. This is achieved by
centralizing the reboot values for all tunable parameters in the /etc/tunables/nextboot stanza file. When a
system is rebooted, the values in the /etc/tunables/nextboot file are automatically applied.
The following commands are used to manipulate the nextboot file and other files containing a set of
tunable parameter values:
v The tunchange command is used to change values in a stanza file.
v The tunsave command is used to save values to a stanza file.
v The tunrestore is used to apply a file; that is, to change all tunables parameter values to those listed in
a file.
v The tuncheck command must be used to validate a file created manually.
v The tundefault is available to reset tunable parameters to their default values.
The preceding commands work on both current and reboot values.
All six tuning commands (no, nfso, vmo, ioo, raso, and schedo) use a common syntax and are available
to directly manipulate the tunable parameter values. Available options include making permanent
changes and displaying detailed help on each of the parameters that the command manages. A large
majority of tunable parameter values are not modifiable when the login session is initiated outside of the
global WPAR partition. Attempts to modify such a read only tunable parameter value is refused by the
command and a diagnostic message written to standard error output.
SMIT panel is also available to manipulate the current and reboot values for all tuning parameters, as
well as the files in the /etc/tunables directory.
Related information:
bosboot command
no command
tunables command
Most of the information in this section does not apply to compatibility mode. For more information, see
compatibility mode in Files Reference.
When a machine is initially installed with AIX, it is automatically set to run in the tuning mode, which is
described in this chapter. The tuning mode is controlled by the sys0 attribute called pre520tune, which
can be set to enable to run in compatibility mode and disable to run in the tuning mode.
To retrieve the current setting of the pre520tune attribute, run the following command:
lsattr -E -l sys0
To change the current setting of the pre520tune attribute, run the following command:
chdev -l sys0 -a pre520tune=enable
OR
These files contain parameter=value pairs specifying tunable parameter changes, classified in six stanzas
corresponding to the six tuning commands : schedo, vmo, ioo, no, raso, and nfso. Additional information
about the level of AIX, when the file was created, and a user-provided description of file usage is stored
in a special stanza in the file. For detailed information on the file's format, see the tunables file.
The main file in the tunables directory is called nextboot. It contains all the tunable parameter values to
be applied at the next reboot. The lastboot file in the tunables directory contains all the tunable values
that were set at the last machine reboot, a timestamp for the last reboot, and checksum information about
the matching lastboot.log file, which is used to log any changes made, or any error messages
encountered, during the last rebooting. The lastboot and lastboot.log files are set to be read-only and are
owned by the root user, as are the directory and all of the other files.
Users can create as many /etc/tunables files as needed, but only the nextboot file is ever automatically
applied. Manually created files must be validated using the tuncheck command. Parameters and stanzas
can be missing from a file. Only tunable parameters present in the file will be changed when the file is
applied with the tunrestore command. Missing tunables will simply be left at their current or default
values. To force resetting of a tunable to its default value with tunrestore (presumably to force other
tunables to known values, otherwise tundefault, which sets all parameters to their default value, could
have been used), DEFAULT can be specified. Specifying DEFAULT for a tunable in the nextboot file is the
same as not having it listed in the file at all because the reboot tuning procedure enforces default values
for missing parameters. This will guarantee to have all tunables parameters set to the values specified in
the nextboot file after each reboot.
Tunable files can have a special stanza named info containing the parameters AIX_level, Kernel_type
and Last_validation. Those parameters are automatically set to the level of AIX and to the type of kernel
(MP64) running when the tuncheck or tunsave is run on the file. Both commands automatically update
those fields. However, the tuncheck command will only update if no error was detected.
The lastboot file always contains values for every tunable parameters. Tunables set to their default value
will be marked with the comment DEFAULT VALUE. Restricted tunables modified from their default value
are marked, after the value, with an additional comment # RESTRICTED not at default value. The
Logfile_checksum parameter only exists in that file and is set by the tuning reboot process (which also
sets the rest of the info stanza) after closing the log file.
Tunable files can be created and modified using one of the following options:
v Using SMIT to modify the next reboot value for tunable parameters, or to ask to save all current values
for next boot, or to ask to use an existing tunable file at the next reboot. All those actions will update
the /etc/tunables/nextboot file. A new file in the /etc/tunables directory can also be created to save all
current or all nextboot values.
v Using the tuning commands (ioo, raso, vmo, schedo, no or nfso) with the -p or -r options, which will
update the /etc/tunables/nexboot file.
v A new file can also be created directly with an editor or copied from another machine. Running
tuncheck [-r | -p] -f must then be done on that file.
v Using the tunsave command to create or overwrite files in the /etc/tunables directory
v Using the tunrestore -r command to update the nextboot file.
All the tunable parameters manipulated by the tuning commands (no, nfso, vmo, ioo, raso, and schedo)
have been classified into the following categories:
v Dynamic: if the parameter can be changed at any time
v Static: if the parameter can never be changed
v Reboot: if the parameter can only be changed during reboot
v Bosboot: if the parameter can only be changed by running bosboot and rebooting the machine
v Mount: if changes to the parameter are only effective for future file systems or directory mounts
v Incremental: if the parameter can only be incremented, except at boot time
v Connect: if changes to the parameter are only effective for future socket connections
v Deprecated: if changing this parameter is no longer supported by the current release of AIX
For parameters of type Bosboot, whenever a change is performed, the tuning commands automatically
prompt the user to ask if they want to execute the bosboot command. When specifying a restricted
tunable for modification in association with the option -p or -r, you are also prompted to confirm the
change. For parameters of type Connect, the tuning commands automatically restart the inetd daemon.
The tunables classified as restricted use tunables exist primarily for specialized intervention by the
support or development teams and are not recommended for end user modification. For this reason, they
are not displayed by default and require the force option on the command line. When modifying a
restricted tunable, a warning message is displayed and confirmation required if the change is specified
for reboot or permanent.
The no, nfso, vmo, ioo, raso, and schedo tuning commands all support the following syntax:
command [-p|-r] {-o tunable[=newvalue]}
command [-p|-r] {-d tunable}
command [-p|-r] -D
command [-p|-r] [-F]-a
command -h [tunable]
command [-F] -L [tunable]
command [-F] -x [tunable]
When -r is used in combination without a new value, the nextboot value for tunable is
displayed. When -p is used in combination without a new value, a value is displayed only if the
current and next boot values for tunable are the same. Otherwise, NONE is displayed as the value.
If a tunable is not supported by the running kernel or the current platform, "n/a" is displayed as
the value.
-p When used in combination with -o, -d or -D, makes changes apply to both current and reboot
values; that is, turns on the updating of the /etc/tunables/nextboot file in addition to the
updating of the current value. This flag cannot be used on Reboot and Bosboot type parameters
because their current value cannot be changed.
When used with -a or -o flag without specifying a new value, values are displayed only if the
current and next boot values for a parameter are the same. Otherwise, NONE is displayed as the
value.
-r When used in combination with -o, -d or -D flags, makes changes apply to reboot values only;
that is, turns on the updating of the /etc/tunables/nextboot file. If any parameter of type
Bosboot is changed, the user will be prompted to run bosboot.
When used with -a or -o without specifying a new value, next boot values for tunables are
displayed instead of current values.
-x [tunable] Lists the characteristics of one or all tunables, one per line, using the following format:
tunable,current,default,reboot, min,max,unit,type,{dtunable }
where:
current = current value
default = default value
reboot = reboot value
min = minimal value
max = maximum value
unit = tunable unit of measure
type = parameter type: D(for Dynamic), S(for Static),
R(for Reboot), B(for Bosboot), M(for Mount),
I(for Incremental), C (for Connect), and
d (for Deprecated)
dtunable = space separated list of dependent tunable
parameters
-L [tunable] Lists the characteristics of one or all tunables, one per line, using the following format:
NAME CUR DEF BOOT MIN MAX UNIT TYPE
DEPENDENCIES
-----------------------------------------------------------------------
memory_frames 128K 128K 4KB pages S
-----------------------------------------------------------------------
maxfree 128 128 128 16 200K 4KB pages D
minfree
memory_frames
----------------------------------------------------------------------
where:
CUR =
current value
DEF =
default value
BOOT =
reboot value
MIN =
minimal value
MAX =
maximum value
UNIT =
tunable unit of measure
TYPE =
parameter type: D (for Dynamic),S (for Static),
R (for Reboot),B (for Bosboot),
M (for Mount), I (for Incremental),
C (for Connect), and d (for Deprecated)
DEPENDENCIES = list of dependent tunable parameters,
one per line
Any change (with -o, -d or -D flags) to a parameter of type Mount will result in a message displays to
warn the user that the change is only effective for future mountings.
Any change (with -o, -d or -D flags) to a parameter of type Connect will result in the inetd daemon
being restarted, and a message will display to warn the user that the change is only effective for socket
connections.
Any attempt to change (with -o, -d or -D flags ) a parameter of type Bosboot or Reboot without -r, will
result in an error message.
Any attempt to change (with -o, -d or -D flags but without -r) the current value of a parameter of type
Incremental with a new value smaller than the current value, will result in an error message.
To guarantee the consistency of their content, all the files are locked before any updates are made. The
commands tunsave, tuncheck (only if successful), and tundefault -r all update the info stanza.
tunchange Command
The tunchange command is used to update one or more tunable stanzas in a file.
The following is an example of how to update the pacefork parameter in the /etc/tunables/mytunable
directory:
tunchange -f mytunable -t schedo -o pacefork=10
The following is an example of how to unconditionally update the pacefork parameter in the
/etc/tunables/nextboot directory. This should be done with caution because no warning will be printed if
a parameter of type bosboot was changed.
tunchange -f nextboot -t schedo -o pacefork=10
The following is an example of how to clear the schedo stanza in the nextboot file.
tunchange -f nextboot -t schedo -D
The following is an example of how to merge the /home/admin/schedo_conf file with the current
nextboot file. If the file to merge contains multiple entries for a parameter, only the first entry will be
applied. If both files contain an entry for the same tunable, the entry from the file to merge will replace
the current nextboot file's value.
tunchange -f nextboot -m /home/admin/schedo_conf
tuncheck Command
The tuncheck command is used to validate a file.
The following is an example of how to validate the /etc/tunables/mytunable file for usage on current
values.
tuncheck -f mytunable
The following is an example of how to validate the /etc/tunables/nextboot file or my_nextboot file for
usage during reboot. Note that the -r flag is the only valid option when the file to check is the nextboot
file.
tuncheck -r -f nextboot
tuncheck -r -f /home/bill/my_nextboot
All parameters in the nextboot or my_nextboot file are checked for range, and dependencies, and if a
problem is detected, a message similar to: "Parameter X is out of range" or "Dependency problem
between parameter A and B" is issued. The -r and -p options control the values used in dependency
checking for parameters not listed in the file and the handling of proposed changes to parameters of type
Incremental, Bosboot, and Reboot.
Except when used with the -r option, checking is performed on parameter of type Incremental to make
sure the value in the file is not less than the current value. If one or more parameters of type Bosboot are
listed in the file with a different value than its current value, the user will either be prompted to run
bosboot (when -r is used) or an error message will display.
Parameters having dependencies are checked for compatible values. When one or more parameters in a
set of interdependent parameters is not listed in the file being checked, their values are assumed to either
be set at their current value (when the tuncheck command is called without -p or -r), or their default
value. This is because when called without -r, the file is validated to be applicable on the current values,
while with -r, it is validated to be used during reboot when parameters not listed in the file will be left at
their default value. Calling this command with -p is the same as calling it twice; once with no argument,
and once with the -r flag. This checks whether a file can be used both immediately, and at reboot time.
Note: Users creating a file with an editor, or copying a file from another machine, must run the tuncheck
command to validate their file.
tunrestore Command
The tunrestore command is used to restore all the parameters from a file.
For example, the following will change the current values for all tunable parameters present in the file if
ranges, dependencies, and incremental parameter rules are all satisfied.
tunrestore -f mytunable
tunrestore -f /etc/tunables/mytunable
If changes to parameters of type Bosboot are detected, the user will be prompted to run the bosboot
command.
The following command can only be called from the /etc/inittab file and changes tunable parameters to
values from the /etc/tunables/nextboot file.
tunrestore -R
Any problem found or change made is logged in the /etc/tunables/lastboot.log file. A new
/etc/tunables/lastboot file is always created with the list of current values for all parameters. Any change
to restricted tunables from their default values will cause the addition of an error log entry identifying
the list of these modified tunables.
If filename does not exist, an error message displays. If the nextboot file does not exist, an error message
displays if -r was used. If -R was used, all the tuning parameters of a type other than Bosboot will be set
to their default value, and a nextboot file containing only an info stanza will be created. A warning will
also be logged in the lastboot.log file.
Except when -r is used, parameters requiring a call to bosboot and a reboot are not changed, but an error
message is displayed to indicate they could not be changed. When -r is used, if any parameter of type
Bosboot needs to be changed, the user will be prompted to run bosboot. Parameters missing from the
file are simply left unchanged, except when -R is used, in which case missing parameters are set to their
default values. If the file contains multiple entries for a parameter, only the first entry will be applied,
and a warning will be displayed or logged (if called with -R).
tunsave Command
The tunsave command is used to save current tunable parameter values into a file.
For example, the following saves all of the current tunable parameter values that are different from their
default into the /etc/tunables/mytunable file.
tunsave -f mytunable
If the file already exists, an error message is printed instead. The -F flag must be used to overwrite an
existing file.
For example, the following saves all of the current tunable parameter values different from their default
into the /etc/tunables/nextboot file.
tunsave -f nextboot
If necessary, the tunsave command will prompt the user to run bosboot.
For example, the following saves all of the current tunable parameters values (including parameters for
which default is their value) into the mytunable file.
tunsave -A -f mytunable
This permits you to save the current setting. This setting can be reproduced at a later time, even if the
default values have changed (default values can change when the file is used on another machine or
when running another version of AIX).
tunsave -a -f ./mytunable
For the parameters that are set to default values, a line using the keyword DEFAULT will be put in the file.
This essentially saves only the current changed values, while forcing all the other parameters to their
default values. This permits you to return to a known setup later using the tunrestore command.
tundefault Command
The tundefault command is used to force all tuning parameters to be reset to their default value. The -p
flag makes changes permanent, while the -r flag defers changes until the next reboot.
For example, the following example resets all tunable parameters to their default value, except the
parameters of type Bosboot and Reboot, and parameters of type Incremental set at values bigger than
their default value.
tundefault
Error messages will be displayed for any parameter change that is not permitted.
For example, the following example resets all the tunable parameters to their default value. It also
updates the /etc/tunables/nextboot file, and if necessary, offers to run bosboot, and displays a message
warning that rebooting is needed for all the changes to be effective.
tundefault -p
This command permanently resets all tunable parameters to their default values, returning the system to
a consistent state and making sure the state is preserved after the next reboot.
For example, the following example clears all the command stanzas in the /etc/tunables/nextboot file, and
proposes bosboot if necessary.
tundefault -r
Initial setup
Installing the bos.perf.tune fileset automatically creates an initial /etc/tunables/nextboot file.
When you install the bos.perf.tune fileset the following line is added at the beginning of the /etc/inittab
file:
tunable:23456789:wait:/usr/bin/tunrestore -R > /dev/console 2>&1
This entry sets the reboot value of all tunable parameters to their default. For more information about
migration from a previous version of AIX and the compatibility mode automatically setup in case of
migration, see the Files Reference guide.
Recovery Procedure
If the machine becomes unstable with a given nextboot file, users should put the system into
maintenance mode, make sure the sys0 pre520tune attribute is set to disable, delete the nextboot file, run
the bosboot command and reboot. This action will guarantee that all tunables are set to their default
value.
Select Save/Restore All Kernel & Network Parameters to manipulate all tuning parameter values at the
same time. To individually change tuning parameters managed by one of the tuning commands, select
any of the other lines.
The main panel to manipulate all tunable parameters by sets looks similar to the following:
Save/Restore All Kernel Tuning Parameters
Each of the options in this panel are explained in the following sections.
1. View Last Boot Parameters All last boot parameters are listed stanza by stanza, retrieved from the
/etc/tunables/lastboot file.
2. View Last Boot Log File Displays the content of the file /etc/tunables/lastboot.log.
3. Save All Current Parameters for Next Boot
Save All Current Kernel Tuning Parameters for Next Boot
After selecting yes and pressing ENTER, all the current tuning parameter values are saved in the
/etc/tunables/nextboot file. Bosboot will be offered if necessary.
4. Save All Current Parameters
Save All Current Kernel Tuning Parameters
File name []
Description []
After selecting yes and pressing ENTER, all the tuning parameters will be set to values from the
/etc/tunables/lastboot file. Error messages will be displayed if any parameter of type Bosboot or
Reboot would need to be changed, which can only be done when changing reboot values.
A select menu shows existing files in the /etc/tunables directory, except the files nextboot, lastboot
and lastboot.log which all have special purposes. After pressing ENTER, the parameters present in
the selected file in the /etc/tunables directory will be set to the value listed if possible. Error
messages will be displayed if any parameter of type Bosboot or Reboot would need to be changed,
which can't be done on the current values. Error messages will also be displayed for any parameter
of type Incremental when the value in the file is smaller than the current value, and for out of range
and incompatible values present in the file. All possible changes will be made.
7. Reset All Current Parameters To Default Value
Reset All Current Kernel Tuning Parameters To Default Value
After pressing ENTER, each tunable parameter will be reset to its default value. Parameters of type
Bosboot and Reboot, are never changed, but error messages are displayed if they should have been
changed to get back to their default values.
8. Save All Next Boot Parameters
Save All Next Boot Kernel Tuning Parameters
File name []
Type or a select values for the entry field. Pressing F4 displays a list of existing files. This is the list
of all files in the /etc/tunables directory except the files nextboot, lastboot and lastboot.log which all
have special purposes. File names entered cannot be any of those three reserved names. After
pressing ENTER, the nextboot file, is copied to the specified /etc/tunables file if it can be
successfully tunchecked.
9. Restore All Next Boot Parameters from Last Boot Values
Restore All Next Boot Kernel Tuning Parameters from Last Boot Values
After selecting yes and pressing ENTER, all values from the lastboot file will be copied to the
nextboot file. If necessary, the user will be prompted to run bosboot, and warned that for all the
changes to be effective, the machine must be rebooted.
10. Restore All Next Boot Parameters from Saved Values
Restore All Next Boot Kernel Tuning Parameters from Saved Values
A select menu shows existing files in the /etc/tunables directory, except the files nextboot, lastboot
and lastboot.log which all have special purposes. After selecting a file and pressing ENTER, all
After hitting ENTER, the /etc/tunables/nextboot file will be cleared. If necessary bosboot will be
proposed and a message indicating that a reboot is needed will be displayed.
Here is the main panel to manipulate parameters managed by the schedo command:
Tuning Scheduler and Memory Load Control Parameters
The following table shows the interaction between parameter types and the different SMIT sub-panels:
Each of the sub-panels behavior is explained in the following sections using examples of the scheduler
and memory load control sub-panels:
1. List All Characteristics of Tuning Parameters The output of schedo -L is displayed.
2. Change/Show Current Scheduler and Memory Load Control Parameters
[Entry Field]
affinity_lim [7]
idle_migration_barrier [4]
fixed_pri_global [0]
maxspin [1]
pacefork [10]
sched_D [16]
sched_R [16]
timeslice [1]
%usDelta [100]
v_exempt_secs [2]
v_min_process [2]
v_repage_hi [2]
v_repage_proc [6]
v_sec_wait [4]
This panel is initialized with the current schedo values (output from the schedo -a command). Any
parameter of type Bosboot, Reboot or Static is displayed with no surrounding square bracket
indicating that it cannot be changed. From the F4 list, type or select values for the entry fields
corresponding to parameters to be changed. Clearing a value results in resetting the parameter to its
default value. The F4 list also shows minimum, maximum, and default values, the unit of the
parameter and its type. Selecting F1 displays the help associated with the selected parameter. The text
displayed will be identical to what is displayed by the tuning commands when called with the -h
option. Press ENTER after making all the required changes. Doing so will launch the schedo
command to make the changes. Any error message generated by the command, for values out of
range, incompatible values, or lower values for parameter of type Incremental, will be displayed to
the user.
3. The following is an example of the Change / Show Scheduler and Memory Load Control Parameters
for next boot panel.
Change / Show Scheduler and Memory Load Control Parameters for next boot
[Entry Field]
affinity_lim [7]
idle_migration_barrier [4]
fixed_pri_global [0]
maxpin [1]
pacefork [10]
sched_D [16]
sched_R [16]
timeslice [1]
%usDelta [100]
v_exempt_secs [2]
v_min_process [2]
v_repage_hi [2]
v_repage_proc [6]
v_sec_wait [4]
This panel is similar to the previous panel, in that, any parameter value can be changed except for
parameters of type Static. It is initialized with the values listed in the /etc/tunables/nextboot file,
completed with default values for the parameter not listed in the file. Type or select (from the F4 list)
values for the entry field corresponding to the parameters to be changed. Clearing a value results in
resetting the parameter to its default value. The F4 list also shows minimum, maximum, and default
values, the unit of the parameter and its type. Pressing F1 displays the help associated with the
selected parameter. The text displayed will be identical to what is displayed by the tuning commands
when called with the -h option. Press ENTER after making all desired changes. Doing so will result in
After pressing ENTER on this panel, all the current schedo parameter values will be saved in the
/etc/tunables/nextboot file . If any parameter of type Bosboot needs to be changed, the user will be
prompted to run bosboot.
5. The following is an example of the Reset Current Scheduler and Memory Load Control Parameters to
Default Values
Reset Current Scheduler and Memory Load Control Parameters to Default Value
After selecting yes and pressing ENTER on this panel, all the tuning parameters managed by the
schedo command will be reset to their default value. If any parameter of type Incremental, Bosboot
or Reboot should have been changed, and error message will be displayed instead.
6. The following is an example of the Reset Scheduler and Memory Load Control Next Boot Parameters
To Default Values
Reset Next Boot Parameters To Default Value
After pressing ENTER, the schedo stanza in the /etc/tunables/nextboot file will be cleared. This will
defer changes until next reboot. If necessary, bosboot will be proposed.
The procmon tool enables you to view and manage the processes running on a system. The procmon tool
has a graphical interface and displays a table of process metrics that you can sort on the different fields
that are provided. The default number of processes listed in the table is 20, but you can change the value
in the Table Properties panel from the main menu. Only the top processes based on the sorting metric
are displayed and the default sorting key is CPU consumption.
The default value of the refresh rate for the table of process metrics is 5 seconds, but you can change the
refresh rate by either using the Table Properties panel in the main menu or by clicking on the Refresh
button.
You can choose other metrics to display from the Table Properties panel in the main menu. For more
information, see “The process table of the procmon tool.”
You can filter any of the processes that are displayed. For more information, see “Filtering processes” on
page 214.
You can also perform certain AIX performance commands on these processes. For more information, see
“Performing AIX commands on processes” on page 214.
The procmon tool is a Performance Workbench plugin, so you can only launch the procmon tool from
within the Performance Workbench framework. You must install the bos.perf.gtools fileset by either
using the smitty tool or the installp command. You can then access the Performance Workbench by
running the /usr/bin/perfwb script.
You can refresh the statistics data by either clicking on the Refresh button in the menu bar or by
activating the automatic refresh option through the menu bar. To save the statistics information, you can
export the table to any of the following file formats:
v XML
v HTML
v CSV
The default value of the number of processes listed in the process table is 20, but you can change this
value from the Table Properties panel from the main menu.
The yellow arrow key in the column header indicates the sort key for the process table. The arrow points
either up or down, depending on whether the sort order is ascending or descending, respectively. You
can change the sort key by clicking on any of the column headers.
You can customize the process table, modify the information on the various processes, and run
commands on the displayed processes. By default, the procmon tool displays the following columns:
Item Descriptor
PPID Parent process identifier
NICE Nice value for the process
PRI Priority of the process
DRSS Data resident set size
TRSS Text resident set size
STARTTIME Time when the command started
EUID Effective user identifier
RUID Real user identifier
EGID Effective group identifier
RGID Real group identifier
THCOUNT Number of threads used
CLASSID Identifier of the class which pertains to the WLM process
CLASSNAME Name of the class which pertains to the WLM process
TOTDISKIO Disk I/O for that process
NVCSW N voluntary context switches
NIVCSW N involuntary context switches
MINFLT Minor page faults
MAJFLT Major page faults
INBLK Input blocks
OUBLK Output blocks
MSGSEND Messages sent
MSGRECV Messages received
EGROUP Effective group name
RGROUP Real group name
You can use either the table properties or preference to display the metrics you are interested in. If you
choose to change the table properties, the new configuration values are set for the current session only. If
you change the preferences, the new configuration values are set for the next session of the procmon tool.
Real values are retrieved from the kernel and displayed in the process table. An example of a real value
is the PID, PPID, or TTY.
Below the process table, there is another table that displays the sum of the values for each column of the
process table. For example, this table might provide a good idea of the percentage of total CPU used by
the top 20 CPU-consuming processes.
You can refresh the data by either clicking on the Refresh button in the menu bar or by activating the
automatic refresh option through the menu bar. To save the statistics information, you can export the
table to any of the following file formats:
v XML
v HTML
v CSV
Item Descriptor
Name WPAR name
Hostname WPAR hostname
Type WPAR type, either System or Application
State WPAR state–this can have one of the following values: Active, Defined,
Transitional, Broken, Paused, Loaded, Error
Directory WPAR root directory
Nb. virtual PIDs Number of virtual PIDs running in this WPAR
Filtering processes
You can filter processes based on the various criteria that is displayed in the process table. To create a
filter, select Table Filters from the menu bar. A new window opens and displays a list of filters.
You can run the following AIX commands on the processes you select in the process table:
v The svmon command
v The renice command
v The kill command
v The following proctools commands:
– The procfiles command
Profiling tools
You can use profiling tools to identify which portions of the program are executed most frequently or
where most of the time is spent.
Profiling tools are typically used after a basic tool, such as the vmstat or iostat commands, shows that a
CPU bottleneck is causing a performance problem.
Before you begin locating hot spots in your program, you need a fully functional program and realistic
data values.
The output from the time command is in minutes and seconds, as follows:
real 0m26.72s
user 0m26.53s
sys 0m0.03s
Comparing the user+sys CPU time to the real time will give you an idea if your application is
CPU-bound or I/O-bound.
Note: Be careful when you do this on an SMP system. For more information, see time and timex
Cautions).
The timex command is also available through the SMIT command on the Analysis Tools menu, found
under Performance and Resource Scheduling. The -p and -s options of the timex command enable data
from accounting (-p) and the sar command (-s) to be accessed and reported. The -o option reports on
blocks read or written.
To use the prof command, use the -p option to compile a source program in C, FORTRAN, or COBOL.
This inserts a special profiling startup function into the object file that calls the monitor() subroutine to
track function calls. When the program is executed, the monitor() subroutine creates a mon.out file to
track execution time. Therefore, only programs that explicitly exit or return from the main program cause
the mon.out file to be produced. Also, the -p flag causes the compiler to insert a call to the mcount()
subroutine into the object code generated for each recompiled function of your program. While the
program runs, each time a parent calls a child function, the child calls the mcount() subroutine to
increment a distinct counter for that parent-child pair. This counts the number of calls to a function.
Note: You cannot use the prof command for profiling optimized code.
By default, the displayed report is sorted by decreasing percentage of CPU time. This is the same as
when specifying the -t option.
The -c option sorts by decreasing number of calls and the -n option sorts alphabetically by symbol name.
If the -s option is used, a summary file mon.sum is produced. This is useful when more than one profile
file is specified with the -m option (the -m option specifies files containing monitor data).
The -z option includes all symbols, even if there are zero calls and time associated.
Other options are available and explained in the prof command in the Files Reference.
The following example shows the first part of the prof command output for a modified version of the
Whetstone benchmark (Double Precision) program.
# cc -o cwhet -p -lm cwhet.c
# cwhet > cwhet.out
# prof
Name %Time Seconds Cumsecs #Calls msec/call
.main 32.6 17.63 17.63 1 17630.
.__mcount 28.2 15.25 32.88
.mod8 16.3 8.82 41.70 8990000 0.0010
.mod9 9.9 5.38 47.08 6160000 0.0009
.cos 2.9 1.57 48.65 1920000 0.0008
.exp 2.4 1.32 49.97 930000 0.0014
.log 2.4 1.31 51.28 930000 0.0014
.mod3 1.9 1.01 52.29 140000 0.0072
.sin 1.2 0.63 52.92 640000 0.0010
.sqrt 1.1 0.59 53.51
.atan 1.1 0.57 54.08 640000 0.0009
.pout 0.0 0.00 54.08 10 0.0
.exit 0.0 0.00 54.08 1 0.
.free 0.0 0.00 54.08 2 0.
.free_y 0.0 0.00 54.08 2 0.
In this example, we see many calls to the mod8() and mod9() routines. As a starting point, examine the
source code to see why they are used so much. Another starting point could be to investigate why a
routine requires so much time.
Note: If the program you want to monitor uses a fork() system call, be aware that the parent and the
child create the same file (mon.out). To avoid this problem, change the current directory of the child
process.
The statistics of called subroutines are included in the profile of the calling program. The gprof command
is useful in identifying how a program consumes CPU resources. It is roughly a superset of the prof
command, giving additional information and providing more visibility to active sections of code.
This action links in versions of library routines compiled for profiling and reads the symbol table in the
named object file (a.out by default), correlating it with the call graph profile file (gmon.out by default).
This means that the compiler inserts a call to the mcount() function into the object code generated for
each recompiled function of your program. The mcount() function counts each time a parent calls a child
function. Also, the monitor() function is enabled to estimate the time spent in each routine.
Each report section begins with an explanatory part describing the output columns. You can suppress
these pages by using the -b option.
Where the program is executed, statistics are collected in the gmon.out file. These statistics include the
following:
v The names of the executable program and shared library objects that were loaded
v The virtual memory addresses assigned to each program segment
v The mcount() data for each parent-child
v The number of milliseconds accumulated for each program segment
Later, when the gprof command is issued, it reads the a.out and gmon.out files to generate the two
reports. The call-graph profile is generated first, followed by the flat profile. It is best to redirect the gprof
output to a file, because browsing the flat profile first might answer most of your usage questions.
The following example shows the profiling for the cwhet benchmark program. This example is also used
in “The prof command” on page 215:
# cc -o cwhet -pg -lm cwhet.c
# cwhet > cwhet.out
# gprof cwhet > cwhet.gprof
called/total parents
index %time self descendents called+self name index
called/total children
-----------------------------------------------
<spontaneous>
[2] 64.6 0.00 40.62 .__start [2]
19.44 21.18 1/1 .main [1]
0.00 0.00 1/1 .exit [37]
-----------------------------------------------
Usually the call graph report begins with a description of each column of the report, but it has been
deleted in this example. The column headings vary according to type of function (current, parent of
current, or child of current function). The current function is indicated by an index in brackets at the
beginning of the line. Functions are listed in decreasing order of CPU time used.
To read this report, look at the first index [1] in the left-hand column. The .main function is the current
function. It was started by .__start (the parent function is on top of the current function), and it, in turn,
calls .mod8 and .mod9 (the child functions are beneath the current function). All the accumulated time of
.main is propagated to .__start. The self and descendents columns of the children of the current
function add up to the descendents entry for the current function. The current function can have more
than one parent. Execution time is allocated to the parent functions based on the number of times they
are called.
Flat profile:
The flat profile sample is the second part of the cwhet.gprof file.
Normally, the top functions on the list are candidates for optimization, but you should also consider how
many calls are made to the function. Sometimes it can be easier to make slight improvements to a
frequently called function than to make extensive changes to a piece of code that is called once.
A cross reference index is the last item produced and looks similar to the following:
Index by function name
Note: If the program you want to monitor uses a fork() system call, be aware that by default, the parent
and the child create the same file, gmon.out. To avoid this problem, use the GPROF environment
variable. You can also use the GPROF environment variable to profile multi-threaded applications.
You can determine which particular statements or subroutines to examine with the tprof command.
The tprof command is a versatile profiler that provides a detailed profile of CPU usage by every process
ID and name. It further profiles at the application level, routine level, and even to the source statement
level and provides both a global view and a detailed view. In addition, the tprof command can profile
kernel extensions, stripped executable programs, and stripped libraries. It does subroutine-level profiling
for most executable programs on which the stripnm command produces a symbols table. The tprof
command can profile any program produced by any of the following compilers:
v C
v C++
v FORTRAN
v Java™
The tprof command only profiles CPU activity. It does not profile other system resources, such as
memory or disks.
The tprof command can profile Java programs using Java Persistence API (JPA) (-x java -Xrunjpa) to
collect Java Just-in-Time (JIT) source line numbers and instructions, if the following parameters are added
to -Xrunjpa:
Time-based profiling
Time-based profiling is the default profiling mode and it is triggered by the decrementer interrupt, which
occurs every 10 milliseconds.
With time-based profiling, the tprof command cannot determine the address of a routine when interrupts
are disabled. While interrupts are disabled, all ticks are charged to the unlock_enable() routines.
Event-based profiling
Event-based profiling is triggered by any one of the software-based events or any Performance Monitor
event that occurs on the processor.
The primary advantages of event-based profiling over time-based profiling are the following:
v The routine addresses are visible when interrupts are disabled.
v The ability to vary the profiling event
v The ability to vary the sampling frequency
With event-based profiling, ticks that occur while interrupts are disabled are charged to the proper
routines. Also, you can select the profiling event and sampling frequency. The profiling event determines
the trigger for the interrupt and the sampling frequency determines how often the interrupt occurs. After
the specified number of occurrences of the profiling event, an interrupt is generated and the executing
instruction is recorded.
The default type of profiling event is processor cycles. The following are various types of software-based
events:
v Emulation interrupts (EMULATION)
v Alignment interrupts (ALIGNMENT)
v Instruction Segment Lookaside Buffer misses (ISLBMISS)
v Data Segment Lookaside Buffer misses (DSLBMISS)
The sampling frequency for the software-based events is specified in milliseconds and the supported
range is 1 to 500 milliseconds. The default sampling frequency is 10 milliseconds.
The following command generates an interrupt every 5 milliseconds and retrieves the record for the last
emulation interrupt:
# tprof -E EMULATION -f 5
The following command generates an interrupt every 100 milliseconds and records the contents of the
Sampled Instruction Address Register, or SIAR:
# tprof -E -f 100
Event-based profiling uses the SIAR, which contains the address of an instruction close to the executing
instruction. For example, if the profiling event is PM_FPU0_FIN, which means the floating point unit 0
produces a result, the SIAR might not contain that floating point instruction but might contain another
instruction close to it. This is more relevant for profiling based on Performance Monitor events. In fact for
the proximity reason, on systems based on POWER4 and later, it is recommended that the Performance
Monitor profiling event be one of the marked events. Marked events have the PM_MRK prefix.
Certain combinations of profiling event, sampling frequency, and workload might cause interrupts to
occur at such a rapid rate that the system spends most of its time in the interrupt handler. The tprof
command detects this condition by keeping track of the number of completed instructions between two
consecutive interrupts. When the tprof command detects five occurrences of the count falling below the
acceptable limit, the trace collection stops. Reports are still generated and an error message is displayed.
The default threshold is 1,000 instructions.
Large Page Analysis uses the information in the trace to project translation buffer performance when
mapping any of the following four application memory regions to a different page size:
v static application data (initialized and uninitialized data)
v application heap (dynamically allocated data)
v stack
v application text
The performance projections are provided for each of the page sizes supported by the operating system.
The first performance projection is a baseline projection for mapping all four memory regions to the
default 4 KB pages. Subsequent projections map one region at a time to a different page size. The
statistics reported for each projection include: the page size, the number of pages needed to back all four
regions, a translation miss score, and a cold translation miss score.
The summary section lists the processes profiled and the statistics reported including: number/percentage
of memory reference, modeled memory reference, malloc calls, and free calls.
The translation miss score is an indicator of the translation miss rate and ranges from 0 (no translation
misses) to 1 (every reference results in a translation miss).
The translation miss score differs from the actual translation miss rate because it is based on sampled
references. Sampling has the effect of reducing the denominator (Number of translation buffer accesses)
in the above equation faster than the numerator (Number of translation misses). As a result, the
translation miss score tends to overestimate the actual translation miss rate at increasing sampling rates.
Thus, the translation score should be interpreted as a relative measure for comparing the effectiveness of
different projections rather than as a predictor of actual translation miss rates.
The translation miss score is directly affected by larger page sizes: growing the page size reduces the
translation miss score. The performance projection report includes both a cold translation miss score (such
The performance projection for a process would appear similar to the following:
Modeled region for the process ./workload [661980]
Data profiling:
The tprof –b command turns on basic data profiling and collects data access information.
The summary section reports access information across kernel data, library data, user global data, and
stackheap sections for each process, as shown in the following example:
Table 3. Data profiling of the tprof -b command
Process Freq Total Kernel User Shared Other
tlbref 1 60.49 0.07 59.71 0.38 0.00
/usr/bin/dd 1 39.30 26.75 11.82 0.73 0.00
tprof 2 0.21 0.21 0.00 0.33 0.00
Total 20 100.00 27.03 71.53 1.44 0.00
When used with the-s, -u, -k and -e flags, the tprof command's data profiling reports most-used data
structures (exported data symbols) in shared library, binary, kernel and kernel extensions. The -B flag also
reports the functions that use data structures.
The second table shown is an example of the data profiling report for the /usr/bin/dd process.. The
example report shows that __start data structure is the most used data structure in the /usr/bin/dd
process, based on the samples collected. The data structure is a list of functions (right aligned) that use
the data structure, reported along with their share and source as shown in the following example:
Total % For /usr/bin/dd[323768] (/usr/bin/dd) = 11.69
Subroutine % Source
.noconv 11.29 /usr/bin/dd
.main 0.14 /usr/bin/dd
.read 0.07 glink.s
.setobuf 0.05 /usr/bin/dd
.rpipe 0.04 /usr/bin/dd
.flsh 0.04 /usr/bin/dd
.write 0.04 glink.s
.wbuf 0.02 /usr/bin/dd
.rbuf 0.02 /usr/bin/dd
Data % Source
__start 7.80 /usr/bin/dd
.noconv 6.59 /usr/bin/dd
.main 0.14 /usr/bin/dd
.read 0.04 glink.s
.wbuf 0.02 /usr/bin/dd
.write 0.02 glink.s
.flsh 0.102 /usr/bin/dd
When a program is profiled, the trace facility is activated and instructed to collect data from the trace
hook with hook ID 234 that records the contents of the Instruction Address Register, or IAR, when a
system-clock interrupt occurs (100 times a second per processor). Several other trace hooks are also
activated to enable the tprof command to track process and dispatch activity. The trace records are not
written to a disk file. They are written to a pipe that is read by a program that builds a table of the
unique program addresses that have been encountered and the number of times each one occurred.
When the workload being profiled is complete, the table of addresses and their occurrence counts are
written to disk. The data-reduction component of the tprof command then correlates the instruction
addresses that were encountered with the ranges of addresses occupied by the various programs and
reports the distribution of address occurrences, or ticks, across the programs involved in the workload.
The distribution of ticks is roughly proportional to the CPU time spent in each program, which is 10
milliseconds per tick. After the high-use programs are identified, you can take action to restructure the
hot spots or minimize their use.
The following example demonstrates how to collect a CPU tick profile of a program using the tprof
command. The example was executed on a 4-way SMP system and since it is a fast-running system, the
command completed in less than a second. To make this program run longer, the array size, or Asize,
was changed to 4096 instead of 1024.
Upon running the following command, the version1.prof file is created in the current directory:
# tprof -z -u -p version1 -x version1
The version1.prof file reports how many CPU ticks for each of the programs that were running on the
system while the version1 program was running.
Profile: ./version1
Total Ticks For All Processes (./version1) = 1637
Profile: ./version1
Total Ticks For ./version1[245974] (./version1) = 1637
The first section of the report summarizes the results by program, regardless of the process ID, or PID. It
shows the number of different processes, or Freq, that ran each program at some point.
The second section of the report displays the number of ticks consumed by, or on behalf of, each process.
In the example, the version1 program used 1637 ticks itself and 35 ticks occurred in the kernel on behalf
of the version1 process.
The third section breaks down the user ticks associated with the executable program being profiled. It
reports the number of ticks used by each function in the executable program and the percentage of the
total run's CPU ticks (7504) that each function's ticks represent. Since the system's CPUs were mostly idle,
most of the 7504 ticks are idle ticks.
To see what percentage of the busy time this program took, subtract the wait thread's CPU ticks, which
are the idle CPU ticks, from the total and then divide the difference from the total number of ticks.
Total number of ticks / (Total - Idle CPU ticks) = % busy time of program
1637 / (7504 - 5810) =
1637 / 1694 = 0.97
As the root user, you can tune the sampling frequency with the following raso tunables:
v tprof_cyc_mult
v tprof_evt_mult
For example, for events based on processor cycles, setting the tprof_cyc_mult tunable to 50 and
specifying the -f flag as 100 is equivalent to specifying a sampling frequency of 100/50 milliseconds.
For other Performance Monitor events, setting the tprof_evt_mult tunable to 100 and specifying the -f
flag as 20,000 is equivalent to specifying a sampling frequency of 20,000/100 occurrences.
To insure the trace file contains sufficient information to be post-processed by tprof, the trace command
line must include the -M and -j tprof flags.
If you name the rootstring file trace1, to collect a trace, you can use the trace command using all of the
hooks or at least the following hooks:
# trace -af -M -T 1000000 -L 10000000 -o trace1.trc -j tprof
# workload
# trcoff
# gensyms > trace1.syms
# trcstop
# trcrpt -r trace1 -k -u -s -z
The example above creates a trace1.prof file, which gives you a CPU profile of the system while the trace
command was running.
The svmon command captures a snapshot of the current state of memory; however, it is not a true
snapshot because it runs at the user level with interrupts enabled.
If an interval is indicated by the the -i flag statistics will be displayed until the command is killed or until
the number of intervals which is specified with the-i flag, is reached.
You can generate the following different reports to analyze the memory consumption of your machine:
v command report (-C)
v detailed report (-D)
v global report (-G)
v process report (-P)
v segment report (-S)
v user report (-U)
v workload management Class report (-W)
v workload management tier report (-T)
v XML report (-X)
Security
Any user of the machine can run the svmon command. It uses two different mechanisms to allow two
different views for a non-root user.
You can view the complete details of the RBAC in Files Reference.
For example, the following .svmonrc file sets svmon to generate the default report format before the -O
option were introduced:
# cat .svmonrc
summary=basic
segment=category
pgsz=on
Note:
v When an option is not recognized in the file, it is ignored.
v When an option is defined more than once, only the last value will be used.
The svmon configuration file can generate two types of reports for the -G, -P, -U, -C, and -W option:
v Compact report, which is a one-line-per-entity report.
v Long report, which uses several lines per entity.
For the -G option, you can switch from the standard report to the compact report with the option -O
summary=longreal. For the -P, -U, -C and -W options, a compact report is reported when the option -O
summary=basic is set and the option -O segment=off is set (default value).
Example:
In this example, the command line specifies to run svmon 3 times every 5 seconds. The timestamp and
command line are set with the .svmonrc file.
v -O commandline=[on|off]: when set to on, this option adds the command line you use to produce the
report in the report header.
# svmon -G -i 5 3
Command line : svmon -G -i 5 3
.svmonrc: -O timestamp=on,commandline=on
Unit: page Timestamp: 11:23:02
-------------------------------------------------------------------------------
size inuse free pin virtual available
memory 262144 227471 34673 140246 223696 53801
pg space 131072 39091
Example:
# svmon -G -O commandline=on
Command line : svmon -G -O commandline=on
Unit: page
-------------------------------------------------------------------------------
size inuse free pin virtual available
memory 262144 227312 34832 140242 223536 53961
pg space 131072 39091
When auto,KB, MB, or GB are used, only the 3 most significant digits are displayed. You should be
careful when interpreting the results with a unit other than page. When the auto setting is selected, the
abbreviated units are specified immediately after each metric (K for kilobytes, M for megabytes, or G for
gigabytes).
Examples:
# svmon -G -O unit=GB
Unit: GB
==============================================================================
size inuse free pin virtual available
memory 4.00 0.84 3.16 0.43 0.74 3.13
pg space 0.50 0
# svmon -G -O unit=auto
Unit: auto
==============================================================================
Segment details can be added to the user, command, process, and class reports after the summary when
the -O segment=on or -O segment=category option is set to:
v -O segment=on, the list of segments is displayed for each entity.
v -O segment=category, the segments are grouped into the following three categories for each entity:
– system: used by the system
– exclusive: used only by one entity, except for shared memory (shm) segments
– shared: used by two or more entities, except for shared memory (shm) segments
The following table contains the description of the items that the svmon reports for segment information.
Table 5. Description table
Segment
type Segment usage Description
persistent log files IO space mapping
persistent files and directories device name : inode number
persistent large files large file device name : inode number
mapping files mapping mapped to sid source sid no longer mapped
working data areas of processes and shared memory segments dependent on the role of the segment based on the VSID and
ESID
client NFS and CD-ROM files dependent on the role of the segment based on the VSID and
ESID
client JFS2 files device name: inode number
rmapping I/O space mapping dependent on the role of the segment based on the VSID and
ESID
In these examples, the mapping option adds or removes the mapping source segments which are not in
the address space of the process number 266414. There is a difference of four pages (three pages from
segment 191338, and one page from segment 131332) in the Inuse consumption between -O mapping=off
and -O mapping=on.
v -O sortseg=[inuse | pin | pgsp | virtual]: by default,, all segments are sorted in decreasing order of
real memory usage (the Inuse metric) for each entity (user, process, command, segment). Sorting
options for the report include the following:
– Inuse: real memory used
– Pin: pinned memory used
– Pgsp: paging space memory used
– Virtual: virtual memory used
Examples:
# svmon -P 1 -O unit=KB,segment=on
Unit: KB
-------------------------------------------------------------------------------
Pid Command Inuse Pin Pgsp Virtual
1 init 67752 32400 0 67688
# svmon -P 1 -O unit=KB,segment=on,sortseg=pin
Unit: KB
-------------------------------------------------------------------------------
Pid Command Inuse Pin Pgsp Virtual
1 init 67752 32400 0 67688
Unit: page
-------------------------------------------------------------------------------
Pid Command Inuse Pin Pgsp Virtual
221326 java 20619 6326 9612 27584
Additional -O options
Review the additional -O options for the svmon command.
All reports containing two or more entities can be filtered and/or sorted with the following options:
v -O sortentity=[inuse |...]: specifies the summary metric used to sort the entities (process, user, and so
on) when several entities are printed in a report.
The list of metrics permitted in the report depend on the type of summary (-O summary option)
chosen. Any of the metrics used in a summary can be used as a sort key.
Examples:
# svmon -P -t 5 -O summary=off -O segment=off -O sortentity=pin
Command line : svmon -P -t 5 -O summary=off -O segment=off -O sortentity=pin
Unit: page
-------------------------------------------------------------------------------
Pid Command Inuse Pin Pgsp Virtual
127044 dog 9443 8194 0 9443
0 swapper 9360 8176 0 9360
8196 wait 9360 8176 0 9360
53274 wait 9360 8176 0 9360
237700 rpc.lockd 9580 8171 0 9580
v -O filtercat=[off | exclusive | kernel | shared | unused | unattached]: this option filters the output
by segment category. You can specify more than one filter at a time.
Note: Use the unattached filter value with the -S report because unattached segments cannot be
owned by a process or command.
Examples:
# svmon -P 1 -O unit=KB,segment=on,sortseg=pin,filtercat=shared
Unit: KB
-------------------------------------------------------------------------------
Pid Command Inuse Pin Pgsp Virtual
1 init 58684 28348 0 58616
Unit: page
...............................................................................
SYSTEM segments Inuse Pin Pgsp Virtual
7088 6288 64 7104
...............................................................................
EXCLUSIVE segments Inuse Pin Pgsp Virtual
112 12 0 111
...............................................................................
SHARED segments Inuse Pin Pgsp Virtual
9056 0 16 9056
Unit: page
===============================================================================
Command Inuse Pin Pgsp Virtual
yes 1 0 0 0
...............................................................................
EXCLUSIVE segments Inuse Pin Pgsp Virtual
1 0 0 0
Unit: page
===============================================================================
Command Inuse Pin Pgsp Virtual
yes 16255 6300 80 16271
...............................................................................
SYSTEM segments Inuse Pin Pgsp Virtual
7088 6288 64 7104
...............................................................................
EXCLUSIVE segments Inuse Pin Pgsp Virtual
111 12 0 111
...............................................................................
SHARED segments Inuse Pin Pgsp Virtual
9056 0 16 9056
Unit: page
===============================================================================
User Inuse Pin Pgsp Virtual
root 12288 12288 0 12288
Reports details
Review the output for the svmon command reports.
To display compact report of memory expansion information (in a system with Active Memory
Expansion enabled), enter:
# svmon -G -O summary=longame
Unit: page
-----------------------------------------------------------------------------------------------------
Active Memory Expansion
-----------------------------------------------------------------------------------------------------
Size Inuse Free DXMSz UCMInuse CMInuse TMSz TMFr CPSz
262144 152625 43055 67640 98217 54408 131072 6787 26068
Global report
To print the Global report, specify the -G flag. The Global report displays a system-wide detailed real
memory view of the machine. This report contains various summaries, only the memory and inuse
summaries are always displayed.
When the -O summary option is not used, or when it is set to -O summary=basic, the column headings
used in global reports summaries are:
memory
Specifies statistics describing the use of memory, including:
size Number of frames (size of real memory)
Tip: This does not include the free frames that have been made unusable by the memory
sizing tool, the rmss command.
inuse Number of frames containing pages
Tip: On a system where a reserved pool is defined (such as the 16 MB page pool), this
value includes the frames reserved for any of these reserved pools.
virtual
Number of pages allocated in the system virtual space
available
Amount of memory available for computational data. This metric is calculated based on
the size of the file cache and the amount of free memory.
stolen Displayed only when rmss runs on the machine. Number of frames stolen by rmss and
marked unusable by the VMM
mmode
Indicates the memory mode the system is running.
Following are the current possible values for mmode.
Ded Neither Active Memory Sharing nor Active Memory Expansion is enabled.
Shar Only Active Memory Sharing is enabled, Expansion in not enabled.
Ded-E
Active Memory Sharing is not enabled but Expansion is enabled.
Shar-E Both Active Memory Sharing & Active Memory Expansion are enabled.
ucomprsd
This gives a breakdown of expanded memory statistics in the uncompressed pool,
including: inuse Number of uncompressed pages that are in use.
comprsd
This gives a breakdown of expanded memory statistics in the compressed pool, including:
inuse Number of compressed pages in the compressed pool.
pg space
Specifies statistics describing the use of paging space.
size Size of paging space
inuse Number of paging space pages used
ucomprsd
This gives a breakdown of expanded memory statistics of working pages in the
uncompressed pool, including: inuse Number of compressed pages in the compressed
pool.
comprsd
This gives a breakdown of expanded memory statistics of working pages in the
compressed pool, including: inuse Number of compressed pages in the compressed pool.
Pin Specifies statistics on the subset of real memory containing pinned pages, including:
work Number of frames containing working segment in use pages
pers Number of frames containing persistent segment in use pages
clnt Number of frames containing client segment in use pages
Note: The ucomprsdand comprsd metrics are available only in systems with Active Memory Expansion
enabled.–O summary=ame option is needed to show these expanded memory statistics.
When the –O summary=ame option is used in a system with Active Memory Expansion enabled, the
following memory information (true memory snapshot) is displayed in the global report summary at the
end of the regular report.
True Memory
True memory size.
Note: The above true memory section of expanded memory statistics can be turned off using the option
–O tmem=off.
When the -O summary=longreal option is set with -G, the compact report header contains the following
metrics:
Size Number of frames (size of real memory)
Tip: This includes any free frames that have been made unusable by the memory sizing tool, the
rmss command.
Inuse Number of frames containing pages
Tip: On a system where a reserved pool is defined (such as the 16 MB page pool), this value
includes the frames reserved for any of these reserved pools.
Free Number of frames free in all memory pools. There may be more memory available depending on
the file cache (see: available)
Pin Number of frames containing pinned pages
Tip: On a system where a reserved pool is defined (such as the 16 MB page pool), this value
includes the frames reserved for any of these reserved pools.
Note:
v If you specify the -@ flag without a list, the flag has no effect except when the -O summary option is
used, then the WPAR name is added in the last column.
If a list is provided after the -@ flag, the svmon command report includes one section per WPAR listed.
If ALL is specified, a system-wide and a global section will also be present. Any metric not available
on a per WPAR basis is either replaced by the corresponding global value (in the case of -@ WparList)
or by a "-" (in the case of -@ ALL).
v Global values are displayed instead of a per WPAR metrics. They are flagged by the presence of a @ in
the report.
v Some of the metrics are only available on a per WPAR basis if the WLM is used to restrict the WPAR
memory usage.
When the -O summary=longameoption is set with -G , the compact report header contains the following
Active Memory Expansion metrics
Size Expanded memory size
Inuse Number of pages in use (expanded form).
Free Size of freelist (expanded form).
DXMSz
Deficit memory to reach the target memory expansion
UCMInuse
Number of uncompressed pages in use.
CMInuse
Number of compressed pages in the compressed pool.
TMSz True memory size
TMFr True number of free page frames
CPSz Size of Compressed pool.
CPFr Size of Uncompressed pool.
txf Target Memory Expansion Factor
cxf Current Memory Expansion Factor
CR Compression Ratio.
Examples
v To display the default svmon report, with automatic unit selection, enter:
# svmon -G -O unit=auto,pgsz=on
Unit: auto
-------------------------------------------------------------------------------
size inuse free pin virtual available
memory 31.0G 2.85G 28.1G 1.65G 2.65G 27.3G
pg space 512.00M 13.4M
# svmon -G -O unit=MB,pgsz=on,affinity=on
Unit: MB
-------------------------------------------------------------------------------
size inuse free pin virtual available
memory 31744.00 3055.36 28688.64 1838.84 2859.78 27911.33
pg space 512.00 14.7
# svmon -G -O unit=MB,pgsz=on,affinity=on
Unit: MB
-------------------------------------------------------------------------------
size inuse free pin virtual available
memory 4096.00 811.59 3284.41 421.71 715.08 3248.66
pg space 512.00 6.23
# svmon -O summary=longreal
Unit: page
------------------------------------------------------------------------
Memory
-----------------------------------------------------------------------
Size Inuse Free Pin Virtual Available Pgsp
262144 187219 74925 82515 149067 101251 131072
The metrics reported here are identical to the metrics in the basic format. There is a memory size of
262144 frames with 187219 frames inuse and 74925 remaining frames. 149067 pages are allocated in the
virtual memory and 101251 frames are available.
# svmon -G -O unit=MB,summary=shortreal -i 60 5
Unit: MB
-------------------------------------------------------------------------------
Size Inuse Free Pin Virtual Available Pgsp
1024.00 709.69 314.31 320.89 590.74 387.95 512.00
1024.00 711.55 312.39 320.94 592.60 386.02 512.00
1024.00 749.10 274.89 322.89 630.15 348.53 512.00
1024.00 728.08 295.93 324.57 609.11 369.57 512.00
1024.00 716.79 307.21 325.66 597.50 381.16 512.00
This example shows how to monitor the whole system by taking a memory snapshot every 60 seconds
for 5 minutes.
v To display detailed memory expansion information (in a system with Active Memory Expansion
enabled), enter:
# svmon -G -O summary=ame
Unit: page
--------------------------------------------------------------------------------------
size inuse free pin virtual available mmode
memory 262144 152619 43061 73733 154779 41340 Ded-E
ucomprsd - 98216 -
comprsd - 54403 -
pg space 131072 1212
# svmon -G -O summary=ame,tmem=off
Unit: page
--------------------------------------------------------------------------------------
size inuse free pin virtual available mmode
memory 262144 152619 43061 73733 154779 41340 Ded-E
ucomprsd - 98216 -
comprsd - 54403 -
pg space 131072 1212
User report
The User report displays the memory usage statistics for all specified login name or when no argument is
specified for all users.
If processes owned by this user use pages of a size other than the base 4 KB page size, and the -O
pgsz=on option is set, these statistics are followed by breakdown statistics for each page size. The metrics
reported in this per-page size summary are reported in the page size unit by default.
Note:
v If you specify the -@ flag without an argument, these statistics will be followed by the users
assignments to WPARs. This information is shown with an additional WPAR column displaying the
WPAR name where the user was found.
v If you specify the -O activeusers=on option, users which do not use memory (Inuse memory is 0 page)
are not shown in the report.
Examples
1. To display per user memory consumption statistics, enter:
# svmon -U
Unit: page
===============================================================================
User Inuse Pin Pgsp Virtual
root 56007 16070 0 54032
daemon 14864 7093 0 14848
guest 14705 7087 0 14632
bin 0 0 0 0
sys 0 0 0 0
adm 0 0 0 0
uucp 0 0 0 0
nobody 0 0 0 0
This command gives a summary of all the users using memory on the system. This report uses the
default sorting key: the Inuse column. Since no -O option was specified, the default unit (page) is
used. Each page is 4 KB.
The Inuse column, which is the total number of pages in real memory from segments that are used by
all the processes of the root user, shows 56007 pages. The Pin column, which is the total number of
pages pinned from segments that are used by all the processes of the root user, shows 16070 pages.
The Pgsp column, which is the total number of paging-space pages that are used by all the processes
of the root user, shows 0 pages. The Virtual column (total number of pages in the process virtual
space) shows 54032 pages for the root user.
2. To display per WPAR per active user memory consumption statistics, enter:
###############################################################################
######## WPAR : Global
###############################################################################
===============================================================================
User Inuse Pin Pgsp Virtual
root 155.49M 49.0M 0K 149.99M
daemon 69.0M 34.8M 0K 68.9M
###############################################################################
######## WPAR : wp0
###############################################################################
===============================================================================
User Inuse Pin Pgsp Virtual
root 100.20M 35.4M 0K 96.4M
###############################################################################
######## WPAR : wp2
###############################################################################
===============================================================================
User Inuse Pin Pgsp Virtual
root 100.14M 35.4M 0K 96.3M
In this case, we run in each WPAR context and we want some details about every users in all the
WPARs running on the system. Since there are users that are not active, we want to keep only the
active user by adding the -O activeusers=on option on the command line. Each WPAR has a root
user, which in this example consumes the same amount of memory since each one runs the exact
same list of processes. The root user of the Global WPAR uses more memory since more processes are
running in the Global than in a WPAR.
Command report
The Command report displays the memory usage statistics for the specified command names. To print
the command report, specify the -C flag.
This report contains all the columns detailed in the common summary metrics as well as its own defined
here:
Command
Indicates the command name.
If processes running this command use pages of size other than the base 4KB page size, and the -O
pgsz=on option is set, these statistics are followed by breakdown statistics for each page size. The metrics
reported in this per-page size summary are reported in the page size unit by default.
Examples:
1. To display memory statistics about the yes command, with breakdown by process and categorized
detailed statistics by segment, enter:
...............................................................................
SYSTEM segments Inuse Pin Pgsp Virtual
6336 5488 0 6336
...............................................................................
EXCLUSIVE segments Inuse Pin Pgsp Virtual
37 4 0 36
...............................................................................
SHARED segments Inuse Pin Pgsp Virtual
8032 0 0 8032
This report contains all the columns detailed in the common summary metrics as well as its own defined
here:
Pid Indicates the process ID.
Command
Indicates the command the process is running.
If processes use pages of size other than the base 4KB page size, and the -O pgsz=on option is set, these
statistics are followed by breakdown statistics for each page size. The metrics reported in this per-page
size summary are reported in the page size unit by default.
After process information is displayed, svmon displays information about all the segments that the
process used. Information about segments are described in the paragraph Segment Report.
Note:
v If you specify the -@ flag, the svmon command displays two additional lines that show the virtual pid
and the WPAR name of the process. If the virtual pid is not valid, a dash sign (-) is displayed.
v The -O affinity flag supported by the -P option, gives details on domain affinity for the process when
set to on and for each of the segments when set to detail. Note that the Memory affinity information is
not available for the shared partitions.
Examples:
1. To display the top 10 list of processes in terms of real memory usage in KB unit, enter:
# svmon -P -O unit=KB,summary=basic,sortentity=inuse -t 10
Unit: KB
-------------------------------------------------------------------------------
Pid Command Inuse Pin Pgsp Virtual
344254 java 119792 22104 0 102336
209034 xmwlm 68612 21968 0 68256
262298 IBM.CSMAgentR 60852 22032 0 60172
270482 rmcd 60844 21996 0 60172
336038 IBM.ServiceRM 59588 22032 0 59344
225432 IBM.DRMd 59408 22040 0 59284
204900 sendmail 59240 21968 0 58532
266378 rpc.statd 59000 21980 0 58936
168062 snmpdv3ne 58700 21968 0 58508
131200 errdemon 58496 21968 0 58108
This example gives the top 10 processes consuming the most real memory. The report is sorted by the
inuse count, 119792 KB for the java process, 68612 KB for the xmwlm daemon and so on. The other
metrics are: KB pinned in memory, KB of paging space and virtual memory.
2. To display information about all the non empty segments of a process, enter:
Unit: page
-------------------------------------------------------------------------------
Pid Command Inuse Pin Pgsp Virtual
221326 java 20619 6326 9612 27584
Unit: page
-------------------------------------------------------------------------------
Pid Command Inuse Pin Pgsp Virtual
221326 java 20619 6326 9612 27584
Domain affinity Npages
0 29345
1 11356
This report contains all the columns detailed in the common summary metrics as well as its own defined
here:
Class or Superclass
Indicates the class or superclass name.
Examples:
1. To display memory statistics about all WLM classes in the system, enter:
# svmon -W -O unit=page,commandline=on,timestamp=on
Command line : svmon -W -O unit=page,commandline=on,timestamp=on
Unit: page Timestamp: 10:41:20
===============================================================================
Superclass Inuse Pin Pgsp Virtual
System 121231 94597 19831 135505
Unclassified 27020 8576 67 8659
Default 17691 12 1641 16491
Shared 15871 0 0 13584
Unmanaged 0 0 0 0
In this example, all the WLM classes of the system are reported. Since no sort option was specified,
the Inuse metric (real memory usage) is the sorting key. The class System uses 121231 pages in real
memory. 94597 frames are pinned. The number of pages reserved or used in paging space is 19831.
The number of pages allocated in the virtual space is 135505.
This report contains all the columns detailed in the common summary metrics as well as its own defined
here:
Tier Indicates the tier number
Superclass
The optional column heading indicates the superclass name when tier applies to a superclass
(when the -a flag is used).
The -O subclass=on option can be added to display the list of subclasses. The -a <supclassname> option
allows reporting only the details of a given super class.
Examples:
1. To display memory statistics about all WLM tiers and superclasses in the system, enter:
# svmon -T -O unit=page
Unit: page
===============================================================================
Tier Inuse Pin Pgsp Virtual
0 137187 61577 2282 110589
===============================================================================
Superclass Inuse Pin Pgsp Virtual
System 81655 61181 2282 81570
Unclassified 26797 384 0 2107
Default 16863 12 0 15040
Shared 11872 0 0 11872
Unmanaged 0 0 0 0
1 9886 352 0 8700
===============================================================================
Superclass Inuse Pin Pgsp Virtual
myclass 9886 352 0 8700
All the superclasses of all the defined tiers are reported. Each Tier has a summary header with the
Inuse, Pin, Paging space, and Virtual memory, and then the list of all its classes.
2. To display memory statistics about all WLM tiers, superclasses and classes in the system, enter:
Segment report
To print the segment report, specify the -S flag.
Note:
v Mapping device name and inode number to file names can be a lengthy operation for deeply nested
file systems. Because of that, the -O filename=on option should be used with caution.
v If the segment is a persistent segment and is associated with a log, then the string log displays. If the
segment is a working segment, then the svmon command attempts to determine the role of the
segment. For instance, special working segments such as the kernel and shared library are recognized
by the svmon command. If the segment is the private data segment for a process, then private prints
out. If the segment is the code segment for a process, and the segment report prints out in response to
the -P flag, then the string code is prepended to the description.
v If the segment is mapped by several processes and used in different ways (that is, a process private
segment mapped as shared memory by another process), then the description is empty. The exact
description can be obtained through -P flag applied on each process identifier using the segment.
v If a segment description is too large to fit in the description space, then the description is truncated. If
you need to enlarge the output you can use the -O format flag. When set to -O format=160, the report
is displayed in 160 columns, which means more room for the description field. When set to -O
format=nolimit, the description will be fully printed even if it brakes the column alignment.
Restriction:
v Segment reports can only be generated for primary segments.
Examples:
1. To display information about a list of segments including the list of processes using them, enter:
# svmon -S -O filtercat=unattached
Unit: page
# svmon -S -t 10 -O unit=auto,filterprop=text,filename=on
Unit: auto
When a WPAR was used during a checkpoint and restarted, some shared library areas might be local to
the WPAR. The name of the WPAR is printed after the name of the area. Note that using Named Shared
Library Areas in a WPAR does not mean that the area is for this WPAR only. For more information, see
the documentation on NSLA.
In all other examples, the area is system-wide; therefore, the WPAR name is omitted.
Examples:
-------------------------------------------------------------------------------
Pid Command Inuse Pin Pgsp Virtual
381050 yes 11309 9956 0 11308
Detailed report
The detailed report (-D) displays information about the pages owned by a segment and, on-demand, it
can display the frames these pages are mapped to. To print the detailed report, specify the -D flag.
Several fields are presented before the listing of the pages used:
Segid The segment identifier.
Type The type of the segment.
Note:
v The -@ flag has no effect on the -D option.
v This option only supports the additional -O frame option, which shows additional frame level details.
v The format used by this report is on 160 columns.
Examples:
#svmon -D b9015
Segid: b9015
Type: client
PSize: s (4 KB)
Address Range: 0..9 : 122070..122070
The segment b9015 is a client segment with 11 pages. None of them are pinned.
The page 122070 is physically the page dcd6 in the extended segment 208831.
Page Psize Frame Pin Ref Mod ExtSegid ExtPage Pincount State Swbits
65483 s 72235 Y N N - - 1/0 Hidden 88000000
65353 s 4091 Y N N - - 1/0 Hidden 88000000
65352 s 4090 Y N N - - 1/0 Hidden 88000000
65351 s 4089 Y N N - - 1/0 Hidden 88000000
65350 s 1010007 N N N - - 0/0 In-Use 88020000
65349 s 1011282 N N N - - 0/0 In-Use 88020000
65354 s 992249 N N N - - 0/0 In-Use 88020000
65494 s 1011078 N N N - - 0/0 In-Use 88020000
0 s 12282 N N N - - 0/0 In-Use 88820000
1 s 12281 N N N - - 0/0 In-Use 88820000
2 s 64632 N N N - - 0/0 In-Use 88a20000
3 s 64685 N N N - - 0/0 In-Use 88a20000
4 s 64630 N N N - - 0/0 In-Use 88a20000
5 s 64633 N N N - - 0/0 In-Use 88820000
The frame 72235 is pinned, not referenced and not modified, it is in the Hidden state, it does not pertain
to an extended segment nor to a large page segment.
XML report
To print the XML report, specify the -X option.
By default the report is printed to standard output. The -o filename flag allows you to redirect the report
to a file. When the -O affinity option is used, affinity information is added to the report.
The extension of XML reports is .svm. To prevent a report overwrite, the option -O overwrite=off option
can be specified (by default this option is set to on).
This XML file uses a XML Schema Definition (XSD) which can be found in the file: /usr/lib/perf/
svmon_measurement.xsd. This schema is self-documented and thus can be used by anyone to build
custom application using the XML data provided in these reports.
The data provided in this file is a snapshot view of the whole machine. It contains enough data to build
an equivalent of the -G, -P, -S, -W, -U, and -C options.
Use the RSI Interface API to write programs that access one or more xmtopas daemons. It allows you to
develop programs that print, post-process, or otherwise manipulate the raw statistics provided by the
xmtopas daemons. Such programs are known as Data-Consumer programs. AIX Version 7.1 Technical
Reference: Communications, Volume 2 must be installed to see the RSi subroutines
Makefile
The include files are based on the define directives, which must be properly set. They are defined with
the -D preprocessor flag.
v _AIX® specifies the include files to generate code for AIX.
v _BSD required for proper BSD compatibility.
RsiCons: RsiCons.c
$(CC) -o RsiCons RsiCons.c $(CFLAGS) $(LIBS)
RsiCons1: RsiCons1.c
$(CC) -o RsiCons1 RsiCons1.c $(CFLAGS) $(LIBS)
chmon: chmon.c $
$(CC) -o chmon chmon.c $(CFLAGS) $(LIBS) -lcurses
If the system that is used to compile does not support ANSI function prototypes, include the -D_NO_PROTO
flag.
The Remote Statistics Interface (RSI) application programming interface (API) is used to create
data-consumer programs that helps to access statistics of any host's xmtopas daemon.
To start using the RSI interface API you must be aware of the format and use of the RSI interface data
structures.
RSI handle
An RSI handle is a pointer to a data structure of type RsiHandleStructx. Prior to using any other RSI call,
a data-consumer program must use the RSiInit subroutine to allocate a table of RSI handles. An RSI
handle from the table is initialized when you open the logical connection to a host and that RSI handle
must be specified as an argument on all subsequent subroutines to the same host. Only one of the
internal fields of the RSI handle should be used by the data-consumer program, namely the pointer to
received network packets, pi. Only in very special cases will you ever need to use this pointer, which is
initialized by RSiOpenx and must never be modified by a data-consumer program. If your program
changes any field in the RSI handle structure, results are highly unpredictable. The RSI handle is defined
in /usr/include/sys/Rsi.h.
SpmiStatVals
Note: The two value fields are defined as union Value, which means that the actual data fields may be
long or float, depending on flags in the corresponding SpmiStat structure. The SpmiStat structure cannot
be accessed directly from the StatVals structure (the pointer is not valid, as previously mentioned).
Therefore, to determine the type of data in the val and val_change fields, you must have saved the
SpmiStat structure as returned by the RSiPathAddSetStatx subroutine. This is rather clumsy, so the
RSiGetValuex subroutine does everything for you and you do not need to keep track of SpmiStat
structures.
The SpmiStat structure is used to describe a statistic. It is defined in the /usr/include/sys/Spmidef.h file
of type SpmiStat struct . If you ever need information from this data structure (apart from information
that can be returned by the RSiStatGetPathx subroutine) be sure to save it as it is returned by the
RSiPathAddSetStatx subroutine.
The RSiGetValuex subroutine provides another way of getting access to an SpmiStat structure but can
only do so while a data feed packet is being processed.
The xmtopas daemon accepts the definition of sets of statistics that are to be extracted simultaneously
and sent to the data-consumer program in a single data packet. The structure that describes such a set of
statistics is defined in the /usr/include/sys/Spmidef.h file of type SpmiStatSet struct . As returned by
When returned in a data feed packet, the SpmiStatSet structure holds the actual time the data feed packet
was created (according to the remote host's clock) and the elapsed time since the latest previous data feed
packet for the same SpmiStatSet was created.
SpmiHotSet structure represents another set of access structures that allow an application program to
define an alternative way of extracting and processing metrics. They are used to extract data values for
the most or least active statistics for a group of peer contexts. For example, it can be used to define that
the program wants to receive information about the two highest loaded disks, optionally subject to the
load exceeding a specified threshold.
When the SPMI receives a read request for an SpmiHotSet, the SPMI reads the latest value for all the peer
sets of statistics in the hotset in one operation. This action reduces the system overhead caused by access
of kernel structures and other system areas, and ensures that all data values for the peer sets of statistics
within a hotset are read at the same time. The hotset may consist of one or many sets of peer statistics.
SpmiHotVals One SpmiHotVals structure is created for each set of peer statistics selected for the hotset.
When the SPMI executes a request from the application program to read the data values for a hotset, all
SpmiHotVals structures in the set are updated. The RSi application program can then traverse the list of
SpmiHotVals structures by using the RSiGetHotItemx subroutine call.
The SpmiHotVals structure carries the data values from the SPMI to the application program. Its data
carrying fields are:
Item Descriptor
error Returns a zero value if the SPMI's last attempt to read the data
values for a set of peer statistics was successful. Otherwise, this
field contains an error code as defined in the sys/Spmidef.h file.
avail_resp Used to return the number of peer statistic data values that meet
the selection criteria (threshold). The field max_responses
determines the maximum number of entries actually returned.
count Contains the number of elements returned in the array items.
This number is the number of data values that met the selection
criteria (threshold), capped at max_responses.
items The array used to return count elements. This array is defined in
the SpmiHotItems data structure. Each element in the
SpmiHotItems array has the following fields:
name
The name of the peer context for which the values are
returned.
val
Returns the value of the counter or level field for the peer
statistic. This field returns the statistic's value as maintained
by the original supplier of the value. However, the val field
is converted to an SPMI data format.
val_change
Returns the difference between the previous reading of the
counter and the current reading when the statistic contains
counter data. When this value is divided by the elapsed
time returned in the SpmiHotSet Structure, an event
rate-per-time-unit can be calculated.
If neither a communication error nor a timeout error occurred, a packet is available in the receive buffer
pointed to by the pi pointer in the RSI handle. The packet includes a status code that tells whether the
subroutine was successful at the xmtopas daemon. You must check the status code in a packet if it
matters what exactly it is because the RSiBadStat constant is placed in RSiErrno field to indicate to your
program that a bad status code was received.
You can use the indication of error or success as defined for each subroutine to determine if the
subroutine succeeded or you can test the external integer RSiErrno. If this field is RSiOkay the subroutine
succeeded; otherwise it did not. The error codes returned in RSiErrno are defined in the RSiErrorType
enum .
All the library functions use the request-response interface, except for RSiMainLoop (which uses a network
driven interface) and RSiInitx, RSiGetValuex, and RSiGetRawValuex (that do not involve network traffic).
The request packet types are the still_alive, the data_feed, and the except_rec packets. The
still_alive packets are handled internally in the RSI interface and require no programming in the
data-consumer program.
The data_feed packets are received asynchronously with any packets produced by the request-response
type subroutines. If a data_feed packet is received when processing a request-response function, control
is passed to a callback function, which must be named when the RSI handle is initialized with the
RSiOpenx subroutine.
When the data-consumer program is not using the request-response functions, it still needs to be able to
receive and process data_feed packets. This is done with the RSiMainLoopx function, which invokes the
callback function whenever a packet is received.
Actually, the data feed callback function is invoked for all packets received that cannot be identified as a
response to the latest request sent, except if such packets are of type i_am_back, still_alive, or
except_rec. Note that this means that responses to “request-response” packets that arrive after a timeout
is sent to the callback function. It is the responsibility of your callback function to test for the packet type
received.
The except_rec packets are received asynchronously with any packets produced by the request-response
type subroutines. If an except_rec packet is received when processing a request-response function,
control is passed to a callback function, which must be named when the RSI handle is initialized with the
RSiOpenx subroutine.
When the data-consumer program is not using the request-response functions, it still needs to be able to
receive and process except_rec packets. This is done with the RSiMainLoopx function which invokes the
callback function whenever a packet is received.
Note: The API discards except_rec messages from a remote host unless a callback function to process the
message type was specified on the RSiOpenx subroutine call for that host.
In the case of the xmtopas protocol, such situations usually result in one or more of the following:
v Missing packets
v Resynchronizing requests
Missing packets
Responses to outstanding requests are not received, which generate a timeout. That's fairly easy to cope
with because the data-consumer program has to handle other error return codes anyway. It also results in
expected data feeds not being received. Your program may want to test for this happening. The proper
way to handle this situation is to use the RSiClosex function to release all memory related to the dead
host and to free the RSI handle. After this is done, the data-consumer program may attempt another
RSiOpenx to the remote system or may simply exit.
Resynchronizing requests
Whenever an xmtopas daemon hears from a given data-consumer program on a particular host for the
first time, it responds with a packet of i_am_back type, effectively prompting the data-consumer program
to resynchronize with the daemon. Also, when the daemon attempts to reconnect to data-consumer
programs that it talked to when it was killed or died, it sends an i_am_back packet.
It is important that you understand how the xmtopas daemon handles “first time contacted.” It is based
upon tables internal to the daemon. Those tables identify all the data-consumers that the daemon knows
about. Be aware that a data-consumer program is known by the host name of the host where it executes
suffixed by the IP port number used to talk to the daemon. Each data-consumer program running is
identified uniquely as are multiple running copies of the same data-consumer program.
Whenever a data-consumer program exits orderly, it alerts the daemon that it intends to exit and the
daemon removes it from the internal tables. If, however, the data-consumer program decides to not
request data feeds from the daemon for some time, the daemon detects that the data consumer has lost
interest and removes the data consumer from its tables as described in Life and Death of xmtopas. If the
data-consumer program decides later that it wants to talk to the xmtopas daemon again, the daemon
responds with an i_am_back packet.
The i_am_back packets are given special treatment by the RSI interface. Each time one is received, a
resynchronizing callback function is invoked. This function must be defined on the RSiOpenx subroutine.
Note: All data-consumer programs can expect to have this callback invoked once during execution of the
RSiOpenx subroutine because the remote xmtopas does not know the data consumer. This is usual and
should not cause your program to panic. If the resynchronize callback is invoked twice during processing
of the RSiOpenx function, the open failed and can be retried, if appropriate.
Example:
portrange 3001 3003
When the RSI communication starts, it uses 3001, 3002 or 3003 ports in the specified range. Only 3 RSI
agents can listen to the ports and the subsequent RSI communication fails.
The first version accesses only CPU-related statistics. It assumes that you want to get your statistics from
the local host unless you specify a host name on the command line. The program continues to display the
statistics until it is killed. The source code for the sample program can be found in the
/usr/samples/perfmgr/RsiCons1.c file.
Finally, lines 34 through 36 prepare an initial value path name for the main processing loop of the
data-consumer program. This is the method followed to create the value path names. Then, the main
processing loop in the internal lststats function is called. If this function returns, issue an RSiClosex call
and exit the program.
Defining a Statset
Eventually, you want the sample of the data-consumer program to receive data feeds from the xmtopas
daemon. Thus, start preparing the SpmiStatSet, which defines the set of statistics with which you are
interested. This is done with the RSiCreateStatSetx subroutine.
[01] voidlststats(char *basepath)
[02] {
[03] struct SpmiStatSet *ssp;
[04] char tmp[128];
[05]
[06] if (!(ssp = RSiCreateStatSetx(rsh)))
[07] {
[08] fprintf(stderr, “RsiCons1 can\’t create StatSet\n”);
[09] exit(62);
[10] }
[11]
[12] strcpy(tmp, basepath);
[13] strcat(tmp, “CPU/cpu0”);
[14] if ((tix = addstat(tix, ssp, tmp, “cpu0”)) == -1)
[15] {
[16] if (strlen(RSiEMsg))
[17] fprintf(stderr, “%s”, RSiEMsg);
[18] exit(63);
[19] }
[20]
[21] RSiStartFeedx(rsh, ssp, 1000);
[22] while(TRUE)
[23] RSiMainLoopx(499);
[24] }
In the sample program, the SpmiStatSet is created in the local lststats function shown previously in lines
6 through 10.
Lines 12 through 19 invoke the local function addstat (Adding Statistics to the Statset), which finds all
the CPU-related statistics in the context hierarchy and initializes the arrays to collect and print the
information. The first two lines expand the value path name passed to the function by appending
CPU/cpu0. The resulting string is the path name of the context where all CPU-related statistics for cpu0 are
held. The path name has the hosts/hostname/CPU/cpu0 format without a terminating slash, which is what
is expected by the subroutines that take a value path name as an argument. The addstat function is
shown in the next section. It uses three of the traversal functions to access the CPU-related statistics.
The only part of the main processing function in the main section yet to explain consists of lines 21
through 23. The first line simply tells the xmtopas daemon to start feeding observations of statistics for
an SpmiStatSet by issuing the RSiStartFeedx subroutine call. The next two lines define an infinite loop
that calls the RSiMainLoopx function to check for incoming data_feed packets.
There are two more subroutines concerned with controlling the flow of data feeds from xmtopas daemon.
Neither is used in the sample program. The subroutines are described in RSiChangeFeedx and
RSiStopFeedx structures.
The use of RSiPathGetCxx by the sample program is shown in lines 8 through 12. Following that, in
lines 14 through 30, two subroutines are used to get all the statistics values defined for the CPU context.
This is done by using RSiFirstStatx and RSiNextStatx subroutines.
In lines 20-21, the short name of the context (“cpu0”) and the short name of the statistic are saved in two
arrays for use when printing the column headings. Lines 22-24 construct the full path name of the
statistics value by concatenating the full context path name and the short name of the value. This is
necessary to proceed with adding the value to the SpmiStatSet with the RSiPathAddSetStatx. The value
is added by using the lines 25 and 26.
Actual processing of received statistics values is done by the lines 20-24. It involves the use of the library
RSiGetValuex subroutine. The following is an example of output from the sample program RsiCons1:
$ RsiCons1 umbra
Traversing contexts
The adddisk function in the following list shows how the RSiFirstCxx, RSiNextCxx, and the
RSiInstantiatex subroutines are combined with RSiPathGetCxx to make sure all subcontexts are accessed.
The sample program's addstat internal function is used to add the statistics of each subcontext to the
SpmiStatSet structure. A programmer who wanted to traverse all levels of subcontexts below a start
context could easily create a recursive function to do this.
01] int adddisk(int ix, struct SpmiStatSet *ssp, char *path)
[02] {
[03] int i = ix;
[04] char tmp[128];
[05] cx_handle *cxh;
[06] struct SpmiStatLink *statlink;
[07] struct SpmiCxLink *cxlink;
[08]
[09] cxh = RSiPathGetCxx(rsh, path);
[10] if ((!cxh) || (!cxh->cxt))
[11] {
The output from the RsiCons program when run on the xmtopas daemon on an AIX operating system
host is shown in the following example.
$ RsiCons encee
The $HOME/Rsi.hosts file has a simple layout. Only one keyword is recognized and only if placed in
column one of a line. That keyword is:
nobroadcast and means that the are_you_there message should not be broadcast using method 1 shown
previously. This option is useful in situations where a large number of hosts are on the network and only
a well-defined subset should be remotely monitored. To say that you don't want broadcasts but want
direct contact to three hosts, your $HOME/Rsi.hosts file might look like this:
nobroadcast
birte.austin.ibm.com
gatea.almaden.ibm.com
umbra
This example shows that the hosts to monitor do not necessarily have to be in the same domain or on a
local network. However, doing remote monitoring across a low-speed communications line is unlikely to
be popular; neither with other users of that communications line nor with yourself.
Be aware that whenever you want to monitor remote hosts that are not on the same subnet as the
data-consumer host, you must specify the broadcast address of the other subnets or all the host names of
those hosts in the $HOME/Rsi.hosts file. The reason is that IP broadcasts do not propagate through IP
routers or gateways.
The following example illustrates a situation where you want to do broadcasting on all local interfaces,
want to broadcast on the subnet identified by the broadcast address 129.49.143.255, and also want to
invite the host called umbra. (The subnet mask corresponding to the broadcast address in this example is
255.255.240.0 and the range of addresses covered by the broadcast is 129.49.128.0 - 129.49.143.255.)
129.49.143.255
If the RSiInvitex subroutine detects that the name server is inoperational or has abnormally long
response time, it returns the IP addresses of hosts rather than the host names. If the name server fails
after the list of hosts is partly built, the same host may appear twice, once with its IP address and once
with its host name.
The execution time of the RSiInvitex subroutine depends primarily on the number of broadcast addresses
you place in the $HOME/Rsi.hosts file. Each broadcast address increases the execution time with roughly
50 milliseconds plus the time required to process the responses. The minimum execution time of the
subroutine is roughly 1.5 seconds, during which time your application only gets control if callback
functions are specified and if packets arrive that must be given to those callback functions.
Another sample program written to the data-consumer API is the chmon program . Source code to the
program is in /usr/samples/perfmgr/chmon.c.file. The chmon program is also stored as an executable
during the installation of the Manager component. An example program follows:
Data-Consumer API Remote Monitor for host Tue Apr 14 09:09:05
1992
CHMON Sample Program *** birte *** Interval: 5 seconds
Item Descriptor
seconds_interval Is the interval between observations. Must be specified in
seconds. No blanks must be entered between the flag and the
interval. Defaults to 5 seconds.
no_of_processes Is the number of “hot” processes to be shown. A process is
considered “hotter” the more CPU it uses. No blanks must be
entered between the flag and the count field. Defaults to 0 (no)
processes.
hostname Is the host name of the host to be monitored. Default is the local
host. The sample program exits after 2,000 observations have
been taken, or when you type the letter “q” in its window.
IBM may not offer the products, services, or features discussed in this document in other countries.
Consult your local IBM representative for information on the products and services currently available in
your area. Any reference to an IBM product, program, or service is not intended to state or imply that
only that IBM product, program, or service may be used. Any functionally equivalent product, program,
or service that does not infringe any IBM intellectual property right may be used instead. However, it is
the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or
service.
IBM may have patents or pending patent applications covering subject matter described in this
document. The furnishing of this document does not grant you any license to these patents. You can send
license inquiries, in writing, to:
For license inquiries regarding double-byte character set (DBCS) information, contact the IBM Intellectual
Property Department in your country or send inquiries, in writing, to:
This information could include technical inaccuracies or typographical errors. Changes are periodically
made to the information herein; these changes will be incorporated in new editions of the publication.
IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this
publication at any time without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in
any manner serve as an endorsement of those websites. The materials at those websites are not part of
the materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you provide in any way it believes appropriate without
incurring any obligation to you.
Licensees of this program who wish to have information about it for the purpose of enabling: (i) the
exchange of information between independently created programs and other programs (including this
one) and (ii) the mutual use of the information which has been exchanged, should contact:
Such information may be available, subject to appropriate terms and conditions, including in some cases,
payment of a fee.
The licensed program described in this document and all licensed material available for it are provided
by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or
any equivalent agreement between us.
The performance data and client examples cited are presented for illustrative purposes only. Actual
performance results may vary depending on specific configurations and operating conditions.
Information concerning non-IBM products was obtained from the suppliers of those products, their
published announcements or other publicly available sources. IBM has not tested those products and
cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM
products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of
those products.
Statements regarding IBM's future direction or intent are subject to change or withdrawal without notice,
and represent goals and objectives only.
All IBM prices shown are IBM's suggested retail prices, are current and are subject to change without
notice. Dealer prices may vary.
This information is for planning purposes only. The information herein is subject to change before the
products described become available.
This information contains examples of data and reports used in daily business operations. To illustrate
them as completely as possible, the examples include the names of individuals, companies, brands, and
products. All of these names are fictitious and any similarity to actual people or business enterprises is
entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs
in any form without payment to IBM, for the purposes of developing, using, marketing or distributing
application programs conforming to the application programming interface for the operating platform for
which the sample programs are written. These examples have not been thoroughly tested under all
conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these
programs. The sample programs are provided "AS IS", without warranty of any kind. IBM shall not be
liable for any damages arising out of your use of the sample programs.
Each copy or any portion of these sample programs or any derivative work must include a copyright
notice as follows:
Portions of this code are derived from IBM Corp. Sample Programs.
This Software Offering does not use cookies or other technologies to collect personally identifiable
information.
If the configurations deployed for this Software Offering provide you as the customer the ability to collect
personally identifiable information from end users via cookies and other technologies, you should seek
your own legal advice about any laws applicable to such data collection, including any requirements for
notice and consent.
For more information about the use of various technologies, including cookies, for these purposes, see
IBM’s Privacy Policy at http://www.ibm.com/privacy and IBM’s Online Privacy Statement at
http://www.ibm.com/privacy/details the section entitled “Cookies, Web Beacons and Other
Technologies” and the “IBM Software Products and Software-as-a-Service Privacy Statement” at
http://www.ibm.com/software/info/product-privacy.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business
Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at
Copyright and trademark information at www.ibm.com/legal/copytrade.shtml.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or
its affiliates.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Notices 277
278 AIX Version 7.2: Performance Tools Guide and Reference
Index
A curt (continued)
sample report (continued)
API calls -t flag 24
basic syntax 2
pm_delete_program 60 System Calls Summary Report 15
pm_get_data 60 System Summary Report 7
pm_get_program 60
pm_get_tdata 60
pm_get_Tdata 60
pm_reset_data 60 E
pm_set_program 60 event list
pm_start 60 POWERCOMPAT 53
pm_stop 60 examples
pm_tstart 60 performance monitor APIs 63
pm_tstop 60
G
B gennames utility 35
bos.perf.libperfstat 5.2.0 file set 195 global interfaces
perfstat_cpu_util interface 105
perfstat_partition_config interface 97
C perfstat_process 154
perfstat_process_util 156
commands perfstat_processor_pool_util 158
gprof 217
prof 215
tprof 219
counter multiplexing mode 61 I
pm_get_data_mx 62 info stanza 199
pm_get_program_mx 62
pm_get_tdata_mx 62
pm_set_program_mx 62 K
CPU Utilization Reporting Tool kernel tuning 198
see curt 2 attributes
curt 2 pre520tune 198
Application Pthread Summary (by PID) Report 14 commands 198
Application Summary (by process type) Report 13 flags 200
Application Summary by Process ID (PID) Report 12 tunchange 202
Application Summary by Thread ID (Tid) Report 11 tuncheck 203
default reports 6 tundefault 205
Event Explanation 3 tunrestore 203
Event Name 3 tunsave 204
examples 4 commands syntax 200
FILH Summary Report 19 file manipulation commands 202
flags 2 initial setup 205
FLIH types 20 introduction 198
General Information 6 migration and compatibility 198
Global SLIH Summary Report 21 reboot tuning procedures 206
Hook ID 3 recovery procedure 206
Kproc Summary (by Tid) Report 13 SMIT interface 206
measurement and sampling 3 tunable parameters 198
parameters 2 tunables file directory 199
Pending Pthread Calls Summary Report 19 tunables parameters
Pending System Calls Summary Report 15 type 200
Processor Summary Report 9
Pthread Calls Summary Report 19
report overview 5
sample report L
-e flag 22 lastboot 199
-p flag 26 lastboot.log 199
-P flag 29
-s flag 23
Printed in USA