|
|
Oracle to capture UNIX/Linux vmstat information
Don Burleson
|
It is a simple matter to create an Oracle table
to store this information and use a script to populate the table.
Creating the automated
vmstat monitor begins by creating an Oracle table to
contain the
vmstat output:
cr_vmstat_tab.sql
connect perfstat/perfstat;
drop table stats$vmstat;
create table stats$vmstat
(
start_date date,
duration number,
server_name varchar2(20),
runque_waits number,
page_in number,
page_out number,
user_cpu number,
system_cpu number,
idle_cpu number,
wait_cpu number
)
tablespace perfstat
storage (initial 10m
next 1m
pctincrease 0)
;
Now that we have defined an Oracle table to
capture the vmstat information, we need to write a UNIX script
that will execute vmstat, capture the vmstat output, and place it into
the Oracle table.
The main script to collect the vmstat
information is a Korn shell script called get_vmstat.ksh. As we
noted earlier, each dialect of UNIX displays vmstat information in
different columns, so we need slightly different scripts for each type
of UNIX.
The idea is to
write a script that continually runs the vmstat utility and then
directs the results into our Oracle table.
The script shows
the vmstat capture utility script for the Linux operating system.
Note that you
must change this script in several places to make it work for you:
·
You must set the ORACLE_HOME to your directory:
ORACLE_HOME=/usr/app/oracle/admin/product/8/1/6
·
You must set your ORACLE_SID in the sqlplus command:
$ORACLE_HOME/bin/sqlplus -s perfstat/perfstat@testsys1<<EOF
·
You can change the duration of samples by re-setting
SAMPLE_TIME UNIX variable:
get_vmstat.ksh (Linux version)
#!/bin/ksh
# This is the Linux version
ORACLE_HOME=/usr/app/oracle/admin/product/8/1/6
export ORACLE_HOME
PATH=$ORACLE_HOME/bin:$PATH
export PATH
SERVER_NAME=`uname -a|awk '{print $2}'`
typeset -u SERVER_NAME
export SERVER_NAME
# sample every five minutes (300 seconds) . . . .
SAMPLE_TIME=300
while true
do
vmstat ${SAMPLE_TIME} 2 > /tmp/msg$$
# run vmstat and direct the output into the Oracle table . . .
cat /tmp/msg$$|sed 1,3d | awk '{ printf("%s %s %s %s %s %s\n", $1,
$8, $9, SEE CODE DEPOT FOR
FULL SCRIPT
14, $15, $16) }' | while read RUNQUE PAGE_IN PAGE_OUT USER_CPU
SYSTEM_CPU
DLE_CPU
do
$ORACLE_HOME/bin/sqlplus -s perfstat/perfstat@testsys1<<EOF
insert into perfstat.stats\$vmstat
values (
sysdate,
$SAMPLE_TIME,
'$SERVER_NAME',
$RUNQUE,
$PAGE_IN,
$PAGE_OUT,
$USER_CPU,
$SYSTEM_CPU,
$IDLE_CPU,
0
);
EXIT
EOF
done
done
rm /tmp/msg$$
Because of the differences in implementations
of vmstat, the first task is to identify the columns of the vmstat
output that contain the information that we want to capture. Once we
know the columns that we want to capture, we can add these columns to
the vmstat script to put the output into our table.
|
|
Run queue |
Page-in |
Page-out |
|
User |
System |
Idle |
Wait |
Dialect |
|
column |
column |
column |
|
column |
column |
column |
column |
HP/UX |
|
1 |
8 |
9 |
|
16 |
17 |
18 |
NA |
AIX |
|
1 |
6 |
7 |
|
14 |
15 |
16 |
17 |
Solaris |
|
1 |
8 |
9 |
|
20 |
21 |
22 |
NA |
Linux |
|
1 |
8 |
9 |
|
14 |
15 |
16 |
NA |
Using this table, you can adjust the capture script
according to your operating system. You customize the script by
changing the line in the script that reads the vmstat output and
places it into the stats$vmstat table. Here is a summary of the UNIX
dialect changes to this line.
HP/UX vmstat Columns
cat
/tmp/msg$$|sed 1,3d |\
awk
'{ printf("%s %s %s %s %s %s\n", $1, $8, $9, $16, $17, $18) }' |\
while read RUNQUE PAGE_IN PAGE_OUT USER_CPU SYSTEM_CPU IDLE_CPU
IBM AIX vmstat Columns
cat
/tmp/msg$$|sed 1,3d |\
awk
'{ printf("%s %s %s %s %s %s\n", $1, $6, $7, $14, $15, $16, $17) }'
|\
while read RUNQUE PAGE_IN PAGE_OUT USER_CPU SYSTEM_CPU IDLE_CPU
WAIT_CPU
Sun Solaris vmstat Columns
cat
/tmp/msg$$|sed 1,3d |\
awk
'{ printf("%s %s %s %s %s %s\n", $1, $8, $9, $20, $21, $22) }' |\
while read RUNQUE PAGE_IN PAGE_OUT USER_CPU SYSTEM_CPU IDLE_CPU
Linux vmstat columns
cat
/tmp/msg$$|sed 1,3d |\
awk
'{ printf("%s %s %s %s %s %s\n", $1, $8, $9, $14, $15, $16) }' |\
while read RUNQUE PAGE_IN PAGE_OUT USER_CPU SYSTEM_CPU IDLE_CPU
Internals of the vmstat Capture Script
It is important to understand how the
get_vmstat.ksh script functions, so let's examine the steps in
this script:
1. It
executes the vmstat utility for the specified elapsed-time interval (SAMPLE_TIME=300).
2.
The output of the vmstat is directed into the /tmp directory
3.
The output is then parsed using the awk utility, and the values are
inserted into the mon_vmstats table.
Once started, the get_vmstat.ksh script
will run continually and capture the vmstats into your stats$vmstat
table. This script is an example of a UNIX daemon process, and it will
run continually to sample the server status. However, the script may
be terminated if you server is rebooted, so it is a good idea to place
a crontab entry to make sure that the get_vmstat script is always
running. Below is a script called run_vmstat.ksh that will
ensure that the vmstat utility is always running on your server.
Note that you must make the following changes
to this script:
·
Set the file location variable vmstat to the directory
that contains your get_vmstat.ksh script:
vmstat=`echo ~oracle/vmstat`
·
Create a small file in your UNIX file directory ($vmstat)
called mysid. This file will contain one line and specify the
name of your ORACLE_SID.
ORACLE_SID=`cat ${vmstat}/mysid`
run_vmstat.ksh
#!/bin/ksh
# First, we must set the environment . . . .
vmstat=`echo ~oracle/vmstat`
export vmstat
ORACLE_SID=`cat ${vmstat}/mysid`
export ORACLE_SID
ORACLE_HOME=`cat /etc/oratab|grep $ORACLE_SID:|cut -f2 -d':'`
export ORACLE_HOME
PATH=$ORACLE_HOME/bin:$PATH
export PATH
#----------------------------------------
# If it is not running, then start it . . .
#----------------------------------------
check_stat=`ps -ef|grep get_vmstat|grep -v grep|wc -l`;
SEE CODE DEPOT FOR FULL
SCRIPT
oracle_num=`expr $check_stat`
if [ $oracle_num -le 0 ]
then nohup $vmstat/get_vmstat_linux.ksh > /dev/null 2>&1 &
fi
The run_vmstat.ksh script can be
scheduled to run hourly on the server. As we can see by examining the
code, this script checks to see if the get_vmstat.ksh script is
executing. If it is not executing, the script resubmits it for
execution. In practice, the get_vmstat.ksh script will not
abort, but if the server is shut down and restarted, the script will
need to be restarted.
Here is an example of the UNIX crontab file.
For those not familiar with cron, the cron facility is a UNIX
scheduling facility that allows tasks to be submitted at specific
times. Note that it schedules the run_vmstat.ksh script every
hour, and runs a vmstat exception report every day at 7:00
a.m.
00 * * * * /home/vmstat/run_vmstat.ksh
> /home/vmstat/r.lst
00 7 * * * /home/vmstat/run_vmstat_alert.ksh prodb1 > /home/vmstat/v.lst
Now that we see how to monitor the Oracle
database server, let's examine how we can use this technique to report
on other Oracle-related server. This technique is very handy for
reporting on Oracle Web servers and application servers.
Reporting vmstat Information on Other Oracle Servers
To get a complete picture of the performance of
your total Oracle system, you must also monitor the behavior of all of
the servers that communicate with Oracle. For example, many Oracle
environments have other servers:
·
Oracle Applications In
Oracle Applications products, you generally have separate application
servers communicating with the database server.
·
SAP with Oracle In
SAP, you have separate application servers that communicate with
Oracle.
·
Real Application Clusters (Oracle Parallel Server) With
RAC, you have multiple Oracle database servers, all sharing the same
database.
·
Oracle Web Applications When
using Oracle databases on the Web, you have separate WebServers that
direct the communications into the database.
This technique in get_vmstat.ksh can
easily be extended to measure the performance of other servers in your
Oracle environment. Note that the stats$vmstat table has a column to
store the server name. Since we can separate vmstat metrics by server,
we simply need to create a remote vmstat script that will capture the
performance of the other servers and send the data to a central
database. Because only the database server contains an Oracle
database, the vmstat data will be sent to the database from the remote
server using database links.
Any server that has a Net8 client can be used to capture vmstat
information.
If we take a close look at the get_vmstat
script from above, we see that this script can be executed on a remote
server. The script will send the vmstat data to the server that
contains our Oracle database using a database link. Note where the
script enters sqlplus using "sqlplus perfstat/perfstat@prod".
By collecting the data remotely, we can capture
a complete picture of the performance of all of the components of the
Oracle environment, not just the database server. This is important in
cases where you need to track slow performance of ecommerce systems.
Using this vmstat information, you can go back to the time of the
slowdown and see which Web servers may have been overloaded and also
examine the load on the database server.
vmstat Reporting
The SQL script vmstat_alert.sql can
quickly give a complete exception report on all of the servers in our
Oracle environment. This report will display times when the CPU and
RAM memory exceed your predefined thresholds:
set lines
80;
set pages 999;
set feedback off;
set verify off;
column my_date heading 'date hour' format a20
column c2 heading runq format 999
column c3 heading pg_in format 999
column c4 heading pg_ot format 999
column c5 heading usr format 999
column c6 heading sys format 999
column c7 heading idl format 999
column c8 heading wt format 999
ttitle 'run queue > 2|May indicate an overloaded CPU|When runqueue
exceeds
the number of CPUs| on the server, tasks are waiting for service.';
select
server_name,
to_char(start_date,'YY/MM/DD HH24') my_date,
avg(runque_waits) c2,
avg(page_in) c3,
avg(page_out) c4,
avg(user_cpu) c5,
avg(system_cpu) c6,
avg(idle_cpu) c7
from
perfstat.stats$vmstat
WHERE
runque_waits > 2
and start_date > sysdate-&&1
group by
server_name,
to_char(start_date,'YY/MM/DD HH24')
ORDER BY
server_name,
to_char(start_date,'YY/MM/DD HH24')
;
ttitle 'page_in > 1|May indicate overloaded memory|Whenever Unix
performs
a page-in, the RAM memory | on the server has been exhausted and
swap pages are being used.';
select
server_name,
to_char(start_date,'YY/MM/DD HH24') my_date,
avg(runque_waits) c2,
avg(page_in) c3,
avg(page_out) c4,
avg(user_cpu) c5,
avg(system_cpu) c6,
avg(idle_cpu) c7
from
perfstat.stats$vmstat
WHERE
page_in > 1
and start_date > sysdate-&&1
group by
server_name,
to_char(start_date,'YY/MM/DD HH24')
ORDER BY
server_name,
to_char(start_date,'YY/MM/DD HH24')
;
ttitle 'user+system CPU > 70%|Indicates periods with a fully-loaded
CPU subssystem.|Periods of 100% utilization are only a | concern
when runqueue values exceeds the number of CPs on the server.';
select
server_name,
to_char(start_date,'YY/MM/DD HH24') my_date,
avg(runque_waits) c2,
avg(page_in) c3,
avg(page_out) c4,
avg(user_cpu) c5,
avg(system_cpu) c6,
avg(idle_cpu) c7
from
perfstat.stats$vmstat
WHERE
(user_cpu + system_cpu) > 70
and start_date > sysdate-&&1
group by
server_name,
to_char(start_date,'YY/MM/DD HH24')
ORDER BY
server_name,
to_char(start_date,'YY/MM/DD HH24')
;
The standard vmstat alert report is used to
alert the Oracle DBA and systems administrator to out-of-bounds
conditions on each Oracle server. These conditions include:
·
CPU waits > 40% (AIX version only) This
may indicate I/O-based contention. The solution is to spread files
across more disks or add buffer memory.
·
Run queue > xxx - (where xxx is the number of
CPUs on the server, 2 in this example) This
indicates an overloaded CPU. The solution is to add additional
processors to the server.
·
Page_in > 2 with correlated scan rates Page-in
operations can indicate overloaded memory. The solution is to reduce
the size of the Oracle SGA, PGA, or add additional RAM memory to the
server.
·
User CPU + System CPU > 90% This
indicates periods where the CPU is highly utilized.
While the SQL here is self-explanatory, let's
look at a sample report and see how it will help our systems
administrator monitor the server's behavior:
SQL> @vmstat_alert
7
Wed Dec 20
page 1
run queue > 2
May indicate an overloaded CPU.
When runqueue exceeds the number of CPUs
on the server, tasks are waiting for service.
SERVER_NAME date hour runq pg_in pg_ot usr sys idl
--------------- -------------------- ---- ----- ----- ---- ----
----
AD-01 01/13/04 17 3 0 0 87 5
8
Wed Dec
20
page 1
page_in > 1
May indicate overloaded memory.
Whenever Unix performs a page-in, the RAM memory
on the server has been exhausted and swap pages are being
used.
SERVER_NAME date hour runq pg_in pg_ot usr
sys idl
-------------------- -------------------- ---- ----- ----- ---- ----
----
AD-01 01/12/13 16 0 5 0 1
1 98
AD-01 01/12/14 09 0 5 0 10
2 88
AD-01 01/12/15 16 0 6 0 0
0 100
AD-01 01/12/19 20 0 29 2 1
2 98
PROD1DB 01/12/13 14 0 3 43 4
4 93
PROD1DB 01/12/19 07 0 2 0 1
3 96
PROD1DB 01/12/19 11 0 3 0 1
3 96
PROD1DB 01/12/19 12 0 6 0 1
3 96
PROD1DB 01/12/19 16 0 3 0 1
3 96
PROD1DB 01/12/19 17 0 47 68 5
5 91
Wed Dec
20
page 1
user+system > 70%
Indicates periods with a fully-loaded CPU sub-system.
Periods of 100% utilization are only a
concern when runqueue values exceeds the number of CPUs on the
server.
SERVER_NAME date hour runq pg_in pg_ot usr
sys idl
-------------------- -------------------- ---- ----- ----- ---- ----
----
AD-01 01/12/13 14 0 0 2 75
2 22
AD-01 01/12/13 17 3 0 0 87
5 8
AD-01 01/12/15 15 0 0 0 50
29 22
AD-01 01/12/15 16 0 0 0 48
33 20
AD-01 01/12/19 07 0 0 0 77
4 19
AD-01 01/12/19 10 0 0 0 70
5 24
AD-01 01/12/19 11 1 0 0 60
17 24
PROD1 01/12/19 12 0 0 1 52
30 18
PROD1 01/12/19 13 0 0 0 39
59 2
PROD1 01/12/19 14 0 0 0 39
55 6
PROD1 01/12/19 15 1 0 0 57
23 20
You may notice that this exception report gives
the hourly average for the vmstat information. If you look at the
get_vmstat.ksh script, you will see that the data is captured in
intervals of every 300 elapsed seconds (5-minute intervals). Hence, if
you see an hour where your server is undergoing stress, you can modify
your script to show the vmstat changes every five minutes. You can
also run this report in conjunction with other STATSPACK reports to
identify what tasks may have precipitated the server problem. The
stats$sql_summary table is especially useful for this purpose.
|