Archive

Archive for the ‘HA & HPC’ Category

map/reduce framework definition and introduction

March 27th, 2014 No comments

MapReduce is a parallel programming model that allows distributed processing on large data sets on a cluster of computers. The MapReduce framework is patented (http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.htm&r=1&f=G&l=50&s1=7,650,331.PN.&OS=PN/7,650,331&RS=PN/7,650,331) by Google, but the ideas are freely shared and adopted in a number of open-source implementations.

MapReduce derives its ideas and inspiration from concepts in the world of functional programming. Map and reduce are commonly used functions in the world of functional programming. In functional programming, a map function applies an operation or a function to each element in a list. For example, a multiply-by-two function on a list [1, 2, 3, 4] would generate another list as follows: [2, 4, 6, 8]. When such functions are applied, the original list is not altered. Functional programming believes in keeping data immutable and avoids sharing data among multiple processes or threads. This means the map function that was just illustrated, trivial as it may be, could be run via two or more multiple threads on the list and these threads would not step on each other, because the list itself is not altered.

Like the map function, functional programming has a concept of a reduce function. Actually, a reduce function in functional programming is more commonly known as a fold function. A reduce or a fold function is also sometimes called an accumulate, compress, or inject function. A reduce or fold function applies a function on all elements of a data structure, such as a list, and produces a single result or output. So applying a reduce function-like summation on the list generated out of the map function, that is, [2, 4, 6, 8], would generate an output equal to 20.

So map and reduce functions could be used in conjunction to process lists of data, where a function is first applied to each member of a list and then an aggregate function is applied to the transformed and generated list.

This same simple idea of map and reduce has been extended to work on large data sets. The idea is slightly modified to work on collections of tuples or key/value pairs. The map function applies a function on every key/value pair in the collection and generates a new collection. Then the reduce function works on the new generated collection and applies an aggregate function to compute a final output. This is better understood through an example, so let me present a trivial one to explain the flow. Say you have a collection of key/value pairs as follows:

[{ "94303": "Tom"}, {"94303": "Jane"}, {"94301": "Arun"}, {"94302": "Chen"}]

This is a collection of key/value pairs where the key is the zip code and the value is the name of a person who resides within that zip code. A simple map function on this collection could get the names of all those who reside in a particular zip code. The output of such a map function is as follows:

[{"94303":["Tom", "Jane"]}, {“94301″:["Arun"]}, {“94302″:["Chen"]}]

Now a reduce function could work on this output to simply count the number of people who belong to particular zip code. The final output then would be as follows:

[{"94303": 2}, {"94301": 1}, {"94302": 1}]

This example is extremely simple and a MapReduce mechanism seems too complex for such a manipulation, but I hope you get the core idea behind the concepts and the flow.

PS:

This article is from book <Professional NoSQL>.

Common storage multi path Path-Management Software

December 12th, 2013 No comments
Vendor Path-Management Software URL
Hewlett-Packard AutoPath, SecurePath www.hp.com
Microsoft MPIO www.microsoft.com
Hitachi Dynamic Link Manager www.hds.com
EMC PowerPath www.emc.com
IBM RDAC, MultiPath Driver www.ibm.com
Sun MPXIO www.sun.com
VERITAS Dynamic Multipathing (DMP) www.veritas.com
Categories: HA, Hardware, IT Architecture, SAN, Storage Tags:

Configuring Active/Passive Clustering for Apache Tomcat in Oracle RAC

October 1st, 2013 1 comment

Note: this is from book <Pro Oracle Database 11g RAC on Linux>

A slightly more complex example involves making Apache Tomcat or another web-accessible
application highly available. The difference in this setup compared to the database setup described in
the previous chapter lies in the fact that you need to use a floating virtual IP address. Floating in this
context means that the virtual IP address moves jointly with the application. Oracle calls its
implementation of a floating VIP an application VIP. Application VIPs were introduced in Oracle
Clusterware 10.2. Previous versions only had a node VIP.
The idea behind application VIPs is that, in the case of a node failure, both VIP and the application
migrate to the other node. The example that follows makes Apache Tomcat highly available, which is
accomplished by installing the binaries for version 6.0.26 in /u01/tomcat on two nodes in the cluster. The
rest of this section outlines the steps you must take to make Apache Tomcat highly available.
Oracle Grid Infrastructure does not provide an application VIP by default, so you have to create one.
A new utility, called appvipcfg, can be used to set up an application VIP, as in the following example:

[root@london1 ~]# appvipcfg
Production Copyright 2007, 2008, Oracle.All rights reserved

Usage: appvipcfg create -network=<network_number> -ip=<ip_address> -vipname=<vipname>
-user=<user_name>[-group=<group_name>]
delete -vipname=<vipname>
[root@london1 ~]# appvipcfg create -network=1 \
> -ip 172.17.1.108 -vipname httpd-vip -user=root
Production Copyright 2007, 2008, Oracle.All rights reserved
2010-06-18 16:07:12: Creating Resource Type
2010-06-18 16:07:12: Executing cmd: /u01/app/crs/bin/crsctl add type app.appvip.type -basetype
cluster_resource -file /u01/app/crs/crs/template/appvip.type
2010-06-18 16:07:13: Create the Resource
2010-06-18 16:07:13: Executing cmd: /u01/app/crs/bin/crsctl add resource httpd-vip -type
app.appvip.type -attr USR_ORA_VIP=172.17.1.104,START_DEPENDENCIES=hard(ora.net1.network)

pullup(ora.net1.network),STOP_DEPENDENCIES=hard(ora.net1.network),ACL=’owner:root:rwx,pgrp:roo
t:r-x,other::r–,user:root:r-x’

The preceding output shows that the new resource has been created, and it is owned by root
exclusively. You could use crsctl setperm to change the ACL, but this is not required for this process.
Bear in mind that no account other than root can start the resource at this time. You can verify the result
of this operation by querying the resource just created. Note how the httpd-vip does not have an ora.
prefix:

[root@london1 ~]# crsctl status resource httpd-vip
NAME=httpd-vip
TYPE=app.appvip.type
TARGET=OFFLINE
STATE=OFFLINE

Checking the resource profile reveals that it matches the output of the appvipcfg command; the
output has been shortened for readability, and it focuses only on the most important keys (the other
keys were removed for the sake of clarity):

[root@london1 ~]# crsctl stat res httpd-vip –p
NAME=httpd-vip
TYPE=app.appvip.type
ACL=owner:root:rwx,pgrp:root:r-x,other::r–,user:root:r-x
ACTIVE_PLACEMENT=0
AGENT_FILENAME=%CRS_HOME%/bin/orarootagent%CRS_EXE_SUFFIX%
AUTO_START=restore
CARDINALITY=1
CHECK_INTERVAL=1
DEGREE=1
DESCRIPTION=Application VIP
RESTART_ATTEMPTS=0
SCRIPT_TIMEOUT=60
SERVER_POOLS=*
START_DEPENDENCIES=hard(ora.net1.network) pullup(ora.net1.network)
STOP_DEPENDENCIES=hard(ora.net1.network)
USR_ORA_VIP=172.17.1.108
VERSION=11.2.0.1.0

The dependencies on the network ensure that, if the network is not started, it will be started as part
of the VIP start. The resource is controlled by the CRSD orarootagent because changes to the network
configuration require root privileges in Linux. The status of the resource revealed it was stopped; you
can use the following command to start it:

[root@london1 ~]# crsctl start res httpd-vip
CRS-2672: Attempting to start ‘httpd-vip’ on ‘london2′
CRS-2676: Start of ‘httpd-vip’ on ‘london2′ succeeded
[root@london1 ~]#

In this case, Grid Infrastructure decided to start the resource on server london2.

[root@london1 ~]# crsctl status resource httpd-vip
NAME=httpd-vip
TYPE=app.appvip.type
TARGET=ONLINE
STATE=ONLINE on london2

You can verify this by querying the network setup, which has changed. The following output is again
shortened for readability:

[root@london2 source]# ifconfig

eth0:3 Link encap:Ethernet HWaddr 00:16:36:2B:F2:F6
inet addr:172.17.1.108 Bcast:172.17.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

Next, you need an action script that controls the Tomcat resource. Again, the requirement is to
implement start, stop, clean, and check functions in the action script. The Oracle documentation lists
C, C++, and shell scripts as candidate languages for an action script. We think that the action script can
be any executable, as long as it returns 0 or 1, as required by Grid Infrastructure. A sample action script
that checks for the Tomcat webserver could be written in plan bash, as in the following example:

#!/bin/bash

export CATALINA_HOME=/u01/tomcat
export ORA_CRS_HOME=/u01/app/crs
export JAVA_HOME=$CRS_HOME/jdk
export CHECKURL=”http://172.17.1.108:8080/tomcat-power.gif”

case $1 in
‘start’)
$CATALINA_HOME/bin/startup.sh
RET=$?
;;
‘stop’)
$CATALINA_HOME/bin/shutdown.sh
RET=$?
;;
‘clean’)
$CATALINA_HOME/bin/shutdown.sh
RET=$?
;;
‘check’)
# download a simple, small image from the tomcat server
/usr/bin/wget -q –delete-after $CHECKURL
RET=$?
;;
*)
RET=0
;;
esac
# A 0 indicates success, return 1 for an error.
if [ $RET -eq 0 ]; then

exit 0
else
exit 1
fi

In our installation, we created a $GRID_HOME/hadaemon/ directory on all nodes in the cluster to save
the Tomcat action script, tomcat.sh.
The next step is to ensure that the file is executable, which you can accomplish by running test to
see whether the file works as expected. Once you are confident that the script is working, you can add
the Tomcat resource.
The easiest way to configure the new resource is by creating a text file with the required attributes,
as in this example:

[root@london1 hadaemon]# cat tomcat.profile
ACTION_SCRIPT=/u01/app/crs/hadaemon/tomcat.sh
PLACEMENT=restricted
HOSTING_MEMBERS=london1 london2
CHECK_INTERVAL=30
RESTART_ATTEMPTS=2
CHECK_INTERVAL=30
RESTART_ATTEMPTS=2
START_DEPENDENCIES=hard(httpd-vip)
STOP_DEPENDENCIES=hard(httpd-vip)

The following command registers the resource tomcat in Grid Infrastructure:

[root@london1 ~]# crsctl add resource tomcat –type cluster_resource -file tomcat.profile

Again, the profile registered matches what has been defined in the tomcat.profile file, plus the
default values:

[root@london1 hadaemon]# crsctl status resource tomcat –p
NAME=tomcat
TYPE=cluster_resource
ACL=owner:root:rwx,pgrp:root:r-x,other::r–
ACTION_SCRIPT=/u01/app/crs/hadaemon/tomcat.sh
ACTIVE_PLACEMENT=0
AGENT_FILENAME=%CRS_HOME%/bin/scriptagent
AUTO_START=restore
CARDINALITY=1
CHECK_INTERVAL=30
DEFAULT_TEMPLATE=
DEGREE=1
DESCRIPTION=
ENABLED=1
FAILOVER_DELAY=0
FAILURE_INTERVAL=0
FAILURE_THRESHOLD=0
HOSTING_MEMBERS=london1 london2
LOAD=1
LOGGING_LEVEL=1

NOT_RESTARTING_TEMPLATE=
OFFLINE_CHECK_INTERVAL=0
PLACEMENT=restricted
PROFILE_CHANGE_TEMPLATE=
RESTART_ATTEMPTS=2
SCRIPT_TIMEOUT=60
SERVER_POOLS=
START_DEPENDENCIES=hard(httpd-vip)
START_TIMEOUT=0
STATE_CHANGE_TEMPLATE=
STOP_DEPENDENCIES=hard(httpd-vip)
STOP_TIMEOUT=0
UPTIME_THRESHOLD=1h

This example includes a hard dependency on the httpd-vip resource, which is started now. If you
try to start the Tomcat resource, you will get the following error:

[root@london1 hadaemon]# crsctl start resource tomcat
CRS-2672: Attempting to start ‘tomcat’ on ‘london1′
CRS-2674: Start of ‘tomcat’ on ‘london1′ failed
CRS-2527: Unable to start ‘tomcat’ because it has a ‘hard’ dependency
on ‘httpd-vip’
CRS-2525: All instances of the resource ‘httpd-vip’ are already running;
relocate is not allowed because the force option was not specified
CRS-4000: Command Start failed, or completed with errors.

To get around this problem, you need begin by shutting down httpd-vip and then trying again:

[root@london1 hadaemon]# crsctl stop res httpd-vip
CRS-2673: Attempting to stop ‘httpd-vip’ on ‘london1′
CRS-2677: Stop of ‘httpd-vip’ on ‘london1′ succeeded
[root@london1 hadaemon]# crsctl start res tomcat
CRS-2672: Attempting to start ‘httpd-vip’ on ‘london1′
CRS-2676: Start of ‘httpd-vip’ on ‘london1′ succeeded
CRS-2672: Attempting to start ‘tomcat’ on ‘london1′
CRS-2676: Start of ‘tomcat’ on ‘london1′ succeeded

The Tomcat servlet and JSP container is now highly available. However, please bear in mind that the
session state of an application will not fail over to the passive node in the case of a node failure. The
preceding example could be further enhanced by using a shared cluster logical ACFS volume to store the
web applications used by Tomcat, as well as and the Tomcat binaries themselves.

Categories: HA, Oracle DB Tags:

oracle ocfs2 cluster filesystem best practise

May 21st, 2013 No comments
  • To check current settings of o2cb, check files under /sys/kernel/config/cluster/ocfs2/
  • To set new value for o2cb:

service o2cb unload
service o2cb configure

heartbeat dead threshold 151 #Iterations before a node is considered dead
network idle timeout 120000 #Time in ms before a network connection is considered dead
network keepalive delay 5000 #Max time in ms before a keepalive packet is sent
network reconnect delay 5000 #Min time in ms between connection attempts

service o2cb load

service o2cb status #will show new configuration if OVS in server pool; or it will show offline

PS:

o2cb – Default cluster stack for the OCFS2 file system, it includes
  • a node manager (o2nm) to keep track of the nodes in the cluster,
  • a heartbeat agent (o2hb) to detect live nodes
  • a network agent (o2net) for intra-cluster node communication
  • a distributed lock manager (o2dlm) to keep track of lock resources
  • All these components are in-kernel.
  • It also includes an in-memory file system, dlmfs, to allow userspace to access the in-kernel dlm
  • main conf files: /etc/ocfs2/cluster.conf, /etc/sysconfig/o2cb
  • more info here https://oss.oracle.com/projects/ocfs2-tools/dist/documentation/v1.4/o2cb.html
Categories: Clouding, HA, HA & HPC, Oracle Cloud Tags:

How HA is achived in Oracle Exadata

November 27th, 2012 No comments
  1. Each Exadata Database Machine has completely redundant hardware including redundant InfiniBand networking, redundant Power Distribution Units (PDU), redundant power upplies, and redundant database and storage servers.
  2. Oracle RAC protects against database server failure.
  3. ASM provides data mirroring to protect against disk or storage server failures.
  4. Oracle RMAN provides extremely fast and efficient backups to disk or tape.
  5. Oracle’s Flashback technology allows backing out user errors at the database, table or even row level.
  6. Using Oracle Data Guard, a second Exadata Database Machine can be configured to maintain a real-time copy of the database at a remote site to provide full protection against site failures and disasters.

resolved – opcmsgm isn’t running

October 22nd, 2012 No comments

If you encounter the problem of opcmsgm not running, you can do the following to resolve the issue:

  • check current status:

Control Manager opcctlm (22616) is running
Action Manager opcactm (22628) is running
Message Manager opcmsgm isn’t running
TT & Notify Mgr opcttnsm (22630) is running
Forward Manager opcforwm (22631) is running
Service Engine opcsvcm (22636) is running
Cert. Srv Adapter opccsad (22634) is running
BBC config adapter opcbbcdist (22635) is running
Display Manager opcdispm (22632) is running
Distrib. Manager opcdistm (22633) is running

Open Agent Management status:
—————————–
Request Sender ovoareqsdr (2738) is running
Request Handler ovoareqhdlr (2847) is running
Message Receiver (HTTPS) opcmsgrb (2849) is running
Message Receiver (DCE) opcmsgrd (2850) is running

OV Control Core components status:
———————————-
OV Control ovcd (1621) is running
OV Communication Broker ovbbccb (1626) is running
OV Certificate Server ovcs aborted

  • restart opcsv
testserver:root root # opcsv -stop
testserver:root root # opcsv -start
  • check status again

testserver:root root # opcsv -status
OVO Management Server status:
—————————–
Control Manager opcctlm (15575) is running
Action Manager opcactm (15602) is running
Message Manager opcmsgm (15603) is running
TT & Notify Mgr opcttnsm (15606) is running
Forward Manager opcforwm (15607) is running
Service Engine opcsvcm (15620) is running
Cert. Srv Adapter opccsad (15612) is running
BBC config adapter opcbbcdist (15613) is running
Display Manager opcdispm (15610) is running
Distrib. Manager opcdistm (15611) is running

Open Agent Management status:
—————————–
Request Sender ovoareqsdr (2738) is running
Request Handler ovoareqhdlr (2847) is running
Message Receiver (HTTPS) opcmsgrb (2849) is running
Message Receiver (DCE) opcmsgrd (2850) is running

OV Control Core components status:
———————————-
OV Control ovcd (1621) is running
OV Communication Broker ovbbccb (1626) is running
OV Certificate Server ovcs aborted

Categories: HA, IT Architecture Tags:

veritas vcs 5.1 on solaris 5.10 changes of restarting procedure

July 26th, 2012 No comments

For 5.1 VCS on solaris 10, start/stop of VCS are no longer controlled by /etc/rc*.d/S* scripts.
They are under SMF control. Plus, some of the /etc/default/gab,llt,vcs,vxfen etc.. there are lines which needs to be set to 1 if VCS is setup manually.
For example:

VCS_START=1
VCS_STOP=1

More interestingly with VCS one node cluster, the SMF resource for vcs is not system/vcs:default, It is system/vcs-onenode:default.

Categories: HA, HA & HPC, IT Architecture Tags:

vcs commands hang consistently

June 8th, 2012 No comments

Today we encounter an issue that veritas vcs commands hang in a consistent manner. The commands like haconf -dump -makero just stuck there for a long time that we have to terminate it from console. When using truss(on solaris) or strace(on linux) to trace system calls and signals, we found the following output:

test# truss haconf -dump -makero

execve(“/opt/VRTSvcs/bin/haconf”, 0xFFBEF21C, 0xFFBEF22C) argc = 3
resolvepath(“/usr/lib/ld.so.1″, “/usr/lib/ld.so.1″, 1023) = 16
open(“/var/ld/ld.config”, O_RDONLY) Err#2 ENOENT

open(“//.vcspwd”, O_RDONLY) Err#2 ENOENT
getuid() = 0 [0]
getuid() = 0 [0]
so_socket(1, 2, 0, “”, 1) = 4
fcntl(4, F_GETFD, 0×00000004) = 0
fcntl(4, F_SETFD, 0×00000001) = 0
connect(4, 0xFFBE7E1E, 110, 1) = 0
fstat64(4, 0xFFBE7AF8) = 0
getsockopt(4, 65535, 8192, 0xFFBE7BF8, 0xFFBE7BF4, 0) = 0
setsockopt(4, 65535, 8192, 0xFFBE7BF8, 4, 0) = 0
fcntl(4, F_SETFL, 0×00000084) = 0
brk(0x000F6F28) = 0
brk(0x000F8F28) = 0
poll(0xFFBE8A60, 1, 0) = 1
send(4, ” G\0\0\0 $\0\0\t15\0\0\0″.., 57, 0) = 57
poll(0xFFBE8AA0, 1, -1) = 1
poll(0xFFBE68B8, 0, 0) = 0
recv(4, ” G\0\0\0 $\0\0\r02\0\0\0″.., 8192, 0) = 55
poll(0xFFBE8B10, 1, 0) = 1
send(4, ” G\0\0\0 $\0\0\f 1\0\0\0″.., 58, 0) = 58
poll(0xFFBE8B50, 1, -1) = 1
poll(0xFFBE6968, 0, 0) = 0
recv(4, ” G\0\0\0 $\0\0\r02\0\0\0″.., 8192, 0) = 49
getpid() = 10386 [10385]
poll(0xFFBE99B8, 1, 0) = 1
send(4, ” G\0\0\0 $\0\0\f A\0\0\0″.., 130, 0) = 130
poll(0xFFBE99F8, 1, -1) = 1
poll(0xFFBE7810, 0, 0) = 0
recv(4, ” G\0\0\0 $\0\0\r02\0\0\0″.., 8192, 0) = 62
fstat64(4, 0xFFBE9BB0) = 0
getsockopt(4, 65535, 8192, 0xFFBE9CB0, 0xFFBE9CAC, 0) = 0
setsockopt(4, 65535, 8192, 0xFFBE9CB0, 4, 0) = 0
fcntl(4, F_SETFL, 0×00000084) = 0
getuid() = 0 [0]
door_info(3, 0xFFBE78C8) = 0
door_call(3, 0xFFBE78B0) = 0
open(“//.vcspwd”, O_RDONLY) Err#2 ENOENT
poll(0xFFBEE370, 1, 0) = 1
send(4, ” G\0\0\0 $\0\0\t13\0\0\0″.., 42, 0) = 42
poll(0xFFBEE3B0, 1, -1) (sleeping…)

After some digging into the internet, we found the following solution to this weird problem:

1. Stop VCS on all nodes in the cluster by manually killing both had & hashadow processes on each node.
# ps -ef | grep had
root 27656 1 0 10:24:02 ? 0:00 /opt/VRTSvcs/bin/hashadow
root 27533 1 0 10:22:01 ? 0:02 /opt/VRTSvcs/bin/had -restart

# kill 27656 27533
GAB: Port h closed

2. Unconfig GAB & llt.
# gabconfig -U
GAB: Port a closed
GAB unavailable

# lltconfig -U
lltconfig: this will attempt to stop and reset LLT. Confirm (y/n)? y

3. Unload GAB & llt modules.
# modinfo | grep gab
100 60ea8000 38e9b 136 1 gab (GAB device)

# modunload -i 100
GAB unavailable

# modinfo | grep llt
84 60c6a000 fd74 137 1 llt (Low Latency Transport device)
# modunload -i 84
LLT Protocol unavailable

4. Restart llt.
# /etc/rc2.d/S70llt start
Starting LLT
LLT Protocol available

5. Restart gab.
# /etc/gabtab
GAB available
GAB: Port a registration waiting for seed port membership

6. Restart VCS :
# hastart -force
# VCS: starting on: <node_name>

Categories: HA, HA & HPC, IT Architecture Tags:

impact of restart vxconfigd on solaris and linux – VxVM Configuration Daemon

May 30th, 2012 No comments

stop and restart the VxVM Configuration Daemon, vxconfigd may cause your VxVA, VMSA and/or VEA session to exit. This may also cause a momentary stoppage of any VxVM configuration actions. This should not harm any data; however, it may cause some configuration operations (e.g. moving subdisks, plex resynchronization) to abort unexpectedly. Any VxVM configuration changes should be completed before running this section.

If you are using EMC PowerPath devices with Veritas Volume Manager, you must run the EMC command(s) ‘powervxvm setup’ (or ‘safevxvm setup’) and/or ‘powervxvm online’ (or ‘safevxvm online’) if this script terminates abnormally. Also, if VCS service groups are running on the host, restarting vxconfigd may cause failover to occur. So you’d better freeze service groups before doing this. You can refer to the following for details: http://www.doxer.org/learn-linux/differences-between-freezing-vcs-system-and-freezing-service-group/

Categories: HA, HA & HPC Tags:

vcs service group and resource attributes dictionary page

May 22nd, 2012 No comments

Here’s all the veritas vcs service group and resource attributes and their explanation/crab sheet/cheatsheet(actually this is the file content of /etc/VRTSvcs/conf/attributes/cluster_attrs.xml):

Administrators Contains list of users with Administrator privileges.
 AllowNativeCliUsers If user does not have root privileges, and if this attribute is set to 0 (false), user is prompted for a password when issuing ha-xxx commands. If this attribute is set to 1(true), the user is not prompted; instead, VCS validates OS user’s login against VCS’ list of user IDs and assigns appropriate privileges. Default = 0(false).
 ClusterLocation Specifies the location of the cluster.
 ClusterName Arbitrary string containing the name of cluster.
 ClusterOwner This attribute is used for VCS notification; specifically, VCS sends notifications to persons designated in this attribute when something goes wrong with the cluster.
 CompareRSM Indicates if VCS engine is to verify that Replicated State Machine is consistent.This can be set by using the hadebug command.
 CounterInterval Intervals counted by the attribute GlobalCounter indicating approximately how often a broadcast will happen that will cause the GlobalCounter attribute to increase. The default value of the GlobalCounter increment can be modified by changing CounterInterval. If you increase this attribute to exceed five seconds, consider increasing the default value of the ShutdownTimeout attribute
 DumpingMembership Indicates that the engine is writing to disk.
 EngineClass Indicates the scheduling class for the VCS engine (had).
 EnginePriority Indicates the priority in which had runs. This attribute has no effect for windows environment.
 GlobalCounter This counter increases incrementally by one for each counter interval. It increases when the broadcast is received. VCS uses the GlobalCounter attribute to measure the time it takes to shut down a system. By default, the GlobalCounter attribute is updated every five seconds. This default value, combined with the 60-second default value of ShutdownTimeout, means if system goes down within twelve increments of GlobalCounter, it is treated as a fault. The default value of GlobalCounter increment can be modified by changing the CounterInterval attribute.
 GroupLimit Maximum number of service groups.
 HacliUserLevel This attribute has three, case-sensitive values:
 LockMemory Controls the locking of VCS engine pages in memory. This attribute has three values: ALL: Locks all current and future pages. CURRENT: Locks current pages.
 LogSize Size of the log file. Minimum value 64 KB  Maximum value 128 MB.
 MajorVersion Major version of system’s join protocol.
 MinorVersion Minor version of system’s join protocol.
 Notifier Indicates the status of the notifier in the cluster; specifically:
 Operators Contains list of users with Operator privileges.
 ProcessClass Indicates the scheduling class for had processes (for example, triggers).
 ProcessPriority Indicates the priority of had processes (for example, triggers). This attribute has no effect for windows environment.
 PrintMsg If set to 1 (true) , enables logging TagM messages in engine log.
 ReadOnly Indicates the mode of cluster configuration.
 ResourceLimit Maximum number of resources.
 SourceFile File from which the configuration was read.
 TypeLimit Maximum number of resource types.
 UserNames List of VCS user names.
 VCSMode Denotes the mode for which VCS is licensed, including VCS, VCS_QUICKSTART, and VCS_OPS.
 LinkMonitoring Enables link monitoring.
 NotifyList Stores notification list consisting of recipient’s email addresses, separated by spaces.
 MaxFactor For internal use only.
 LoadSampling For internal use only.
 Factor For internal use only.
 Stewards Specifies the IP address/hostname of systems running the steward process.
 ClusterAddress Specifies the cluster’s virtual IP address (used by a remote cluster when connecting to the local cluster).
 ClusterUUID Indicates the unique cluster identification assigned to the cluster by the Availability Manager.
 ClusState Indicates the current state of the cluster
 AutoStartTimeout If the local cluster cannot communicate with one or more remote clusters, this attribute specifies the number of seconds the VCS engine waits before initiating the AutoStart process for an AutoStart global service group.
 PanicOnNoMem For internal use only.
 UseFence Indicates whether the cluster uses SCSI III I/O fencing.
 VCSFeatures Indicates which VCS features are enabled.
 ClusterTime The number of seconds since January 1, 1970. This is defined by the lowest node in running state.
 WACPort The TCP port on which the WAC (Wide Area Connector) process on the local cluster listens for connection from remote clusters. The attribute can take a value from 0 to 65535.
 SecureClus Indicates whether the cluster is secured. VCS runs in the Secure mode using VxSS; all public network communications use SSL. VCS users belong to the platform user base and VCS does not store user passwords. This value cannot be changed while the cluster is running.
 AvailableCapacity Available Capacity = Capacity – Current System Load
 Capacity Value expressing total load capacity of system. This value is relative to other systems in the cluster and does not reflect any real value associated with the system. Default=100
 ConfigBlockCount Number of 512-byte blocks in configuration when the system joined the cluster.
 ConfigCheckSum Sixteen-bit checksum of configuration identifying when the system joined the cluster.
 ConfigDiskState State of configuration on the disk when the system joined the cluster.
 ConfigFile Directory containing the configuration files.
 ConfigModDate Last modification date of configuration when the system joined the cluster.
 CurrentLimits System-maintained calculation of current value of Limits. CurrentLimits = Limits – (additive value of all service group Prerequisites).
 CPUUsage Indicates the CPUUsage of the system in the form of CPU percentage utilization. This attribute’s value is valid if the Enabled value in CPUUsageMonitoring attribute equals 1. This value is updated when there is a change of 5 percent since the last indicated value.
 CPUUsageMonitoring Monitors the system’s CPU usage using various factors. The default value for this attribute is CPUUsageMonitoring = {Enabled = 0, NotifyThreshold = 0, NotifyTimeLimit = 0, ActionThreshold = 0, ActionTimeLimit = 0, Action = NONE}
 DiskHbStatus Indicates status of communication disks on the system.
 DynamicLoad System-maintained value of current dynamic load. The value is set external to VCS with the hasys -load command.
 Frozen Indicates if service groups can be brought online or taken offline on the system. Groups cannot be brought online taken offline if the attribute value is 1(true).
 GUIIPAddr Determines the local IP address that VCS uses to accept connections. Incoming connections over other IP addresses are dropped. If GUIIPAddr is not set, the default behavior is to accept external connections over all configured local IP addresses.
 Limits An unordered set of name=value pairs denoting specific resources available on a system. Names are arbitrary and are set by the administrator for any value; names are not obtained from the system. The format for Limits is:Limits() = { Name=Value, Name2=Value2 }.
 Location Denotes the location of the system.
 LinkHbStatus Indicates status of private network links on any system.
 LoadTimeCounter System-maintained internal counter of how many seconds the system load has been above LoadWarningLevel. This value resets to zero anytime system load drops below the value in LoadWarningLevel.
 LoadTimeThreshold Indicates length of time a system must remain above at or above LoadWarningLevel before the loadwarning trigger is fired. Default = 600 seconds.
 LoadWarningLevel Defines, as a percentage of system Capacity, the level at which load has reached a critical limit. Default = 80 percent.
 MajorVersion Major version of system’s join protocol.
 MinorVersion Minor version of system’s join protocol.
 NodeId System ID specified in “/etc/llttab”.
 OnGrpCnt Number of groups that are online, or about to go online.
 ShutdownTimeout Determines whether to treat system reboot as a fault for service groups running on the system. On many systems, when a reboot occurs the processes are killed first, then the system goes down. When the VCS engine is killed, service groups that include the failed system in their SystemList attributes are autodisabled. However, if the system goes down within the number of seconds designated in ShutdownTimeout, service groups previously online on the failed system are treated as faulted and failed over. If you do not want to treat the system reboot as a fault, set the value for this attribute to 0. Default = 120 seconds
 SourceFile File from which the configuration was read.
 SysInfo Provides platform-specific information, including the name, version, and release of the operating system, the name of the system on which it is running, and the hardware type.
 SystemOwner This attribute is used for VCS email notification and logging. VCS sends email notification to the person designated in this attribute when an event occurs related to the system.
 SysState System state such as running , faulted , exited.
 TFrozen Indicates if a group can be brought online or taken offline on the system.
 TRSE Indicates in seconds the time to Regular State Exit. Time is calculated as the duration between the events of VCS losing port h membership and of VCS losing port a membership of GAB.
 UpDownState This attribute has four values DOWN, UP BUT NOT IN CLUSTER MEMBERSHIP, UP AND IN JEOPARDY, UP.
 UserInt Stores a system’s integer value.
 UserStr Stores a system’s String value.
 DiskHbDown Indicates if communication disks are down on any system. Enabled by the LinkMonitoring attribute.
 LinkHbDown Indicates if private network links are down on any system. Enabled by the LinkMonitoring attribute.
 Load Normalized value of system load used to compare systems in load balancing. Value is determined by dividing the raw values of LoadRaw by the values of Factor.
 LoadRaw List of load-calculation criterion and their associated raw values over the last five seconds.
 SysName The name of the system. The name must begin with a letter and must only contain letters, numbers, dashes (-), and underscores (_).
 SystemLocation Denotes the location of the system.
 LLTNodeId Displays the node ID ( as defined in the llthots.txt ) for the node.
 ConfigInfoCnt For internal use only.
 AgentsStopped The attribute is set to 1 for a system when all agents running on that system are stopped.
 NoAutoDisable When set to 0, this attribute autodisables service groups when the VCS engine is taken down. Groups remain autodisabled until the engine is brought up (regular membership). Setting this attribute to 1 bypasses the autodisable feature.
 LicenseKey LicenseKey
 VCSFeatures Indicates which VCS features are enabled.
 LicenseType Indicates the license type of the base VCS key used by the system. Possible values are:
 VCSMode Denotes the mode for which VCS is licensed, including VCS, Traffic Director, and VCS_OPS.
 CPUBinding Binds the HAD process to the specified CPU.
 EngineRestarted Indicates whether the VCS engine (HAD) was restarted by the hashadow process on a node in the cluster. The value 1 indicates that the engine was restarted; 0 indicates it was not restarted.
 ConnectorState Indicates the state of the wide-area connector (WAC). If 0, WAC is not running. If 1, WAC is running and communicating with the VCS engine.
 ActiveCount Number of resources in a service group that are active (online or waiting to go online). When the number of resources drops to zero, the service group is considered offline.
 Administrators List of VCS users with privileges to administer the group.
 AutoDisabled Indicates that VCS does not know the status of a service group (or specified system for parallel service groups). This is due to: 1) Group not probed (on specified system for parallel groups) in the SystemList attribute. 2) VCS engine in not running on a node designated in the SystemList attribute, but the node is visible.
 AutoFailOver Indicates whether VCS initiates an automatic failover if the service group faults.
 AutoRestart Restarts a service group after a faulted persistent resource becomes online. This attribute applies to persistent resources only. Default value is 1(true).
 AutoStart Designates whether a service group is automatically started when VCS is started. Default value is 1(true).
 AutoStartIfPartial Indicates whether to initiate bringing a service group online if the group is probed and discovered to be in a PARTIAL state when VCS is started. Default =1.
 AutoStartList List of systems on which, under specific conditions, the service group will be started with  VCS (usually at system boot). For example, if a system is a member of a failover service group’s AutoStartList attribute, and if it is not already running on another system in the cluster, the group is brought online when the system is started.
 AutoStartPolicy Sets the policy VCS uses to determine on which system to bring a service group online if multiple systems are available.
 CurrentCount Number of systems on which the service groups is active.
 Enabled Indicates if a group can be failed over or brought online. If any of the local values are disabled, the group is disabled. Default value is 1(true).
 Evacuate Indicates if VCS initiates an automatic failover when user issues hastop -local -evacuate. Default value is 1(true).
 Evacuating Indicates the node ID from which the service group is being evacuated.
 Failover Indicates service group is in the process of failing over.
 FailOverPolicy Sets the policy VCS uses to determine which system a group fails over to if multiple systems exist. The values are Priority (default), Load, RoundRobin.
 FromQ Indicates the system name from which the service is failing over. This attribute is specified when service group failover is a direct consequence of the group event, such as a resource fault within the group or a group switch.
 Frozen Disables all actions, including autostart, online and offline, and failover, except for monitor actions performed by agents. Default value is 0(false).
 GroupOwner This attribute is used for VCS email notification and logging. VCS sends email notification to the person designated in this attribute when an event occurs related to the service group.
 IntentOnline Indicates whether to keep service groups online or offline. It is set to 1 by VCS if an attempt has been made, successful or not, to online the service group. For failover groups, this attribute is set to 0 by VCS when the group is taken offline. For parallel groups, it is set to 0 for the system when the group is taken offline or when the group faults and can fail over to another system.
 LastSuccess For internal use only.
 Load Integer value expressing total system load this group will put on a system.
 ManualOps Indicates if manual operations are allowed on the service group.
 MigrateQ Indicates the system from which the service group is migrating. This attribute is specified when group failover is an indirect consequence, such as system shutdown, another group faulted and is linked to this group, etc.
 NumRetries Indicates the number of times attempts are made to bring a service group online. This attribute is used only if the attribute OnlineRetryLimit is set for the service group.
 OnlineRetryInterval Indicates the interval, in seconds, during which a service group that has successfully restarted on the same system and faults again should be failed over, even if the attribute OnlineRetryLimit is non-zero. This prevents a group from continuously faulting and restarting on the same system. Default=0
 OnlineRetryLimit If non-zero, specifies the number of times the VCS engine tries to restart a faulted service group on the same system on which the group faulted, before it gives up and tries to fail over the group to another system. Default = 0.
 Operators List of VCS users with privileges to operate the group. A Group Operator can only perform  online/offline, and temporary freeze/unfreeze operations pertaining to a specific group.
 Parallel This indicates if service group is failover (0), parallel (1) or hybrid (2).
 PathCount Number of resources in path not yet taken offline. When this number drops to zero, the engine may take the entire service group offline if critical fault has occurred.
 PreOnline Indicates that the VCS engine should not online a service group in response to a manual group online, group autostart, or group failover. The engine should instead call a user-defined script that checks for external conditions before bringing the group online. Default value is 0(false).
 PreOnlining Indicates that VCS engine invoked the preonline script; however, the script has not yet returned with group online.
 Prerequisites An unordered set of name=value pairs denoting specific resources required by a service group. If prerequisites are not met, the group cannot go online. The format for Prerequisites is: Prerequisites() = { Name=Value, name2=value2 }. Names used in setting Prerequisites are arbitrary and not obtained from the system. Coordinate name=value pairs listed in Prerequisites with the same name=value pairs in Limits().
 Priority Enables users to designate and prioritize the service group. VCS does not interpret the value; rather, this attribute enables the user to configure the priority of a service group and the sequence of actions required in response to a particular event. Default=0
 PrintTree Indicates whether or not the resource dependency tree is written to the configurtaion file.
 Probed Indicates whether all enabled resources in the group have been detected by their respective agents.
 ProbesPending The number of resources that remain to be detected by the agent on each system.
 Responding Indicates VCS engine is responding to a failover event and is in the process of bringing the service group online or failing over the node.
 SourceFile File from which the configuration was read.
 State Group state on each system. Group states are OFFLINE, ONLINE, FAULTED, PARTIAL, STARTING, STOPPING, OFFLINE | FAULTED, OFFLINE | STARTED, PARTIAL | FAULTED, PARTIAL | STARTING, PARTIAL | STOPPING, ONLINE | STOPPING.
 SystemList List of systems on which the service group is configured to run and their priorities. Lower numbers indicate a preference for the system as a failover target.
 SystemZones Indicates the virtual sublists within the SystemList attribute that grant priority in failing over.Values are string/integer pairs. The string key is the name of a system in the SystemList attribute, and the integer is the number of the zone. Systems with the same zone number are members of the same zone. If a service group faults on one system in a zone, it is granted priority to fail over to another system within the same zone, despite the policy granted by the FailOverPolicy attribute.
 Tag Identifies special-purpose service groups created for specific VCS products.
 TargetCount Indicates the number of target systems on which the service group should be brought online.
 TFrozen Indicates if service groups can be brought online on the system. Groups cannot be brought online if the attribute value is 1(true). Default value is 0 (false).
 ToQ Indicates the node name to which the service is failing over. This attribute is specified when service group failover is a direct consequence of the group event, such as a resource fault within the group or a group switch.
 TriggerResStateChange Determines whether or not to invoke the resstatechange trigger if resource state changes.
 UserIntGlobal Use this attribute for any purpose. It is not used by VCS.
 UserStrGlobal Use this attribute for any purpose. It is not used by VCS.
 TypeDependencies Creates a dependency between resource types specified in the service group list, and all instances of the respective resource type.
 UserIntLocal Use this attribute for any purpose. It is not used by VCS.
 UserStrLocal Use this attribute for any purpose. It is not used by VCS.
 Dependencies Creates a dependency between resource types specified in the service group list, and all instances of the respective resource type.
 PostOffline Setting this attribute to 1 executes the PostOffline event trigger on the system where the group went offline from a partial or fully online state.
 PostOnline Setting this attribute to 1 executes the PostOnline event trigger on the system where the group went online from a partial or fully offline state.
 ExtMonApp For internal use only.
 ExtMonArgs For internal use only.
 PreOffline For internal use only.
 PreOfflining For internal use only.
 Restart For internal use only.
 TriggerEvent For internal use only.
 ClusterList Specifies the list of clusters on which the service group is configured to run.
 Authority Indicates whether or not the local cluster is allowed to bring the service group online. If set to 0, it is not, if set to 1, it is.
 ClusterFailOverPolicy Determines how a global service group behaves when a cluster faults.
 ManageFaults Specifies if VCS manages resource failures within the service group by calling clean entry point for
 FaultPropagation Specifies if VCS should propagate the fault up to parent resources and take the entire service group
 PreonlineTimeout Defines the maximum amount of time the preonline script takes to run the command hagrp -online -nopre for the group. Note that HAD uses this timeout during evacuation only.
 DeferAutoStart Indicates whether HAD defers the auto-start of a local group in case the global cluster is not fully connected.
 VCSi3Info Enables VCS service groups to be mapped to VERITAS i3 applications. This attribute is managed solely by the i3 product and should not be set or modified by the user.
 AgentClass Indicates the scheduling class for the VCS agent process.
 AgentFailedOn A list of systems on which the agent for the resource type has failed.
 AgentPriority Indicates the priority in which the agent process runs. This attribute has no effect for windows environment. Default = 0.
 AgentReplyTimeout The number of seconds the engine waits to receive a heartbeat from the agent before restarting the agent. Default = 130 seconds.
 AgentStartTimeout The number of seconds after starting the agent that the engine waits for the initial agent “handshake” before restarting the agent. Default = 60 seconds.
 ArgList An ordered list of attributes whose values are passed to the open, close, online, offline, monitor, and clean entry points.
 AttrChangedTimeout Maximum time (in seconds) within which the attr_changed entry point must complete or be terminated. Default = 60 seconds.
 CleanTimeout Maximum time (in seconds) within which the clean entry point must complete or else be terminated. Default = 60 seconds.
 CloseTimeout Maximum time (in seconds) within which the close entry point must complete or else be terminated. Default = 60 seconds.
 ConfInterval When a resource has remained online for the specified time (in seconds), previous faults and restart attempts are ignored by the agent.
 FaultOnMonitorTimeouts When a monitor times out as many times as the value specified, the corresponding resource is brought down by calling the clean entry point. The resource is then marked FAULTED, or it is restarted, depending on the value set in the Restart Limit attribute. When FaultOnMonitorTimeouts is set to 0, monitor failures are not considered indicative of a resource fault. A low value may lead to spurious resource faults, especially on heavily loaded systems.
 LogFileSize Specifies the size (in bytes) of the agent log file. Minimum value is 65536 bytes. Maximum value is 134217728 bytes (128MB). Default = 33554432 (32MB)
 MonitorInterval Duration (in seconds) between two consecutive monitor calls for an ONLINE or transitioning resource. A lower value could impact performance if many resources of the same type exist. A higher value could delay detection of a faulted resource.
 MonitorTimeout Maximum time (in seconds) within which the monitor entry point must complete or else be terminated. Default = 60 seconds
 OfflineMonitorInterval Duration (in seconds) between two consecutive monitor calls for an OFFLINE resource. If set to 0, OFFLINE resources are not monitored.
 NumThreads Number of threads used within the agent process for managing resources. This number does not include the three threads used for other internal purposes.Increasing to a significantly large value can degrade system performance. Decreasing to 1 prevents multiple threads. Default = 10.
 OfflineTimeout Maximum time (in seconds) within which the offline entry point must complete or else be terminated. Default = 300 seconds
 OnlineRetryLimit Number of times to retry online, if the attempt to online a resource is unsuccessful. This parameter is meaningful only if clean is implemented. Default = 0.
 OnlineTimeout Maximum time (in seconds) within which the online entry point must complete or else be terminated. Default = 300 seconds
 OnlineWaitLimit Number of monitor intervals to wait after completing the online procedure, and before the resource becomes online. Default = 2.
 OpenTimeout Maximum time (in seconds) within which the open entry point must complete or else be terminated. Default = 60 seconds.
 Operations Indicates valid operations of resources of the resource type. Values are OnOnly (can online only), OnOff (can online and offline), None (cannot online or offline).
 RestartLimit Number of times to retry bringing a resource online when it is taken offline unexpectedly and before VCS declares it FAULTED. Default = 0
 ScriptClass Indicates the scheduling class of the script processes (for example, online) created by the agent. This attribute has no effect for windows environment.
 ScriptPriority Indicates the priority of the script processes created by the agent. This attribute has no effect for windows environment. Default = 0.
 SourceFile File from which the configuration was read.
 ToleranceLimit Number of times the monitor entry point should return OFFLINE before declaring the resource FAULTED. A large value could delay detection of a genuinely faulted resource. Default = 0
 MonitorIfOffline Indicates whether resources are monitored when offline (value 1), or not (value 0).
 Type File system type, such as vxfs, ufs, etc.
 RestartLimits The number of times the agent should try to restart the resources.
 FireDrill Specifies whether or not fire drill is enabled for resource type. If set to 1, fire drill is enabled. If set to 0, it is disabled.
 LogDbg Indicates the debug severities enabled for the resource type or agent framework. Debug severities used by the agent entry points are in the range of DBG_1 to DBG_21. The debug messages from the agent framework are logged with the severities DBG_AGINFO, DBG_AGDEBUG and DBG_AGTRACE, representing the least to most verbose.
 MonitorStatsParam Designates the values governing the monitor interval. Valid keys include:
 InfoInterval Determines when info entry point is invoked by the agent framework. If set to 0, the entry point is not invoked. Set this attribute to a non-zero value to invoke the entry point periodically.
 InfoTimeout Timeout value for info entry point. If entry point does not complete by the designated time, the agent framework cancels the entry point’s thread.
 ActionTimeout Timeout value for action entrypoint. Default is 40s
 SupportedActions Valid action tokens for this resource type. Default is an
 LogLevel LogLevel
 LogTags LogTags
 ArgListValues List of arguments passed to the resource’s agent on each system.This attribute is resource- and system-specific, meaning that the list of values passed to the agent depend on which system and which resource they are for.
 AutoStart Indicates that the resource is brought online when the service group is brought online. Default value is 1(true).
 ConfidenceLevel Indicates the level of confidence in an online resource. Values range from 0 – 100. Note that some VCS agents may not take advantage of this attribute and may always set it to 0. Set the level to 100 if the attribute is not used.
 Critical Indicates that the service group is faulted when the resource, or any resource it depends on, faults. Default value is 1(true).
 Enabled Indicates agents monitor the resource. If a resource is created dynamically while VCS is running, you must enable the resource before VCS monitors it. When Enabled is set to 0(false), it implies a disabled resource. VCS will not bring a disabled resource, nor its children online, even if the children are enabled. If you specify the resource in main.cf prior to starting VCS, the default value for this attribute is 0(false).
 Flags Additional information relating to the state of a resource. Possible values are : RESTARTING, STATUS UNKNOWN, MONITOR TIMEDOUT, UNABLE TO OFFLINE and ADMIN WAIT.
 Group String name of the service group to which the resource belongs.
 LastOnline Indicates the system name on which the resource was last online. This attribute is automatically set by the VCS engine (had).
 MonitorOnly Indicates if the resource can be brought online or taken offline. If set to 0(false), resource can be brought online or taken offline. If set to 1(true),resource can be monitored only. Default value is 0(false).
 IState Indicates internal state of a resource. In addition to the State attribute, this attribute shows to which state the resource is transitioning. Possible values are : NOT WAITING, WAITING TO GO ONLINE, WAITING FOR CHILDREN ONLINE, WAITING TO GO OFFLINE, WAITING TO GO OFFLINE (propagate), WAITING TO GO ONLINE (reverse), WAITING TO GO OFFLINE (reverse/propagate).
 Path The number of parent resources in the path up to the top of the resource graph. This attribute is used when an online resource faults.
 Probed Indicates whether the resource has been detected by the agent.
 ResourceOwner This attribute is used for VCS email notification and logging. VCS sends email notification to the person designated in this attribute when an event occurs related to the resource.VCS also logs the owner name in when an event occurs.If ResourceOwner is not specified in main.cf, the default value is “unknown.”
 Signaled Indicates whether a resource has been traversed. Used when bringing a service group online or taking it offline.
 Start Indicates whether a resource was started (the process of bringing it online was initiated) on a system.
 State Resource state on each system. Possible values are : ONLINE, OFFLINE, FAULTED, ONLINE | STATE UNKNOWN, ONLINE | MONITOR TIMEDOUT, ONLINE | UNABLE TO OFFLINE, OFFLINE | STATE UNKNOWN, FAULTED | RESTARTING. A faulted resource is physically offline, though unintentionally.
 AgentDebug A flag that defines whether the agent logs additional debug messages. The value 1(true) indicates that the agent will log additional debug messages. The value 0(false) indicates that it will not. Default value is 0(false).
 TriggerEvent For internal use only.
 ResourceInfo This attribute has three predefined keys:State: values are Valid, Invalid, or Stale Msg: output of the info entry point captured on stdout by the agent framework TS: timestamp indicating when the ResourceInfo attribute was updated by the agent framework Defaults: State = Valid Msg = “” TS = “”
 ComputeStats The attribute indicates to the agfw whether or not to calculate monitor time statistics for the resource. By default this is set to FALSE.
 MonitorTimeStats The valid keys for this attribute are: Average, TS. Average is the average time taken by the monitor EP over the last “Frequency” number of monitor cycles. TS is the timestamp of when the engine last updated the Average for the resource. Default values are:
 Name For internal use only.
 Enabled Indicates if SNMP traps are enabled.
 IPAddr IP address of the host where the SNMP Manager resides.
 Port Port of SNMP server.
 SourceFile File from which the configuration was read.
 TrapList List of traps and their descriptions.
 Clusterlist List of clusters whose health is determined by this heartbeat.
 AgentState State of the heartbeat agent.
 State This is the state of the heartbeat. This state is used to determine the health of the remote cluster.
 AYAInterval This is the ‘Are You Alive Interval’. This is the interval after which the local cluster heartbeats the remote cluster.
 InitTimeout Timeout value for the ‘init’ entry pont.
 StartTimeout Timeout value for the ‘start’ entrypoint.
 CleanTimeout This is the timeout value for the ‘clean’ entry point.
 StopTimeout This is the timeout value for the Stop entry point.
 AYATimeout This is the timeout value for the aya entry point.
 AYARetryLimit number of times to call the aya entry point before giving up.
 Arguments extra generic information that can be passed to the heartbeat agent.
 LogDbg This is used for log messages.

PS:
1.You can download cluster_attrs.xml here for more infomation on vcs service group and resource attributes such as whether the attribute is editable/important/mustconfigure/displayname etc .

vcs-cluster_attrs.zip

2.Some vcs attributes not listed here as they’re dedicated for apps, such as oracle. We can import the vcs attributes configuration file detailed for example in this article: http://sfdoccentral.symantec.com/sf/5.0/solaris64/html/vcs_agents_oracle/ch_vha_oracle_configagent9.html
Categories: HA, HA & HPC Tags:

awstats installation and configuration guide on linux centos

May 21st, 2012 1 comment

Here’s a howto/guide about awstats installation and configuration on linux:

yum -y install awstats

here’s main things installed:

/var/www/awstats
/etc/awstats
/etc/cron.hourly/00awstats
/etc/httpd/conf.d/awstats.conf
/usr/bin/awstats_buildstaticpages.pl
/usr/bin/awstats_exportlib.pl
/usr/bin/awstats_updateall.pl
/usr/bin/logresolvemerge.pl
/usr/bin/maillogconvert.pl
/usr/bin/urlaliasbuilder.pl

mv /etc/awstats/awstats.localhost.localdomain.conf /etc/awstats/awstats.mysite.conf
vi /etc/awstats/awstats.mysite.conf

LogFile=”/usr/bin/logresolvemerge.pl /var/log/httpd/*-access.log|”

#there’s a way to add gzipped log file for analyzing
#LogFile=”gzip -d </var/log/apache/access.log.gz|”
LogType=W #W is for analyzing web log files
LogFormat=1 #or use a custom log format if you don’t use the combined log format
SiteDomain=”www.yoursite.com”
AllowToUpdateStatsFromBrowser=1

cd /var/www/awstats/

chmod -R 755 /var/log/httpd #If you do not add x permission to these log files, you’ll encounter error message below when you click “Update now” in browser:

awstats Couldn’t open server log file xxxx: Permission denied

perl ./awstats.pl -config=mysite -update #or update from browser. or through logrotate(http://awstats.sourceforge.net/docs/awstats_faq.html#ROTATE) or through crontab(http://awstats.sourceforge.net/docs/awstats_faq.html#CRONTAB)
perl ./awstats.pl -config=mysite -output -staticlinks > awstats.mysite.html
Now visit http://www.yoursite.com/awstats-html/awstats.mysite.html#or through http://www.yoursite.com/awstats/awstats.pl?config=mysite(like http://www.yoursite.com/awstats/awstats.pl?month=MM&year=YYYY&output=unknownos). Reports are generated in real time from the statistics data base. If this is slow, or putting too much load on your server, consider generating static reports instead.
Here’s the httpd configuration file for awstats:

[root@doxer awstats]# cat /etc/httpd/conf.d/awstats.conf
Alias /awstats/icon/ /var/www/awstats/icon/
Alias /awstats-html/ /var/www/awstats/
ScriptAlias /awstats/ /var/www/awstats/
<Directory /var/www/awstats/>
AllowOverride All
DirectoryIndex awstats.pl
Options ExecCGI
Order allow,deny
Allow from all
</Directory>
#Alias /css/ /var/www/awstats/css/
#Alias /js/ /var/www/awstats/js/

NB:

1.If you encounter 500 internal server error, this article may be useful for you to troubleshoot http://www.doxer.org/learn-linux/resolved-awstats-500-internal-server-error-after-installation-on-centos-linux/

2.For more info, you can refer to official site here http://awstats.sourceforge.net/docs/index.html

resolved awstats 500 internal server error after installation on centos linux

May 21st, 2012 No comments

After installation of awstats on centos according to official installation guide, the dynamically view from browser was rendering ok, i.e. http://www.mysite.com/awstats/awstats.pl?config=mysite was ok and I can see statistics with no problem. However, when I tried view the static html page generated by perl ./awstats.pl -config=mysite -output -staticlinks > awstats.mysite.html, there was 500 internal server error when visiting this page: http://www.mysite.com/awstats/awstats.mysite.html.

This is quite weird because usually the ones that complain about 500 internal server error are usually dynamically generated pages such as php pages or perl cgi script pages. But this problem was that only static html page gave 500 internal server error, and the dynamically generated pages were ok to render. I tried moving the html file to some other virtualhost and it’s ok to render without the horrible 500 internal server error: statistics looked good and all icons were ok.

The configuration of awstats in httpd conf file was like this:

[root@doxer awstats]# cat /etc/httpd/conf.d/awstats.conf
Alias /awstats/icon/ /var/www/awstats/icon/
ScriptAlias /awstats/ /var/www/awstats/
<Directory /var/www/awstats/>
AllowOverride All
DirectoryIndex awstats.pl
Options ExecCGI
Order allow,deny
Allow from all
</Directory>
#Alias /css/ /var/www/awstats/css/
#Alias /js/ /var/www/awstats/js/

Pay attention to line with red color. It took me a whole forenoon before I found the root cause(there was no useful error log for this awstats 500 internal server error). As detailed in httpd documents:

The ScriptAlias directive has the same behavior as the Alias directive, except that in addition it marks the target directory as containing CGI scripts that will be processed by mod_cgi’s cgi-script handler.

This is quite clear that files under directory followed by ScriptAlias directive will be treated as CGI scripts. As the static html file was placed under the directory which should only contains CGI scripts, so 500 internal server error threw when visiting that static html file under it.

To fix this awstats 500 internal server error, change the configuration file as the following:

[root@doxer awstats]# cat /etc/httpd/conf.d/awstats.conf
Alias /awstats/icon/ /var/www/awstats/icon/
Alias /awstats-html/ /var/www/awstats/
ScriptAlias /awstats/ /var/www/awstats/
<Directory /var/www/awstats/>
AllowOverride All
DirectoryIndex awstats.pl
Options ExecCGI
Order allow,deny
Allow from all
</Directory>
#Alias /css/ /var/www/awstats/css/
#Alias /js/ /var/www/awstats/js/

After this, you should now be able to see the awstats static html file with no problem.(use http://www.mysite.com/awstats-html/awstats.mysite.html instead of http://www.mysite.com/awstats/awstats.mysite.html)

NB:

Here’s an article about awstats installation on linux howto:  http://www.doxer.org/learn-linux/awstats-installation-steps-on-linux-centos/

what is fence or fencing device

May 16th, 2012 No comments

To understand what is fencing device, you need first know something about split-brian condition. read here for info: http://linux-ha.org/wiki/Split_Brain

Here’s is something about what fence device is:

Fencing is the disconnection of a node from shared storage. Fencing cuts off I/O from shared storage, thus ensuring data integrity. A fence device is a hardware device that can be used to cut a node off from shared storage. This can be accomplished in a variety of ways: powering off the node via a remote power switch, disabling a Fibre Channel switch port, or revoking a host’s SCSI 3 reservations. A fence agent is a software program that connects to a fence device in order to ask the fence device to cut off access to a node’s shared storage (via powering off the node or removing access to the shared storage by other means).

To check whether a LUN has SCSI-3 Persistent Reservation, run the following:

root@doxer# symdev -sid 369 show 2040|grep SCSI
SCSI-3 Persistent Reserve: Enabled

And here’s an article about I/O fencing using SCSI-3 Persistent Reservations in the configuration of SF Oracle RAC: http://sfdoccentral.symantec.com/sf/5.0/solaris64/html/sf_rac_install/sfrac_intro13.html

Categories: HA & HPC, Hardware, NAS, SAN, Storage Tags:

differences between freezing vcs system and freezing service group

May 16th, 2012 No comments

In veritas vcs, freezing a system prevents service groups from coming online on the system if they failover from another node in the cluster. But this does not prevent faults from failing any service group already online on the system.

To prevent veritas intervention on faults caused by expected changes (even if the symptoms are unexpected) we would usually freeze the service group. This prevents any online/clean or restart operation kicking in on detection of faults.

After your modification on vcs, you need check that resources are not autodisabled and make sure that the config is made ro again.

Here’s the step to freeze service group(s) in vcs:
/opt/VRTS/bin/haconf -makerw
mkdir /var/tmp/veritas_config_backup_`date +%F`
cp -R /etc/VRTSvcs /var/tmp/veritas_config_backup_`date +%F`
/opt/VRTS/bin/hagrp -freeze $i -persistent
/opt/VRTS/bin/haconf -dump -makero

Categories: HA, HA & HPC Tags: ,

tips about nagios notes_url action_url

March 16th, 2012 No comments

nagios has two useful parameters, i.e. notes_url & action_url.

Firstly, you can modify notes_url in template configuration file:
# Generic service definition template – This is NOT a real service, just a template!
define service{
name generic-service ; The ‘name’ of this service template
notes_url http://www.yoursite.com/mediawiki/index.php/nagios#$SERVICEDESC$
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service ‘freshness’
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0 ; The service is not volatile
check_period 24×7 ; The service can be checked at any time of the day
max_check_attempts 3 ; Re-check the service up to 3 times in order to determine its final (hard) state
normal_check_interval 10 ; Check the service every 10 minutes under normal conditions
retry_check_interval 2 ; Re-check the service every two minutes until a hard state can be determined
contact_groups admins ; Notifications get sent out to everyone in the ‘admins’ group
notification_options w,u,c,r ; Send notifications about warning, unknown, critical, and recovery events
notification_interval 60 ; Re-notify about service problems every hour
notification_period 24×7 ; Notifications can be sent out at any time
register 0 ; DONT REGISTER THIS DEFINITION – ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}

Then, when you define a new service check and want to use that notes_url in combination with notes_url, then when define the service check:
define service{
use generic-service ;using this template will auto enable notes_url, and it will replace $SERVICEDESC$ macro in template with service_description below
host_name yoursite.com
servicegroups HttpCheck
service_description aboutus #this will be used to replace $SERVICEDESC$ macro in template definition
check_command check_http!-u “/aboutus/” -t 60 -w 15 -c 30 -f follow -l -r 010
action_url http://$HOSTADDRESS$/aboutus/ #there’ll be a “cloud icon” in nagios web gui when you add this line
}

Now here’s the result after adding nagios action_url and notes_url:

You can try click on each icons and see the fancy thing!

Here’s the snapshot:

nagios notes_url action_url(click to see the full size image)

add oracle under vcs control howto – using veritas vxvm filesystem

February 14th, 2012 No comments

In this example, I’m gonna add oracle and oracle listener under vcs control.

haconf -makerw #make Change VCS to read-write mode

hagrp -add SG_myoracle #add service group

hagrp -modify SG_myoracle SystemList  host3 0 host4 1 host2 2 host1 3

hagrp -modify SG_myoracle AutoStartList  host1 host2 host3 host4 #List of systems on which, under specific conditions, the service group will be started with VCS (usually at system boot). For example, if a system is a member of a failover service group’s AutoStartList attribute, and if the service group is not already running on another system in the cluster, the group is brought online when the system is started.

hagrp -modify SG_myoracle SourceFile “./main.cf”

hares -add dg_myoracle DiskGroup SG_myoracle #add disk group
hares -modify dg_myoracle Critical 0
hares -modify dg_myoracle DiskGroup myoracle
hares -modify dg_myoracle PanicSystemOnDGLoss 0
hares -modify dg_myoracle StartVolumes 1
hares -modify dg_myoracle StopVolumes 1
hares -modify dg_myoracle MonitorReservation 0
hares -modify dg_myoracle tempUseFence INVALID
hares -modify dg_myoracle DiskGroupType private
hares -modify dg_myoracle Enabled 1
hares -add vip_myoracle IP SG_myoracle #add vip
hares -modify vip_myoracle Critical 0
hares -local vip_myoracle Device
hares -modify vip_myoracle Device bond0 -sys host1
hares -modify vip_myoracle Device bond0 -sys host2
hares -modify vip_myoracle Device bond0 -sys host3
hares -modify vip_myoracle Device bond0 -sys host4
hares -modify vip_myoracle Address “192.168.0.7″
hares -modify vip_myoracle NetMask “255.255.255.0″
hares -modify vip_myoracle Enabled 1
hares -add mnt_myoracle Mount SG_myoracle #add mount point resource
hares -modify mnt_myoracle Critical 0
hares -modify mnt_myoracle MountPoint “/myoracle”
hares -modify mnt_myoracle BlockDevice “/dev/vx/dsk/myoracle/myoracleroot”
hares -modify mnt_myoracle FSType vxfs
hares -modify mnt_myoracle MountOpt largefiles
hares -modify mnt_myoracle FsckOpt “%-y”
hares -modify mnt_myoracle SnapUmount 0
hares -modify mnt_myoracle CkptUmount 1
hares -modify mnt_myoracle SecondLevelMonitor 0
hares -modify mnt_myoracle SecondLevelTimeout 30
hares -modify mnt_myoracle VxFSMountLock 0
hares -modify mnt_myoracle Enabled 1
hares -add mnt_myoracle-ora Mount SG_myoracle #add another mount point resource
hares -modify mnt_myoracle-ora Critical 0
hares -modify mnt_myoracle-ora MountPoint “/myoracle/ora”
hares -modify mnt_myoracle-ora BlockDevice “/dev/vx/dsk/myoracle/myoracle-ora”
hares -modify mnt_myoracle-ora FSType vxfs
hares -modify mnt_myoracle-ora MountOpt largefiles
hares -modify mnt_myoracle-ora FsckOpt “%-y”
hares -modify mnt_myoracle-ora SnapUmount 0
hares -modify mnt_myoracle-ora CkptUmount 1
hares -modify mnt_myoracle-ora SecondLevelMonitor 0
hares -modify mnt_myoracle-ora SecondLevelTimeout 30
hares -modify mnt_myoracle-ora VxFSMountLock 0
hares -modify mnt_myoracle-ora Enabled 1
hares -add lsnr_myoracle Netlsnr SG_myoracle #add listener resource
hares -modify lsnr_myoracle Critical 0
hares -modify lsnr_myoracle Owner oracle
hares -modify lsnr_myoracle Home “/ora/product/11.2.0.2a”
hares -modify lsnr_myoracle TnsAdmin “/myoracle/ora/admin/etc”
hares -modify lsnr_myoracle Listener LISTENER_myoracle
hares -modify lsnr_myoracle MonScript “./bin/Netlsnr/LsnrTest.pl”
hares -modify lsnr_myoracle AgentDebug 0
hares -modify lsnr_myoracle Enabled 1

hares -add myoracle Oracle SG_myoracle #add oracle resource
hares -modify myoracle Critical 0
hares -modify myoracle Sid myoracle
hares -modify myoracle Owner oracle
hares -modify myoracle Home “/ora/product/11.2.0.2a”
hares -modify myoracle Pfile “/myoracle/ora/admin/myoracle/pfile/initmyoracle.ora”
hares -modify myoracle StartUpOpt STARTUP
hares -modify myoracle ShutDownOpt IMMEDIATE
hares -modify myoracle AutoEndBkup 1
hares -modify myoracle MonScript “./bin/Oracle/SqlTest.pl”
hares -modify myoracle AgentDebug 0
hares -modify myoracle Enabled 1

hares -add proxy_mnic_myoracle Proxy SG_myoracle #add proxy resource
hares -modify proxy_mnic_myoracle Critical 0
hares -modify proxy_mnic_oracle TargetResName mnic
hares -modify proxy_mnic_oracle Enabled 1

#Now do the dependency

hares -link mnt_myoracle dg_myoracle

hares -link mnt_myoracle-ora mnt_myoracle

hares -link myoracle mnt_myoracle-ora

hares -link vip_myoracle proxy_mnic_oracle

hares -link lsnr_myoracle vip_myoracle

hares -link lsnr_myoracle mnt_myoracle-ora

haconf -dump -makero #Write the configuration to disk and remove the designation stale. -makero changes the VCS mode to read-only.

NB:

If your system already has other service group configured, then hacf -cftocmd is your friend. Refer to here.

Categories: HA & HPC Tags:

vcs architecture overview

October 28th, 2011 No comments
vcs architecture overview

vcs architecture overview

hp openview monitoring solution – enable disable policies

August 18th, 2011 2 comments

This document is part of the “ESM – How To” Guides and gives instructions on how to Disable and Enable monitoring policies on Servers managed by Openview for Unix (OVOU), Openview for Windows (OVOW) and opsview for blackout or maintenance purposes.

Monitoring disabling / enabling process may differ depending what solution is used for monitoring the servers involved in the DR test.

This document list procedures to disable and enable monitoring in different monitoring solutions.

1.Monitoring changes to servers monitored in OVOU(Unix/Linux)
1.1. Disabling monitoring templates
Disable monitoring templates on the monitored server.
Logon to server and execute following command as root user
/opt/OV/bin/opctemplate –d -all

Above command will disable all monitoring policies on the server.

Now disable openview agent heartbeat polling from Openview servers(Server Side):
Execute following commands to disable OVO agent heartbeat polling
/opt/OV/bin/OpC/opchbp –stop <name of server>

1.2. Enabling monitoring templates
Enable monitoring templates on the monitored server.
Logon to server and execute following command as root user
/opt/OV/bin/opctemplate –e -all

Above command will enable all monitoring policies on the server.

Now enable openview agent heartbeat polling from Openview servers(Server Side):
Execute following commands to enable OVO agent heartbeat polling
/opt/OV/bin/OpC/opchbp –start <name of server>

2.Disabling monitoring for servers monitored in OVOW(Windows)
Firstly, access the Web Console of OV:

http://yourserver/OVOWeb/

2.1 Disabling policy(s)
Select the Policies Icon on the left hand panel and then expand the Nodes by selecting the select the plus next to the nodes in the center panel. Once you have located the server that you want to disable policies on select it and the policy inventory is shown in the right hand panel. The policies to be disabled are selected using the check boxes next to the policy name, or to disable all the policies select the check box next to the name, and this will select all the policies. Then select the Disable tab and the policies will be disabled, the display will show Active and pending until all the policies have been disabled, use the Refresh tab to update the screen.

2.2 Enabling policy(s)
Please refer to 2.1 Disabling policy(s)

3.Monitoring changes for servers monitored in Opsview(Nagios)
Log on opsview through web, search and locate the host you want operate on, Click on the grey arrow on the right side of the node and choose “Schedule downtime”. Now you can choose the start and End time for the downtime for this server.

Categories: HA & HPC Tags:

Nagios Check_nrpe/check_disk with error message DISK CRITICAL – /apps/yourxxx is not accessible: Permission denied

June 21st, 2011 1 comment

This was sometimes because the user under which nagios runs by had no read permission to the file systems check_disk is going to check.

For example, if you received alert:

DISK CRITICAL – /apps/vcc/logs/way is not accessible: Permission denied

You then can log on your server, under root, run:

/usr/local/nagios/libexec/check_disk -p /apps/vcc/logs/way, you may see:

DISK OK – free space: /apps/vcc/logs/way 823 MB (21% inode=90%);| /=2938MB;;3966;0;3966

But when you run this under user nagios, you may see DISK CRITICAL again.

Resolution:

Grant read permission to the filesystem/directory that had problem.

 

remove mnt resource of vcs

May 3rd, 2011 No comments

Target: remove mnt_prd-ora-grdctl from group SG_dbPRD

Here is the steps:

haconf -makerw
hares -delete mnt_prd-ora-grdctl
haconf -dump -makero

Categories: HA & HPC, Storage Tags:

openview Error getting OvCore ID solution

April 17th, 2011 No comments

In order to resolve the issue I had to do the following:

/opt/OV/bin/opcagt -stop
/opt/OV/bin/opcagt -kill
./ovcoreid -version
./ovcoreid -create
./ovcert -check
./ovcert -list
./ovcert -remove 62647e9a-b574-753b-023c-c23175d68bad
./ovcert -remove CA_1a50fab8-f092-750d-1d73-ff3268150d67
./ovcert -remove CA_3f49eabc-9514-752f-15c8-ff1a32bb2ec3
./ovcert -remove CA_ebc2af64-3bbc-750d-0528-c9520bf60692
/opt/OV/bin/OpC/install/opcactivate -srv test-cert_srv test

Then got error:

(xpl-273) Error occurred when loading configuration file ‘/var/opt/OV/conf/xpl/config/local_settings.ini’.
(xpl-272) Syntax error in line 0.
(xpl-95) Conversion to UTF16 failed.
(xpl-278) Processing update jobs skipped.

Then, do the following steps:

cd /var/opt/OV/conf/xpl/config/
mv local_settings.ini local_settings.ini_corrupted
/opt/OV/bin/OpC/install/opcactivate -srv test-cert_srv test
/opt/OV/bin/ovcert -certreq

bash-3.00# /opt/OV/bin/ovcert -list
+———————————————————+
| Keystore Content |
+———————————————————+
| Certificates: |
+———————————————————+
| Trusted Certificates: |
+———————————————————+
bash-3.00# /opt/OV/bin/ovc -status
ovcd OV Control CORE (17140) Running
ovbbccb OV Communication Broker CORE (17141) Running

Categories: HA & HPC Tags:

ERROR: Unable to read from logfile ‘/var/opt/perf/datafiles/logxxx’ – corrupted data

April 17th, 2011 No comments

Try this:
1.# mwa stop all
Wait for all processes to terminate.
2.#ttd -k
3. Move the log files (/var/opt/perf/data/logxxx) into a different directory for later access or remove it.
4. Restart mwa to create the new logfile set
#mwa start all

NB:
What is wma?
mwa – performance tool script for starting and stopping data collection and alarms
mwa is a script  that  is  used  to  start,  stop,  and  re-initialize  HP Performance Agent processes.
On Sun Solaris,IBM/AIX and Tru64 UNIX systems, depending on which  communication  protocol  is installed and running, you can
choose a protocol selector (ncs or dce).

Categories: HA & HPC Tags:

Veritas Cluster Brown Bag

April 17th, 2011 3 comments

Terms

Cluster — set of one or more nodes running resource group(s) in cooperation
Active-Active cluster — one resource group running on several nodes at the same time
Active-Passive cluster — one resource group running on single node at the same time
Resource group — logical set of resources which can’t be splitted and should work toghether to provide some services
Resource — is a resource of some resource type :)
Resource type — IP address, VxVM volume, VxFX file system, Oracle instance, etc.
Heartbeat — is a dedicated channel between nodes to communicate to each other (vi ethernet, storage, COM-port, etc).
Cluster panic — situation when one node loose connection to other nodes via hartbeat and hence don’t know what to do with resource groups
VIP — virtual IP address associated to resource group
Commands

# Status of cluster
hastatus -sum

# Failover some resource group from one node to another
hagrp -switch ResourceGroup1 -to Server1

# List of resources in resourse group
hares -list | grep ResourceGroup1

# Offline group
hagrp -offline ResourceGroup1 -sys test

# Online group
hagrp -online ResourceGroup1 -sys Server1

# Get state of group
hagrp -state ResourceGroup1

# Display group parameters (attributes)
hagrp -display ResourceGroup1

# Freeze resource group
hagrp -freeze ResourceGroup1

# Unfreeze resource group
hagrp -unfreeze ResourceGroup1

# Get list of supported resource types
hatype -list

# How to get commands which can be used to re-build VCS configuration
hacf -cftocmd /etc/VRTSvcs/conf/config -dest .
less main.cmd

# Stop all nodes in cluster
hastop

# Stop local node in cluster
hastop -local

# Stop local node in cluster w/o stopping resource groups and keep them running
hastop -local -force

# Start cluster
hastart

Practice

Task: Create new resource group ResourceGroup1 with VIP address associated with it.
Environment: Server1/Server2

# make cluster configuration read/write
haconf -makerw

# Create new resource group
hagrp -add ResourceGroup1
hagrp -modify ResourceGroup1 SystemList  Server1 0 Server2 1
hagrp -modify ResourceGroup1 AutoStartList  Server1 Server2
hagrp -modify ResourceGroup1 SourceFile “./main.cf”

# Add VIP (Virtual IP) to resource group
hares -add Server1_IP IP ResourceGroup1
hares -local Server1_IP Device
hares -modify Server1_IP Device ce5 -sys Server1
hares -modify Server1_IP Device ce1 -sys Server2
hares -modify Server1_IP Address “xxx.xxx.xxx.xxx”
hares -modify Server1_IP NetMask “255.255.255.0″
hares -modify Server1_IP IfconfigTwice 1
hares -modify Server1_IP ArpDelay 1
hares -modify Server1_IP Enabled 1

# make cluster configuration read/only
haconf -dump -makero

# Check state of resource group ResourceGroup1
hagrp -state ResourceGroup1

# Bring it online on Server2
hagrp -online ResourceGroup1 -sys Server2

# Check state of resource group ResourceGroup1 again
hagrp -state ResourceGroup1

# Ping IP address xxx.xxx.xxx.xxx, check ifconfig (on Server2):
ping xxx.xxx.xxx.xxx
ifconfig -a

# Failover ResourceGroup1 to Server1
hagrp -switch ResourceGroup1 -to Server1

# Check how it is moving
hastatus -sum

# Ping IP address xxx.xxx.xxx.xxx, check ifconfig (on Server1):
ping xxx.xxx.xxx.xxx
ifconfig -a

# Stop ResourceGroup1
hagrp -offline ResourceGroup1 -sys Server1

# Ping IP address xxx.xxx.xxx.xxx, check ifconfig (on both machines):
ping xxx.xxx.xxx.xxx
ifconfig -a

# Bring back online on Server1
hagrp -online ResourceGroup1 -sys Server1

# Manually unplumb IP:
ifconfig ce5:6 unplumb

# Wait while VCS will spot that IP missed. VCS will failover resource group ResourceGroup1 to Server2 at this point and will mark resource Server1_IP as failed on Server1
hastatus -sum

# Try to switch resource group ResourceGroup1 back to Server1. You’ll get error that some resources in this group are failed on Server1 and need to clear them.
hagrp -switch ResourceGroup1 -to Server1

# Check which resources are failed
hares -state | grep FAULT

# Clear failed resources
hares -clear Server1_IP -sys Server1

# Try to switch resource group ResourceGroup1 back to Server1 again after clearing of failed resources
hagrp -switch ResourceGroup1 -to Server1

# Stop ResourceGroup1
hagrp -offline ResourceGroup1 -sys Server1

# Remove ResourceGroup1 resources and resource group
haconf -makerw
hares -delete Server1_IP
hagrp -delete ResourceGroup1
haconf -dump -makero

Categories: HA & HPC Tags:

OpsView check_http usage

April 17th, 2011 No comments

check_http v1.4.14-20100630 (nagios-plugins 1.4.14)
Copyright (c) 1999 Ethan Galstad <[email protected]>
Copyright (c) 1999-2008 Nagios Plugin Development Team
<[email protected]>

This plugin tests the HTTP service on the specified host. It can test
normal (http) and secure (https) servers, follow redirects, search for
strings and regular expressions, check connection times, and report on
certificate expiration times.

Usage:
check_http -H <vhost> | -I <IP-address> [-u <uri>] [-p <port>]
[-w <warn time>] [-c <critical time>] [-t <timeout>] [-L] [-a auth]
[-b proxy_auth] [-f <ok|warning|critcal|follow|sticky|stickyport>]
[-e <expect>] [-s string] [-l] [-r <regex> | -R <case-insensitive regex>]
[-P string] [-m <min_pg_size>:<max_pg_size>] [-4|-6] [-N] [-M <age>]
[-A string] [-k string] [-S] [--sni] [-C <age>] [-T <content-type>]
[-j method]
NOTE: One or both of -H and -I must be specified

Options:
-h, –help
Print detailed help screen
-V, –version
Print version information
-H, –hostname=ADDRESS
Host name argument for servers using host headers (virtual host)
Append a port to include it in the header (eg: example.com:5000)
-I, –IP-address=ADDRESS
IP address or name (use numeric address if possible to bypass DNS lookup).
-p, –port=INTEGER
Port number (default: 80)
-4, –use-ipv4
Use IPv4 connection
-6, –use-ipv6
Use IPv6 connection
-S, –ssl
Connect via SSL. Port defaults to 443
–sni
Enable SSL/TLS hostname extension support (SNI)
-C, –certificate=INTEGER
Minimum number of days a certificate has to be valid. Port defaults to 443
(when this option is used the URL is not checked.)

-e, –expect=STRING
Comma-delimited list of strings, at least one of them is expected in
the first (status) line of the server response (default: HTTP/1.)
If specified skips all other status line logic (ex: 3xx, 4xx, 5xx processing)
-s, –string=STRING
String to expect in the content
-u, –url=PATH
URL to GET or POST (default: /)
-P, –post=STRING
URL encoded http POST data
-j, –method=STRING (for example: HEAD, OPTIONS, TRACE, PUT, DELETE)
Set HTTP method.
-N, –no-body
Don’t wait for document body: stop reading after headers.
(Note that this still does an HTTP GET or POST, not a HEAD.)
-M, –max-age=SECONDS
Warn if document is more than SECONDS old. the number can also be of
the form “10m” for minutes, “10h” for hours, or “10d” for days.
-T, –content-type=STRING
specify Content-Type header media type when POSTing

-l, –linespan
Allow regex to span newlines (must precede -r or -R)
-r, –regex, –ereg=STRING
Search page for regex STRING
-R, –eregi=STRING
Search page for case-insensitive regex STRING
–invert-regex
Return CRITICAL if found, OK if not

-a, –authorization=AUTH_PAIR
Username:password on sites with basic authentication
-b, –proxy-authorization=AUTH_PAIR
Username:password on proxy-servers with basic authentication
-A, –useragent=STRING
String to be sent in http header as “User Agent”
-k, –header=STRING
Any other tags to be sent in http header. Use multiple times for additional headers
-L, –link
Wrap output in HTML link (obsoleted by urlize)
-f, –onredirect=<ok|warning|critical|follow|sticky|stickyport>
How to handle redirected pages. sticky is like follow but stick to the
specified IP address. stickyport also ensure post stays the same.
-m, –pagesize=INTEGER<:INTEGER>
Minimum page size required (bytes) : Maximum page size required (bytes)
-w, –warning=DOUBLE
Response time to result in warning status (seconds)
-c, –critical=DOUBLE
Response time to result in critical status (seconds)
-t, –timeout=INTEGER
Seconds before connection times out (default: 10)
-v, –verbose
Show details for command-line debugging (Nagios may truncate output)

Notes:
This plugin will attempt to open an HTTP connection with the host.
Successful connects return STATE_OK, refusals and timeouts return STATE_CRITICAL
other errors return STATE_UNKNOWN. Successful connects, but incorrect reponse
messages from the host result in STATE_WARNING return values. If you are
checking a virtual server that uses ‘host headers’ you must supply the FQDN
(fully qualified domain name) as the [host_name] argument.

This plugin can also check whether an SSL enabled web server is able to
serve content (optionally within a specified time) or whether the X509
certificate is still valid for the specified number of days.

Examples:
CHECK CONTENT: check_http -w 5 -c 10 –ssl -H www.verisign.com

When the ‘www.verisign.com’ server returns its content within 5 seconds,
a STATE_OK will be returned. When the server returns its content but exceeds
the 5-second threshold, a STATE_WARNING will be returned. When an error occurs,
a STATE_CRITICAL will be returned.

CHECK CERTIFICATE: check_http -H www.verisign.com -C 14

When the certificate of ‘www.verisign.com’ is valid for more than 14 days,
a STATE_OK is returned. When the certificate is still valid, but for less than
14 days, a STATE_WARNING is returned. A STATE_CRITICAL will be returned when
the certificate is expired.

Send email to [email protected] if you have questions
regarding use of this software. To submit patches or suggest improvements,
send email to [email protected]

Categories: HA & HPC Tags: