Archive

Archive for the ‘HA & HPC’ Category

How HA is achived in Oracle Exadata

November 27th, 2012 No comments
  1. Each Exadata Database Machine has completely redundant hardware including redundant InfiniBand networking, redundant Power Distribution Units (PDU), redundant power upplies, and redundant database and storage servers.
  2. Oracle RAC protects against database server failure.
  3. ASM provides data mirroring to protect against disk or storage server failures.
  4. Oracle RMAN provides extremely fast and efficient backups to disk or tape.
  5. Oracle’s Flashback technology allows backing out user errors at the database, table or even row level.
  6. Using Oracle Data Guard, a second Exadata Database Machine can be configured to maintain a real-time copy of the database at a remote site to provide full protection against site failures and disasters.

resolved – opcmsgm isn’t running

October 22nd, 2012 No comments

If you encounter the problem of opcmsgm not running, you can do the following to resolve the issue:

  • check current status:

Control Manager opcctlm (22616) is running
Action Manager opcactm (22628) is running
Message Manager opcmsgm isn’t running
TT & Notify Mgr opcttnsm (22630) is running
Forward Manager opcforwm (22631) is running
Service Engine opcsvcm (22636) is running
Cert. Srv Adapter opccsad (22634) is running
BBC config adapter opcbbcdist (22635) is running
Display Manager opcdispm (22632) is running
Distrib. Manager opcdistm (22633) is running

Open Agent Management status:
—————————–
Request Sender ovoareqsdr (2738) is running
Request Handler ovoareqhdlr (2847) is running
Message Receiver (HTTPS) opcmsgrb (2849) is running
Message Receiver (DCE) opcmsgrd (2850) is running

OV Control Core components status:
———————————-
OV Control ovcd (1621) is running
OV Communication Broker ovbbccb (1626) is running
OV Certificate Server ovcs aborted

  • restart opcsv
testserver:root root # opcsv -stop
testserver:root root # opcsv -start
  • check status again

testserver:root root # opcsv -status
OVO Management Server status:
—————————–
Control Manager opcctlm (15575) is running
Action Manager opcactm (15602) is running
Message Manager opcmsgm (15603) is running
TT & Notify Mgr opcttnsm (15606) is running
Forward Manager opcforwm (15607) is running
Service Engine opcsvcm (15620) is running
Cert. Srv Adapter opccsad (15612) is running
BBC config adapter opcbbcdist (15613) is running
Display Manager opcdispm (15610) is running
Distrib. Manager opcdistm (15611) is running

Open Agent Management status:
—————————–
Request Sender ovoareqsdr (2738) is running
Request Handler ovoareqhdlr (2847) is running
Message Receiver (HTTPS) opcmsgrb (2849) is running
Message Receiver (DCE) opcmsgrd (2850) is running

OV Control Core components status:
———————————-
OV Control ovcd (1621) is running
OV Communication Broker ovbbccb (1626) is running
OV Certificate Server ovcs aborted

Categories: HA, IT Architecture Tags:

veritas vcs 5.1 on solaris 5.10 changes of restarting procedure

July 26th, 2012 No comments

For 5.1 VCS on solaris 10, start/stop of VCS are no longer controlled by /etc/rc*.d/S* scripts.
They are under SMF control. Plus, some of the /etc/default/gab,llt,vcs,vxfen etc.. there are lines which needs to be set to 1 if VCS is setup manually.
For example:

VCS_START=1
VCS_STOP=1

More interestingly with VCS one node cluster, the SMF resource for vcs is not system/vcs:default, It is system/vcs-onenode:default.

Categories: HA, HA & HPC, IT Architecture Tags:

vcs commands hang consistently

June 8th, 2012 No comments

Today we encounter an issue that veritas vcs commands hang in a consistent manner. The commands like haconf -dump -makero just stuck there for a long time that we have to terminate it from console. When using truss(on solaris) or strace(on linux) to trace system calls and signals, we found the following output:

test# truss haconf -dump -makero

execve(“/opt/VRTSvcs/bin/haconf”, 0xFFBEF21C, 0xFFBEF22C) argc = 3
resolvepath(“/usr/lib/ld.so.1″, “/usr/lib/ld.so.1″, 1023) = 16
open(“/var/ld/ld.config”, O_RDONLY) Err#2 ENOENT

open(“//.vcspwd”, O_RDONLY) Err#2 ENOENT
getuid() = 0 [0]
getuid() = 0 [0]
so_socket(1, 2, 0, “”, 1) = 4
fcntl(4, F_GETFD, 0×00000004) = 0
fcntl(4, F_SETFD, 0×00000001) = 0
connect(4, 0xFFBE7E1E, 110, 1) = 0
fstat64(4, 0xFFBE7AF8) = 0
getsockopt(4, 65535, 8192, 0xFFBE7BF8, 0xFFBE7BF4, 0) = 0
setsockopt(4, 65535, 8192, 0xFFBE7BF8, 4, 0) = 0
fcntl(4, F_SETFL, 0×00000084) = 0
brk(0x000F6F28) = 0
brk(0x000F8F28) = 0
poll(0xFFBE8A60, 1, 0) = 1
send(4, ” G\0\0\0 $\0\0\t15\0\0\0″.., 57, 0) = 57
poll(0xFFBE8AA0, 1, -1) = 1
poll(0xFFBE68B8, 0, 0) = 0
recv(4, ” G\0\0\0 $\0\0\r02\0\0\0″.., 8192, 0) = 55
poll(0xFFBE8B10, 1, 0) = 1
send(4, ” G\0\0\0 $\0\0\f 1\0\0\0″.., 58, 0) = 58
poll(0xFFBE8B50, 1, -1) = 1
poll(0xFFBE6968, 0, 0) = 0
recv(4, ” G\0\0\0 $\0\0\r02\0\0\0″.., 8192, 0) = 49
getpid() = 10386 [10385]
poll(0xFFBE99B8, 1, 0) = 1
send(4, ” G\0\0\0 $\0\0\f A\0\0\0″.., 130, 0) = 130
poll(0xFFBE99F8, 1, -1) = 1
poll(0xFFBE7810, 0, 0) = 0
recv(4, ” G\0\0\0 $\0\0\r02\0\0\0″.., 8192, 0) = 62
fstat64(4, 0xFFBE9BB0) = 0
getsockopt(4, 65535, 8192, 0xFFBE9CB0, 0xFFBE9CAC, 0) = 0
setsockopt(4, 65535, 8192, 0xFFBE9CB0, 4, 0) = 0
fcntl(4, F_SETFL, 0×00000084) = 0
getuid() = 0 [0]
door_info(3, 0xFFBE78C8) = 0
door_call(3, 0xFFBE78B0) = 0
open(“//.vcspwd”, O_RDONLY) Err#2 ENOENT
poll(0xFFBEE370, 1, 0) = 1
send(4, ” G\0\0\0 $\0\0\t13\0\0\0″.., 42, 0) = 42
poll(0xFFBEE3B0, 1, -1) (sleeping…)

After some digging into the internet, we found the following solution to this weird problem:

1. Stop VCS on all nodes in the cluster by manually killing both had & hashadow processes on each node.
# ps -ef | grep had
root 27656 1 0 10:24:02 ? 0:00 /opt/VRTSvcs/bin/hashadow
root 27533 1 0 10:22:01 ? 0:02 /opt/VRTSvcs/bin/had -restart

# kill 27656 27533
GAB: Port h closed

2. Unconfig GAB & llt.
# gabconfig -U
GAB: Port a closed
GAB unavailable

# lltconfig -U
lltconfig: this will attempt to stop and reset LLT. Confirm (y/n)? y

3. Unload GAB & llt modules.
# modinfo | grep gab
100 60ea8000 38e9b 136 1 gab (GAB device)

# modunload -i 100
GAB unavailable

# modinfo | grep llt
84 60c6a000 fd74 137 1 llt (Low Latency Transport device)
# modunload -i 84
LLT Protocol unavailable

4. Restart llt.
# /etc/rc2.d/S70llt start
Starting LLT
LLT Protocol available

5. Restart gab.
# /etc/gabtab
GAB available
GAB: Port a registration waiting for seed port membership

6. Restart VCS :
# hastart -force
# VCS: starting on: <node_name>

Categories: HA, HA & HPC, IT Architecture Tags:

impact of restart vxconfigd on solaris and linux – VxVM Configuration Daemon

May 30th, 2012 No comments

stop and restart the VxVM Configuration Daemon, vxconfigd may cause your VxVA, VMSA and/or VEA session to exit. This may also cause a momentary stoppage of any VxVM configuration actions. This should not harm any data; however, it may cause some configuration operations (e.g. moving subdisks, plex resynchronization) to abort unexpectedly. Any VxVM configuration changes should be completed before running this section.

If you are using EMC PowerPath devices with Veritas Volume Manager, you must run the EMC command(s) ‘powervxvm setup’ (or ‘safevxvm setup’) and/or ‘powervxvm online’ (or ‘safevxvm online’) if this script terminates abnormally. Also, if VCS service groups are running on the host, restarting vxconfigd may cause failover to occur. So you’d better freeze service groups before doing this. You can refer to the following for details: http://www.doxer.org/learn-linux/differences-between-freezing-vcs-system-and-freezing-service-group/

Categories: HA, HA & HPC Tags:

vcs service group and resource attributes dictionary page

May 22nd, 2012 No comments

Here’s all the veritas vcs service group and resource attributes and their explanation/crab sheet/cheatsheet(actually this is the file content of /etc/VRTSvcs/conf/attributes/cluster_attrs.xml):

Read more…

Categories: HA, HA & HPC Tags: