Archive

Archive for the ‘HA & HPC’ Category

oracle ocfs2 cluster filesystem best practise

May 21st, 2013 No comments
  • To check current settings of o2cb, check files under /sys/kernel/config/cluster/ocfs2/
  • To set new value for o2cb:

service o2cb unload
service o2cb configure

heartbeat dead threshold 151 #Iterations before a node is considered dead
network idle timeout 120000 #Time in ms before a network connection is considered dead
network keepalive delay 5000 #Max time in ms before a keepalive packet is sent
network reconnect delay 5000 #Min time in ms between connection attempts

service o2cb load

service o2cb status #will show new configuration if OVS in server pool; or it will show offline

PS:

o2cb – Default cluster stack for the OCFS2 file system, it includes
  • a node manager (o2nm) to keep track of the nodes in the cluster,
  • a heartbeat agent (o2hb) to detect live nodes
  • a network agent (o2net) for intra-cluster node communication
  • a distributed lock manager (o2dlm) to keep track of lock resources
  • All these components are in-kernel.
  • It also includes an in-memory file system, dlmfs, to allow userspace to access the in-kernel dlm
  • main conf files: /etc/ocfs2/cluster.conf, /etc/sysconfig/o2cb
  • more info here https://oss.oracle.com/projects/ocfs2-tools/dist/documentation/v1.4/o2cb.html
Categories: Clouding, HA, HA & HPC, Oracle Cloud Tags:

How HA is achived in Oracle Exadata

November 27th, 2012 No comments
  1. Each Exadata Database Machine has completely redundant hardware including redundant InfiniBand networking, redundant Power Distribution Units (PDU), redundant power upplies, and redundant database and storage servers.
  2. Oracle RAC protects against database server failure.
  3. ASM provides data mirroring to protect against disk or storage server failures.
  4. Oracle RMAN provides extremely fast and efficient backups to disk or tape.
  5. Oracle’s Flashback technology allows backing out user errors at the database, table or even row level.
  6. Using Oracle Data Guard, a second Exadata Database Machine can be configured to maintain a real-time copy of the database at a remote site to provide full protection against site failures and disasters.

resolved – opcmsgm isn’t running

October 22nd, 2012 No comments

If you encounter the problem of opcmsgm not running, you can do the following to resolve the issue:

  • check current status:

Control Manager opcctlm (22616) is running
Action Manager opcactm (22628) is running
Message Manager opcmsgm isn’t running
TT & Notify Mgr opcttnsm (22630) is running
Forward Manager opcforwm (22631) is running
Service Engine opcsvcm (22636) is running
Cert. Srv Adapter opccsad (22634) is running
BBC config adapter opcbbcdist (22635) is running
Display Manager opcdispm (22632) is running
Distrib. Manager opcdistm (22633) is running

Open Agent Management status:
—————————–
Request Sender ovoareqsdr (2738) is running
Request Handler ovoareqhdlr (2847) is running
Message Receiver (HTTPS) opcmsgrb (2849) is running
Message Receiver (DCE) opcmsgrd (2850) is running

OV Control Core components status:
———————————-
OV Control ovcd (1621) is running
OV Communication Broker ovbbccb (1626) is running
OV Certificate Server ovcs aborted

  • restart opcsv
testserver:root root # opcsv -stop
testserver:root root # opcsv -start
  • check status again

testserver:root root # opcsv -status
OVO Management Server status:
—————————–
Control Manager opcctlm (15575) is running
Action Manager opcactm (15602) is running
Message Manager opcmsgm (15603) is running
TT & Notify Mgr opcttnsm (15606) is running
Forward Manager opcforwm (15607) is running
Service Engine opcsvcm (15620) is running
Cert. Srv Adapter opccsad (15612) is running
BBC config adapter opcbbcdist (15613) is running
Display Manager opcdispm (15610) is running
Distrib. Manager opcdistm (15611) is running

Open Agent Management status:
—————————–
Request Sender ovoareqsdr (2738) is running
Request Handler ovoareqhdlr (2847) is running
Message Receiver (HTTPS) opcmsgrb (2849) is running
Message Receiver (DCE) opcmsgrd (2850) is running

OV Control Core components status:
———————————-
OV Control ovcd (1621) is running
OV Communication Broker ovbbccb (1626) is running
OV Certificate Server ovcs aborted

Categories: HA, IT Architecture Tags:

veritas vcs 5.1 on solaris 5.10 changes of restarting procedure

July 26th, 2012 No comments

For 5.1 VCS on solaris 10, start/stop of VCS are no longer controlled by /etc/rc*.d/S* scripts.
They are under SMF control. Plus, some of the /etc/default/gab,llt,vcs,vxfen etc.. there are lines which needs to be set to 1 if VCS is setup manually.
For example:

VCS_START=1
VCS_STOP=1

More interestingly with VCS one node cluster, the SMF resource for vcs is not system/vcs:default, It is system/vcs-onenode:default.

Categories: HA, HA & HPC, IT Architecture Tags:

vcs commands hang consistently

June 8th, 2012 No comments

Today we encounter an issue that veritas vcs commands hang in a consistent manner. The commands like haconf -dump -makero just stuck there for a long time that we have to terminate it from console. When using truss(on solaris) or strace(on linux) to trace system calls and signals, we found the following output:

test# truss haconf -dump -makero

execve(“/opt/VRTSvcs/bin/haconf”, 0xFFBEF21C, 0xFFBEF22C) argc = 3
resolvepath(“/usr/lib/ld.so.1″, “/usr/lib/ld.so.1″, 1023) = 16
open(“/var/ld/ld.config”, O_RDONLY) Err#2 ENOENT

open(“//.vcspwd”, O_RDONLY) Err#2 ENOENT
getuid() = 0 [0]
getuid() = 0 [0]
so_socket(1, 2, 0, “”, 1) = 4
fcntl(4, F_GETFD, 0×00000004) = 0
fcntl(4, F_SETFD, 0×00000001) = 0
connect(4, 0xFFBE7E1E, 110, 1) = 0
fstat64(4, 0xFFBE7AF8) = 0
getsockopt(4, 65535, 8192, 0xFFBE7BF8, 0xFFBE7BF4, 0) = 0
setsockopt(4, 65535, 8192, 0xFFBE7BF8, 4, 0) = 0
fcntl(4, F_SETFL, 0×00000084) = 0
brk(0x000F6F28) = 0
brk(0x000F8F28) = 0
poll(0xFFBE8A60, 1, 0) = 1
send(4, ” G\0\0\0 $\0\0\t15\0\0\0″.., 57, 0) = 57
poll(0xFFBE8AA0, 1, -1) = 1
poll(0xFFBE68B8, 0, 0) = 0
recv(4, ” G\0\0\0 $\0\0\r02\0\0\0″.., 8192, 0) = 55
poll(0xFFBE8B10, 1, 0) = 1
send(4, ” G\0\0\0 $\0\0\f 1\0\0\0″.., 58, 0) = 58
poll(0xFFBE8B50, 1, -1) = 1
poll(0xFFBE6968, 0, 0) = 0
recv(4, ” G\0\0\0 $\0\0\r02\0\0\0″.., 8192, 0) = 49
getpid() = 10386 [10385]
poll(0xFFBE99B8, 1, 0) = 1
send(4, ” G\0\0\0 $\0\0\f A\0\0\0″.., 130, 0) = 130
poll(0xFFBE99F8, 1, -1) = 1
poll(0xFFBE7810, 0, 0) = 0
recv(4, ” G\0\0\0 $\0\0\r02\0\0\0″.., 8192, 0) = 62
fstat64(4, 0xFFBE9BB0) = 0
getsockopt(4, 65535, 8192, 0xFFBE9CB0, 0xFFBE9CAC, 0) = 0
setsockopt(4, 65535, 8192, 0xFFBE9CB0, 4, 0) = 0
fcntl(4, F_SETFL, 0×00000084) = 0
getuid() = 0 [0]
door_info(3, 0xFFBE78C8) = 0
door_call(3, 0xFFBE78B0) = 0
open(“//.vcspwd”, O_RDONLY) Err#2 ENOENT
poll(0xFFBEE370, 1, 0) = 1
send(4, ” G\0\0\0 $\0\0\t13\0\0\0″.., 42, 0) = 42
poll(0xFFBEE3B0, 1, -1) (sleeping…)

After some digging into the internet, we found the following solution to this weird problem:

1. Stop VCS on all nodes in the cluster by manually killing both had & hashadow processes on each node.
# ps -ef | grep had
root 27656 1 0 10:24:02 ? 0:00 /opt/VRTSvcs/bin/hashadow
root 27533 1 0 10:22:01 ? 0:02 /opt/VRTSvcs/bin/had -restart

# kill 27656 27533
GAB: Port h closed

2. Unconfig GAB & llt.
# gabconfig -U
GAB: Port a closed
GAB unavailable

# lltconfig -U
lltconfig: this will attempt to stop and reset LLT. Confirm (y/n)? y

3. Unload GAB & llt modules.
# modinfo | grep gab
100 60ea8000 38e9b 136 1 gab (GAB device)

# modunload -i 100
GAB unavailable

# modinfo | grep llt
84 60c6a000 fd74 137 1 llt (Low Latency Transport device)
# modunload -i 84
LLT Protocol unavailable

4. Restart llt.
# /etc/rc2.d/S70llt start
Starting LLT
LLT Protocol available

5. Restart gab.
# /etc/gabtab
GAB available
GAB: Port a registration waiting for seed port membership

6. Restart VCS :
# hastart -force
# VCS: starting on: <node_name>

Categories: HA, HA & HPC, IT Architecture Tags:

impact of restart vxconfigd on solaris and linux – VxVM Configuration Daemon

May 30th, 2012 No comments

stop and restart the VxVM Configuration Daemon, vxconfigd may cause your VxVA, VMSA and/or VEA session to exit. This may also cause a momentary stoppage of any VxVM configuration actions. This should not harm any data; however, it may cause some configuration operations (e.g. moving subdisks, plex resynchronization) to abort unexpectedly. Any VxVM configuration changes should be completed before running this section.

If you are using EMC PowerPath devices with Veritas Volume Manager, you must run the EMC command(s) ‘powervxvm setup’ (or ‘safevxvm setup’) and/or ‘powervxvm online’ (or ‘safevxvm online’) if this script terminates abnormally. Also, if VCS service groups are running on the host, restarting vxconfigd may cause failover to occur. So you’d better freeze service groups before doing this. You can refer to the following for details: http://www.doxer.org/learn-linux/differences-between-freezing-vcs-system-and-freezing-service-group/

Categories: HA, HA & HPC Tags: