H/W under test during POST on SUN T2000 Series

June 12th, 2012

We got the following error messages during POST on a SUN T2000 Series server:

0:0:0>ERROR: TEST = Queue Block Mem Test
0:0:0>H/W under test = MB/CMP0/CH0/R1/D1/S0 (J0901)
0:0:0>Repair Instructions: Replace items in order listed by 'H/W under
test' above.
0:0:0>MSG = Pin 236 failed on MB/CMP0/CH0/R1/D1/S0 (J0901)
0:0:0>END_ERROR
ERROR: The following devices are disabled:
MB/CMP0/CH0/R1/D1
Aborting auto-boot sequence.

To resolve this issue, we can disable the components in ALOM/ILOM and power off /on then try to reboot the machine. Here's the steps:

If you use ALOM :
=============
disablecomponent component
poweroff
poweron

If you use ILOM :
=============
-> set /SYS/component component_state=disabled
-> stop /SYS
-> start /SYS
Example :
========
-> set /SYS/MB/CMP0/CH0/R1/D1 component_state=disabled

-> stop /SYS
Are you sure you want to stop /SYS (y/n)? y
Stopping /SYS
-> start /SYS
Are you sure you want to start /SYS (y/n)? y
Starting /SYS

After you disabled the components, you should clear SC error log and FMA logs:

Clearing faults from SC:
----------------------------------

a) Show the faults on the system controller
sc> showfaults -v

b) For each fault listed run
sc> clearfault <uuid>

c) re-enable the disabled components run
sc> clearasrdb

d) Clear ereports
sc> setsc sc_servicemode true
sc> clearereports -y

To clear the FMA faults and error logs from Solaris:
a) Show faults in FMA
# fmadm faulty

b) For each fault listed in the 'fmadm faulty' run
# fmadm repair <uuid>

c) Clear ereports and resource cache
# cd /var/fm/fmd
# rm e* f* c*/eft/* r*/*

d) Reset the fmd serd modules
# fmadm reset cpumem-diagnosis
# fmadm reset cpumem-retire
# fmadm reset eft
# fmadm reset io-retire

Categories: Hardware, Servers Tags:

vcs commands hang consistently

June 8th, 2012

Today we encounter an issue that veritas vcs commands hang in a consistent manner. The commands like haconf -dump -makero just stuck there for a long time that we have to terminate it from console. When using truss(on solaris) or strace(on linux) to trace system calls and signals, we found the following output:

test# truss haconf -dump -makero

execve("/opt/VRTSvcs/bin/haconf", 0xFFBEF21C, 0xFFBEF22C) argc = 3
resolvepath("/usr/lib/ld.so.1", "/usr/lib/ld.so.1", 1023) = 16
open("/var/ld/ld.config", O_RDONLY) Err#2 ENOENT

open("//.vcspwd", O_RDONLY) Err#2 ENOENT
getuid() = 0 [0]
getuid() = 0 [0]
so_socket(1, 2, 0, "", 1) = 4
fcntl(4, F_GETFD, 0x00000004) = 0
fcntl(4, F_SETFD, 0x00000001) = 0
connect(4, 0xFFBE7E1E, 110, 1) = 0
fstat64(4, 0xFFBE7AF8) = 0
getsockopt(4, 65535, 8192, 0xFFBE7BF8, 0xFFBE7BF4, 0) = 0
setsockopt(4, 65535, 8192, 0xFFBE7BF8, 4, 0) = 0
fcntl(4, F_SETFL, 0x00000084) = 0
brk(0x000F6F28) = 0
brk(0x000F8F28) = 0
poll(0xFFBE8A60, 1, 0) = 1
send(4, " G\0\0\0 $\0\0\t15\0\0\0".., 57, 0) = 57
poll(0xFFBE8AA0, 1, -1) = 1
poll(0xFFBE68B8, 0, 0) = 0
recv(4, " G\0\0\0 $\0\0\r02\0\0\0".., 8192, 0) = 55
poll(0xFFBE8B10, 1, 0) = 1
send(4, " G\0\0\0 $\0\0\f 1\0\0\0".., 58, 0) = 58
poll(0xFFBE8B50, 1, -1) = 1
poll(0xFFBE6968, 0, 0) = 0
recv(4, " G\0\0\0 $\0\0\r02\0\0\0".., 8192, 0) = 49
getpid() = 10386 [10385]
poll(0xFFBE99B8, 1, 0) = 1
send(4, " G\0\0\0 $\0\0\f A\0\0\0".., 130, 0) = 130
poll(0xFFBE99F8, 1, -1) = 1
poll(0xFFBE7810, 0, 0) = 0
recv(4, " G\0\0\0 $\0\0\r02\0\0\0".., 8192, 0) = 62
fstat64(4, 0xFFBE9BB0) = 0
getsockopt(4, 65535, 8192, 0xFFBE9CB0, 0xFFBE9CAC, 0) = 0
setsockopt(4, 65535, 8192, 0xFFBE9CB0, 4, 0) = 0
fcntl(4, F_SETFL, 0x00000084) = 0
getuid() = 0 [0]
door_info(3, 0xFFBE78C8) = 0
door_call(3, 0xFFBE78B0) = 0
open("//.vcspwd", O_RDONLY) Err#2 ENOENT
poll(0xFFBEE370, 1, 0) = 1
send(4, " G\0\0\0 $\0\0\t13\0\0\0".., 42, 0) = 42
poll(0xFFBEE3B0, 1, -1) (sleeping...)

After some digging into the internet, we found the following solution to this weird problem:

1. Stop VCS on all nodes in the cluster by manually killing both had & hashadow processes on each node.
# ps -ef | grep had
root 27656 1 0 10:24:02 ? 0:00 /opt/VRTSvcs/bin/hashadow
root 27533 1 0 10:22:01 ? 0:02 /opt/VRTSvcs/bin/had -restart

# kill 27656 27533
GAB: Port h closed

2. Unconfig GAB & llt.
# gabconfig -U
GAB: Port a closed
GAB unavailable

# lltconfig -U
lltconfig: this will attempt to stop and reset LLT. Confirm (y/n)? y

3. Unload GAB & llt modules.
# modinfo | grep gab
100 60ea8000 38e9b 136 1 gab (GAB device)

# modunload -i 100
GAB unavailable

# modinfo | grep llt
84 60c6a000 fd74 137 1 llt (Low Latency Transport device)
# modunload -i 84
LLT Protocol unavailable

4. Restart llt.
# /etc/rc2.d/S70llt start
Starting LLT
LLT Protocol available

5. Restart gab.
# /etc/gabtab
GAB available
GAB: Port a registration waiting for seed port membership

6. Restart VCS :
# hastart -force
# VCS: starting on: <node_name>

Categories: Clouding, HA, HA & HPC, IT Architecture Tags:

using oracle materialized view with one hour refresh interval to reduce high concurrency

June 8th, 2012

If your oracle DB is at a very high concurrency and you find that the top sqls are some views, then there's a quick way to resolve this: using oracle materialized view. You may consider setting the refresh interval to one hour which means the view will refresh every hour. After the setting go live, you'll find the normal performance will appear.

For more information about oracle materialized view, you can visit http://en.wikipedia.org/wiki/Materialized_view

Here's a image with high oracle concurrency:

oracle high concurrency

useful sed single line examples when clearing embedded trojans or embedded links

June 7th, 2012

When your site is embedded with some links/trojans by somebody maliciously, the first thing you could think of would mostly like to clear these malicious links/trojans. sed is a useful stream editor based on line, and you would of course think of using sed to do the cleaning job.

Usually, the embedded codes would be several lines of html codes like the following:

<div class="trojans">
<a href="http://www.malicous-site-url.com">malicous site's name</a>
blablabla...
</div>

To clear these html codes, you can use the following sed line:

sed  '/<div class=\"trojans\">/,/<\/div>/d' injected.htm

But usually the injected files are spread across several directories or even your whole website's directory. You can combine using find and sed together to clean these annoying trojans:

find /var/www/html/yoursite.com/ -type f \( -name *.htm -o -name *.html -o -name *.php \) -exec sed  -i.bak' /<div class=\"trojans\">/,/<\/div>/d' {} \;

Please note I use -i.bak to backup file before doing the replacement.(you should also backup your data before cleaning trojans!)

PS:

For more info about sed examples/tutorials, you may refer to the following two resources:

1.http://sed.sourceforge.net/sed1line.txt

2.http://www.grymoire.com/Unix/Sed.html

differences between Server Connection Time Server Response Time Page Load Time Page Download Time

May 31st, 2012

Here's an excerpt from google analystics:

Avg. Server Connection Time (sec): 0.12 The average amount of time (in seconds) spent in establishing TCP connection for this page.

Avg. Server Response Time (sec): 0.80 The average amount of time (in seconds) your server takes to respond to a user request, including the network time from user’s location to your server.

Avg. Page Load Time (sec): 7.85 Avg. Page Load Time is the average amount of time (in seconds) it takes for pages from the sample set to load, from initiation of the pageview (e.g. click a page link) to load completion in the browser. If you see zero (0) as a value, please refer to the Site Speed article.

Avg. Page Download Time (sec): 2.08 The average amount of time (in seconds) to download this page.

For example, my site is like this:

Server Response Time

PS:

1.You can read more info in the following link for how to use google analytics site speed http://support.google.com/analytics/bin/answer.py?hl=en-us&topic=1120718&answer=1205784

2.If you want to break down your site's loading time by digging resources like js/css/html/cgi/php ones, firebug is your friend. You can refer to the following two links for how to use firebug:

http://www.softwareishard.com/blog/firebug/firebug-net-panel-timings/

http://www.softwareishard.com/blog/firebug/page-load-analysis-using-firebug/

Categories: IT Architecture Tags:

difference between SCSI ISCSI FCP FCoE FCIP NFS CIFS DAS NAS SAN iFCP

May 30th, 2012

Here goes some differences between SCSI ISCSI FCP FCoE FCIP NFS CIFS DAS NAS SAN(excerpt from Internet):

Most storage networks use the SCSI protocol for communication between servers and disk drive devices. A mapping layer to other protocols is used to form a network: Fibre Channel Protocol (FCP), the most prominent one, is a mapping of SCSI over Fibre Channel; Fibre Channel over Ethernet (FCoE); iSCSI, mapping of SCSI over TCP/IP.

 

A storage area network (SAN) is a dedicated network that provides access to consolidated, block level data storage. SANs are primarily used to make storage devices, such as disk arrays, tape libraries, and optical jukeboxes, accessible to servers so that the devices appear like locally attached devices to the operating system. A storage area network (SAN) is a dedicated network that provides access to consolidated, block level data storage. SANs are primarily used to make storage devices, such as disk arrays, tape libraries, and optical jukeboxes, accessible to servers so that the devices appear like locally attached devices to the operating system. Historically, data centers first created "islands" of SCSI disk arrays as direct-attached storage (DAS), each dedicated to an application, and visible as a number of "virtual hard drives" (i.e. LUNs). Operating systems maintain their own file systems on their own dedicated, non-shared LUNs, as though they were local to themselves. If multiple systems were simply to attempt to share a LUN, these would interfere with each other and quickly corrupt the data. Any planned sharing of data on different computers within a LUN requires advanced solutions, such as SAN file systems or clustered computing. Despite such issues, SANs help to increase storage capacity utilization, since multiple servers consolidate their private storage space onto the disk arrays.Sharing storage usually simplifies storage administration and adds flexibility since cables and storage devices do not have to be physically moved to shift storage from one server to another. SANs also tend to enable more effective disaster recovery processes. A SAN could span a distant location containing a secondary storage array. This enables storage replication either implemented by disk array controllers, by server software, or by specialized SAN devices. Since IP WANs are often the least costly method of long-distance transport, the Fibre Channel over IP (FCIP) and iSCSI protocols have been developed to allow SAN extension over IP networks. The traditional physical SCSI layer could only support a few meters of distance - not nearly enough to ensure business continuance in a disaster.

More about FCIP is here http://en.wikipedia.org/wiki/Fibre_Channel_over_IP (still use FC protocol)

A competing technology to FCIP is known as iFCP. It uses routing instead of tunneling to enable connectivity of Fibre Channel networks over IP.

IP SAN uses TCP as a transport mechanism for storage over Ethernet, and iSCSI encapsulates SCSI commands into TCP packets, thus enabling the transport of I/O block data over IP networks.

Network-attached storage (NAS), in contrast to SAN, uses file-based protocols such as NFS or SMB/CIFS where it is clear that the storage is remote, and computers request a portion of an abstract file rather than a disk block. The key difference between direct-attached storage (DAS) and NAS is that DAS is simply an extension to an existing server and is not necessarily networked. NAS is designed as an easy and self-contained solution for sharing files over the network.

 

FCoE works with standard Ethernet cards, cables and switches to handle Fibre Channel traffic at the data link layer, using Ethernet frames to encapsulate, route, and transport FC frames across an Ethernet network from one switch with Fibre Channel ports and attached devices to another, similarly equipped switch.

 

When an end user or application sends a request, the operating system generates the appropriate SCSI commands and data request, which then go through encapsulation and, if necessary, encryption procedures. A packet header is added before the resulting IP packets are transmitted over an Ethernet connection. When a packet is received, it is decrypted (if it was encrypted before transmission), and disassembled, separating the SCSI commands and request. The SCSI commands are sent on to the SCSI controller, and from there to the SCSI storage device. Because iSCSI is bi-directional, the protocol can also be used to return data in response to the original request.

 

Fibre channel is more flexible; devices can be as far as ten kilometers (about six miles) apart if optical fiber is used as the physical medium. Optical fiber is not required for shorter distances, however, because Fibre Channel also works using coaxial cable and ordinary telephone twisted pair.

 

Network File System (NFS) is a distributed file system protocol originally developed by Sun Microsystems in 1984,[1] allowing a user on a client computer to access files over a network in a manner similar to how local storage is accessed. On the contrary, CIFS is its Windows-based counterpart used in file sharing.

Categories: Hardware, NAS, SAN, Storage Tags: