Resolved – yum return error No module named sqlite

June 18th, 2012

Today I met an error when I tried running yum commands(like 'yum list' etc) on centos linux, the error message looked like below:


[root@doxer_#1]# yum list
There was a problem importing one of the Python modules
required to run yum. The error leading to this problem was:No module named sqlitePlease install a package which provides this module, or
verify that the module is installed correctly.

It's possible that the above module doesn't match the
current version of Python, which is:
2.4.3 (#1, Sep 21 2011, 19:55:41)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-51)]

If you cannot solve this problem yourself, please go to
the yum faq at:

http://wiki.linux.duke.edu/YumFaq

To resolve this issue, we need install sqlite and python-sqlite manually(can not install it by yum now cause it's not working):

[root@doxer_#1]# wget 'ftp://ftp.pbone.net/mirror/ftp.centos.org/5.8/os/x86_64/CentOS/python-sqlite-1.1.7-1.2.1.x86_64.rpm'


[root@doxer_#1]# wget 'ftp://ftp.muug.mb.ca/mirror/centos/5.8/os/x86_64/CentOS/sqlite-3.3.6-5.x86_64.rpm'
[root@doxer_#1]# rpm -Uvh python-sqlite-1.1.7-1.2.1.x86_64.rpm
[root@doxer_#1}# ldconfig
[root@doxer_#1}# rpm -Uvh sqlite-3.3.6-5.x86_64.rpm

After the installation of these two packages, yum commands worked ok.
Categories: IT Architecture, Linux, Systems Tags:

McAfee Solidcore agent and McAfee agent management

June 14th, 2012

The File Integrity Monitoring (FIM) agents on Solaris and Redhat servers is made up of 2 components, a Solidcore Agent and a McAfee Agent.

  • The Solidcore agent is the element that performs the file monitoring. It runs as a kernel module and needs a kernel restart ( reboot ) to disable it.
  • The McAfee agent is responsible for communication back to a central McAfee Enterprise Policy Orchestrator ( EPO ) server. It runs as a service (cma) that can be stopped with minimal impact to the running server. The software can also easily be removed or reinstalled without and impact.

With both Solidocre and a Mcafee agent running, the centralised ePO will control the 'policy' of files to be monitored on the host. this can be overridden in the OS if need be using commands in the Additional Tasks section below.

Status check
To query the status of solidcore on a server ( as root ) run
# sadmin status

To query the policy a server is running with, the local config needs to be 'unlocked' To do this, 'recover' the config and query the policy.
# sadmin recover #( password required, available from the epo administrator )
# sadmin mon list
# sadmin lockdown

Restart
The McAfee agent has an associated service 'cma' which can be stopped and restarted while the server is running.
- Stopping the service
service cma stop

- Starting the service
service cma start

The Solidcore agent has an associated service 'scsrvc'
- Stopping the service
service scsrvc stop

- Starting the service
service scsrvc start

Note:
The solidcore agent runs as part of the UNIX kernel. Stopping the 'scsrvc' service doesn't fully disable the solidcore software.
To do this :
- Open the local configuration for editing
sadmin recover #{ password needed from the ePO Administrator )

- Set the agent to be disabled at next reboot
sadmin disable

- Close the local configuration for edits
sadmin lockdown

'REBOOT'

when the server comes back the agent will be disabled. This can be confiremd by running :
sadmin status

error system dram available – some DIMMs not probed by solaris

June 12th, 2012

We encountered error message as the following after we disabled some components on a SUN T2000 server:

SEP 29 19:34:35 ERROR: System DRAM  Available: 004096 MB  Physical: 008192 MB

This means that only 4Gb memory available although 8G is physically installed on the server. We can confirm the memory size from the following:

# prtdiag
System Configuration: Sun Microsystems sun4v Sun Fire T200
Memory size: 3968 Megabytes

 

1. Why the server panic :
======================
We don't have clear data for that . But normally Solaris 10 on T2000 Systems may Panic When Low on Memory. The system panics with the following panic string:

hypervisor call 0x21 returned an unexpected error 2

But in our case , that's also not happened .

2. The error which we can see from Alom:
======================

sc> showcomponent
Disabled Devices
MB/CMP0/CH0/R1/D1 : [Forced DIAG fail (POST)]

DIMMs with CEs are being unnecessarily flagged by POST as faulty. When POST encounters a single CE, the associated DIMM is declared faulty and half of system's memory is deconfigured and unavailable for Solaris. Since PSH (Predictive Self-Healing) is the primary means for detecting errors and diagnosing faults on the Niagara platforms, this policy is too aggressive (reference bug 6334560).
3.What action we can take now :
==========================

a) clear the SC log .
b)enable the component in SC .
c)Monitor the server .

if again same fault reports , we will replace the DIMM.

PS:

For more information about DIMM, you can refer to http://en.wikipedia.org/wiki/DIMM

Categories: Hardware, Servers Tags: ,

H/W under test during POST on SUN T2000 Series

June 12th, 2012

We got the following error messages during POST on a SUN T2000 Series server:

0:0:0>ERROR: TEST = Queue Block Mem Test
0:0:0>H/W under test = MB/CMP0/CH0/R1/D1/S0 (J0901)
0:0:0>Repair Instructions: Replace items in order listed by 'H/W under
test' above.
0:0:0>MSG = Pin 236 failed on MB/CMP0/CH0/R1/D1/S0 (J0901)
0:0:0>END_ERROR
ERROR: The following devices are disabled:
MB/CMP0/CH0/R1/D1
Aborting auto-boot sequence.

To resolve this issue, we can disable the components in ALOM/ILOM and power off /on then try to reboot the machine. Here's the steps:

If you use ALOM :
=============
disablecomponent component
poweroff
poweron

If you use ILOM :
=============
-> set /SYS/component component_state=disabled
-> stop /SYS
-> start /SYS
Example :
========
-> set /SYS/MB/CMP0/CH0/R1/D1 component_state=disabled

-> stop /SYS
Are you sure you want to stop /SYS (y/n)? y
Stopping /SYS
-> start /SYS
Are you sure you want to start /SYS (y/n)? y
Starting /SYS

After you disabled the components, you should clear SC error log and FMA logs:

Clearing faults from SC:
----------------------------------

a) Show the faults on the system controller
sc> showfaults -v

b) For each fault listed run
sc> clearfault <uuid>

c) re-enable the disabled components run
sc> clearasrdb

d) Clear ereports
sc> setsc sc_servicemode true
sc> clearereports -y

To clear the FMA faults and error logs from Solaris:
a) Show faults in FMA
# fmadm faulty

b) For each fault listed in the 'fmadm faulty' run
# fmadm repair <uuid>

c) Clear ereports and resource cache
# cd /var/fm/fmd
# rm e* f* c*/eft/* r*/*

d) Reset the fmd serd modules
# fmadm reset cpumem-diagnosis
# fmadm reset cpumem-retire
# fmadm reset eft
# fmadm reset io-retire

Categories: Hardware, Servers Tags:

vcs commands hang consistently

June 8th, 2012

Today we encounter an issue that veritas vcs commands hang in a consistent manner. The commands like haconf -dump -makero just stuck there for a long time that we have to terminate it from console. When using truss(on solaris) or strace(on linux) to trace system calls and signals, we found the following output:

test# truss haconf -dump -makero

execve("/opt/VRTSvcs/bin/haconf", 0xFFBEF21C, 0xFFBEF22C) argc = 3
resolvepath("/usr/lib/ld.so.1", "/usr/lib/ld.so.1", 1023) = 16
open("/var/ld/ld.config", O_RDONLY) Err#2 ENOENT

open("//.vcspwd", O_RDONLY) Err#2 ENOENT
getuid() = 0 [0]
getuid() = 0 [0]
so_socket(1, 2, 0, "", 1) = 4
fcntl(4, F_GETFD, 0x00000004) = 0
fcntl(4, F_SETFD, 0x00000001) = 0
connect(4, 0xFFBE7E1E, 110, 1) = 0
fstat64(4, 0xFFBE7AF8) = 0
getsockopt(4, 65535, 8192, 0xFFBE7BF8, 0xFFBE7BF4, 0) = 0
setsockopt(4, 65535, 8192, 0xFFBE7BF8, 4, 0) = 0
fcntl(4, F_SETFL, 0x00000084) = 0
brk(0x000F6F28) = 0
brk(0x000F8F28) = 0
poll(0xFFBE8A60, 1, 0) = 1
send(4, " G\0\0\0 $\0\0\t15\0\0\0".., 57, 0) = 57
poll(0xFFBE8AA0, 1, -1) = 1
poll(0xFFBE68B8, 0, 0) = 0
recv(4, " G\0\0\0 $\0\0\r02\0\0\0".., 8192, 0) = 55
poll(0xFFBE8B10, 1, 0) = 1
send(4, " G\0\0\0 $\0\0\f 1\0\0\0".., 58, 0) = 58
poll(0xFFBE8B50, 1, -1) = 1
poll(0xFFBE6968, 0, 0) = 0
recv(4, " G\0\0\0 $\0\0\r02\0\0\0".., 8192, 0) = 49
getpid() = 10386 [10385]
poll(0xFFBE99B8, 1, 0) = 1
send(4, " G\0\0\0 $\0\0\f A\0\0\0".., 130, 0) = 130
poll(0xFFBE99F8, 1, -1) = 1
poll(0xFFBE7810, 0, 0) = 0
recv(4, " G\0\0\0 $\0\0\r02\0\0\0".., 8192, 0) = 62
fstat64(4, 0xFFBE9BB0) = 0
getsockopt(4, 65535, 8192, 0xFFBE9CB0, 0xFFBE9CAC, 0) = 0
setsockopt(4, 65535, 8192, 0xFFBE9CB0, 4, 0) = 0
fcntl(4, F_SETFL, 0x00000084) = 0
getuid() = 0 [0]
door_info(3, 0xFFBE78C8) = 0
door_call(3, 0xFFBE78B0) = 0
open("//.vcspwd", O_RDONLY) Err#2 ENOENT
poll(0xFFBEE370, 1, 0) = 1
send(4, " G\0\0\0 $\0\0\t13\0\0\0".., 42, 0) = 42
poll(0xFFBEE3B0, 1, -1) (sleeping...)

After some digging into the internet, we found the following solution to this weird problem:

1. Stop VCS on all nodes in the cluster by manually killing both had & hashadow processes on each node.
# ps -ef | grep had
root 27656 1 0 10:24:02 ? 0:00 /opt/VRTSvcs/bin/hashadow
root 27533 1 0 10:22:01 ? 0:02 /opt/VRTSvcs/bin/had -restart

# kill 27656 27533
GAB: Port h closed

2. Unconfig GAB & llt.
# gabconfig -U
GAB: Port a closed
GAB unavailable

# lltconfig -U
lltconfig: this will attempt to stop and reset LLT. Confirm (y/n)? y

3. Unload GAB & llt modules.
# modinfo | grep gab
100 60ea8000 38e9b 136 1 gab (GAB device)

# modunload -i 100
GAB unavailable

# modinfo | grep llt
84 60c6a000 fd74 137 1 llt (Low Latency Transport device)
# modunload -i 84
LLT Protocol unavailable

4. Restart llt.
# /etc/rc2.d/S70llt start
Starting LLT
LLT Protocol available

5. Restart gab.
# /etc/gabtab
GAB available
GAB: Port a registration waiting for seed port membership

6. Restart VCS :
# hastart -force
# VCS: starting on: <node_name>

Categories: Clouding, HA, HA & HPC, IT Architecture Tags:

using oracle materialized view with one hour refresh interval to reduce high concurrency

June 8th, 2012

If your oracle DB is at a very high concurrency and you find that the top sqls are some views, then there's a quick way to resolve this: using oracle materialized view. You may consider setting the refresh interval to one hour which means the view will refresh every hour. After the setting go live, you'll find the normal performance will appear.

For more information about oracle materialized view, you can visit http://en.wikipedia.org/wiki/Materialized_view

Here's a image with high oracle concurrency:

oracle high concurrency