McAfee Solidcore agent and McAfee agent management

June 14th, 2012

The File Integrity Monitoring (FIM) agents on Solaris and Redhat servers is made up of 2 components, a Solidcore Agent and a McAfee Agent.

  • The Solidcore agent is the element that performs the file monitoring. It runs as a kernel module and needs a kernel restart ( reboot ) to disable it.
  • The McAfee agent is responsible for communication back to a central McAfee Enterprise Policy Orchestrator ( EPO ) server. It runs as a service (cma) that can be stopped with minimal impact to the running server. The software can also easily be removed or reinstalled without and impact.

With both Solidocre and a Mcafee agent running, the centralised ePO will control the 'policy' of files to be monitored on the host. this can be overridden in the OS if need be using commands in the Additional Tasks section below.

Status check
To query the status of solidcore on a server ( as root ) run
# sadmin status

To query the policy a server is running with, the local config needs to be 'unlocked' To do this, 'recover' the config and query the policy.
# sadmin recover #( password required, available from the epo administrator )
# sadmin mon list
# sadmin lockdown

Restart
The McAfee agent has an associated service 'cma' which can be stopped and restarted while the server is running.
- Stopping the service
service cma stop

- Starting the service
service cma start

The Solidcore agent has an associated service 'scsrvc'
- Stopping the service
service scsrvc stop

- Starting the service
service scsrvc start

Note:
The solidcore agent runs as part of the UNIX kernel. Stopping the 'scsrvc' service doesn't fully disable the solidcore software.
To do this :
- Open the local configuration for editing
sadmin recover #{ password needed from the ePO Administrator )

- Set the agent to be disabled at next reboot
sadmin disable

- Close the local configuration for edits
sadmin lockdown

'REBOOT'

when the server comes back the agent will be disabled. This can be confiremd by running :
sadmin status

error system dram available – some DIMMs not probed by solaris

June 12th, 2012

We encountered error message as the following after we disabled some components on a SUN T2000 server:

SEP 29 19:34:35 ERROR: System DRAM  Available: 004096 MB  Physical: 008192 MB

This means that only 4Gb memory available although 8G is physically installed on the server. We can confirm the memory size from the following:

# prtdiag
System Configuration: Sun Microsystems sun4v Sun Fire T200
Memory size: 3968 Megabytes

 

1. Why the server panic :
======================
We don't have clear data for that . But normally Solaris 10 on T2000 Systems may Panic When Low on Memory. The system panics with the following panic string:

hypervisor call 0x21 returned an unexpected error 2

But in our case , that's also not happened .

2. The error which we can see from Alom:
======================

sc> showcomponent
Disabled Devices
MB/CMP0/CH0/R1/D1 : [Forced DIAG fail (POST)]

DIMMs with CEs are being unnecessarily flagged by POST as faulty. When POST encounters a single CE, the associated DIMM is declared faulty and half of system's memory is deconfigured and unavailable for Solaris. Since PSH (Predictive Self-Healing) is the primary means for detecting errors and diagnosing faults on the Niagara platforms, this policy is too aggressive (reference bug 6334560).
3.What action we can take now :
==========================

a) clear the SC log .
b)enable the component in SC .
c)Monitor the server .

if again same fault reports , we will replace the DIMM.

PS:

For more information about DIMM, you can refer to http://en.wikipedia.org/wiki/DIMM

Categories: Hardware, Servers Tags: ,

H/W under test during POST on SUN T2000 Series

June 12th, 2012

We got the following error messages during POST on a SUN T2000 Series server:

0:0:0>ERROR: TEST = Queue Block Mem Test
0:0:0>H/W under test = MB/CMP0/CH0/R1/D1/S0 (J0901)
0:0:0>Repair Instructions: Replace items in order listed by 'H/W under
test' above.
0:0:0>MSG = Pin 236 failed on MB/CMP0/CH0/R1/D1/S0 (J0901)
0:0:0>END_ERROR
ERROR: The following devices are disabled:
MB/CMP0/CH0/R1/D1
Aborting auto-boot sequence.

To resolve this issue, we can disable the components in ALOM/ILOM and power off /on then try to reboot the machine. Here's the steps:

If you use ALOM :
=============
disablecomponent component
poweroff
poweron

If you use ILOM :
=============
-> set /SYS/component component_state=disabled
-> stop /SYS
-> start /SYS
Example :
========
-> set /SYS/MB/CMP0/CH0/R1/D1 component_state=disabled

-> stop /SYS
Are you sure you want to stop /SYS (y/n)? y
Stopping /SYS
-> start /SYS
Are you sure you want to start /SYS (y/n)? y
Starting /SYS

After you disabled the components, you should clear SC error log and FMA logs:

Clearing faults from SC:
----------------------------------

a) Show the faults on the system controller
sc> showfaults -v

b) For each fault listed run
sc> clearfault <uuid>

c) re-enable the disabled components run
sc> clearasrdb

d) Clear ereports
sc> setsc sc_servicemode true
sc> clearereports -y

To clear the FMA faults and error logs from Solaris:
a) Show faults in FMA
# fmadm faulty

b) For each fault listed in the 'fmadm faulty' run
# fmadm repair <uuid>

c) Clear ereports and resource cache
# cd /var/fm/fmd
# rm e* f* c*/eft/* r*/*

d) Reset the fmd serd modules
# fmadm reset cpumem-diagnosis
# fmadm reset cpumem-retire
# fmadm reset eft
# fmadm reset io-retire

Categories: Hardware, Servers Tags:

vcs commands hang consistently

June 8th, 2012

Today we encounter an issue that veritas vcs commands hang in a consistent manner. The commands like haconf -dump -makero just stuck there for a long time that we have to terminate it from console. When using truss(on solaris) or strace(on linux) to trace system calls and signals, we found the following output:

test# truss haconf -dump -makero

execve("/opt/VRTSvcs/bin/haconf", 0xFFBEF21C, 0xFFBEF22C) argc = 3
resolvepath("/usr/lib/ld.so.1", "/usr/lib/ld.so.1", 1023) = 16
open("/var/ld/ld.config", O_RDONLY) Err#2 ENOENT

open("//.vcspwd", O_RDONLY) Err#2 ENOENT
getuid() = 0 [0]
getuid() = 0 [0]
so_socket(1, 2, 0, "", 1) = 4
fcntl(4, F_GETFD, 0x00000004) = 0
fcntl(4, F_SETFD, 0x00000001) = 0
connect(4, 0xFFBE7E1E, 110, 1) = 0
fstat64(4, 0xFFBE7AF8) = 0
getsockopt(4, 65535, 8192, 0xFFBE7BF8, 0xFFBE7BF4, 0) = 0
setsockopt(4, 65535, 8192, 0xFFBE7BF8, 4, 0) = 0
fcntl(4, F_SETFL, 0x00000084) = 0
brk(0x000F6F28) = 0
brk(0x000F8F28) = 0
poll(0xFFBE8A60, 1, 0) = 1
send(4, " G\0\0\0 $\0\0\t15\0\0\0".., 57, 0) = 57
poll(0xFFBE8AA0, 1, -1) = 1
poll(0xFFBE68B8, 0, 0) = 0
recv(4, " G\0\0\0 $\0\0\r02\0\0\0".., 8192, 0) = 55
poll(0xFFBE8B10, 1, 0) = 1
send(4, " G\0\0\0 $\0\0\f 1\0\0\0".., 58, 0) = 58
poll(0xFFBE8B50, 1, -1) = 1
poll(0xFFBE6968, 0, 0) = 0
recv(4, " G\0\0\0 $\0\0\r02\0\0\0".., 8192, 0) = 49
getpid() = 10386 [10385]
poll(0xFFBE99B8, 1, 0) = 1
send(4, " G\0\0\0 $\0\0\f A\0\0\0".., 130, 0) = 130
poll(0xFFBE99F8, 1, -1) = 1
poll(0xFFBE7810, 0, 0) = 0
recv(4, " G\0\0\0 $\0\0\r02\0\0\0".., 8192, 0) = 62
fstat64(4, 0xFFBE9BB0) = 0
getsockopt(4, 65535, 8192, 0xFFBE9CB0, 0xFFBE9CAC, 0) = 0
setsockopt(4, 65535, 8192, 0xFFBE9CB0, 4, 0) = 0
fcntl(4, F_SETFL, 0x00000084) = 0
getuid() = 0 [0]
door_info(3, 0xFFBE78C8) = 0
door_call(3, 0xFFBE78B0) = 0
open("//.vcspwd", O_RDONLY) Err#2 ENOENT
poll(0xFFBEE370, 1, 0) = 1
send(4, " G\0\0\0 $\0\0\t13\0\0\0".., 42, 0) = 42
poll(0xFFBEE3B0, 1, -1) (sleeping...)

After some digging into the internet, we found the following solution to this weird problem:

1. Stop VCS on all nodes in the cluster by manually killing both had & hashadow processes on each node.
# ps -ef | grep had
root 27656 1 0 10:24:02 ? 0:00 /opt/VRTSvcs/bin/hashadow
root 27533 1 0 10:22:01 ? 0:02 /opt/VRTSvcs/bin/had -restart

# kill 27656 27533
GAB: Port h closed

2. Unconfig GAB & llt.
# gabconfig -U
GAB: Port a closed
GAB unavailable

# lltconfig -U
lltconfig: this will attempt to stop and reset LLT. Confirm (y/n)? y

3. Unload GAB & llt modules.
# modinfo | grep gab
100 60ea8000 38e9b 136 1 gab (GAB device)

# modunload -i 100
GAB unavailable

# modinfo | grep llt
84 60c6a000 fd74 137 1 llt (Low Latency Transport device)
# modunload -i 84
LLT Protocol unavailable

4. Restart llt.
# /etc/rc2.d/S70llt start
Starting LLT
LLT Protocol available

5. Restart gab.
# /etc/gabtab
GAB available
GAB: Port a registration waiting for seed port membership

6. Restart VCS :
# hastart -force
# VCS: starting on: <node_name>

Categories: Clouding, HA, HA & HPC, IT Architecture Tags:

using oracle materialized view with one hour refresh interval to reduce high concurrency

June 8th, 2012

If your oracle DB is at a very high concurrency and you find that the top sqls are some views, then there's a quick way to resolve this: using oracle materialized view. You may consider setting the refresh interval to one hour which means the view will refresh every hour. After the setting go live, you'll find the normal performance will appear.

For more information about oracle materialized view, you can visit http://en.wikipedia.org/wiki/Materialized_view

Here's a image with high oracle concurrency:

oracle high concurrency

useful sed single line examples when clearing embedded trojans or embedded links

June 7th, 2012

When your site is embedded with some links/trojans by somebody maliciously, the first thing you could think of would mostly like to clear these malicious links/trojans. sed is a useful stream editor based on line, and you would of course think of using sed to do the cleaning job.

Usually, the embedded codes would be several lines of html codes like the following:

<div class="trojans">
<a href="http://www.malicous-site-url.com">malicous site's name</a>
blablabla...
</div>

To clear these html codes, you can use the following sed line:

sed  '/<div class=\"trojans\">/,/<\/div>/d' injected.htm

But usually the injected files are spread across several directories or even your whole website's directory. You can combine using find and sed together to clean these annoying trojans:

find /var/www/html/yoursite.com/ -type f \( -name *.htm -o -name *.html -o -name *.php \) -exec sed  -i.bak' /<div class=\"trojans\">/,/<\/div>/d' {} \;

Please note I use -i.bak to backup file before doing the replacement.(you should also backup your data before cleaning trojans!)

PS:

For more info about sed examples/tutorials, you may refer to the following two resources:

1.http://sed.sourceforge.net/sed1line.txt

2.http://www.grymoire.com/Unix/Sed.html