resolved - net/core/dev.c:1894 skb_gso_segment+0x298/0x370()

Today on one of our servers, there were a lot of errors in /var/log/messages like below:

║Apr 14 21:50:25 test01 kernel: WARNING: at net/core/dev.c:1894
║skb_gso_segment+0x298/0x370()
║Apr 14 21:50:25 test01 kernel: Hardware name: SUN FIRE X4170 M3
║Apr 14 21:50:25 test01 kernel: : caps=(0x60014803, 0x0) len=255
║data_len=215 ip_summed=1
║Apr 14 21:50:25 test01 kernel: Modules linked in: dm_nfs nfs fscache
║auth_rpcgss nfs_acl xen_blkback xen_netback xen_gntdev xen_evtchn lockd
║ @ sunrpc 8021q garp bridge stp llc bonding be2iscsi iscsi_boot_sysfs ib_iser
║ @ rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio
║ @ ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp dm_round_robin libiscsi
║ @ dm_multipath scsi_transport_iscsi xenfs xen_privcmd dm_mirror video sbs sbshc
║acpi_memhotplug acpi_ipmi ipmi_msghandler parport_pc lp parport sr_mod cdrom
║ixgbe hwmon dca snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
║snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore
║snd_page_alloc iTCO_wdt iTCO_vendor_support pcspkr ghes i2c_i801 hed i2c_core
║dm_region_hash dm_log dm_mod usb_storage ahci libahci sg shpchp megaraid_sas
║sd_mod crc_t10dif ext3 jbd mbcache
║Apr 14 21:50:25 test01 kernel: Pid: 0, comm: swapper Tainted: G W
║ 2.6.39-400.264.4.el5uek #1
║Apr 14 21:50:25 test01 kernel: Call Trace:
║Apr 14 21:50:25 test01 kernel: <IRQ> [<ffffffff8143dab8>] ?
║skb_gso_segment+0x298/0x370
║Apr 14 21:50:25 test01 kernel: [<ffffffff8106f300>]
║warn_slowpath_common+0x90/0xc0
║Apr 14 21:50:25 test01 kernel: [<ffffffff8106f42e>]
║warn_slowpath_fmt+0x6e/0x70
║Apr 14 21:50:25 test01 kernel: [<ffffffff810d73a7>] ?
║irq_to_desc+0x17/0x20
║Apr 14 21:50:25 test01 kernel: [<ffffffff812faf0c>] ?
║notify_remote_via_irq+0x2c/0x40
║Apr 14 21:50:25 test01 kernel: [<ffffffff8100a820>] ?
║xen_clocksource_read+0x20/0x30
║Apr 14 21:50:25 test01 kernel: [<ffffffff812faf4c>] ?
║xen_send_IPI_one+0x2c/0x40
║Apr 14 21:50:25 test01 kernel: [<ffffffff81011f10>] ?
║xen_smp_send_reschedule+0x10/0x20
║Apr 14 21:50:25 test01 kernel: [<ffffffff81056e0b>] ?
║ttwu_queue_remote+0x4b/0x60
║Apr 14 21:50:25 test01 kernel: [<ffffffff81509a7e>] ?
║_raw_spin_unlock_irqrestore+0x1e/0x30
║Apr 14 21:50:25 test01 kernel: [<ffffffff8143dab8>]
║skb_gso_segment+0x298/0x370
║Apr 14 21:50:25 test01 kernel: [<ffffffff8143dba6>]
║dev_gso_segment+0x16/0x50
║Apr 14 21:50:25 test01 kernel: [<ffffffff8143dfb5>]
║dev_hard_start_xmit+0x3d5/0x530
║Apr 14 21:50:25 test01 kernel: [<ffffffff8145a074>]
║sch_direct_xmit+0xc4/0x1d0
║Apr 14 21:50:25 test01 kernel: [<ffffffff8143e811>]
║dev_queue_xmit+0x161/0x410
║Apr 14 21:50:25 test01 kernel: [<ffffffff815099de>] ?
║_raw_spin_lock+0xe/0x20
║Apr 14 21:50:25 test01 kernel: [<ffffffffa045820c>]
║br_dev_queue_push_xmit+0x6c/0xa0 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffff81076e77>] ?
║local_bh_enable+0x27/0xa0
║Apr 14 21:50:25 test01 kernel: [<ffffffffa045e7ba>]
║br_nf_dev_queue_xmit+0x2a/0x90 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa045f668>]
║br_nf_post_routing+0x1f8/0x2e0 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffff81467428>]
║nf_iterate+0x78/0x90
║Apr 14 21:50:25 test01 kernel: [<ffffffff8146777c>]
║nf_hook_slow+0x7c/0x130
║Apr 14 21:50:25 test01 kernel: [<ffffffffa04581a0>] ?
║br_forward_finish+0x70/0x70 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa04581a0>] ?
║br_forward_finish+0x70/0x70 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa0458130>] ?
║br_flood_deliver+0x20/0x20 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa0458186>]
║br_forward_finish+0x56/0x70 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa045eba4>]
║br_nf_forward_finish+0xb4/0x180 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa045f36f>]
║br_nf_forward_ip+0x26f/0x370 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffff81467428>]
║nf_iterate+0x78/0x90
║Apr 14 21:50:25 test01 kernel: [<ffffffff8146777c>]
║nf_hook_slow+0x7c/0x130
║Apr 14 21:50:25 test01 kernel: [<ffffffffa0458130>] ?
║br_flood_deliver+0x20/0x20 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffff81467428>] ?
║nf_iterate+0x78/0x90
║Apr 14 21:50:25 test01 kernel: [<ffffffffa0458130>] ?
║br_flood_deliver+0x20/0x20 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa04582c8>]
║__br_forward+0x88/0xc0 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa0458356>]
║br_forward+0x56/0x60 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa04591fc>]
║br_handle_frame_finish+0x1ac/0x240 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa045ee1b>]
║br_nf_pre_routing_finish+0x1ab/0x350 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffff8115bfe9>] ?
║kmem_cache_alloc_trace+0xc9/0x1a0
║Apr 14 21:50:25 test01 kernel: [<ffffffffa045fc55>]
║br_nf_pre_routing+0x305/0x370 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffff8100122a>] ?
║xen_hypercall_xen_version+0xa/0x20
║Apr 14 21:50:25 test01 kernel: [<ffffffff81467428>]
║nf_iterate+0x78/0x90
║Apr 14 21:50:25 test01 kernel: [<ffffffff8146777c>]
║nf_hook_slow+0x7c/0x130

To fix this, we should disable LRO(large receive offload) first:

for i in eth0 eth1 eth2 eth3;do /sbin/ethtool -K $i lro off;done

And if the NICs are of Intel 10G, the we should disable GRO(generic receive offload) too:

for i in eth0 eth1 eth2 eth3;do /sbin/ethtool -K $i gro off;done

Here's the command to disable both of LRO/GRO:

for i in eth0 eth1 eth2 eth3;do /sbin/ethtool -K $i gro off;/sbin/ethtool -K $i lro off;done

 

create shared iscsi LUNs from local disk on Linux

We can use iscsitarget to share local disks as iscsi LUNs for clients. Below are brief steps.

First, install some packages:

yum install kernel-devel iscsi-initiator-utils -y #'kernel-uek-devel' if you are using oracle linux with UEK
cd iscsitarget-1.4.20.2 #download the package from here
make
make install

And below are some useful tips about iscsitarget:

The iSCSI target consists of a kernel module (/lib/modules/`uname -r`/extra/iscsi/iscsi_trgt.ko)

The kernel modules will be installed in the module directory of the kernel

The daemon(/usr/sbin/ietd) and the control tool(/usr/sbin/ietadm)

/etc/init.d/iscsi-target status

/etc/iet/{ietd.conf, initiators.allow, targets.allow}

Later, we can modify IET(iSCSI Enterprise Target) config file:

vi /etc/iet/ietd.conf

    Target iqn.2001-04.org.doxer:server1-shared-local-disks
        Lun 0 Path=/dev/sdb1,Type=fileio,ScsiId=0,ScsiSN=doxerorg
        Lun 1 Path=/dev/sdb2,Type=fileio,ScsiId=1,ScsiSN=doxerorg
        Lun 2 Path=/dev/sdb3,Type=fileio,ScsiId=2,ScsiSN=doxerorg
        Lun 3 Path=/dev/sdb4,Type=fileio,ScsiId=3,ScsiSN=doxerorg
        Lun 4 Path=/dev/sdb5,Type=fileio,ScsiId=4,ScsiSN=doxerorg
        Lun 5 Path=/dev/sdb6,Type=fileio,ScsiId=5,ScsiSN=doxerorg
        Lun 6 Path=/dev/sdb7,Type=fileio,ScsiId=6,ScsiSN=doxerorg
        Lun 7 Path=/dev/sdb8,Type=fileio,ScsiId=7,ScsiSN=doxerorg

chkconfig iscsi-target on
/etc/init.d/iscsi-target start

Assume the server sharing local disks for iscsi LUN is with IP 192.168.10.212, and we can do below on client hosts to scan for iscsi LUNs:

[root@client01 ~]# iscsiadm -m discovery -t st -p 192.168.10.212
Starting iscsid:                                           [  OK  ]
192.168.10.212:3260,1 iqn.2001-04.org.doxer:server1-shared-local-disks

[root@client01 ~]# iscsiadm -m node -T iqn.2001-04.org.doxer:server1-shared-local-disks -p 192.168.10.212 -l
Logging in to [iface: default, target: iqn.2001-04.org.doxer:server1-shared-local-disks, portal: 192.168.10.212,3260] (multiple)
Login to [iface: default, target: iqn.2001-04.org.doxer:server1-shared-local-disks, portal: 192.168.10.212,3260] successful.

[root@client01 ~]# iscsiadm -m session --rescan #or iscsiadm -m session -r SID --rescan
Rescanning session [sid: 1, target: iqn.2001-04.org.doxer:server1-shared-local-disks, portal: 192.168.10.212,3260]

[root@client01 ~]# iscsiadm -m session -P 3

You can also scan for LUNs on the server with local disk shared, but you should make sure iscsi-target service boot up between network & iscsi:

mv /etc/rc3.d/S39iscsi-target /etc/rc3.d/S12iscsi-target

PS:

  1. iSCSI target port default to 3260. You can check iscsi connection info in /var/lib/iscsi/send_targets/ - <iscsi portal ip, port>, and /var/lib/iscsi/nodes/ - <target iqn>/<iscsi portal ip, port>.
  2. If there are multiple targets to log on, we can use "iscsiadm -m node --loginall=all".  "iscsiadm -m node -T iqn.2001-04.org.doxer:server1-shared-local-disks -p 192.168.10.212 -u" to log out.
  3. More info is here (includes windows iscsi operation), and here is about create iSCSI on Oracle ZFS appliance.

 

resolved - /etc/rc.local not executed on boot in linux

When you find your scripts in /etc/rc.local not executed along with system boots, then one possibility is that the previous subsys script takes too long to execute, as /etc/rc.local is usually the last one to execute, i.e. S99local. To prove which is the culprit subsys that gets stuck, you can edit /etc/rc.d/rc(which is from /etc/inittab):

[root@host1 tmp] vi /etc/rc.d/rc
# Now run the START scripts.
for i in /etc/rc$runlevel.d/S* ; do
        check_runlevel "$i" || continue

        # Check if the subsystem is already up.
        subsys=${i#/etc/rc$runlevel.d/S??}
        [ -f /var/lock/subsys/$subsys -o -f /var/lock/subsys/$subsys.init ] \
                && continue

        # If we're in confirmation mode, get user confirmation
        if [ -f /var/run/confirm ]; then
                confirm $subsys
                test $? = 1 && continue
        fi

        update_boot_stage "$subsys"
        # Bring the subsystem up.
        if [ "$subsys" = "halt" -o "$subsys" = "reboot" ]; then
                export LC_ALL=C
                exec $i start
        fi
        if LC_ALL=C egrep -q "^..*init.d/functions" $i \
                        || [ "$subsys" = "single" -o "$subsys" = "local" ]; then
                echo $i>>/var/tmp/process.txt
                $i start
                echo $i>>/var/tmp/process_end.txt
        else
                echo $i>>/var/tmp/process_self.txt
                action $"Starting $subsys: " $i start
                echo $i>>/var/tmp/process_self_end.txt
        fi
done

Then you can reboot the system, and check files /var/tmp/{process.txt,process_end.txt,process_self.txt,process_self_end.txt}. In one of the host, I found below entries:

[root@host1 tmp]# tail process.txt
/etc/rc3.d/S85gpm
/etc/rc3.d/S90crond
/etc/rc3.d/S90xfs
/etc/rc3.d/S91vncserver
/etc/rc3.d/S95anacron
/etc/rc3.d/S95atd
/etc/rc3.d/S95emagent_public
/etc/rc3.d/S97rhnsd
/etc/rc3.d/S98avahi-daemon
/etc/rc3.d/S98gcstartup

[root@host1 tmp]# tail process_end.txt
/etc/rc3.d/S85gpm
/etc/rc3.d/S90crond
/etc/rc3.d/S90xfs
/etc/rc3.d/S91vncserver
/etc/rc3.d/S95anacron
/etc/rc3.d/S95atd
/etc/rc3.d/S95emagent_public
/etc/rc3.d/S97rhnsd
/etc/rc3.d/S98avahi-daemon

So from here, we can see /etc/rc3.d/S98gcstartup tried start, but it took too long to finish. To make sure scripts in /etc/rc.local get executed and also the stuck script /etc/rc3.d/S98gcstartup get executed also, we can do this:

[root@host1 tmp]# mv /etc/rc3.d/S98gcstartup /etc/rc3.d/s98gcstartup
[root@host1 tmp]# vi /etc/rc.local

#!/bin/sh

touch /var/lock/subsys/local

#put your scripts here - begin

#put your scripts here - end

#put the stuck script here and make sure it's the last line
/etc/rc3.d/s98gcstartup start

After this, reboot the host and check whether scripts in /etc/rc.local got executed.

resolved - xend error: (98, 'Address already in use')

Today one OVS server met issue with ovs-agent and need reboot. As there were VMs running on it, so I tried live migrating xen based VMs using "xm migrate -l", but below error occurred:

-bash-3.2# xm migrate -l vm1 server1
Error: can't connect: (111, 'Connection refused')
Usage: xm migrate  

Migrate a domain to another machine.

Options:

-h, --help           Print this help.
-l, --live           Use live migration.
-p=portnum, --port=portnum
                     Use specified port for migration.
-n=nodenum, --node=nodenum
                     Use specified NUMA node on target.
-s, --ssl            Use ssl connection for migration.

As xen migration use xend-relocation-server of xend-relocation-port, so this "Connection refused" issue was most likely related to this. And below is the configuration of /etc/xen/xend-config.sxp:

-bash-3.2# egrep -v '^#|^$' /etc/xen/xend-config.sxp
(xend-unix-server yes)
(xend-relocation-server yes)
(xend-relocation-ssl-server no)
(xend-unix-path /var/lib/xend/xend-socket)
(xend-relocation-port 8002)
(xend-relocation-server-ssl-key-file /etc/ovs-agent/cert/key.pem)
(xend-relocation-server-ssl-cert-file /etc/ovs-agent/cert/certificate.pem)
(xend-relocation-address '')
(xend-relocation-hosts-allow '')
(vif-script vif-bridge)
(dom0-min-mem 0)
(enable-dom0-ballooning no)
(dom0-cpus 0)
(vnc-listen '0.0.0.0')
(vncpasswd '')
(xend-domains-lock-path /opt/ovs-agent-2.3/utils/dlm.py)
(domain-shutdown-hook /opt/ovs-agent-2.3/utils/hook_vm_shutdown.py)

And to check the progresses related with these:

-bash-3.2# lsof -i :8002
COMMAND   PID     USER   FD   TYPE    DEVICE SIZE NODE NAME
xend    12095 root    5u  IPv4 146473964       TCP *:teradataordbms (LISTEN)

-bash-3.2# ps auxww|egrep '/opt/ovs-agent-2.3/utils/dlm.py|/opt/ovs-agent-2.3/utils/hook_vm_shutdown.py'
root  3501  0.0  0.0   3924   740 pts/0    S+   08:37   0:00 egrep /opt/ovs-agent-2.3/utils/dlm.py|/opt/ovs-agent-2.3/utils/hook_vm_shutdown.py
root 19007  0.0  0.0  12660  5840 ?        D    03:44   0:00 python /opt/ovs-agent-2.3/utils/dlm.py --lock --name vm1 --uuid 56f17372-0a86-4446-8603-d82423c54367
root 27446  0.0  0.0  12664  5956 ?        D    05:11   0:00 python /opt/ovs-agent-2.3/utils/dlm.py --lock --name vm2 --uuid eb1a4e84-3572-4543-8b1d-685b856d98c7

When processes went into D state(uninterruptable sleep), it'll be troublesome, as these processes can only be killed by reboot the whole system. However, on this server, we had many VMs running, and now live migration/relocation was blocked by issue caused by itself, and deadlock surfaced. And seems reboot was the only way to "resolve" the issue.

Firstly, I tried bounce xend(/etc/init.d/xend restart), but met below error indicated in /var/log/message:

[2015-11-04 04:39:43 24026] INFO (SrvDaemon:227) Xend stopped due to signal 15.
[2015-11-04 04:39:43 24115] INFO (SrvDaemon:332) Xend Daemon started
[2015-11-04 04:39:43 24115] INFO (SrvDaemon:336) Xend changeset: unavailable.
[2015-11-04 04:40:14 24115] ERROR (SrvDaemon:349) Exception starting xend ((98, 'Address already in use'))
Traceback (most recent call last):
  File "/usr/lib/python2.4/site-packages/xen/xend/server/SrvDaemon.py", line 339, in run
    relocate.listenRelocation()
  File "/usr/lib/python2.4/site-packages/xen/xend/server/relocate.py", line 159, in listenRelocation
    hosts_allow = hosts_allow)
  File "/usr/lib/python2.4/site-packages/xen/web/tcp.py", line 36, in __init__
    connection.SocketListener.__init__(self, protocol_class)
  File "/usr/lib/python2.4/site-packages/xen/web/connection.py", line 89, in __init__
    self.sock = self.createSocket()
  File "/usr/lib/python2.4/site-packages/xen/web/tcp.py", line 49, in createSocket
    sock.bind((self.interface, self.port))
  File "", line 1, in bind
error: (98, 'Address already in use')

And later, I realized that we can change xend-relocation-port to have a try. So I made below changes to /etc/xen/xend-config.sxp:

(xend-relocation-port 8003)

And later, bounced xend:

/etc/init.d/xend stop; /etc/init.d/xend start

PS: xend bouncing will not affect running VMs, as I had compared qemu output(ps -ef|grep qemu). A tip here is that when xen related commands(xm list, and so on) stopped working, checking for "qemu" simulator processes will help you get the VM list.

After this, "xm migrate -l vm1 server1" still failed with the same can't connect: (111, 'Connection refused'). And I resolved this by specifying port:(you may need stop iptables too):

-bash-3.2# xm migrate -l -p 8002 vm1 server1

Now the live migration went on smoothly, and after all VMs were migrated, I changed xend-relocation-port back to 8002 and reboot the server to fix the D state(uninterruptable sleep) issue.

PS:

If you find error "Error: can't connect: (111, 'Connection refused')" even after above WA, then you can change back from 8003 to 8002, or even from 8003 to 8004, restart iptables, and try again.

resolved - mountd Cannot export /scratch, possibly unsupported filesystem or fsid= required

Today when I tried to access one autofs exported filesystem on one client host, it reported error:

[root@server01 ~]# cd /net/client01/scratch/
-bash: cd: scratch/: No such file or directory

From server side, we can see it's exported and writable:

[root@client01 ~]# cat /etc/exports
/scratch *(rw,no_root_squash)

[root@client01 ~]# df -h /scratch
Filesystem            Size  Used Avail Use% Mounted on
nas01:/export/generic/share_scratch
                      200G  103G   98G  52% /scratch

So I tried mount manually on client side, but still error reported:

[root@server01 ~]# mount client01:/scratch /media
mount: client01:/scratch failed, reason given by server: Permission denied

Here's log on server side:

[root@client01 scratch]# tail -f /var/log/messages
Nov  2 03:41:58 client01 mountd[2195]: Caught signal 15, un-registering and exiting.
Nov  2 03:41:58 client01 kernel: nfsd: last server has exited, flushing export cache
Nov  2 03:41:59 client01 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Nov  2 03:41:59 client01 kernel: NFSD: starting 90-second grace period
Nov  2 03:42:11 client01 mountd[16046]: authenticated mount request from 192.162.100.137:1002 for /scratch (/scratch)
Nov  2 03:42:11 client01 mountd[16046]: Cannot export /scratch, possibly unsupported filesystem or fsid= required
Nov  2 03:43:05 client01 mountd[16046]: authenticated mount request from 192.162.100.137:764 for /scratch (/scratch)
Nov  2 03:43:05 client01 mountd[16046]: Cannot export /scratch, possibly unsupported filesystem or fsid= required
Nov  2 03:44:11 client01 mountd[16046]: authenticated mount request from 192.165.28.40:670 for /scratch (/scratch)
Nov  2 03:44:11 client01 mountd[16046]: Cannot export /scratch, possibly unsupported filesystem or fsid= required

After some debugging, the reason was that the exported FS was already NFS filesystem on server side, and NFS FS cannot be exported again:

[root@client01 ~]# df -h /scratch
Filesystem            Size  Used Avail Use% Mounted on
nas01:/export/generic/share_scratch
                      200G  103G   98G  52% /scratch

To WA this, just do normal mount of the NFS share instead of using autofs export:

[root@server01 ~]# mount nas01:/export/generic/share_scratch /media

resolved - Yum Error: Cannot retrieve repository metadata (repomd.xml) for repository Please verify its path and try again

Today when I tried to install some package using yum on one linux host, below error prompted even I set up the right proxy:

[root@testvm1 yum.repos.d]# export http_proxy=http://right-proxy.example.com:80
[root@testvm1 yum.repos.d]# export ftp_proxy=http://right-proxy.example.com:80
[root@testvm1 yum.repos.d]# export https_proxy=http://right-proxy.example.com:80
[root@testvm1 yum.repos.d]# yum install xinetd
Loaded plugins: refresh-packagekit
http://public-yum.oracle.com/repo/OracleLinux/OL6/UEK/latest/x86_64/repodata/repomd.xml: [Errno 12] Timeout on http://public-yum.oracle.com/repo/OracleLinux/OL6/UEK/latest/x86_64/repodata/repomd.xml: (28, 'connect() timed out!')
Trying other mirror.
Error: Cannot retrieve repository metadata (repomd.xml) for repository: ol6_UEK_latest. Please verify its path and try again

After some debugging, I found that there's wrong proxy setting in /etc/yum.conf(it take precede over proxy setting in shell). And by comment out the wrong proxy, yum worked.

From:

[root@testvm1 yum.repos.d]# grep proxy /etc/yum.conf
proxy=http://wrong-proxy.example.com:80

To:

[root@testvm1 yum.repos.d]# grep proxy /etc/yum.conf
#proxy=http://wrong-proxy.example.com:80

resolved - sar -d failed with Requested activities not available in file

Today when I tried to get report for activity for each block device using "sar -d", error "Requested activities not available in file" prompted:

[root@test01 ~]# sar -f /var/log/sa/sa11 -d
Requested activities not available in file

To fix this, I did the following:

[root@test01 ~]# cat /etc/cron.d/sysstat
# run system activity accounting tool every 10 minutes
*/10 * * * * root /usr/lib64/sa/sa1 -d 1 1 #add -d. It was */10 * * * * root /usr/lib64/sa/sa1 1 1
# generate a daily summary of process accounting at 23:53
53 23 * * * root /usr/lib64/sa/sa2 -A

Later, move /var/log/sa/sa11 and run sa1 with "-d" to generate a new one:

[root@test01 ~]# mv /var/log/sa/sa11{,.bak}
[root@test01 ~]# /usr/lib64/sa/sa1 -d 1 1 #this generated /var/log/sa/sa11
[root@test01 ~]# /usr/lib64/sa/sa1 -d 1 1 #this put data into /var/log/sa/sa11

After this, the disk activity data could be retrieved:

[root@test01 ~]# sar -f /var/log/sa/sa11 -d
Linux 2.6.18-238.0.0.0.1.el5xen (slc03nsv) 09/11/15

09:26:04 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
09:26:22 dev202-0 10.39 0.00 133.63 12.86 0.00 0.06 0.06 0.07
09:26:22 dev202-1 10.39 0.00 133.63 12.86 0.00 0.06 0.06 0.07
09:26:22 dev202-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: dev202-0 10.39 0.00 133.63 12.86 0.00 0.06 0.06 0.07
Average: dev202-1 10.39 0.00 133.63 12.86 0.00 0.06 0.06 0.07
Average: dev202-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

For column "DEV", you can check mapping in /dev/*(dev202-2 is /dev/xvda2):

[root@test01 ~]# ls -l /dev/xvda2
brw-r----- 1 root disk 202, 2 Jan 26 2015 /dev/xvda2

Or you can add "-p" to sar which is simper

[root@test01 ~]# sar -f /var/log/sa/sa11 -d -p
Linux 2.6.18-238.0.0.0.1.el5xen (slc03nsv) 09/11/15

09:26:04 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
09:26:22 xvda 10.39 0.00 133.63 12.86 0.00 0.06 0.06 0.07
09:26:22 root 10.39 0.00 133.63 12.86 0.00 0.06 0.06 0.07
09:26:22 xvda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: xvda 10.39 0.00 133.63 12.86 0.00 0.06 0.06 0.07
Average: root 10.39 0.00 133.63 12.86 0.00 0.06 0.06 0.07
Average: xvda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

PS:

Here is more info about sysstat in Linux.

resolved - ORA-12578: TNS:wallet open failed

If you met error like "ORA-12578: TNS:wallet open failed", then one possibility is that the Oracle RAC Database is using a local wallet(created with parameter -auto_login_local, which is from 11.2 release, usually local wallet is used in a highly confidential system) but the wallet is migrated from another server.

The migrated local wallet can be opened and read without problems on the new host, but the information inside does not match the hostname and this leads to the error ORA-12578: TNS:wallet open failed. Be noted that even on the original host, the wallet cannot be used by another OS user.

Master encryption key is stored in wallet in TDE(transparent data encryption), it's the key that wraps(encrypts) the Oracle TDE columns and tablespace encryption keys. The wallet must be open before you can create the encrypted tablespace and before you can store or retrieve encrypted data. Also when recovering a database with encrypted tablespaces (for example after a SHUTDOWN ABORT or a catastrophic error that brings down the database instance), you must open the Oracle wallet after database mount and before database open, so the recovery process can decrypt data blocks and redo. When you open the wallet, it is available to all session, and it remains open until you explicitly close it or until the database is shut down.

Tablespace encryption encrypts at the physical block level, can perform better than encrypting many columns. When using column encryption for tables,  there is only one table key regardless of the number of encrypted columns in a table, and the table key is stored in data dictionary. And when using tablespace encryption, the tablespace key is stored in the header of each datafile of the encrypted tablespace.

Below is from here:

TDE uses a two tier key mechanism. When TDE column encryption is applied to an existing application table column, a new table key is created and stored in the Oracle data dictionary. When TDE tablespace encryption is used, the individual tablespace keys are stored in the header of the underlying OS file(s). The table and tablespace keys are encrypted using the TDE master encryption key. The master encryption key is generated when TDE is initialized and stored outside the database in the Oracle Wallet. Both the master key and table keys can be independently changed (rotated, re-keyed) based on company security policies. Tablespace keys cannot be re-keyed (rotated); work around is to move the data into a new encrypted tablespace. Oracle recommends backing up the wallet before and after each master key change.

resolved - nfsv4 Warning: rpc.idmapd appears not to be running. All uids will be mapped to the nobody uid

Today when we tried to mount a nfs share as NFSv4(mount -t nfs4 testnas:/export/testshare01 /media), the following message prompted:

Warning: rpc.idmapd appears not to be running.
All uids will be mapped to the nobody uid.

And I had a check of file permissions under the mount point, they were owned by nobody as indicated:

[root@testvm~]# ls -l /u01/local
total 8
drwxr-xr-x 2 nobody nobody 2 Dec 18 2014 SMKIT
drwxr-xr-x 4 nobody nobody 4 Dec 19 2014 ServiceManager
drwxr-xr-x 4 nobody nobody 4 Mar 31 08:47 ServiceManager.15.1.5
drwxr-xr-x 4 nobody nobody 4 May 13 06:55 ServiceManager.15.1.6

However, as I checked, rpcidmapd was running:

[root@testvm ~]# /etc/init.d/rpcidmapd status
rpc.idmapd (pid 11263) is running...

After some checking, I found it's caused by low nfs version and some missed nfs4 packages on the OEL5 boxes. You can do below to fix this:

yum -y update nfs-utils nfs-utils-lib nfs-utils-lib-devel sblim-cmpi-nfsv4 nfs4-acl-tools
/etc/init.d/nfs restart
/etc/init.d/rpcidmapd restart

If you are using Oracle SUN ZFS appliance, then please make sure to set on ZFS side anonymous user mapping to "root" and also Custom NFSv4 identity domain to the one in your env(e.g. example.com) to avoid NFS clients nobody owner issue.

resolved - yum Error performing checksum Trying other mirror and finally No more mirrors to try

Today when I was installing one package in Linux, below error prompted:

[root@testhost yum.repos.d]# yum list --disablerepo=* --enablerepo=yumpaas
Loaded plugins: rhnplugin, security
This system is not registered with ULN.
ULN support will be disabled.
yumpaas | 2.9 kB 00:00
yumpaas/primary_db | 30 kB 00:00
http://yumrepo.example.com/paas_oel5/repodata/b8e385ebfdd7bed69b7619e63cd82475c8bacc529db7b8c145609b64646d918a-primary.sqlite.bz2: [Errno -3] Error performing checksum
Trying other mirror.
yumpaas/primary_db | 30 kB 00:00
http://yumrepo.example.com/paas_oel5/repodata/b8e385ebfdd7bed69b7619e63cd82475c8bacc529db7b8c145609b64646d918a-primary.sqlite.bz2: [Errno -3] Error performing checksum
Trying other mirror.
Error: failure: repodata/b8e385ebfdd7bed69b7619e63cd82475c8bacc529db7b8c145609b64646d918a-primary.sqlite.bz2 from yumpaas: [Errno 256] No more mirrors to try.

The repo "yumpaas" is hosted on OEL 6 VM, which by default use sha2 for checksum. However, for OEL5 VMs(the VM running yum), yum uses sha1 by default. So I've WA this by install python-hashlib to extend yum's capability to handle sha2(python-hashlib from external repo EPEL).

[root@testhost yum.repos.d]# yum install python-hashlib

And after this, the problematic repo can be used. But to resolve this issue permanently without WA on OEL5 VMs, we should recreate the repo with sha1 for checksum algorithm(createrepo -s sha1).