Archive

Archive for the ‘Unix’ Category

VM shutdown stuck in “mount: you must specify the filesystem type, please stand by while rebooting the system”

November 16th, 2016 Comments off

When you issue "shutdown" or "reboot" on linux box and found "mount: you must specify the filesystem type, please stand by while rebooting the system":

Then one possible reason is that you have specified wrong mount options for nfs shares in /etc/fstab. For example, for nfsv3, please make sure to use below nfs options when you mount shares:

<share name> <mount dir> nfs rsize=32768,wsize=32768,hard,nolock,timeo=14,noacl,intr,mountvers=3,vers=3 0 0

And using below option will make VM shutdown stuck in "mount: you must specify the filesystem type". DO NOT use below:

<share name> <mount dir> nfs vers=3,rsize=32768,wsize=32768,hard,nolock,timeo=14,noacl,intr 0 0

Categories: IT Architecture, Kernel, Linux, Systems, Unix Tags:

TCP wrappers /etc/hosts.allow /etc/hosts.deny

June 2nd, 2016 Comments off

A simple example on linux box:

[root@test ~]# cat /etc/hosts.allow
sshd : ALL EXCEPT host1.example.com
snmpd : ALL EXCEPT host1.example.com
ALL : localhost

[root@test ~]# cat /etc/hosts.deny
ALL:ALL

And here's explaining:

Service "sshd/snmpd" will accept connections from all hosts except host1.example.com. All services will accept connections from localhost. Other services will deny connections from all hosts.

 

Categories: IT Architecture, Linux, Systems, Unix Tags:

resolved – net/core/dev.c:1894 skb_gso_segment+0x298/0x370()

April 19th, 2016 Comments off

Today on one of our servers, there were a lot of errors in /var/log/messages like below:

║Apr 14 21:50:25 test01 kernel: WARNING: at net/core/dev.c:1894
║skb_gso_segment+0x298/0x370()
║Apr 14 21:50:25 test01 kernel: Hardware name: SUN FIRE X4170 M3
║Apr 14 21:50:25 test01 kernel: : caps=(0x60014803, 0x0) len=255
║data_len=215 ip_summed=1
║Apr 14 21:50:25 test01 kernel: Modules linked in: dm_nfs nfs fscache
║auth_rpcgss nfs_acl xen_blkback xen_netback xen_gntdev xen_evtchn lockd
║ @ sunrpc 8021q garp bridge stp llc bonding be2iscsi iscsi_boot_sysfs ib_iser
║ @ rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio
║ @ ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp dm_round_robin libiscsi
║ @ dm_multipath scsi_transport_iscsi xenfs xen_privcmd dm_mirror video sbs sbshc
║acpi_memhotplug acpi_ipmi ipmi_msghandler parport_pc lp parport sr_mod cdrom
║ixgbe hwmon dca snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
║snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore
║snd_page_alloc iTCO_wdt iTCO_vendor_support pcspkr ghes i2c_i801 hed i2c_core
║dm_region_hash dm_log dm_mod usb_storage ahci libahci sg shpchp megaraid_sas
║sd_mod crc_t10dif ext3 jbd mbcache
║Apr 14 21:50:25 test01 kernel: Pid: 0, comm: swapper Tainted: G W
║ 2.6.39-400.264.4.el5uek #1
║Apr 14 21:50:25 test01 kernel: Call Trace:
║Apr 14 21:50:25 test01 kernel: <IRQ> [<ffffffff8143dab8>] ?
║skb_gso_segment+0x298/0x370
║Apr 14 21:50:25 test01 kernel: [<ffffffff8106f300>]
║warn_slowpath_common+0x90/0xc0
║Apr 14 21:50:25 test01 kernel: [<ffffffff8106f42e>]
║warn_slowpath_fmt+0x6e/0x70
║Apr 14 21:50:25 test01 kernel: [<ffffffff810d73a7>] ?
║irq_to_desc+0x17/0x20
║Apr 14 21:50:25 test01 kernel: [<ffffffff812faf0c>] ?
║notify_remote_via_irq+0x2c/0x40
║Apr 14 21:50:25 test01 kernel: [<ffffffff8100a820>] ?
║xen_clocksource_read+0x20/0x30
║Apr 14 21:50:25 test01 kernel: [<ffffffff812faf4c>] ?
║xen_send_IPI_one+0x2c/0x40
║Apr 14 21:50:25 test01 kernel: [<ffffffff81011f10>] ?
║xen_smp_send_reschedule+0x10/0x20
║Apr 14 21:50:25 test01 kernel: [<ffffffff81056e0b>] ?
║ttwu_queue_remote+0x4b/0x60
║Apr 14 21:50:25 test01 kernel: [<ffffffff81509a7e>] ?
║_raw_spin_unlock_irqrestore+0x1e/0x30
║Apr 14 21:50:25 test01 kernel: [<ffffffff8143dab8>]
║skb_gso_segment+0x298/0x370
║Apr 14 21:50:25 test01 kernel: [<ffffffff8143dba6>]
║dev_gso_segment+0x16/0x50
║Apr 14 21:50:25 test01 kernel: [<ffffffff8143dfb5>]
║dev_hard_start_xmit+0x3d5/0x530
║Apr 14 21:50:25 test01 kernel: [<ffffffff8145a074>]
║sch_direct_xmit+0xc4/0x1d0
║Apr 14 21:50:25 test01 kernel: [<ffffffff8143e811>]
║dev_queue_xmit+0x161/0x410
║Apr 14 21:50:25 test01 kernel: [<ffffffff815099de>] ?
║_raw_spin_lock+0xe/0x20
║Apr 14 21:50:25 test01 kernel: [<ffffffffa045820c>]
║br_dev_queue_push_xmit+0x6c/0xa0 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffff81076e77>] ?
║local_bh_enable+0x27/0xa0
║Apr 14 21:50:25 test01 kernel: [<ffffffffa045e7ba>]
║br_nf_dev_queue_xmit+0x2a/0x90 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa045f668>]
║br_nf_post_routing+0x1f8/0x2e0 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffff81467428>]
║nf_iterate+0x78/0x90
║Apr 14 21:50:25 test01 kernel: [<ffffffff8146777c>]
║nf_hook_slow+0x7c/0x130
║Apr 14 21:50:25 test01 kernel: [<ffffffffa04581a0>] ?
║br_forward_finish+0x70/0x70 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa04581a0>] ?
║br_forward_finish+0x70/0x70 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa0458130>] ?
║br_flood_deliver+0x20/0x20 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa0458186>]
║br_forward_finish+0x56/0x70 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa045eba4>]
║br_nf_forward_finish+0xb4/0x180 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa045f36f>]
║br_nf_forward_ip+0x26f/0x370 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffff81467428>]
║nf_iterate+0x78/0x90
║Apr 14 21:50:25 test01 kernel: [<ffffffff8146777c>]
║nf_hook_slow+0x7c/0x130
║Apr 14 21:50:25 test01 kernel: [<ffffffffa0458130>] ?
║br_flood_deliver+0x20/0x20 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffff81467428>] ?
║nf_iterate+0x78/0x90
║Apr 14 21:50:25 test01 kernel: [<ffffffffa0458130>] ?
║br_flood_deliver+0x20/0x20 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa04582c8>]
║__br_forward+0x88/0xc0 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa0458356>]
║br_forward+0x56/0x60 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa04591fc>]
║br_handle_frame_finish+0x1ac/0x240 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffffa045ee1b>]
║br_nf_pre_routing_finish+0x1ab/0x350 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffff8115bfe9>] ?
║kmem_cache_alloc_trace+0xc9/0x1a0
║Apr 14 21:50:25 test01 kernel: [<ffffffffa045fc55>]
║br_nf_pre_routing+0x305/0x370 [bridge]
║Apr 14 21:50:25 test01 kernel: [<ffffffff8100122a>] ?
║xen_hypercall_xen_version+0xa/0x20
║Apr 14 21:50:25 test01 kernel: [<ffffffff81467428>]
║nf_iterate+0x78/0x90
║Apr 14 21:50:25 test01 kernel: [<ffffffff8146777c>]
║nf_hook_slow+0x7c/0x130

To fix this, we should disable LRO(large receive offload) first:

for i in eth0 eth1 eth2 eth3;do /sbin/ethtool -K $i lro off;done

And if the NICs are of Intel 10G, the we should disable GRO(generic receive offload) too:

for i in eth0 eth1 eth2 eth3;do /sbin/ethtool -K $i gro off;done

Here's the command to disable both of LRO/GRO:

for i in eth0 eth1 eth2 eth3;do /sbin/ethtool -K $i gro off;/sbin/ethtool -K $i lro off;done

 

resolved – /etc/rc.local not executed on boot in linux

November 11th, 2015 Comments off

When you find your scripts in /etc/rc.local not executed along with system boots, then one possibility is that the previous subsys script takes too long to execute, as /etc/rc.local is usually the last one to execute, i.e. S99local. To prove which is the culprit subsys that gets stuck, you can edit /etc/rc.d/rc(which is from /etc/inittab):

[root@host1 tmp] vi /etc/rc.d/rc
# Now run the START scripts.
for i in /etc/rc$runlevel.d/S* ; do
        check_runlevel "$i" || continue

        # Check if the subsystem is already up.
        subsys=${i#/etc/rc$runlevel.d/S??}
        [ -f /var/lock/subsys/$subsys -o -f /var/lock/subsys/$subsys.init ] \
                && continue

        # If we're in confirmation mode, get user confirmation
        if [ -f /var/run/confirm ]; then
                confirm $subsys
                test $? = 1 && continue
        fi

        update_boot_stage "$subsys"
        # Bring the subsystem up.
        if [ "$subsys" = "halt" -o "$subsys" = "reboot" ]; then
                export LC_ALL=C
                exec $i start
        fi
        if LC_ALL=C egrep -q "^..*init.d/functions" $i \
                        || [ "$subsys" = "single" -o "$subsys" = "local" ]; then
                echo $i>>/var/tmp/process.txt
                $i start
                echo $i>>/var/tmp/process_end.txt
        else
                echo $i>>/var/tmp/process_self.txt
                action $"Starting $subsys: " $i start
                echo $i>>/var/tmp/process_self_end.txt
        fi
done

Then you can reboot the system, and check files /var/tmp/{process.txt,process_end.txt,process_self.txt,process_self_end.txt}. In one of the host, I found below entries:

[root@host1 tmp]# tail process.txt
/etc/rc3.d/S85gpm
/etc/rc3.d/S90crond
/etc/rc3.d/S90xfs
/etc/rc3.d/S91vncserver
/etc/rc3.d/S95anacron
/etc/rc3.d/S95atd
/etc/rc3.d/S95emagent_public
/etc/rc3.d/S97rhnsd
/etc/rc3.d/S98avahi-daemon
/etc/rc3.d/S98gcstartup

[root@host1 tmp]# tail process_end.txt
/etc/rc3.d/S85gpm
/etc/rc3.d/S90crond
/etc/rc3.d/S90xfs
/etc/rc3.d/S91vncserver
/etc/rc3.d/S95anacron
/etc/rc3.d/S95atd
/etc/rc3.d/S95emagent_public
/etc/rc3.d/S97rhnsd
/etc/rc3.d/S98avahi-daemon

So from here, we can see /etc/rc3.d/S98gcstartup tried start, but it took too long to finish. To make sure scripts in /etc/rc.local get executed and also the stuck script /etc/rc3.d/S98gcstartup get executed also, we can do this:

[root@host1 tmp]# mv /etc/rc3.d/S98gcstartup /etc/rc3.d/s98gcstartup
[root@host1 tmp]# vi /etc/rc.local

#!/bin/sh

touch /var/lock/subsys/local

#put your scripts here - begin

#put your scripts here - end

#put the stuck script here and make sure it's the last line
/etc/rc3.d/s98gcstartup start

After this, reboot the host and check whether scripts in /etc/rc.local got executed.

Categories: IT Architecture, Kernel, Linux, Systems, Unix Tags:

resolved – sar -d failed with Requested activities not available in file

September 11th, 2015 Comments off

Today when I tried to get report for activity for each block device using "sar -d", error "Requested activities not available in file" prompted:

[root@test01 ~]# sar -f /var/log/sa/sa11 -d
Requested activities not available in file

To fix this, I did the following:

[root@test01 ~]# cat /etc/cron.d/sysstat
# run system activity accounting tool every 10 minutes
*/10 * * * * root /usr/lib64/sa/sa1 -d 1 1 #add -d. It was */10 * * * * root /usr/lib64/sa/sa1 1 1
# generate a daily summary of process accounting at 23:53
53 23 * * * root /usr/lib64/sa/sa2 -A

Later, move /var/log/sa/sa11 and run sa1 with "-d" to generate a new one:

[root@test01 ~]# mv /var/log/sa/sa11{,.bak}
[root@test01 ~]# /usr/lib64/sa/sa1 -d 1 1 #this generated /var/log/sa/sa11
[root@test01 ~]# /usr/lib64/sa/sa1 -d 1 1 #this put data into /var/log/sa/sa11

After this, the disk activity data could be retrieved:

[root@test01 ~]# sar -f /var/log/sa/sa11 -d
Linux 2.6.18-238.0.0.0.1.el5xen (slc03nsv) 09/11/15

09:26:04 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
09:26:22 dev202-0 10.39 0.00 133.63 12.86 0.00 0.06 0.06 0.07
09:26:22 dev202-1 10.39 0.00 133.63 12.86 0.00 0.06 0.06 0.07
09:26:22 dev202-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: dev202-0 10.39 0.00 133.63 12.86 0.00 0.06 0.06 0.07
Average: dev202-1 10.39 0.00 133.63 12.86 0.00 0.06 0.06 0.07
Average: dev202-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

For column "DEV", you can check mapping in /dev/*(dev202-2 is /dev/xvda2):

[root@test01 ~]# ls -l /dev/xvda2
brw-r----- 1 root disk 202, 2 Jan 26 2015 /dev/xvda2

Or you can add "-p" to sar which is simper

[root@test01 ~]# sar -f /var/log/sa/sa11 -d -p
Linux 2.6.18-238.0.0.0.1.el5xen (slc03nsv) 09/11/15

09:26:04 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
09:26:22 xvda 10.39 0.00 133.63 12.86 0.00 0.06 0.06 0.07
09:26:22 root 10.39 0.00 133.63 12.86 0.00 0.06 0.06 0.07
09:26:22 xvda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: xvda 10.39 0.00 133.63 12.86 0.00 0.06 0.06 0.07
Average: root 10.39 0.00 133.63 12.86 0.00 0.06 0.06 0.07
Average: xvda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

PS:

Here is more info about sysstat in Linux.

Categories: IT Architecture, Linux, Systems, Unix Tags:

resolved – nfsv4 Warning: rpc.idmapd appears not to be running. All uids will be mapped to the nobody uid

August 31st, 2015 Comments off

Today when we tried to mount a nfs share as NFSv4(mount -t nfs4 testnas:/export/testshare01 /media), the following message prompted:

Warning: rpc.idmapd appears not to be running.
All uids will be mapped to the nobody uid.

And I had a check of file permissions under the mount point, they were owned by nobody as indicated:

[root@testvm~]# ls -l /u01/local
total 8
drwxr-xr-x 2 nobody nobody 2 Dec 18 2014 SMKIT
drwxr-xr-x 4 nobody nobody 4 Dec 19 2014 ServiceManager
drwxr-xr-x 4 nobody nobody 4 Mar 31 08:47 ServiceManager.15.1.5
drwxr-xr-x 4 nobody nobody 4 May 13 06:55 ServiceManager.15.1.6

However, as I checked, rpcidmapd was running:

[root@testvm ~]# /etc/init.d/rpcidmapd status
rpc.idmapd (pid 11263) is running...

After some checking, I found it's caused by low nfs version and some missed nfs4 packages on the OEL5 boxes. You can do below to fix this:

yum -y update nfs-utils nfs-utils-lib nfs-utils-lib-devel sblim-cmpi-nfsv4 nfs4-acl-tools
/etc/init.d/nfs restart
/etc/init.d/rpcidmapd restart

If you are using Oracle SUN ZFS appliance, then please make sure to set on ZFS side anonymous user mapping to "root" and also Custom NFSv4 identity domain to the one in your env(e.g. example.com) to avoid NFS clients nobody owner issue.

resolved – yum Error performing checksum Trying other mirror and finally No more mirrors to try

August 27th, 2015 Comments off

Today when I was installing one package in Linux, below error prompted:

[root@testhost yum.repos.d]# yum list --disablerepo=* --enablerepo=yumpaas
Loaded plugins: rhnplugin, security
This system is not registered with ULN.
ULN support will be disabled.
yumpaas | 2.9 kB 00:00
yumpaas/primary_db | 30 kB 00:00
http://yumrepo.example.com/paas_oel5/repodata/b8e385ebfdd7bed69b7619e63cd82475c8bacc529db7b8c145609b64646d918a-primary.sqlite.bz2: [Errno -3] Error performing checksum
Trying other mirror.
yumpaas/primary_db | 30 kB 00:00
http://yumrepo.example.com/paas_oel5/repodata/b8e385ebfdd7bed69b7619e63cd82475c8bacc529db7b8c145609b64646d918a-primary.sqlite.bz2: [Errno -3] Error performing checksum
Trying other mirror.
Error: failure: repodata/b8e385ebfdd7bed69b7619e63cd82475c8bacc529db7b8c145609b64646d918a-primary.sqlite.bz2 from yumpaas: [Errno 256] No more mirrors to try.

The repo "yumpaas" is hosted on OEL 6 VM, which by default use sha2 for checksum. However, for OEL5 VMs(the VM running yum), yum uses sha1 by default. So I've WA this by install python-hashlib to extend yum's capability to handle sha2(python-hashlib from external repo EPEL).

[root@testhost yum.repos.d]# yum install python-hashlib

And after this, the problematic repo can be used. But to resolve this issue permanently without WA on OEL5 VMs, we should recreate the repo with sha1 for checksum algorithm(createrepo -s sha1).

resolved – passwd: User not known to the underlying authentication module

August 12th, 2015 Comments off

Today we met the following error when tried to change one user's password:

[root@test ~]# echo 2cool|passwd --stdin test
Changing password for user test.
passwd: User not known to the underlying authentication module

And after some searching work, we found it's caused by /etc/shadow file missing:

[root@test ~]# ls -l /etc/shadow
ls: /etc/shadow: No such file or directory

To generate the /etc/shadow file, use pwconv command:

[root@test ~]# pwconv

[root@test ~]# ls -l /etc/shadow
-r-------- 1 root root 1254 Aug 11 12:13 /etc/shadow

After this, we can reset password without issue:

[root@test ~]# echo mypass|passwd --stdin test
Changing password for user test.
passwd: all authentication tokens updated successfully.

Categories: IT Architecture, Linux, Systems, Unix Tags:

remove usb disk from LVM

July 29th, 2015 Comments off

On some servers, USB stick may become part of the LVM volume group. The usb is more prone to fail, which will cause big issues when they start. Also they work at a different speed than the drives, and this also causes performance issues.

For example, on one server, you can see below:

[root@test ~]# vgs
  VG      #PV #LV #SN Attr   VSize VFree
  DomUVol   2   1   0 wz--n- 3.77T    0

[root@test~]# pvs
  PV         VG      Fmt  Attr PSize PFree
  /dev/sdb1  DomUVol lvm2 a--  3.76T    0 #all PE allocated
  /dev/sdc1  DomUVol lvm2 a--  3.59G    0 #this is usb device

[root@test~]# lvs
  LV      VG      Attr   LSize Origin Snap%  Move Log Copy%  Convert
  scratch DomUVol -wi-ao 3.77T

[root@test ~]# df -h /scratch
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/DomUVol-scratch
                      3.7T  257G  3.3T   8% /scratch

[root@test~]# pvdisplay /dev/sdc1
  --- Physical volume ---
  PV Name               /dev/sdc1
  VG Name               DomUVol
  PV Size               3.61 GB / not usable 14.61 MB
  Allocatable           yes (but full)
  PE Size (KByte)       32768
  Total PE              115
  Free PE               0
  Allocated PE          115 #so physical extents are allocated on this usb device
  PV UUID               a8a0P5-AlCz-Cu5e-acC2-ldEQ-NPCn-kc4Du0

As you can see from above, as the USB device has PE allocated, so to remove that(pvreduce), we need first move PE from it to other PV. We can see the other PV has all space allocated(can also confirm from vgs above):

[root@test~]# pvdisplay /dev/sdb1
  --- Physical volume ---
  PV Name               /dev/sdb1
  VG Name               DomUVol
  PV Size               3.76 TB / not usable 30.22 MB
  Allocatable           yes (but full)
  PE Size (KByte)       32768
  Total PE              123360
  Free PE               0
  Allocated PE          123360
  PV UUID               5IyCgh-JsiV-EnpO-XKj4-yxNq-pRjI-d7LKGy

Here are the steps for taking usb device out of VG:

umount /scratch
#fsck -y /dev/mapper/DomUVol-scratch
lvreduce --size -5G /dev/mapper/DomUVol-scratch
resize2fs /dev/mapper/DomUVol-scratch

[root@test ~]# vgs
  VG      #PV #LV #SN Attr   VSize VFree
  DomUVol   2   1   0 wz--n- 3.77T 5.00G

[root@test ~]# pvs
  PV         VG      Fmt  Attr PSize PFree
  /dev/sdb1  DomUVol lvm2 a--  3.76T 1.41G
  /dev/sdc1  DomUVol lvm2 a--  3.59G 3.59G #PEs on the usb device are all freed, if not, use pvmove /dev/sdc1. More info is here about pvmove

[root@test ~]# pvdisplay /dev/sdc1
  --- Physical volume ---
  PV Name               /dev/sdc1
  VG Name               DomUVol
  PV Size               3.61 GB / not usable 14.61 MB
  Allocatable           yes
  PE Size (KByte)       32768
  Total PE              115
  Free PE               115
  Allocated PE          0
  PV UUID               a8a0P5-AlCz-Cu5e-acC2-ldEQ-NPCn-kc4Du0

[root@test ~]# vgreduce DomUVol /dev/sdc1
  Removed "/dev/sdc1" from volume group "DomUVol"

[root@test ~]# pvs
  PV         VG      Fmt  Attr PSize PFree
  /dev/sdb1  DomUVol lvm2 a--  3.76T 1.41G
  /dev/sdc1          lvm2 a--  3.61G 3.61G #VG column is empty for the usb device, this confirms the usb device is taken out of VG. You can run pvremove /dev/sdc1 to remove the pv.

PS:

  1. If you want to shrink lvm volume(lvreduce) on /, then you'll need go to linux rescue mode. Select "Skip" when the system prompts for the option of mounting / to /mnt/sysimage. Run "lvm vgchange -a y" first and then the other steps are more or less the same as above, but you'll need type "lvm" before any lvm command, such as "lvm lvs", "lvm pvs", "lvm lvreduce" etc.
  2. You can refer to this article about using vgcfgrestore to restore vg config from /etc/lvm/archive/.

resolved – yum install Error: Protected multilib versions

June 30th, 2015 Comments off

Today when I tried to install firefox.i686 on Linux using yum, the following error occurred:

Protected multilib versions: librsvg2-2.26.0-14.el6.i686 != librsvg2-2.26.0-5.el6_1.1.x86_64
Error: Protected multilib versions: devhelp-2.28.1-6.el6.i686 != devhelp-2.28.1-3.el6.x86_64
Error: Protected multilib versions: ImageMagick-6.5.4.7-7.el6_5.i686 != ImageMagick-6.5.4.7-6.el6_2.x86_64
Error: Protected multilib versions: vte-0.25.1-9.el6.i686 != vte-0.25.1-8.el6_4.x86_64
Error: Protected multilib versions: polkit-gnome-0.96-4.el6.i686 != polkit-gnome-0.96-3.el6.x86_64
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest

To resolve this, just run yum update <package names>, then the problem will go away.

Categories: IT Architecture, Linux, Systems, Unix Tags:

resolved – Checking for glibc-devel-2.12-1.7-i686; Not found. Failed

May 5th, 2015 Comments off

Today when I tried to install Oracle EM Cloud Control 12c, below error prompted when pre-checking:

pre-check failed

So from above, we can see that it's complaining about missing package "glibc-devel-2.12-1.7-i686"(Checking for glibc-devel-2.12-1.7-i686; Not found. Failed). And I found there were glibc related packages on the system:

[root@testvm ~]# rpm -qa|grep glibc
glibc-common-2.12-1.149.el6_6.7.x86_64
glibc-devel-2.12-1.149.el6_6.7.x86_64
glibc-headers-2.12-1.149.el6_6.7.x86_64
glibc-2.12-1.149.el6_6.7.x86_64

But they were all x86_64 version, not the missing i686 one. So I determined to install i686 ones:

[root@testvm]# yum install -y glibc.i686 glibc-devel.i686 glibc-static.i686

After this, and press "Rerun", the check succeeded.

Categories: IT Architecture, Linux, Systems, Unix Tags:

resolved – file filelists.xml.gz [Errno 5] OSError: [Errno 2] No such file or directory [Errno 256] No more mirrors to try

April 8th, 2015 Comments off

Today below error prompted when running yum install some packages in linux:

file://localhost/tmp/common1/x86_64/redhat/50/base/ga/Server/repodata/filelists.xml.gz: [Errno 5] OSError: [Errno 2] No such file or directory: '/tmp/common1/x86_64/redhat/50/base/ga/Server/repodata/filelists.xml.gz'
Trying other mirror.
Error: failure: repodata/filelists.xml.gz from base: [Errno 256] No more mirrors to try.
You could try running: package-cleanup --problems
package-cleanup --dupes
rpm -Va --nofiles --nodigest

After some checking(yum clean all, download repo to /etc/yum.repos.d, etc), I finally found it's caused by the following entries in /etc/yum.conf:

[base]
name=Red Hat Linux - Base
baseurl=file://localhost/tmp/common1/x86_64/redhat/50/base/ga/Server

After I commented them, yum install can work now.

 

Categories: IT Architecture, Linux, Systems, Unix Tags:

resolved – ext3: No journal on filesystem on disk

March 23rd, 2015 Comments off

Today I met below error when trying to mount a disk:

[root@testvm ~]# mount /scratch
mount: wrong fs type, bad option, bad superblock on /dev/xvdb1,
missing codepage or other error
In some cases useful info is found in syslog - try
dmesg | tail or so

First I ran fsck -y /dev/xvdb1, but after it's done, the issue was still there(sometimes fsck -y /dev/xvdb1 could resolve this though). So as it suggested, I ran a dmesg | tail:

[root@testvm scratch]# dmesg | tail
Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
NFSD: starting 90-second grace period
ext3: No journal on filesystem on xvdb1
ext3: No journal on filesystem on xvdb1
ext3: No journal on filesystem on xvdb1
ext3: No journal on filesystem on xvdb1

So from here we can see that the root cause for mounting failure was "ext3: No journal on filesystem on xvdb1". I first ran "fsck -y /dev/xvdb1", and try mount again. But the issue was still there. So I tried with adding ext3 journal on that disk:

[root@testvm qgomsdc1]# tune2fs -j /dev/xvdb1
tune2fs 1.39 (29-May-2006)
Creating journal inode:

done
This filesystem will be automatically checked every 20 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

After this, the mount succeeded.

Categories: IT Architecture, Kernel, Linux, Systems, Unix Tags:

resolved – su: cannot set user id: Resource temporarily unavailable

January 12th, 2015 Comments off

When i try to log on as user "test", error occurred:

su: cannot set user id: Resource temporarily unavailable

I had a check of limits.conf:

[root@testvm ~]# cat /etc/security/limits.conf|egrep -v '^$|^#'
oracle   soft   nofile    131072
oracle   hard   nofile    131072
oracle   soft   nproc    131072
oracle   hard   nproc    131072
oracle   soft   core    unlimited
oracle   hard   core    unlimited
oracle   soft   memlock    50000000
oracle   hard   memlock    50000000
@svrtech    soft    memlock         500000
@svrtech    hard    memlock         500000
*   soft   nofile    131072
*   hard   nofile    131072
*   soft   nproc    131072
*   hard   nproc    131072
*   soft   core    unlimited
*   hard   core    unlimited
*   soft   memlock    50000000
*   hard   memlock    50000000

Then I had a check of the number of processes/threads with the maximum number of processes to see whether it's coming over the line:

[root@c9qa131-slcn03vmf0293 ~]# ps -eLF | grep test | wc -l
1026

So it's not exceeding. Then I had a check of open files:

[root@testvm ~]# lsof | grep aime | wc -l

6059

It's not exceeding 131072 either, then why the error "su: cannot set user id: Resource temporarily unavailable" was there? Actually the culprit was in file /etc/security/limits.d/90-nproc.conf:

[root@testvm ~]# cat /etc/security/limits.d/90-nproc.conf
# Default limit for number of user's processes to prevent
# accidental fork bombs.
# See rhbz #432903 for reasoning.

* soft nproc 1024
root soft nproc unlimited

After I modified 1024 to 131072, the issue gone away immediately.

Categories: IT Architecture, Kernel, Linux, Systems, Unix Tags:

resolved – cssh installation on linux server

December 29th, 2014 Comments off

ClusterSSH can be used if you need controls a number of xterm windows via a single graphical console window, and you want to run commands interactively on multiple servers over an ssh connection. This guide will show the process to install clusterssh on a linux box from tarball.

At the very first, you should download cssh tarball App-ClusterSSH-4.03_04.tar.gz from sourceforge. You may need export proxy settings if it's needed in your env:

export https_proxy=http://my-proxy.example.com:80/
export http_proxy=http://my-proxy.example.com:80/
export ftp_proxy=http://my-proxy.example.com:80/

After the proxy setting, you can now get the package:

wget 'http://sourceforge.net/projects/clusterssh/files/latest/download'
tar zxvf App-ClusterSSH-4.03_04.tar.gz
cd App-ClusterSSH-4.03_04
cat README

Before installing, let's install some prerequisites packages:

yum install gcc libX11-devel gnome* -y
yum groupinstall "X Window System" -y
yum groupinstall "GNOME Desktop Environment" -y
yum groupinstall "Graphical Internet" -y
yum groupinstall "Graphics" -y

Now run "perl Build.PL" as indicated by README:

[root@centos-32bits App-ClusterSSH-4.03_04]# perl Build.PL
Can't locate Module/Build.pm in @INC (@INC contains: /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.8/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl /usr/lib/perl5/5.8.8/i386-linux-thread-multi /usr/lib/perl5/5.8.8 .) at Build.PL line 5.
BEGIN failed--compilation aborted at Build.PL line 5.

As it challenged, you need install Module::Build.pm first. Let's use cpan to install that module.

Run "cpan" and enter "follow" when below info occurred:

Policy on building prerequisites (follow, ask or ignore)? [ask] follow

If you had already ran cpan before, then you can configure the policy as below:

cpan> o conf prerequisites_policy follow
cpan> o conf commit

Now Let's install Module::Build:

cpan> install Module::Build

After the installation, let's run "perl Build.PL" again:

[root@centos-32bits App-ClusterSSH-4.03_04]# perl Build.PL
Checking prerequisites...
  requires:
    !  Exception::Class is not installed
    !  Tk is not installed
    !  Try::Tiny is not installed
    !  X11::Protocol is not installed
  build_requires:
    !  CPAN::Changes is not installed
    !  File::Slurp is not installed
    !  File::Which is not installed
    !  Readonly is not installed
    !  Test::Differences is not installed
    !  Test::DistManifest is not installed
    !  Test::PerlTidy is not installed
    !  Test::Pod is not installed
    !  Test::Pod::Coverage is not installed
    !  Test::Trap is not installed

ERRORS/WARNINGS FOUND IN PREREQUISITES.  You may wish to install the versions
of the modules indicated above before proceeding with this installation

Run 'Build installdeps' to install missing prerequisites.

Created MYMETA.yml and MYMETA.json
Creating new 'Build' script for 'App-ClusterSSH' version '4.03_04'

As the output says, run "./Build installdeps" to install the missing packages. Make sure you're in GUI env(through vncserver maybe), as "perl Build.PL" has a step to test GUI.

[root@centos-32bits App-ClusterSSH-4.03_04]# ./Build installdeps

......

Running Mkbootstrap for Tk::Xlib ()
chmod 644 "Xlib.bs"
"/usr/bin/perl" "/usr/lib/perl5/5.8.8/ExtUtils/xsubpp" -typemap "/usr/lib/perl5/5.8.8/ExtUtils/typemap" -typemap "/root/.cpan/build/Tk-804.032/Tk/typemap" Xlib.xs > Xlib.xsc && mv Xlib.xsc Xlib.c
make[1]: *** No rule to make target `pTk/tkInt.h', needed by `Xlib.o'. Stop.
make[1]: Leaving directory `/root/.cpan/build/Tk-804.032/Xlib'
make: *** [subdirs] Error 2
/usr/bin/make -- NOT OK
Running make test
Can't test without successful make
Running make install
make had returned bad status, install seems impossible

Errors again, we can see it's complaining something about TK related thing. To resolve this, I manully installed the latest perl-tk module as below:

wget --no-check-certificate 'https://github.com/eserte/perl-tk/archive/master.zip'
unzip master
cd perl-tk-master
perl Makefile.PL
make
make install

After this, let's run "./Build installdeps" and "perl Build.PL" again which all went through good:

[root@centos-32bits App-ClusterSSH-4.03_04]# ./Build installdeps

[root@centos-32bits App-ClusterSSH-4.03_04]# perl Build.PL

And let's run ./Build now:

[root@centos-32bits App-ClusterSSH-4.03_04]# ./Build
Building App-ClusterSSH
Generating: ccon
Generating: crsh
Generating: cssh
Generating: ctel

And now "./Build install" which is the last step:

[root@centos-32bits App-ClusterSSH-4.03_04]# ./Build install

After installation, let's have a test:

[root@centos-32bits App-ClusterSSH-4.03_04]# echo 'svr testserver1 testserver2' > /etc/clusters

Now run 'cssh svr', and you'll get the charm!

clusterssh

PS: 

If you met error like below:

Can't connect to display `unix:1': No such file or directory at /usr/local/share/perl5/X11/Protocol.pm line 2264.

And you are connecting to vnc session like below:

root 3291 1 0 07:36 ? 00:00:02 /usr/bin/Xvnc :1 -desktop Yue-test:1 (root) -auth /root/.Xauthority -geometry 1600x900 -rfbwait 30000 -rfbauth /root/.vnc/passwd -rfbport 5901 -fp catalogue:/etc/X11/fontpath.d -pn

Then make sure to do below:

export DISPLAY=localhost:1.0

Categories: Clouding, IT Architecture, Linux, Systems, Unix Tags:

resolved – openssl error:0D0C50A1:asn1 encoding routines:ASN1_item_verify:unknown message digest algorithm

December 17th, 2014 Comments off

Today when I tried using curl to get url info, error occurred like below:

[root@centos-doxer ~]# curl -i --user username:password -H "Content-Type: application/json" -X POST --data @/u01/shared/addcredential.json https://testserver.example.com/actions -v

* About to connect() to testserver.example.com port 443

*   Trying 10.242.11.201... connected

* Connected to testserver.example.com (10.242.11.201) port 443

* successfully set certificate verify locations:

*   CAfile: /etc/pki/tls/certs/ca-bundle.crt

  CApath: none

* SSLv2, Client hello (1):

SSLv3, TLS handshake, Server hello (2):

SSLv3, TLS handshake, CERT (11):

SSLv3, TLS alert, Server hello (2):

error:0D0C50A1:asn1 encoding routines:ASN1_item_verify:unknown message digest algorithm

* Closing connection #0

After some searching, I found that it's caused by the current version of openssl(openssl-0.9.8e) does not support SHA256 Signature Algorithm. To resolve this, there are two ways:

1. add -k parameter to curl to ignore the SSL error

2. update openssl to "OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008", just try "yum update openssl".

# openssl version
OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008

3. upgrade openssl to at least openssl-0.9.8o. Here's the way to upgrade openssl: --- this may not be needed, try method 2 above

wget --no-check-certificate 'https://www.openssl.org/source/old/0.9.x/openssl-0.9.8o.tar.gz'
tar zxvf openssl-0.9.8o.tar.gz
cd openssl-0.9.8o
./config --prefix=/usr --openssldir=/usr/openssl
make
make test
make install

After this, run openssl version to confirm:

[root@centos-doxer openssl-0.9.8o]# /usr/bin/openssl version
OpenSSL 0.9.8o 01 Jun 2010

PS:

  • If you installed openssl from rpm package, then you'll find the openssl version is still the old one even after you install the new package. This is expected so don't rely too much on rpm:

[root@centos-doxer openssl-0.9.8o]# /usr/bin/openssl version
OpenSSL 0.9.8o 01 Jun 2010

Even after rebuilding rpm DB(rpm --rebuilddb), it's still the old version:

[root@centos-doxer openssl-0.9.8o]# rpm -qf /usr/bin/openssl
openssl-0.9.8e-26.el5_9.1
openssl-0.9.8e-26.el5_9.1

[root@centos-doxer openssl-0.9.8o]# rpm -qa|grep openssl
openssl-0.9.8e-26.el5_9.1
openssl-devel-0.9.8e-26.el5_9.1
openssl-0.9.8e-26.el5_9.1
openssl-devel-0.9.8e-26.el5_9.1

  • If you met error "Unknown SSL protocol error in connection to xxx" (curl) or "write:errno=104" (openssl) on OEL5, then it's due to curl/openssl comes with OEL5 does not support TLSv1.2 (you can run "curl --help|grep -i tlsv1.2" to confirm). You need manually compile & install curl/openssl to fix this. More info here. (you can softlink /usr/local/bin/curl to /usr/bin/curl, but can leave openssl as it is)

output analysis of linux last command

December 9th, 2014 Comments off

Here's the output of "last|less" on my linux host:

root     pts/9        remote.example   Tue Dec  9 14:51   still logged in
testuser pts/2        :3               Tue Dec  9 14:49   still logged in
aime     pts/1        :2               Tue Dec  9 14:49   still logged in
root     pts/0        :1               Tue Dec  9 14:49   still logged in
testuser pts/13       remote.example   Tue Dec  9 10:48 - 10:52  (00:02)
reboot   system boot  2.6.23           Tue Dec  9 10:11          (04:39)
root     pts/11       10.182.120.179   Thu Dec  4 17:14 - 17:20  (00:06)
root     pts/11       10.182.120.179   Thu Dec  4 17:14 - 17:14  (00:00)
root     pts/10       10.182.120.179   Thu Dec  4 15:55 - 15:55  (00:00)
testuser pts/14       :3.0             Tue Dec  2 15:44 - 15:46  (00:01)
testuser pts/12       :3.0             Tue Dec  2 15:44 - 15:46  (00:01)
testuser pts/13       :3.0             Tue Dec  2 15:44 - 15:46  (00:01)
testuser pts/15       :3.0             Tue Dec  2 15:44 - 15:46  (00:01)
testuser pts/11       :3.0             Tue Dec  2 15:44 - 15:46  (00:01)
testuser pts/16       :3.0             Tue Dec  2 15:44 - 15:46  (00:01)
root     pts/10       10.182.120.179   Tue Dec  2 11:20 - 11:20  (00:00)
root     pts/7        10.182.120.179   Tue Dec  2 10:15 - down  (6+07:39)
root     pts/6        10.182.120.179   Tue Dec  2 10:15 - 17:55 (6+07:39)
root     pts/5        10.182.120.179   Tue Dec  2 10:15 - 17:55 (6+07:39)
root     pts/4        10.182.120.179   Tue Dec  2 10:15 - 17:55 (6+07:39)
root     pts/3        10.182.120.179   Tue Dec  2 10:15 - 17:55 (6+07:39)
root     pts/2        :1               Tue Dec  2 10:00 - down  (6+07:55)
aime     pts/1        :2               Tue Dec  2 10:00 - down  (6+07:55)
testuser pts/0        :3               Tue Dec  2 10:00 - down  (6+07:55)
reboot   system boot  2.6.23           Tue Dec  2 09:58         (6+07:56)

Here's some analysis:

  • User "reboot" is a pseudo-user for system reboot. Entries between two reboots are users who log on the system during two reboots. For info about login shells(.bash_profile) and interactive non-login shells(.bashrc), you can refer to here.
  • Here're columns meanings:

Column 1: User logged on

Column 2: The tty name after logging on

Column 3: Remote IP or hostname from which the user logged on. You can see ":1", ":2", ":3", that's vnc port number which vncserver are rendering against.

Column 4: Begin/End time of the session. If "still logged in", then means the user is still logged on; if there's value in parenthesis, then that's the total time of the logged on. For the latest "reboot"(red line 1), means the uptime till now; For the second "reboot"(red line 2), means the uptime between two reboots. Note however that this time is not always accurate, for example after system crash and unusual restart sequence. last calculates it as time between it and next reboot/shutdown.

 

Categories: IT Architecture, Linux, Systems, Unix Tags:

resolved – switching from Unbreakable Enterprise Kernel Release 2(UEKR2) to UEKR3 on Oracle Linux 6

November 24th, 2014 Comments off

As we can see from here, the available kernels include the following 3 for Oracle Linux 6:

3.8.13 Unbreakable Enterprise Kernel Release 3 (x86_64 only)
2.6.39 Unbreakable Enterprise Kernel Release 2**
2.6.32 (Red Hat compatible kernel)

On one of our OEL6 VM, we found that it's using UEKR2:

[root@testbox aime]# cat /etc/issue
Oracle Linux Server release 6.4
Kernel \r on an \m

[root@testbox aime]# uname -r
2.6.39-400.211.1.el6uek.x86_64

So how can we switch the kernel to UEKR3(3.8)?

If your linux version is 6.4, first do a "yum update -y" to upgrade to 6.5 and uppper, and then reboot the host, and follow steps below.

[root@testbox aime]# ls -l /etc/grub.conf
lrwxrwxrwx. 1 root root 22 Aug 21 18:24 /etc/grub.conf -> ../boot/grub/grub.conf

[root@testbox aime]# yum update -y

If your linux version is 6.5 and upper, you'll find /etc/grub.conf and /boot/grub/grub.conf are different files(for yum update one. If your host is OEL6.5 when installed, then /etc/grub.conf should be softlink too):

[root@testbox ~]# ls -l /etc/grub.conf
-rw------- 1 root root 2356 Oct 20 05:26 /etc/grub.conf

[root@testbox ~]# ls -l /boot/grub/grub.conf
-rw------- 1 root root 1585 Nov 23 21:46 /boot/grub/grub.conf

In /etc/grub.conf, you'll see entry like below:

title Oracle Linux Server Unbreakable Enterprise Kernel (3.8.13-44.1.3.el6uek.x86_64)
root (hd0,0)
kernel /vmlinuz-3.8.13-44.1.3.el6uek.x86_64 ro root=/dev/mapper/vg01-lv_root rd_LVM_LV=vg01/lv_root rd_NO_LUKS rd_LVM_LV=vg01/lv_swap LANG=en_US.UTF-8 KEYTABLE=us console=hvc0 rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_NO_DM rhgb quiet
initrd /initramfs-3.8.13-44.1.3.el6uek.x86_64.img

What you'll need to do is just copying the entries above from /etc/grub.conf to /boot/grub/grub.conf(make sure /boot/grub/grub.conf not be a softlink, else you may met error "Boot loader didn't return any data"), and then reboot the VM.

After rebooting, you'll find the kernel is now at UEKR3(3.8).

PS:

If you find the VM is OEL6.5 and /etc/grub.conf is a softlink to /boot/grub/grub.conf, then you could do the following to upgrade kernel to UEKR3:

1. add the following lines to /etc/yum.repos.d/public-yum-ol6.repo:

[public_ol6_UEKR3]
name=UEKR3 for Oracle Linux 6 ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/UEKR3/latest/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=1

2. List and install UEKR3:

[root@testbox aime]# yum list|grep kernel-uek|grep public_ol6_UEKR3
kernel-uek.x86_64 3.8.13-44.1.5.el6uek public_ol6_UEKR3
kernel-uek-debug.x86_64 3.8.13-44.1.5.el6uek public_ol6_UEKR3
kernel-uek-debug-devel.x86_64 3.8.13-44.1.5.el6uek public_ol6_UEKR3
kernel-uek-devel.x86_64 3.8.13-44.1.5.el6uek public_ol6_UEKR3
kernel-uek-doc.noarch 3.8.13-44.1.5.el6uek public_ol6_UEKR3
kernel-uek-firmware.noarch 3.8.13-44.1.5.el6uek public_ol6_UEKR3
kernel-uek-headers.x86_64 3.8.13-26.2.4.el6uek public_ol6_UEKR3

[root@testbox aime]# yum install -y kernel-uek* --disablerepo=* --enablerepo=public_ol6_UEKR3

3. Reboot

resolved – auditd STDERR: Error deleting rule Error sending enable request (Operation not permitted)

September 19th, 2014 Comments off

Today when I try to restart auditd, the following error message prompted:

[2014-09-18T19:26:41+00:00] ERROR: service[auditd] (cookbook-devops-kernelaudit::default line 14) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'
---- Begin output of /sbin/service auditd restart ----
STDOUT: Stopping auditd: [  OK  ]
Starting auditd: [FAILED]
STDERR: Error deleting rule (Operation not permitted)
Error sending enable request (Operation not permitted)
---- End output of /sbin/service auditd restart ----
Ran /sbin/service auditd restart returned 1

After some reading of manpage auditd, I realized that when audit "enabled" was set to 2(locked), any attempt to change the configuration in this mode will be audited and denied. And that maybe the reason of "STDERR: Error deleting rule (Operation not permitted)", "Error sending enable request (Operation not permitted)". Here's from man page of auditctl:

-e [0..2] Set enabled flag. When 0 is passed, this can be used to temporarily disable auditing. When 1 is passed as an argument, it will enable auditing. To lock the audit configuration so that it can't be changed, pass a 2 as the argument. Locking the configuration is intended to be the last command in audit.rules for anyone wishing this feature to be active. Any attempt to change the configuration in this mode will be audited and denied. The configuration can only be changed by rebooting the machine.

You can run auditctl -s to check the current setting:

[root@centos-doxer ~]# auditctl -s
AUDIT_STATUS: enabled=1 flag=1 pid=3154 rate_limit=0 backlog_limit=320 lost=0 backlog=0

And you can run auditctl -e <0|1|2> to change this attribute on the fly, or you can add -e <0|1|2> in /etc/audit/audit.rules. Please note after you modify this, a reboot is a must to make this into effect.

PS:

Here's more about linux audit.

resolved – Permission denied even after chmod 777 world readable writable

September 19th, 2014 Comments off

Several team members asked me that when they want to change to some directories or read some files ,the system reported error "Permission denied". Even after setting world writable(chmod 777), the error was still there:

-bash-3.2$ cd /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs
-bash: cd: /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs: Permission denied

-bash-3.2$ cat /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs/wls_sdi1.out
cat: /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs/wls_sdi1.out: Permission denied

-bash-3.2$ ls -l /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs/wls_sdi1.out
-rwxrwxrwx 1 oracle oinstall 1100961066 Sep 19 07:37 /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs/wls_sdi1.out

In summary, if you want to read some file(e.g. wls_sdi1.out) under some directory(e.g. /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs), then except for "read bit" set on that file(chmod +r wls_sdi1.out), it's also needed that all parent directories of that file(/u01, /u01/local, /u01/local/wls, ......, /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs) have both "read bit" & "execute bit" set(you can check it by ls -ld <dir name>):

chmod +r wls_sdi1.out #first set "read bit" on the file
chmod +r /u01; chmod +x /u01; chmod +r /u01/local; chmod +x /u01/local; <...skipped...>chmod +r /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs; chmod +x /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs; #then set both "read bit" & "execute bit" on all parent directories

And at last, if you can log on as the file owner, then everything will be smooth. For /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs/wls_sdi1.out, it's owned by oracle user. So you can try log on as oracle user and do the operations.

Categories: IT Architecture, Kernel, Linux, Systems, Unix Tags:

crontab cronjob failed with date single apostrophe date +%d-%b-%Y-%H-%M on linux

August 4th, 2014 Comments off

I tried to creat one linux cronjob today, and want to note down date & time when the job was running, and here's the content:

echo '10 10 * * 1 root cd /var/log/ovm-manager/;tar zcvf oc4j.log.`date +%m-%d-%y`.tar.gz oc4j.log;echo "">/var/log/ovm-manager/oc4j.log' > /etc/cron.d/oc4j

However, this entry failed to run, and when check log in /var/log/cron:

Aug 4 06:24:01 testhost crond[1825]: (root) RELOAD (cron/root)
Aug 4 06:24:01 testhost crond[1825]: (root.bak) ORPHAN (no passwd entry)
Aug 4 06:25:01 testhost crond[28376]: (root) CMD (cd /var/log/ovm-manager/;tar zcvf oc4j.log.`date +)

So, the command was intercepted and that's the reason for the failure.

Eventually, I figured out that cron treats the % character specially (it is turned into a newline in the command). You must precede all % characters with a \ in a crontab file, which tells cron to just put a % in the command. And here's the updated version:

echo '10 10 * * 1 root cd /var/log/ovm-manager/;tar zcvf oc4j.log.`date +\%m-\%d-\%y`.tar.gz oc4j.log;echo "">/var/log/ovm-manager/oc4j.log' > /etc/cron.d/oc4j

This time, the job got ran successfully:

Aug 4 06:31:01 testhost crond[1825]: (root) RELOAD (cron/root)
Aug 4 06:31:01 testhost crond[1825]: (root.bak) ORPHAN (no passwd entry)
Aug 4 06:31:01 testhost crond[28503]: (root) CMD (cd /var/log/ovm-manager/;tar zcvf oc4j.log.`date +%m-%d-%y`.tar.gz oc4j.log;echo "">/var/log/ovm-manager/oc4j.log)

PS:

  1. You should use in cron - "su - oracle -c <cmd>" if the script bound to special user
  2. More on here http://stackoverflow.com/questions/1486088/cron-fails-on-single-apostrophe
Categories: IT Architecture, Linux, Systems, Unix Tags:

linux process accounting set up

July 8th, 2014 Comments off

Ensure package psacct is installed and make it boot with system:

rpm -qa|grep -i psacct
chkconfig psacct on
service psacct start

Here're some useful commands

[root@qg-dc2-tas_sdi ~]# ac -p #Display time totals for each user
emcadm 0.00
test1 2.57
aime 37.04
oracle 32819.22
root 12886.86
testuser 1.47
total 45747.15

[root@qg-dc2-tas_sdi ~]# lastcomm testuser #Display command executed by user testuser
top testuser pts/5 0.02 secs Fri Jul 4 03:59
df testuser pts/5 0.00 secs Fri Jul 4 03:59

[root@qg-dc2-tas_sdi ~]# lastcomm top #Search the accounting logs by command name
top testuser pts/5 0.03 secs Fri Jul 4 04:02

[root@qg-dc2-tas_sdi ~]# lastcomm pts/5 #Search the accounting logs by terminal name pts/5
top testuser pts/5 0.03 secs Fri Jul 4 04:02
sleep X testuser pts/5 0.00 secs Fri Jul 4 04:02

[root@qg-dc2-tas_sdi ~]# sa |head #Use sa command to print summarizes information(e.g. the number of times the command was called and the system resources used) about previously executed commands.
332 73.36re 0.03cp 8022k
33 8.76re 0.02cp 7121k ***other*
14 0.02re 0.01cp 26025k perl
7 0.00re 0.00cp 16328k ps
49 0.00re 0.00cp 2620k find
42 0.00re 0.00cp 13982k grep
32 0.00re 0.00cp 952k tmpwatch
11 0.01re 0.00cp 13456k sh
11 0.00re 0.00cp 2179k makewhatis*
8 0.01re 0.00cp 2683k sort

[root@qg-dc2-tas_sdi ~]# sa -u |grep testuser #Display output per-user
testuser 0.00 cpu 14726k mem sleep
testuser 0.03 cpu 4248k mem top
testuser 0.00 cpu 22544k mem sshd *
testuser 0.00 cpu 4170k mem id
testuser 0.00 cpu 2586k mem hostname

[root@qg-dc2-tas_sdi ~]# sa -m | grep testuser #Display the number of processes and number of CPU minutes on a per-user basis
testuser 22 8.18re 0.00cp 7654k

Categories: IT Architecture, Linux, Systems, Unix Tags:

Enable NIS client on linux host

July 2nd, 2014 1 comment

After you set up NIS server, you need set up NIS client. Here's the steps for enabling NIS client on linux box.

Ensure required packages are installed

rpm -qa|egrep 'yp-tools|ypbind|portmap'

Edit /etc/sysconfig/network

NISDOMAIN=example.com

Edit /etc/yp.conf
domain example.com server 10.229.169.88
domain example.com server 10.229.192.99

Set NIS domain-name

domainname example.com
ypdomainname example.com

Set /etc/nsswitch.conf

passwd: files nis
shadow: files nis
group: files nis
hosts: files dns nis
bootparams: nisplus [NOTFOUND=return] files
ethers: files
netmasks: files
networks: files
protocols: files
rpc: files
services: files
netgroup: nisplus
publickey: nisplus
automount: files nisplus
aliases: files nisplus
sudoers: files nis

Make sure the portmap service is running:

service portmap start

chkconfig portmap on

Start ypbind service:

service ypbind start
chkconfig ypbind on

Test it out:

rpcinfo -u localhost ypbind

ypcat passwd|egrep 'username'

If you want to set up sudo privileges for NIS users, then you can refer to this article resolved – /etc/sudoers: syntax error near line 10

PS:

  • If there's firewall between Linux NIS clients and NIS servers, then you should not startup ypbind(chkconfig ypbind off; service ypbind stop), if you startup ypbind, then the box will try to connect to NIS servers without stopping. Your linux box will get stuck and will take a long time for you to log on even as root. This is rule of thumb.
  • Here's about ypwhich:

#a list of available maps

[root@testvm ~]# ypwhich -m

networks.byaddr dcsun2
netgroup.byuser dcsun2
ethers.byaddr dcsun2-new
ypservers dcsun2
hosts.byaddr dcsun2
auto.utils ap101nis
mail.byaddr dcsun2
passwd.byname dcsun2
passwd.byuid dcsun2
protocols.bynumber dcsun2-new
stnis st-yp
group.bygid dcsun2
networks.byname dcsun2
tnsnames dcsun2-new
ethers.byname dcsun2-new
netmasks.byaddr st-yp
hosts.byname dcsun2
auto_home st-yp
auto_home_slc st-yp
printers.conf.byname st-yp
mail.aliases dcsun2
protocols.byname dcsun2-new
bootparams dcsun2-new
rpc.bynumber dcsun2-new
timezone.byname st-yp-refresh
netid.byname dcsun2
group.byname dcsun2
auto_ade st-yp-refresh
netgroup dcsun2
publickey.byname dcsun2-new
netgroup.byhost dcsun2
services.byname dcsun2-new
auto_home_appsdev st-yp

#Display the map nickname translation table, in /var/yp/nicknames

[root@testvm ~]# ypwhich -x

Use "ethers" for map "ethers.byname"
Use "aliases" for map "mail.aliases"
Use "services" for map "services.byname"
Use "protocols" for map "protocols.bynumber"
Use "hosts" for map "hosts.byname"
Use "networks" for map "networks.byaddr"
Use "group" for map "group.byname"
Use "passwd" for map "passwd.byname"

  • TBC
Categories: IT Architecture, Linux, Systems, Unix Tags:

resolved – /etc/sudoers: syntax error near line 10

July 2nd, 2014 Comments off

When using /usr/sbin/visudo, after modification, errors occurred(you can run visudo -c for manually check):

>>> /etc/sudoers: syntax error near line 10 <<<

Here's line 10:

User_Alias Users_SDITAS = username1, username2

Then I changed it as following:

User_Alias USERS_SDITAS = username1, username2

And now everything is ok. So this means that the alias name must all be uppercase.

PS:
1. Here's the explanation about User_Alias Users_SDITAS = username1, username2

The first part is the user,
The second is the terminal from where the user can use sudo command,
The third part is which users he may act as,
The last one, is which commands he may run when using sudo.
For example, root ALL=(ALL) ALL, means the root user can execute from ALL terminals, acting as ALL (any) users, and run ALL (any) command. And USERS_SDITAS ALL=(oracle) NOPASSWD:SETENV: CMD_MIGRATIONDC1DC3 means users in group USERS_SDITAS can execute from ALL terminals, acting as oracle user, and run commands in group CMD_MIGRATIONDC1DC3. (sudo -E -u oracle <command>, -E will pass invoking users env variables to target user if SETENV tag is added to sudo commands in /etc/sudoers. You'll get error message "sudo: sorry, you are not allowed to preserve the environment" if you did not add SETENV tag in /etc/sudoers. You can run sudo -l or sudo -ll to get a list of privilege commands for you or for others if you run sudo -l -U <username> )

2. One sample of /etc/sudoers configuration in linux(use visudo to edit, as visudo can check for errors after modification. You may need set "echo 'export PATH=/usr/bin:$PATH' >> /etc/profile" in some circumstances so that sudo will be /usr/bin/sudo):

Defaults logfile=/var/log/sudo.log

Defaults always_set_home #switched to target user's home directory when running sudo. Note that HOME is already set when the the env_reset option is enabled, so always_set_home is only effective for configurations where either env_reset is disabled(Defaults !env_reset) or HOME is present in the env_keep list(Defaults env_keep += HOME). This flag is off by default.
Host_Alias HOSTS_MIGRATIONDC1DC3 = slcn06vmf0012, slcn06vmf0013
Cmnd_Alias CMD_MIGRATIONDC1DC3 = /u01/local/wls/user_projects/domains/base_domain/bin/tasctl, /u01/shared/wls/Oracle_SDI1/sdictl/sdictl.sh
User_Alias USERS_SDITAS =username1, username2
USERS_SDITAS ALL=(ALL) NOPASSWD: /bin/su - oracle #users in USERS_SDITAS group can now sudo su - oracle without asking for a password
oracle ALL=(ALL) NOPASSWD:SETENV: CMD_MIGRATIONDC1DC3 #oracle user can run all commands in commands group CMD_MIGRATIONDC1DC3.

3. To check  whether some NIS users are using/bin/false shell(means they can not log on the host by ssh), use the following commands:

ypcat passwd|awk -F: '{if($1 ~ /^username1$|^username2$/) print}'|grep false

 4. To disallow some commands, you can check below(for disabling su to root):

  username='user1'
password=`date +%s | sha256sum | base64 | head -c 9 ; echo`
echo "Username is $username and Password is $password"
useradd $username;mkdir -p /home/$username;chown $username:$username /home/$username;chmod 755   /home/$username;echo $password|passwd --stdin $username
set +H
echo "$username ALL=(ALL) NOPASSWD:SETENV:ALL, !/usr/bin/passwd, !/bin/su root , !/bin/su - root, !/bin/su -, !/bin/bash, !/bin/sh, !/bin/tcsh, !/bin/csh, !/bin/ksh" >> /etc/sudoers
visudo -c

Categories: IT Architecture, Linux, Systems, Unix Tags: ,

Resolved – rm cannot remove some files with error message “Device or resource busy”

June 11th, 2014 1 comment

If you meet problem when remove one file on linux with below error message:

[root@test-host ~]# rm -rf /u01/shared/*
rm: cannot remove `/u01/shared/WLS/oracle_common/soa/modules/oracle.soa.mgmt_11.1.1/.nfs0000000000004abf00000001': Device or resource busy
rm: cannot remove `/u01/shared/WLS/oracle_common/modules/oracle.jrf_11.1.1/.nfs0000000000005c7a00000002': Device or resource busy
rm: cannot remove `/u01/shared/WLS/OracleHome/soa/modules/oracle.soa.fabric_11.1.1/.nfs0000000000006bcf00000003': Device or resource busy

Then it means that some progresses were still referring to these files. You have to stop these processes before remove these files. You can use linux command lsof to find the processes using specific files:

[root@test-host ~]# lsof |grep nfs0000000000004abf00000001
java 2956 emcadm mem REG 0,21 1095768 19135 /u01/shared/WLS/oracle_common/soa/modules/oracle.soa.mgmt_11.1.1/.nfs0000000000004abf00000001 (slce49sn-nas:/export/C9QA123_DC1/tas_central_shared)
java 2956 emcadm 88r REG 0,21 1095768 19135 /u01/shared/WLS/oracle_common/soa/modules/oracle.soa.mgmt_11.1.1/.nfs0000000000004abf00000001 (slce49sn-nas:/export/C9QA123_DC1/tas_central_shared)

So from here you can see that processe with PID 2956 is still using file /u01/shared/WLS/oracle_common/soa/modules/oracle.soa.mgmt_11.1.1/.nfs0000000000004abf00000001.

However, some systems have no lsof installed by default. Then you can install it or by using the alternative one "fuser":

[root@test-host ~]# fuser -cu /u01/shared/WLS/oracle_common
/u01/shared/WLS/oracle_common: 2956m(emcadm) 7358c(aime)

Then you can see also that progresses with PIDs 2956 and 7358 are referring to the directory /u01/shared/WLS/oracle_common.

so you'll need stop the process first by killing it(or stop it using the processes own stop() method if defined):

kill -9 2956

After that, you can try remove the files again, should be ok this time.

Categories: IT Architecture, Kernel, Linux, Systems, Unix Tags:

avoid putty ssh connection sever or disconnect

January 17th, 2014 2 comments

After sometime, ssh will disconnect itself. If you want to avoid this, you can try run the following command:

while [ 1 ];do echo hi;sleep 60;done &

This will print message "hi" every 60 seconds on the standard output.

PS:

You can also set some parameters in /etc/ssh/sshd_config, you can refer to http://www.doxer.org/make-ssh-on-linux-not-to-disconnect-after-some-certain-time/

hpux tips

June 30th, 2013 Comments off
tusc #like truss or strace
lsdev
swapinfo -tm #memory usage
/var/adm/syslog/syslog.log, /etc/shutdownlog,
/etc/rc.log
/var/adm/syslog/syslog.log #like /var/log/messages
/etc/shutdownlog
/var/adm/crash/crash.X
/etc/rc.config.d/netconf #the interfaces which are started at boot up
/opt/fcms/bin/fcdutil /dev/fcd0 #HP-ux HBA and driver info
swlist -l product | grep "Fibre Channel Driver" #HP-UX
/usr/sbin/swlist and swinstall
bdf #HP, report number of free disk blocks
Extend the logical volume: lvextend –L <new LV size in MB> /dev/vgxx/lvolXX
Extend the filesystem to use the space added: fsadm –F vxfs –b <new size in 1 KB sectors> <mount point>
/opt/fcms/bin/fcdutil /dev/fcd0 | grep "World Wide Name"
dlmsetconf, dlmcfgmgr
hrdconf #HP
model #find out what model of machine we're on, like 9000/800/rp4440
/opt/ignite/bin/print_manifest #To display system information and configuration(model, memory, CPU, Storage, partitions, I/O devices, software installed, kernel parameters, ip address)
/usr/sbin/lanscan #lists all network adapters, deprecated(after this, use ifconfig <name of NIC> to check details). related commands lanadmin, linkloop, lan, nwmgr<this is recommended>
lanadmin -x 0 #PPA number
ioscan #HP, scan I/O system, scan newly added disks, check processor type etc
ioscan -f #all devices
ioscan –fknC fc #list HBA devices
/opt/fcms/bin/fcmsutil #Fibre Channel Mass Storage Utility Command, fcmsutil /dev/tdX -> Display HBA details
/sbin/rc3.d #run levels, All the scripts should take the appropriate action depending on the argument given. The stop script for a subsystem should be in the rc directory one run level below its start script, e.g. if the start script is in rc3.d then the stop script should be in rc2.d
/stand   #kernel and kernel configuration files
/usr/bin/bdf -l #FS, like df -k
/usr/sbin/sam #system admin tool
/usr/sbin/fsadm #linux belongs to lvm2(JBOD - Raid0)
/usr/sbin/{cstm,xstm,mstm} #Support Tools Manager,
/sbin/ipf #rules in /etc/opt/ipf/ipf.conf, ipf –Fa –f /etc/opt/ipf/ipf.conf to re-read rules file
/usr/lbin/modprpw #To unlock the account (if TCB is used) use: /usr/lbin/modprpw -l -k <loginid>
/opt/perf/bin/extract #performance monitoring
/etc/pam.conf
/etc/nsswitch.conf
/etc/opt/ldapux/ldapux_client.conf
/opt/ldapux/config/setup
nsquery passwd liandy
A Guide to HP-UX Document Collections HP documents
Categories: IT Architecture, Systems, Unix Tags:

AIX tips

June 30th, 2013 Comments off
fuser -cuxk /oracle #kill all the process using filesystem /oracle
procstack #show current stack of a process
bootlist -m normal -o          # Lists the current bootlist
bootlist -m normal cd0 hdisk0  # To set cd0 and hdisk0 as first and second boot devices
bootlist -m service cd0 rmt0   # To change the bootlist for service mode
alog -L
alog -o -t boot
alog -L -t boot #find out the properties of boot log file
##Device Configuration Database(Predefined, Customized)
01. Available  - Device is ready and can be used
02. Defined    - Device is unavailable
03. Unknown    - Undefined
04. Stopped    - Configured but unavailable
lsdev
      -C  to list customized database
      -P  to list predefined database
      -c (class)
      -t (type)
      -s (subtype)
To list all customised devices ie installed
 # lsdev -C
To list all the Hard Drives in a system
 # lsdev -Cc disk
To list all the adapters in a sytem
 # lsdev -Cc adapter
lscfg -v  #list all installed devices in detail
lscfg -vpl fcs0<ent0> #find out the WWN, FRU #, firmware level of fibre adapter fcs0
entstat -d ent0 #link status, link speed and mac address and statistics of an Ethernet adapter ent0
##Setting multiple IP address for a single network card
 # ifconfig lo0 alias 195.60.60.1
 # ifconfig en0 alias <IPadress> netmask <net_mask>
/etc/rc.net, /etc/rc.tcpip #make the above permanent
lsattr -El ent0 -a media_speed -R #find out the possible media_speed values for ethernet card ent0
lsattr -El mem0 #find out the effective attribute of a device "mem0"
lsattr -El sys0 #list the defaults in the pre-defined db for device ent0
To change the maximum number of processes allowed per user
Find out the valid range of values using lsattr command
 # lsattr -l sys0 -a maxuproc -R
 40...131072 (+1)
Change the maxuproc value using chdev command
 # chdev -l sys0 -a maxuproc=10000
rmdev -l (device) -d #delete the device
To delete a static route manually
Syntax:- chdev -l inet0 -a delroute=<net>,<destination_address>,<Gate_way_address>,<Subnet_mask>
 # chdev -l inet0 -a delroute='net','0.0.0.0','172.26.160.2'
To change the IP address of an interface manually
 # chdev -l en0 -a netaddr=192.168.123.1 -a netmask=255.255.255.0 -a state=up
To set the IP address initially
 # mktcpip -h <hostname> -a <ipaddress> -m <subnet_mask> -i <if_name> -n <NameServer_address>
   -d <domain_name> -g <gateway_address> -A no
##add device to system
To define a tape device
 # mkdev -d -c tape -t 8mm -s scsi -p scsi0  -w 5,0
To make the predefined rmt0 tape to available status
 # mkdev -l rmt0
##configure new devices using cfgmgr
cfgmgr -l fcs0 #configure detected devices attached to the fcs0 adapter
cfgmgr -i /tmp/drivers #cfgmgr -i /tmp/drivers
getconf -a
prtconf -c/m/s
bootinfo -K
mksysb/savevg/restore
sysdumpdev/sysdumpstart/snap/kdb
swapon/swapoff/lsps/chps/mkps/rmps, /etc/swapspaces, /etc/filesystems
LVM - lsvg/lspv/mkvg/mklv/logform/crfs/chfs/extendvg/mklvcopy/syncvf/bosboot/synclvodm/chpv
NIM - Network Installation Management
/etc/netsvc.conf #Name resolution order
##no command is used to change the network tuning parameters. ioo for IO tuning(aio, asynchronous IO), vmo for virtual memory manager parameters
To list the current network parameters / network options
 # no  -a
To enable IP forwarding
 # no -o "ipforwarding=1"
To make ipforwarding=1 permanent now and after reboot
 # no -p -o ipforwarding=1
###/etc/tunables/xxx, tuncheck/tunsave/tunrestore/tundefault
startsrc/lssrc, iptrace, tcpdump #The startsrc command sends the System Resource Controller (SRC) a request to start a subsystem or a group of subsystems, or to pass on a packet to the subsystem that starts a subserver.
##ODM, object data manager
/etc/objrepos
/usr/lib/objrepos
/usr/share/lib/objrepos
odmget CuDv #list all records with an Object Class CuDv
odmget -q "name=sys0 and attribute=maxuproc" CuAt
svmon -G #memory. pin(frames that cannot be swapped), pg space(paging space, ie swap)
pagesize
svmon -P 13548 -i 1 2 #monitor memory leak by  looking for processes whose working segment continually grows
trcon/filemon/trcstop #Most Active Logical/physical Volumes, most active Files
rmss #a means to simulate different sizes of real memory that are smaller than your actual machine
netpmon #network monitoring
##package management
oslevel -r/s/l xxxx/-g/-rq
###PATH
/etc/objrepos
/usr/lib/objrepos
/usr/share/lib/objrepos
lslpp -l [software name]
lslpp -f <fileset name> #display the names of all the files of fileset
lslpp -w /usr/sbin/nfsd #which fileset a file belongs to
##NFS
service portmap start
service nfs start
showmount -e localhost
/var/lib/nfs/
/etc/exports #/backup/downloads *(sync,ro,root_squash,wdelay), exportfs -a, exprotfs *:/backup/downloads
mount -fv -t nfs <xx> <dir> #check ports used
lslpp -ha #installation history of filesets
To list all installable software in media /dev/cd0
 installp [-L|-l] -d /dev/cd0
To cleanup all failed installtion
 installp -C
To install bos.net software (apply and commit) package with all pre-requisites from directory /tmp/net
 installp -acgx -d /tmp/net bos.net
To commit teh applied updates
 installp -cgx all
To remove bos.net package
 installp -ug bos.net
To find out whether a Fix is installed or not
 # instfix -i -k <APAR Number>
To list all the fixes that are installed on your system
 # instfix -i -v
To list filesets which are lesser than the specified maintenance level
 # instfix -ciqk 5100-04_AIX_ML | grep ":-:"
To install all filesets associated with fix Ix38794 from the tape
 # instfix  -k Ix38794  -d /dev/rmt0
To Display the entire list of fixes present on the media
 # instfix -T -d /dev/cd0
To confirm the AIX preventive maintenance level on your system
 # instfix -i | grep ML
 All filesets for 5.0.0.0_AIX_ML were found.
 All filesets for 5.1.0.0_AIX_ML were found.
 All filesets for 5.1.0.0_AIX_ML were found.
 All filesets for 5100-01_AIX_ML were found.
 All filesets for 5100-02_AIX_ML were found.
Updating the software to the latest level
01. Using smit
    # smit update_all
02. To update all filesets in a system using command line
    a. Create the list of filesets installed
       # lslpp -Lc | awk -F: '{print $2}'| tail -n +2 > /tmp/lslpp
    b. Update the softwares using installp command
       # installp -agxYd /dev/cd0 -e /tmp/<exclude_list> -f /tmp/lslpp
Another way of updating all the filesets
 # /usr/lib/instl/sm_inst installp_cmd  -acgNXY -d <localtion_of_updates> -f '_update_all'
For not committing and saving all replaced files
 # /usr/lib/instl/sm_inst installp_cmd  -agX -d <localtion_of_updates> -f '_update_all'
To list all the installed efixes on a system
 # emgr -l
To install a efix IY93496.070302.epkg.Z in /mnt directory
 # emgr -e /mnt/IY93496.070302.epkg.Z
inutoc
The inutoc command creates the .toc file in Directory. If a .toc file already exists, it is recreated with new information. The inutoc command adds table of contents entries in the .toc file for every installation image in Directory.
The installp command and the bffcreate command call this command automatically upon the creation or use of an installation image in a directory without a .toc file
To create a .toc file for the /tmp/images directory, enter:
 # inutoc /tmp/images
bffcreate
The bffcreate command creates an installation image file in backup file format (bff) to support software installation operations. It creates an installation image file from an installation image file on the specified installation media
To create an installation image file from the bos.net software package on the tape in the /dev/rmt0 tape drive and use /var/tmp as the working directory, type:
 # bffcreate  -d /dev/rmt0.1 -w /var/tmp bos.net
##security
/etc/security/environ
/etc/security/group
/etc/security/lastlog
/etc/security/limits
/etc/security/login.cfg
/usr/lib/security/mkuser.default
/etc/security/passwd
/etc/security/portlog
/etc/security/user
/etc/security/failedlogin
/etc/security/ldap/ldap.cnf
chsec -f /etc/security/limits -s joe -a cpu=3600 #change the CPU time limit of user joe to 1 hour
chuser rlogin=true smith #enable user smith to access this system remotely
pwdadm -c user1 # To reset the ADMCHG flag for the user user1<forces the user to change the password the next time a login command or an su command is given for the user>
who -a /etc/security/failedlogin # read failed login attempts
##LPAR and HMC<logical partition and hardware management console>
lsslot/hmcshutdown/chsysstate/lsrsrc/smtctl/vtmenu/mkvterm/rmvterm/lssysconff/lssysconn/ssysconn/lssyscfg/lsled/chled/lparstat -i/chhmc/mkvdev
##HACMP
HACMP Daemon
01. clstrmgr
02. clinfo
03. clmuxpd
04. cllockd
cllsgrp/clshowres/clRGmove/clfindres/clRGinfo/cldump/varyonvg/importvg/chvg/cllsappmon/clclear
##Storage
fcstat -D fcs0 | grep Attention #To find out the fiber channel link status
lsvpcfg #List all Vpath devices and their states
dpovgfix vg00 #fixe a DPO Vpath Volume group that has mixed vpath and hdisk volumes
###EMC powerpath
To configure all the emc hdisks, run emc_cfgmgr script. This script invokes the AIX cfgmgr tool to probe each adapter bus separately
To remove the Symettrix hdisks
 # lsdev -CtSYMM* -Fname | xargs -n1 rmdev -dl
To remove hdisks corresponding to CLARiiON devices
 # lsdev -CtCLAR* -Fname | xargs -n1 rmdev -dl
To probe all emc disks
 # inq
To set up multipathing to the root device
 # pprootdev on
To Remove all hdiskpower devices
 # lsdev -Ct power -c disk -F name | xargs -n1 rmdev -l
To find out which hdiskpower device contains hdsik132
 # powermt display dev=hdisk132
###HP Autopath
dlnkmgr view -drv
dlmrmdev #remove all the DLM drivers
dlmpr -a -c #To clear the SCSI reserves on the disks
/usr/DynamicLinkManager/drv/dlmfdrv.conf
###MPIO
To list all the paths which are in Enabled status
 # lspath -s ena -Fname -p fscsi0
 # chpath -s ena -l hdisk0
 paths Enabled
To list all available disks and their paths
  # lspath | sort +1
To list all disks which paths are in failed state
 # lspath -s failed
To list all disks which paths are in Defined state
 # lspath -s defined
To remove a path
 rmpath -dl <disk_name> -p <parent> -w <connection>
 rmpath -dl hdisk3 -p fscsi0 -w 5005076801105daf,1000000000000
Categories: IT Architecture, Systems, Unix Tags:

solaris tips

June 30th, 2013 Comments off
pkginfo -l SUNWcsu
mdb & kmdb http://docs.oracle.com/cd/E19082-01/817-2543/index.html
echo ::memstat | mdb -k #memory usage profile
echo ::kmastat | mdb -k
echo "::threadlist -v" | mdb -k #collect the stack trace of all threads in mdb
truss -p <pid>/ndd/etc/system
pstack /var/core/core_doxerorg_vxconfigd_0_0_1343173375_140 #print a stack trace of running processes, or <pid>, /var/crash
/var/cron/log #solaris 10 cronlog,
###EFI SMI label http://www.chinaunix.net/old_jh/6/955384.html
###solaris proc tools
cd /proc/ ; for i in *; do echo --- process $i ---; pfiles $i | grep -i "port: 11961"; done
pldd 578 #dynamic libraries linked into the process
eeprom use-nvramrc? #whether nvramrc is enabled
eeprom nvramrc #aliases. halt will go to OK mode. type sync there then it will sync disks and then generate a static coredump<if faied, try savecore -L(live) in OK mode/or in OS>
eeprom auto-boot? #auto boot
mpathadm list mpath-support
/etc/dfs/dfstab #like /etc/mtab, also in HPUX
prtvtoc /dev/rdsk/c1t0d0s2 | fmthard -s - /dev/rdsk/c1t1d0s2 #solaris,copy disk head info from c1t0d0 to c1t1d0
isainfo -kv #64bits or 32bits, pagesize,
getconf PAGESIZE #linux
grep -i /etc/network path_to_inst #find out the network cards available on the system, cat /etc/path_to_inst | egrep -i 'eri|ge|ce|qfe|hme'
sysdef #solaris, output system definition
mpstat 2 #solaris, per-processor,cpu
prtdiag -v #solaris hardware type, prtconf
dladm #solaris administer data links
dladm show-dev/show-link/show-aggr/ #two dev can be a link aggr<bonding, on one switch, increase link speed>
dladm show-dev -s e1000g1
ifconfig e1000g1 plumb
ifconfig vmxnet0 10.180.3.218 netmask 255.255.255.0 up #at last, config /etc/hostname.vmxnet0
ifconfig vmxnet0 down
dladm create-aggr -l passive -d e1000g2 -d e1000g3 1(first unplumb e1000g2/3)
ifconfig aggr1 plumb
ifconfig aggr1 10.180.3.220 netmask 255.255.255.0 up
/usr/sbin/cfgadm -c configure /dev/cfg/c2::5006048452a72687
/usr/sbin/fcinfo hba-port -l #solaris, hba, like Qlogic<chibrat5>, Emulex<upora06a>
fcinfo remote-port -l -p 210000e08b18024f #Lists the remote-port information
/{usr/,}opt/FJSV*
/opt/FJSVmadm/sbin/{madmin,hrdconf -l } #madmin is an menu-driven interactive utility that allows you to perform various hardware-related diagnostics and maintenance.
/opt/FJSVsnap/bin/fjsnap -a output #like sun explorer, add -C to include crashdump information
/opt/FJSVhwr/sbin/fjprtdiag -v
/opt/FJSVcsl/bin//mainmenu #hardware etc on SMC and partition poweron/poweroff(PrimePower System Management Controller)
/opt/FJSVcsl/bin/get_console -w -n <partition_name> #partition_name in /etc/hosts #maybe /opt/scripts/bin/console.sh 0 FORCE, if you know the number(through mainmenu)
##to get into OK prompt
a. ctrl+] to get the telnet prompt
b. From telnet prompt, type "send break"  to get OK prompt
ndd #change UDP parameters etc
solaris IPMP configuration solaris-IPMP.pdf

zpool create tank c4t0d0
zpool list
zpool list tank
zpool get autoexpand tank
zpool replace tank c4t0d0 c1t13d0
zpool list tank
zpool set autoexpand=on tank
zpool list tank
zfs userspace tank
zfs groupspace tank
zpool status
zpool status -x #all pools are healthy
zpool history
zpool history -l
zfs mount -a
zfs get mountpoint,compression tank
zfs create -o compression=gzip tank/home
zfs create -o compression=gzip tank/home/firsttry
zfs create -o compression=gzip -o mountpoint=/export/secondtry tank/home/secondtry
##the below is equal to:
##zfs create tank/home/secondtry
##zfs set mountpoint=/export/secondtry tank/home/secondtry
zfs get -s local all
zpool get all tank
zfs list
zfs mount zones/test #Mount the ZFS
zfs unmount zones/test/
zpool status -x
zpool status
zpool clear tank
/usr/lib/pool/poold #start poold manually
pgrep -l poold #1333 poold

/etc/svc/volatile #logs related to current services
fmadm faulty #fault management. fmd, fmdump, fmstat. FRU(Field replaceable unit)
fmadm #fault management configuration tool, fmadm faulty -a
fmadm config
fmadm faulty #show faults in fma
fmstat
fmstat -m zfs-retire 2 5
fmdump
fmdump -vv -u 177b4b48-8ed1-ea7a-e6f3-feed10dd4c38
fmdump -Vu 6252dd23-4397-cbda-8c72-8774fd175bc1
fmdump /var/fm/fmd/errlog
svcs -a | grep -i cron
svcs cron
svcs -l ipfilter #dependency, dependent
svcs -D ipfilter #dependent
svcs -d ipfilter #dependency
svcs -a|grep lrc #smf can monitor init.d scripts but can not manage them
svcadm enable -r ipfilter #boot cascade
svcadm enable -rt ipfilter #single user mode
svcadm restart cron
svcadm refresh #make snapshot working
##recover
svcs -p telnet#check relationship between services and processes, may need pkill -9
svcadm clear telnet #check /var/svc/log
#recover snapshot
svccfg ->select network/ipfilter:default -> listsnap ->help ->help revert ->revert start ->quit
svcadm refresh network/ipfilter:default #refresh /etc/svc/repository.db
svcadm restart network/ipfilter:default #restart
svcadm clear svc:/system/filesystem/local:default
svcs -xv FMRI#check for reason
svccfg -s network/ipfilter:default #unselect,quit
svccfg export pfil >/tmp/pfil.desc
pargs -e `pgrep -f cron`
svccfg -s system/cron setenv LD_PRELOAD libumem.so
svccfg -s system/cron setenv UMEM_DEBUG default
svcadm refresh system/cron
svcadm restart system/cron
pargs -e `pgrep -f cron`
inetadm -l telnet | grep tcp_trace
inetadm -m telnet tcp_trace=TRUE
inetdadm -l ftp|grep exec
inetadm -m ftp exec="/usr/sbin/in.ftpd -a -l"
inetconv -e -i /etc/inet/inetd.conf
pkill -HUP inetd
/lib/svc/method/sshd start #man smf_method
svccfg import
/lib/svc/bin/restore_repository
/var/svc/profile
svcs |grep milestone
svcadm -v milestone -d multi-user-server:default #/var/svc/manifest/milestone/multi-user-server.xml
RBAC #man smf_security
svcadm enable apache2 # manipulate service instances
inetadm - bserve and configure services that are controlled by inetd. inetadm -?, inetconv
svcprop - retrieve values of service configuration properties
FMRI:Fault Management Resource Identifier
svccfg delete /network/http:apache2
ups auxww|grep fmd
-bash-3.00# svccfg -s smtp
svc:/network/smtp> list
:properties
sendmail
svc:/network/smtp> select sendmail
svc:/network/smtp:sendmail> list
:properties
svc:/network/smtp:sendmail> listprop *exec
start/exec astring "/lib/svc/method/smtp-sendmail start"
stop/exec astring "/lib/svc/method/smtp-sendmail stop %{restarter/contract}"
refresh/exec astring "/lib/svc/method/smtp-sendmail refresh"
More http://www.princeton.edu/~unix/Solaris/troubleshoot/smf.html
/lib/svc/method/fs-*

http://www.sunfreeware.com
http://www.opencsw.org #/opt/csw/bin:/opt/csw/sbin
http://wesunsolve.net/
http://www.unixpackages.com/
/usr/sbin/pkgchk -l -p /usr/sbin/fcinfo #SUNWfcprt
pkginfo #/var/sadm/pkg/pkgname/pkginfo
pkgmap #/var/sadm/install/contents
pkgtrans
pkginfo -d ./top-3.5-sol10-intel-local
pkgadd -d . topxxx
pkgadd -d ./xxx
pkgadd -d ./top-xxx -s /var/spool/pkg SMCtop
pkgtrans ./topxxx /var/spool/pkg
pkgadd -d . -s spool
pkginfo -d spool SMCtop
pkgrm -s spool SMCtop
grep showrev /var/sadm/install/contents
pkginfo|grep -i top
root@beisoltest02 ~ # pkgadd
pkgadd: ERROR: no packages were found in </var/spool/pkg>
less /etc/apache/README.Solaris
pkgrm SMCtop
pkgchk SMCtop
pkgchk -p /usr/local/doc/top/README
pkgparam SMCtop PATCHLIST
root:/usr/local/src# wget http://www.sunfreeware.com/BOLTpget.pkg
root:/usr/local/src# pkgadd -d BOLTpget.pkg all

zoneadm list -civ #-v,verbose,zfs list
zonecfg -z andyred #interactive configuration
zoneadm -z andyred boot #boot
zlogin andyred shutdown -i5 -g0 -y #shutdown
zoneadm -z andyred halt #halt,no shutdown scripts will be run
zoneadm -z andyred uninstall -F #delete
zoneadm -z test detach #detach the zone
zoneadm -z test attach -u
zlogin -C andyred #login zone from global zone
zonename #which zone am I in
netstat -p #non-global zone to get global zone name

Categories: IT Architecture, Systems, Unix Tags:

resolved – how to check nfs version in linux

September 11th, 2012 Comments off

To know nfs version in linux/solaris:

  • On the nfs server side, you can run a nfsstat -s to check. The used version of nfs will have data summary other than 0% ones, as the following:

root@doxer.org# nfsstat -s
Server rpc stats:
calls badcalls badauth badclnt xdrcall
28 0 0 0 0

Server nfs v3:
null getattr setattr lookup access readlink
3 11% 4 14% 0 0% 1 3% 4 14% 0 0%
read write create mkdir symlink mknod
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
remove rmdir rename link readdir readdirplus
0 0% 0 0% 0 0% 0 0% 0 0% 2 7%
fsstat fsinfo pathconf commit
9 33% 4 14% 0 0% 0 0%

  • On the nfs server, we can also have a checking on what versions(2/3/4) and transport protocols(tcp/udp) the nfs supported with the command "rpcinfo -p localhost|grep nfs":

 

root@doxer# rpcinfo -p localhost|grep nfs
100003 2 udp 2049 nfs
100003 3 udp 2049 nfs
100003 4 udp 2049 nfs
100003 2 tcp 2049 nfs
100003 3 tcp 2049 nfs
100003 4 tcp 2049 nfs

 

  • On the nfs client hosts, you can run a nfsstat -c to check the version the client is using. As always, the used version of nfs will have data summary other than 0% ones, as the following:

root@doxer.org# nfsstat -c

Client rpc:
Connection oriented:
calls badcalls badxids timeouts newcreds badverfs
1219760 322812 0 0 0 0
timers cantconn nomem interrupts
0 322808 0 0
Connectionless:
calls badcalls retrans badxids timeouts newcreds
0 0 0 0 0 0
badverfs timers nomem cantsend
0 0 0 0

Client nfs:
calls badcalls clgets cltoomany
753081 28 753081 0
Version 2: (0 calls)
null getattr setattr root lookup readlink
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
read wrcache write create remove rename
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
link symlink mkdir rmdir readdir statfs
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
Version 3: (748700 calls)
null getattr setattr lookup access readlink
0 0% 140588 18% 61939 8% 184611 24% 150266 20% 8 0%
read write create mkdir symlink mknod
35415 4% 58540 7% 11703 1% 562 0% 248 0% 0 0%
remove rmdir rename link readdir readdirplus
3264 0% 0 0% 9 0% 0 0% 1165 0% 1219 0%
fsstat fsinfo pathconf commit
33435 4% 7160 0% 3309 0% 55259 7%

Client nfs_acl:
Version 2: (0 calls)
null getacl setacl getattr access
0 0% 0 0% 0 0% 0 0% 0 0%
Version 3: (4382 calls)
null getacl setacl
0 0% 4382 100% 0 0%

  • Also, you can run nfsstat -m on nfs client hosts to print information about each of the mounted NFS file systems(the output info has nfs version indicated also):

root@doxer.org # nfsstat -m
/apps/uriman/tmp from doxer:/export/was/trncsc_cell_urimantmp
Flags: vers=3,proto=tcp,sec=none,hard,intr,link,symlink,acl,rsize=32768,wsize=32768,retrans=5,timeo=600
Attr cache: acregmin=3,acregmax=60,acdirmin=30,acdirmax=60

PS:

  • Here's more about output analytic of nfsstat:

The client- and server-side implementations of NFS compile per-call statistics of NFS service usage at both the RPC and application layers. nfsstat -c displays the client-side statistics while nfsstat -s shows the server tallies. With no arguments, nfsstat prints out both sets of statistics:

Code View: Scroll / Show All

% nfsstat -s
Server rpc:
Connection oriented:
calls badcalls nullrecv badlen xdrcall dupchecks
10733943 0 0 0 0 1935861
dupreqs
0
Connectionless:
calls badcalls nullrecv badlen xdrcall dupchecks
136499 0 0 0 0 0
dupreqs
0

Server nfs:
calls badcalls
10870161 14
Version 2: (1716 calls)
null getattr setattr root lookup readlink
48 2% 0 0% 0 0% 0 0% 1537 89% 13 0%
read wrcache write create remove rename
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
link symlink mkdir rmdir readdir statfs
0 0% 0 0% 0 0% 0 0% 111 6% 7 0%
Version 3: (10856042 calls)
null getattr setattr lookup access readlink
136447 1% 4245200 39% 95412 0% 1430880 13% 2436623 22% 74093 0%
read write create mkdir symlink mknod
376522 3% 277812 2% 165838 1% 25497 0% 24480 0% 0 0%
remove rmdir rename link readdir readdirplus
359460 3% 33293 0% 8211 0% 69484 0% 69898 0% 876367 8%
fsstat fsinfo pathconf commit
1579 0% 7698 0% 4253 0% 136995 1%
Server nfs_acl:
Version 2: (2357 calls)
null getacl setacl getattr access
0 0% 5 0% 0 0% 2170 92% 182 7%
Version 3: (10046 calls)
null getacl setacl
0 0% 10039 99% 7 0%

 

The server-side RPC fields indicate if there are problems removing the packets from the NFS service end point. The kernel reports statistics on connection-oriented RPC and connectionless RPC separately. The fields detail each kind of problem:

calls
The NFS calls value represents the total number of NFS Version 2, NFS Version 3, NFS ACL Version 2 and NFS ACL Version 3 RPC calls made to this server from all clients. The RPC calls value represents the total number of NFS, NFS ACL, and NLM RPC calls made to this server from all clients. RPC calls made for other services, such as NIS, are not included in this count.
badcalls
These are RPC requests that were rejected out of hand by the server's RPC mechanism, before the request was passed to the NFS service routines in the kernel. An RPC call will be rejected if there is an authentication failure, where the calling client does not present valid credentials.
nullrecv
Not used in Solaris. Its value is always 0.
badlen/xdrcall
The RPC request received by the server was too short (badlen) or the XDR headers in the packet are malformed (xdrcall ). Most likely this is due to a malfunctioning client. It is rare, but possible, that the packet could have been truncated or damaged by a network problem. On a local area network, it's rare to have XDR headers damaged, but running NFS over a wide-area network could result in malformed requests. We'll look at ways of detecting and correcting packet damage on wide-area networks in Section 18.4.
dupchecks/dupreqs
The dupchecksfield indicates the number of RPC calls that were looked up in the duplicate request cache. The dupreqs field indicates the number of RPC calls that were actually found to be duplicates. Duplicate requests occur as a result of client retransmissions. A large number of dupreqs usually indicates that the server is not replying fast enough to its clients. Idempotent requests can be replayed without ill effects, therefore not all RPCs have to be looked up on the duplicate request cache. This explains why the dupchecks field does not match the calls field.

The statistics for each NFS version are reported independently, showing the total number of NFS calls made to this server using each version of the protocol. A version-specific breakdown by procedure of the calls handled is also provided. Each of the call types corresponds to a procedure within the NFS RPC and NFS_ACL RPC services.

The null procedure is included in every RPC program for pinging the RPC server. The null procedure returns no value, but a successful return from a call to null ensures that the network is operational and that the server host is alive. rpcinfo calls the null procedure to check RPC server health. The automounter (see Chapter 9) calls the null procedure of all NFS servers in parallel when multiple machines are listed for a single mount point. The automounter and rpcinfo should account for the total null calls reported by nfsstat.

Client-side RPC statistics include the number of calls of each type made to all servers, while the client NFS statistics indicate how successful the client machine is in reaching NFS servers:

Code View: Scroll / Show All

% nfsstat -c
Client rpc:
Connection oriented:
calls badcalls badxids timeouts newcreds badverfs
1753584 1412 18 64 0 0
timers cantconn nomem interrupts
0 1317 0 18
Connectionless:
calls badcalls retrans badxids timeouts newcreds
12443 41 334 80 166 0
badverfs timers nomem cantsend
0 4321 0 206

Client nfs:
calls badcalls clgets cltoomany
1661217 23 1661217 3521
Version 2: (234258 calls)
null getattr setattr root lookup readlink
0 0% 37 0% 0 0% 0 0% 184504 78% 811 0%
read wrcache write create remove rename
49 0% 0 0% 24301 10% 3 0% 2 0% 0 0%
link symlink mkdir rmdir readdir statfs
0 0% 0 0% 12 0% 12 0% 24500 10% 27 0%
Version 3: (1011525 calls)
null getattr setattr lookup access readlink
0 0% 417691 41% 14598 1% 223609 22% 47438 4% 695 0%
read write create mkdir symlink mknod
56347 5% 221334 21% 1565 0% 106 0% 48 0% 0 0%
remove rmdir rename link readdir readdirplus
807 0% 14 0% 676 0% 24 0% 475 0% 5204 0%
fsstat fsinfo pathconf commit
8 0% 10612 1% 95 0% 10179 1%

Client nfs_acl:
Version 2: (411477 calls)
null getacl setacl getattr access
0 0% 181399 44% 0 0% 185858 45% 44220 10%
Version 3: (3957 calls)
null getacl setacl
0 0% 3957 100% 0 0%

 

In addition to the total number of NFS calls made and the number of rejected NFS calls (badcalls), the client-side statistics indicate if NFS calls are being delayed due to a lack of client RPC handles. Client RPC handles are opaque pointers used by the kernel to hold server connection information. In SunOS 4.x, the number of client handles was fixed, causing the NFS call to block until client handles became available. In Solaris, client handles are allocated dynamically. The kernel maintains a cache of up to 16 client handles, which are reused to speed up communication with the server. The clgets count indicates the number of times a client handle has been requested. If the NFS call cannot find an unused client handle in the cache, it will not block until one frees up. Instead, it will create a brand new client handle and proceed. This count is reflected by cltoomany. The client handle is destroyed when the reply to the NFS call arrives. This count is of little use to system administrators since nothing can be done to increase the cache size and reduce the number of misses.

Included in the client RPC statistics are counts for various failures experienced while trying to send NFS requests to a server:

calls
Total number of calls made to all NFS servers.
badcalls
Number of RPC calls that returned an error. The two most common RPC failures are timeouts and interruptions, both of which increment the badcalls counter. The connection-oriented RPC statistics also increment the interrupts counter. There is no equivalent counter for connectionless RPC statistics. If a server reply is not received within the RPC timeout period, an RPC error occurs. If the RPC call is interrupted, as it may be if a filesystem is mounted with the intr option, then an RPC interrupt code is returned to the caller. nfsstat also reports the badcalls count in the NFS statistics. NFS call failures do not include RPC timeouts or interruptions, but do include other RPC failures such as authentication errors (which will be counted in both the NFS and RPC level statistics).
badxids
The number of bad XIDs. The XID in an NFS request is a serial number that uniquely identifies the request. When a request is retransmitted, it retains the same XID through the entire timeout and retransmission cycle. With the Solaris multithreaded kernel, it is possible for the NFS client to have several RPC requests outstanding at any time, to any number of NFS servers. When a response is received from an NFS server, the client matches the XID in the response to an RPC call in progress. If an XID is seen for which there is no active RPC call — because the client already received a response for that XID — then the client increments badxid. A high badxid count, therefore, indicates that the server is receiving some retransmitted requests, but is taking a long time to reply to all NFS requests. This scenario is explored in Section 18.1.
timeouts
Number of calls that timed out waiting for a server's response. For hard-mounted filesystems, calls that time out are retransmitted, with a new timeout period that may be longer than the previous one. However, calls made on soft-mounted filesystems may eventually fail if the retransmission count is exceeded, so that the call counts obey the relationship:

timeout + badcalls >= retrans

 

The final retransmission of a request on a soft-mounted filesystem increments badcalls (as previously explained). For example, if a filesystem is mounted with retrans=5, the client reissues the same request five times before noting an RPC failure. All five requests are counted in timeout, since no replies are received. Of the failed attempts, four are counted in the retrans statistic and the last shows up in badcalls.

newcreds
Number of times client authentication information had to be refreshed. This statistic only applies if a secure RPC mechanism has been integrated with the NFS service.
badverfs
Number of times server replies could not be authenticated. The number of times the client could not guarantee that the server was who it says it was. These are likely due to packet retransmissions more than security breaches, as explained later in this section.
timers
Number of times the starting RPC call timeout value was greater than or equal to the minimum specified timeout value for the call. Solaris attempts to dynamically tune the initial timeout based on the history of calls to the specific server. If the server has been sluggish in its reponse to this type of RPC call, the timeout will be greater than if the server had been replying normally. It makes sense to wait longer before retransmitting for the first time, since history indicates that this server is slow to reply. Most client implementations use an exponential back-off strategy that doubles or quadruples the timeout after each retransmission up to an implementation-specific limit.
cantconn
Number of times a connection-oriented RPC call failed due to a failure to establish a connection to the server. The reasons why connections cannot be created are varied; one example is the server may not be running the nfsd daemon.
nomem
Number of times a call failed due to lack of resources. The host is low in memory and cannot allocate enough temporary memory to handle the request.
interrupts
Number of times a connection-oriented RPC call was interrupted by a signal before completing. This counter applies to connection-oriented RPC calls only. Interrupted connection and connectionless RPC calls also increment badcalls.
retrans
Number of calls that were retransmitted because no response was received from the NFS server within the timeout period. This is only reported for RPC over connectionless transports. An NFS client that is experiencing poor server response will have a large number of retransmitted calls.

cantsend
Number of times a request could not be sent. This counter is incremented when network plumbing problems occur. This will mostly occur when no memory is available to allocate buffers in the various network layer modules, or the request is interrupted while the client is waiting to queue the request downstream. Thenomem and interrupts counters report statistics encountered in the RPC software layer, while the cantsend counter reports statistics gathered in the kernel TLI layer.

The statistics shown by nfsstat are cumulative from the time the machine was booted, or the last time they were zeroed using nfsstat -z:

nfsstat -z Resets all counters.
nfsstat -sz Zeros server-side RPC and NFS statistics.
nfsstat -cz Zeros client-side RPC and NFS statistics.

nfsstat -crz Zeros client-side RPC statistics only.

 

Only the superuser can reset the counters.

nfsstat provides a very coarse look at NFS activity and is limited in its usefulness for resolving performance problems. Server statistics are collected for all clients, while in many cases it is important to know the distribution of calls from each client. Similarly, client-side statistics are aggregated for all NFS servers.

However, you can still glean useful information from nfsstat. Consider the case where a client reports a high number of bad verifiers. The high badverfs count is most likely an indication that the client is having to retransmit its secure RPC requests. As explained in Section 12.1, every secure RPC call has a unique credential and verifier with a unique timestamp (in the case of AUTH_DES) or a unique sequence number (in the case of RPCSEC_GSS). The client expects the server to include this verifier (or some form of it) in its reply, so that the client can verify that it is indeed obtaining the reply from the server it called.

Consider the scenario where the client makes a secure RPC call using AUTH_DES, using timestamp T1 to generate its verifier. If no reply is received within the timeout period, the client retransmits the request, using timestamp T1+delta to generate its verifier (bumping up the retrans count). In the meantime, the server replies to the original request using timestamp T1 to generate its verifier:

Code View: Scroll / Show All

RPC call (T1) --->
** time out **
RPC call (retry: T1+delta) --->
<--- Server reply to first RPC call (T1 verifier)

 

The reply to the client's original request will cause the verifier check to fail because the client now expects T1+delta in the verifier, not T1. This consequently bumps up thebadverf count. Fortunately, the Solaris client will wait for more replies to its retransmissions and, if the reply passes the verifier test, an NFS authentication error will be avoided. Bad verifiers are not a big problem, unless the count gets too high, especially when the system starts experiencing NFS authentication errors. Increasing the NFS timeoon the mount or automounter map may help alleviate this problem. Note also that this is less of a problem with TCP than UDP. Analysis of situations such as this will be the focus of Section 16.1Chapter 17, and Chapter 18.

For completeness, we should mention that verifier failures can also be caused when the security content expires before the response is received. This is rare but possible. It usually occurs when you have a network partition that is longer than the lifetime of the security context. Another cause might be a significant time skew between the client and server, as well as a router with a ghost packet stored, that fires after being delayed for a very long time. Note that this is not a problem with TCP.