resolved – su: cannot set user id: Resource temporarily unavailable

January 12th, 2015

When i try to log on as user "test", error occurred:

su: cannot set user id: Resource temporarily unavailable

I had a check of limits.conf:

[root@testvm ~]# cat /etc/security/limits.conf|egrep -v '^$|^#'
oracle   soft   nofile    131072
oracle   hard   nofile    131072
oracle   soft   nproc    131072
oracle   hard   nproc    131072
oracle   soft   core    unlimited
oracle   hard   core    unlimited
oracle   soft   memlock    50000000
oracle   hard   memlock    50000000
@svrtech    soft    memlock         500000
@svrtech    hard    memlock         500000
*   soft   nofile    131072
*   hard   nofile    131072
*   soft   nproc    131072
*   hard   nproc    131072
*   soft   core    unlimited
*   hard   core    unlimited
*   soft   memlock    50000000
*   hard   memlock    50000000

Then I had a check of the number of processes/threads with the maximum number of processes to see whether it's coming over the line:

[root@c9qa131-slcn03vmf0293 ~]# ps -eLF | grep test | wc -l
1026

So it's not exceeding. Then I had a check of open files:

[root@testvm ~]# lsof | grep aime | wc -l

6059

It's not exceeding 131072 either, then why the error "su: cannot set user id: Resource temporarily unavailable" was there? Actually the culprit was in file /etc/security/limits.d/90-nproc.conf:

[root@testvm ~]# cat /etc/security/limits.d/90-nproc.conf
# Default limit for number of user's processes to prevent
# accidental fork bombs.
# See rhbz #432903 for reasoning.

* soft nproc 1024
root soft nproc unlimited

After I modified 1024 to 131072, the issue gone away immediately.

Categories: IT Architecture, Kernel, Linux, Systems, Unix Tags:

resolved – Error: Unable to connect to xend: Connection reset by peer. Is xend running?

January 7th, 2015

Today I met some issue when trying to run xm commands on a XEN server:

[root@xenhost1 ~]# xm list
Error: Unable to connect to xend: Connection reset by peer. Is xend running?

I had a check, and found xend was actually running:

[root@xenhost1 ~]# /etc/init.d/xend status
xend daemon running (pid 8329)

After some debugging, I found it's caused by libvirtd & xend corrupted. And then I did a bounce of them:

[root@xenhost1 ~]# /etc/init.d/libvirtd restart
Stopping libvirtd daemon: [ OK ]
Starting libvirtd daemon: [ OK ]

[root@xenhost1 ~]# /etc/init.d/xend restart #this may not be needed 
restarting xend...
xend daemon running (pid 19684)

Later, the xm commands went good.

PS:

  • If you met issue like "resolved - xend error: (98, 'Address already in use')" when restart xend or "can't connect: (111, 'Connection refused')" when doing xm live migrate, then you can refer this article.
  • For more information about libvirt, you can check here.

 

Categories: Clouding, IT Architecture, Oracle Cloud Tags:

remove entries in perl array with specified value

December 30th, 2014

Assume that in array @array_filtered:

my @array_filtered = ("your", "array", "here", 1, 3, 8, "here", 2, 5, 9, "sit", "here",3, 4, 7,"yes","now",8,1,7,6); #or my @array_filtered=qw(your array here 1 3 8 here 2 5 9 sit here 3 4 7 yes now 8 1 7 6) which uses Alternative Quotes(q, qq, qw, qx)

You want to remove entries that have value "here" or "now" and it's following 3 entries, you can use splice:

#!/usr/bin/perl
my @array_filtered = ("your", "array", "here", 1, 3, 8, "here", 2, 5, 9, "sit", "here",3, 4, 7,"yes","now",8,1,7,6);
my @search_for = ("here","now");
#return keys that have specified value, =~/!~ for regular expression, eq/ne for string, ==/!= for number. or use unless()/if(not()). use m{} instead of // if there's too much / in the expression and you're tired of using \/ to escape them.

$search_for_s=join('|',@search_for);
@index_all = grep { $array_filtered[$_] =~ /$search_for_s/ } 0..$#array_filtered;

for($i=0;$i<=$#index_all;$i++) {
@index_all_one = grep { $array_filtered[$_] =~ /$search_for_s/ } 0..$#array_filtered;
splice(@array_filtered,$index_all_one[0],4);
#print $indexone."\n"
}

print "@array_filtered"."\n";

The output is "your array sit yes 6".

PS:

  • For more info about perl regular expression(such as operators<m, s, tr> and their modifiers, complex regular expression cheat sheet<.\s\S\d\D\w\W[aeiou][^aeiou](foo|bar), \G, $, $&, $`, $'> and more), you can refer to this article.
  • The following is about perl alternative quotes:

q// is generally the same thing as using single quotes - meaning it doesn't interpolate values inside the delimiters.
qq// is the same as double quoting a string. It interpolates.
qw// return a list of white space delimited words. @q = qw/this is a test/ is functionally the same as @q = ('this', 'is', 'a', 'test')
qx// is the same thing as using the backtick operators.

Categories: IT Architecture, Perl, Programming Tags:

resolved – cssh installation on linux server

December 29th, 2014

ClusterSSH can be used if you need controls a number of xterm windows via a single graphical console window, and you want to run commands interactively on multiple servers over an ssh connection. This guide will show the process to install clusterssh on a linux box from tarball.

At the very first, you should download cssh tarball App-ClusterSSH-4.03_04.tar.gz from sourceforge. You may need export proxy settings if it's needed in your env:

export https_proxy=http://my-proxy.example.com:80/
export http_proxy=http://my-proxy.example.com:80/
export ftp_proxy=http://my-proxy.example.com:80/

After the proxy setting, you can now get the package:

wget 'http://sourceforge.net/projects/clusterssh/files/latest/download'
tar zxvf App-ClusterSSH-4.03_04.tar.gz
cd App-ClusterSSH-4.03_04
cat README

Before installing, let's install some prerequisites packages:

yum install gcc libX11-devel gnome* -y
yum groupinstall "X Window System" -y
yum groupinstall "GNOME Desktop Environment" -y
yum groupinstall "Graphical Internet" -y
yum groupinstall "Graphics" -y

Now run "perl Build.PL" as indicated by README:

[root@centos-32bits App-ClusterSSH-4.03_04]# perl Build.PL
Can't locate Module/Build.pm in @INC (@INC contains: /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.8/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl /usr/lib/perl5/5.8.8/i386-linux-thread-multi /usr/lib/perl5/5.8.8 .) at Build.PL line 5.
BEGIN failed--compilation aborted at Build.PL line 5.

As it challenged, you need install Module::Build.pm first. Let's use cpan to install that module.

Run "cpan" and enter "follow" when below info occurred:

Policy on building prerequisites (follow, ask or ignore)? [ask] follow

If you had already ran cpan before, then you can configure the policy as below:

cpan> o conf prerequisites_policy follow
cpan> o conf commit

Now Let's install Module::Build:

cpan> install Module::Build

After the installation, let's run "perl Build.PL" again:

[root@centos-32bits App-ClusterSSH-4.03_04]# perl Build.PL
Checking prerequisites...
  requires:
    !  Exception::Class is not installed
    !  Tk is not installed
    !  Try::Tiny is not installed
    !  X11::Protocol is not installed
  build_requires:
    !  CPAN::Changes is not installed
    !  File::Slurp is not installed
    !  File::Which is not installed
    !  Readonly is not installed
    !  Test::Differences is not installed
    !  Test::DistManifest is not installed
    !  Test::PerlTidy is not installed
    !  Test::Pod is not installed
    !  Test::Pod::Coverage is not installed
    !  Test::Trap is not installed

ERRORS/WARNINGS FOUND IN PREREQUISITES.  You may wish to install the versions
of the modules indicated above before proceeding with this installation

Run 'Build installdeps' to install missing prerequisites.

Created MYMETA.yml and MYMETA.json
Creating new 'Build' script for 'App-ClusterSSH' version '4.03_04'

As the output says, run "./Build installdeps" to install the missing packages. Make sure you're in GUI env(through vncserver maybe), as "perl Build.PL" has a step to test GUI.

[root@centos-32bits App-ClusterSSH-4.03_04]# ./Build installdeps

......

Running Mkbootstrap for Tk::Xlib ()
chmod 644 "Xlib.bs"
"/usr/bin/perl" "/usr/lib/perl5/5.8.8/ExtUtils/xsubpp" -typemap "/usr/lib/perl5/5.8.8/ExtUtils/typemap" -typemap "/root/.cpan/build/Tk-804.032/Tk/typemap" Xlib.xs > Xlib.xsc && mv Xlib.xsc Xlib.c
make[1]: *** No rule to make target `pTk/tkInt.h', needed by `Xlib.o'. Stop.
make[1]: Leaving directory `/root/.cpan/build/Tk-804.032/Xlib'
make: *** [subdirs] Error 2
/usr/bin/make -- NOT OK
Running make test
Can't test without successful make
Running make install
make had returned bad status, install seems impossible

Errors again, we can see it's complaining something about TK related thing. To resolve this, I manully installed the latest perl-tk module as below:

wget --no-check-certificate 'https://github.com/eserte/perl-tk/archive/master.zip'
unzip master
cd perl-tk-master
perl Makefile.PL
make
make install

After this, let's run "./Build installdeps" and "perl Build.PL" again which all went through good:

[root@centos-32bits App-ClusterSSH-4.03_04]# ./Build installdeps

[root@centos-32bits App-ClusterSSH-4.03_04]# perl Build.PL

And let's run ./Build now:

[root@centos-32bits App-ClusterSSH-4.03_04]# ./Build
Building App-ClusterSSH
Generating: ccon
Generating: crsh
Generating: cssh
Generating: ctel

And now "./Build install" which is the last step:

[root@centos-32bits App-ClusterSSH-4.03_04]# ./Build install

After installation, let's have a test:

[root@centos-32bits App-ClusterSSH-4.03_04]# echo 'svr testserver1 testserver2' > /etc/clusters

Now run 'cssh svr', and you'll get the charm!

clusterssh

PS: 

If you met error like below:

Can't connect to display `unix:1': No such file or directory at /usr/local/share/perl5/X11/Protocol.pm line 2264.

And you are connecting to vnc session like below:

root 3291 1 0 07:36 ? 00:00:02 /usr/bin/Xvnc :1 -desktop Yue-test:1 (root) -auth /root/.Xauthority -geometry 1600x900 -rfbwait 30000 -rfbauth /root/.vnc/passwd -rfbport 5901 -fp catalogue:/etc/X11/fontpath.d -pn

Then make sure to do below:

export DISPLAY=localhost:1.0

Categories: Clouding, IT Architecture, Linux, Systems, Unix Tags:

resolved – error:0D0C50A1:asn1 encoding routines:ASN1_item_verify:unknown message digest algorithm

December 17th, 2014

Today when I tried using curl to get url info, error occurred like below:

[root@centos-doxer ~]# curl -i --user username:password -H "Content-Type: application/json" -X POST --data @/u01/shared/addcredential.json https://testserver.example.com/actions -v

* About to connect() to testserver.example.com port 443

*   Trying 10.242.11.201... connected

* Connected to testserver.example.com (10.242.11.201) port 443

* successfully set certificate verify locations:

*   CAfile: /etc/pki/tls/certs/ca-bundle.crt

  CApath: none

* SSLv2, Client hello (1):

SSLv3, TLS handshake, Server hello (2):

SSLv3, TLS handshake, CERT (11):

SSLv3, TLS alert, Server hello (2):

error:0D0C50A1:asn1 encoding routines:ASN1_item_verify:unknown message digest algorithm

* Closing connection #0

After some searching, I found that it's caused by the current version of openssl(openssl-0.9.8e) does not support SHA256 Signature Algorithm. To resolve this, there are two ways:

1. add -k parameter to curl to ignore the SSL error

2. update openssl to "OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008", just try "yum update openssl".

# openssl version
OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008

3. upgrade openssl to at least openssl-0.9.8o. Here's the way to upgrade openssl: --- this may not be needed, try method 2 above

wget --no-check-certificate 'https://www.openssl.org/source/old/0.9.x/openssl-0.9.8o.tar.gz'
tar zxvf openssl-0.9.8o.tar.gz
cd openssl-0.9.8o
./config --prefix=/usr --openssldir=/usr/openssl
make
make test
make install

After this, run openssl version to confirm:

[root@centos-doxer openssl-0.9.8o]# /usr/bin/openssl version
OpenSSL 0.9.8o 01 Jun 2010

PS:

If you installed openssl from rpm package, then you'll find the openssl version is still the old one even after you install the new package. This is expected so don't rely too much on rpm:

[root@centos-doxer openssl-0.9.8o]# /usr/bin/openssl version
OpenSSL 0.9.8o 01 Jun 2010

Even after rebuilding rpm DB(rpm --rebuilddb), it's still the old version:

[root@centos-doxer openssl-0.9.8o]# rpm -qf /usr/bin/openssl
openssl-0.9.8e-26.el5_9.1
openssl-0.9.8e-26.el5_9.1

[root@centos-doxer openssl-0.9.8o]# rpm -qa|grep openssl
openssl-0.9.8e-26.el5_9.1
openssl-devel-0.9.8e-26.el5_9.1
openssl-0.9.8e-26.el5_9.1
openssl-devel-0.9.8e-26.el5_9.1

 

output analysis of linux last command

December 9th, 2014

Here's the output of "last|less" on my linux host:

root     pts/9        remote.example   Tue Dec  9 14:51   still logged in
testuser pts/2        :3               Tue Dec  9 14:49   still logged in
aime     pts/1        :2               Tue Dec  9 14:49   still logged in
root     pts/0        :1               Tue Dec  9 14:49   still logged in
testuser pts/13       remote.example   Tue Dec  9 10:48 - 10:52  (00:02)
reboot   system boot  2.6.23           Tue Dec  9 10:11          (04:39)
root     pts/11       10.182.120.179   Thu Dec  4 17:14 - 17:20  (00:06)
root     pts/11       10.182.120.179   Thu Dec  4 17:14 - 17:14  (00:00)
root     pts/10       10.182.120.179   Thu Dec  4 15:55 - 15:55  (00:00)
testuser pts/14       :3.0             Tue Dec  2 15:44 - 15:46  (00:01)
testuser pts/12       :3.0             Tue Dec  2 15:44 - 15:46  (00:01)
testuser pts/13       :3.0             Tue Dec  2 15:44 - 15:46  (00:01)
testuser pts/15       :3.0             Tue Dec  2 15:44 - 15:46  (00:01)
testuser pts/11       :3.0             Tue Dec  2 15:44 - 15:46  (00:01)
testuser pts/16       :3.0             Tue Dec  2 15:44 - 15:46  (00:01)
root     pts/10       10.182.120.179   Tue Dec  2 11:20 - 11:20  (00:00)
root     pts/7        10.182.120.179   Tue Dec  2 10:15 - down  (6+07:39)
root     pts/6        10.182.120.179   Tue Dec  2 10:15 - 17:55 (6+07:39)
root     pts/5        10.182.120.179   Tue Dec  2 10:15 - 17:55 (6+07:39)
root     pts/4        10.182.120.179   Tue Dec  2 10:15 - 17:55 (6+07:39)
root     pts/3        10.182.120.179   Tue Dec  2 10:15 - 17:55 (6+07:39)
root     pts/2        :1               Tue Dec  2 10:00 - down  (6+07:55)
aime     pts/1        :2               Tue Dec  2 10:00 - down  (6+07:55)
testuser pts/0        :3               Tue Dec  2 10:00 - down  (6+07:55)
reboot   system boot  2.6.23           Tue Dec  2 09:58         (6+07:56)

Here's some analysis:

  • User "reboot" is a pseudo-user for system reboot. Entries between two reboots are users who log on the system during two reboots. For info about login shells(.bash_profile) and interactive non-login shells(.bashrc), you can refer to here.
  • Here're columns meanings:

Column 1: User logged on

Column 2: The tty name after logging on

Column 3: Remote IP or hostname from which the user logged on. You can see ":1", ":2", ":3", that's vnc port number which vncserver are rendering against.

Column 4: Begin/End time of the session. If "still logged in", then means the user is still logged on; if there's value in parenthesis, then that's the total time of the logged on. For the latest "reboot"(red line 1), means the uptime till now; For the second "reboot"(red line 2), means the uptime between two reboots. Note however that this time is not always accurate, for example after system crash and unusual restart sequence. last calculates it as time between it and next reboot/shutdown.

 

Categories: IT Architecture, Linux, Systems, Unix Tags:

ORA-12154 – TNS:could not resolve the connect identifier specified

December 2nd, 2014

Today I try to connect to one Db service named pditui using the following easy connect method:

export ORACLE_HOME=/u01/app/oracle/product/11.2.0/client_1
export PATH=$ORACLE_HOME/bin:$PATH

sqlplus "sys/password@(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = scanname.test.example.com)(PORT = 1521))(CONNECT_DATA =(SERVER = DEDICATED)(SERVICE_NAME = pditui)))" as sysdba

However, the following error messages prompted:

SQL*Plus: Release 11.2.0.3.0 Production on Tue Dec 2 14:07:35 2014

Copyright (c) 1982, 2011, Oracle. All rights reserved.

ERROR:
ORA-12154: TNS:could not resolve the connect identifier specified

Enter user-name:
ERROR:
ORA-12162: TNS:net service name is incorrectly specified

The username/password and service name were all correct, but the error was there. After some checking, I found that it was caused by wrong configuration of NAMES.DIRECTORY_PATH in file $ORACLE_HOME/network/admin/sqlnet.ora:

[root@centos-doxer ~]# cat /u01/app/oracle/product/11.2.0/client_1/network/admin/sqlnet.ora
# sqlnet.ora Network Configuration File: /u01/app/oracle/product/11.2.0/client_1/network/admin/sqlnet.ora
# Generated by Oracle configuration tools.

#NAMES.DIRECTORY_PATH= (TNSNAMES)
NAMES.DIRECTORY_PATH= (TNSNAMES,ezconnect) -- add ezconnect methond here

ADR_BASE = /u01/app/oracle

After this, the connection was ok.

PS: 

You can read more about NAMES.DIRECTORY_PATH in file $ORACLE_HOME/network/admin/sqlnet.ora here.

Categories: Databases, Oracle DB Tags:

resolved – switching from Unbreakable Enterprise Kernel Release 2(UEKR2) to UEKR3 on Oracle Linux 6

November 24th, 2014

As we can see from here, the available kernels include the following 3 for Oracle Linux 6:

3.8.13 Unbreakable Enterprise Kernel Release 3 (x86_64 only)
2.6.39 Unbreakable Enterprise Kernel Release 2**
2.6.32 (Red Hat compatible kernel)

On one of our OEL6 VM, we found that it's using UEKR2:

[root@testbox aime]# cat /etc/issue
Oracle Linux Server release 6.4
Kernel \r on an \m

[root@testbox aime]# uname -r
2.6.39-400.211.1.el6uek.x86_64

So how can we switch the kernel to UEKR3(3.8)?

If your linux version is 6.4, first do a "yum update -y" to upgrade to 6.5 and uppper, and then reboot the host, and follow steps below.

[root@testbox aime]# ls -l /etc/grub.conf
lrwxrwxrwx. 1 root root 22 Aug 21 18:24 /etc/grub.conf -> ../boot/grub/grub.conf

[root@testbox aime]# yum update -y

If your linux version is 6.5 and upper, you'll find /etc/grub.conf and /boot/grub/grub.conf are different files(for yum update one. If your host is OEL6.5 when installed, then /etc/grub.conf should be softlink too):

[root@testbox ~]# ls -l /etc/grub.conf
-rw------- 1 root root 2356 Oct 20 05:26 /etc/grub.conf

[root@testbox ~]# ls -l /boot/grub/grub.conf
-rw------- 1 root root 1585 Nov 23 21:46 /boot/grub/grub.conf

In /etc/grub.conf, you'll see entry like below:

title Oracle Linux Server Unbreakable Enterprise Kernel (3.8.13-44.1.3.el6uek.x86_64)
root (hd0,0)
kernel /vmlinuz-3.8.13-44.1.3.el6uek.x86_64 ro root=/dev/mapper/vg01-lv_root rd_LVM_LV=vg01/lv_root rd_NO_LUKS rd_LVM_LV=vg01/lv_swap LANG=en_US.UTF-8 KEYTABLE=us console=hvc0 rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_NO_DM rhgb quiet
initrd /initramfs-3.8.13-44.1.3.el6uek.x86_64.img

What you'll need to do is just copying the entries above from /etc/grub.conf to /boot/grub/grub.conf(make sure /boot/grub/grub.conf not be a softlink, else you may met error "Boot loader didn't return any data"), and then reboot the VM.

After rebooting, you'll find the kernel is now at UEKR3(3.8).

PS:

If you find the VM is OEL6.5 and /etc/grub.conf is a softlink to /boot/grub/grub.conf, then you could do the following to upgrade kernel to UEKR3:

1. add the following lines to /etc/yum.repos.d/public-yum-ol6.repo:

[public_ol6_UEKR3]
name=UEKR3 for Oracle Linux 6 ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/UEKR3/latest/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=1

2. List and install UEKR3:

[root@testbox aime]# yum list|grep kernel-uek|grep public_ol6_UEKR3
kernel-uek.x86_64 3.8.13-44.1.5.el6uek public_ol6_UEKR3
kernel-uek-debug.x86_64 3.8.13-44.1.5.el6uek public_ol6_UEKR3
kernel-uek-debug-devel.x86_64 3.8.13-44.1.5.el6uek public_ol6_UEKR3
kernel-uek-devel.x86_64 3.8.13-44.1.5.el6uek public_ol6_UEKR3
kernel-uek-doc.noarch 3.8.13-44.1.5.el6uek public_ol6_UEKR3
kernel-uek-firmware.noarch 3.8.13-44.1.5.el6uek public_ol6_UEKR3
kernel-uek-headers.x86_64 3.8.13-26.2.4.el6uek public_ol6_UEKR3

[root@testbox aime]# yum install -y kernel-uek* --disablerepo=* --enablerepo=public_ol6_UEKR3

3. Reboot

resolved – ORA-27300 OS system dependent operation:fork failed with status: 11

November 18th, 2014

Today we observed all our DB were in down status, and in the trace file:

Errors in file /u01/database/diag/rdbms/oimdb/OIMDB/trace/OIMDB_psp0_3173.trc:

ORA-27300: OS system dependent operation:fork failed with status: 11

ORA-27301: OS failure message: Resource temporarily unavailable

ORA-27302: failure occurred at: skgpspawn5

After some searching for ORA-27300, I found this article, which suggested that it's the issue of user processes used up and system could not spawn new one at the time. As the problem happened at Mon Nov 17 02:08:51 2014, so I did some check using sysstat sar:

[root@test sa]# sar -f /var/log/sa/sa17 -s 00:00:00 -e 03:20:00
Linux 2.6.32-300.27.1.el5uek (slcn11vmf0029) 11/17/14

00:00:01 CPU %user %nice %system %iowait %steal %idle
00:10:01 all 1.16 0.12 0.48 0.71 0.18 97.35
00:20:02 all 1.30 0.00 0.47 0.95 0.19 97.10
00:30:01 all 1.88 0.00 0.63 1.98 0.19 95.32
00:40:01 all 1.00 0.00 0.35 2.15 0.18 96.32
00:50:01 all 1.09 0.00 0.40 0.47 0.18 97.87
01:00:01 all 1.03 0.00 0.34 0.25 0.16 98.23
01:10:01 all 3.98 0.02 1.72 4.26 0.22 89.80
01:20:01 all 9.98 0.13 5.99 47.40 0.31 36.19
01:30:01 all 1.86 0.00 1.24 48.72 0.16 48.01
01:40:01 all 1.08 0.00 0.82 48.77 0.18 49.15
01:50:01 all 1.54 0.00 0.97 49.32 0.18 47.98
02:00:01 all 1.05 0.00 0.85 48.74 0.18 49.19 --- problem occurred at Mon Nov 17 02:08:51 2014
02:10:01 all 10.14 0.14 8.95 44.75 0.34 35.68
02:20:01 all 0.06 0.00 0.21 1.87 0.07 97.78
02:30:01 all 0.08 0.00 0.29 2.81 0.08 96.74
02:40:01 all 0.09 0.00 0.31 3.08 0.08 96.44
02:50:01 all 0.05 0.00 0.13 0.96 0.06 98.81
03:00:01 all 0.07 0.00 0.26 2.38 0.07 97.22
03:10:01 all 0.06 0.12 0.21 1.52 0.07 98.02
Average: all 1.85 0.03 1.20 15.89 0.16 80.88

[root@test sa]# sar -f /var/log/sa/sa17 -s 01:10:00 -e 02:11:00 -A
......
......
01:10:01 kbmemfree kbmemused %memused kbbuffers kbcached kbswpfree kbswpused %swpused kbswpcad
01:20:01 259940 15482728 98.35 2004 11703696 0 2104504 100.00 194056 -- even all SWAP spaces were used up
01:30:01 398584 15344084 97.47 904 11703152 0 2104504 100.00 191728
01:40:01 409104 15333564 97.40 984 11716924 0 2104504 100.00 191404
01:50:01 452844 15289824 97.12 1004 11711548 0 2104504 100.00 189076
02:00:01 440780 15301888 97.20 1424 11757600 0 2104504 100.00 189364
02:10:01 14602712 1139956 7.24 19548 382588 1978020 126484 6.01 3096
Average: 2760661 12982007 82.46 4311 9829251 329670 1774834 84.34 159787

So this proved that system was very busy during that time. I then increased oracle user's user process number to 131072 in /etc/security/limits.conf with the following:

* soft nproc 131072
* hard nproc 131072

And also set kernel.pid_max to 139264(which is 131072 plus 8192 which is recommended for OS stability) in /etc/sysctl.conf.

[root@test ~]# sysctl -a|grep pid_max
kernel.pid_max = 139264

Then increased memory from 16G to 32G of the box, and reboot.

resolved – high value of RX overruns in ifconfig

November 13th, 2014

Today we tried ssh to one host, it stuck soon there. And we observed that RX overruns in ifconfig output was high:

[root@test /]# ifconfig bond0
bond0 Link encap:Ethernet HWaddr 00:10:E0:0D:AD:5E
inet6 addr: fe80::210:e0ff:fe0d:ad5e/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:140234052 errors:0 dropped:0 overruns:12665 frame:0
TX packets:47259596 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:34204561358 (31.8 GiB) TX bytes:21380246716 (19.9 GiB)

Receiver overruns usually occur when packets come in faster than the kernel can service the last interrupt. But in our case, we are seeing increasing inbound errors on interface Eth105/1/7 of device bcd-c1z1-swi-5k07a/b, did shut and no shut but no change. And after some more debugging, we found one bad SFP/cable(uplink port, e.g. WAN port, more info here). After replaced that, the server came back to normal.

example-swi-5k07a# sh int Eth105/1/7 | i err
4065 input error 0 short frame 0 overrun 0 underrun 0 ignored
0 output error 0 collision 0 deferred 0 late collision

example-swi-5k07a# sh int Eth105/1/7 | i err
4099 input error 0 short frame 0 overrun 0 underrun 0 ignored
0 output error 0 collision 0 deferred 0 late collision

example-swi-5k07a# sh int Eth105/1/7 counters errors

--------------------------------------------------------------------------------
Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscards
--------------------------------------------------------------------------------
Eth105/1/7 3740 483 0 4223 0 0

--------------------------------------------------------------------------------
Port Single-Col Multi-Col Late-Col Exces-Col Carri-Sen Runts
--------------------------------------------------------------------------------
Eth105/1/7 0 0 0 0 0 3740

--------------------------------------------------------------------------------
Port Giants SQETest-Err Deferred-Tx IntMacTx-Er IntMacRx-Er Symbol-Err
--------------------------------------------------------------------------------
Eth105/1/7 0 -- 0 0 0 0

example-swi-5k07a# sh int Eth105/1/7 counters errors

--------------------------------------------------------------------------------
Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscards
--------------------------------------------------------------------------------
Eth105/1/7 4386 551 0 4937 0 0

--------------------------------------------------------------------------------
Port Single-Col Multi-Col Late-Col Exces-Col Carri-Sen Runts
--------------------------------------------------------------------------------
Eth105/1/7 0 0 0 0 0 4386

--------------------------------------------------------------------------------
Port Giants SQETest-Err Deferred-Tx IntMacTx-Er IntMacRx-Er Symbol-Err
--------------------------------------------------------------------------------
Eth105/1/7 0 -- 0 0 0 0

PS:

During debugging, we also found on server side, the interface eth0 is half duplex and speed at 100Mb/s:

-bash-3.2# ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Half
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000003 (3)
Link detected: yes

However, it should be full duplex and 1000Mb/s. So we also changed the speed, duplex to auto auto on switch and after that, the OS side is now showing the expected value:

-bash-3.2# ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000003 (3)
Link detected: yes