Archive

Author Archive

raid10 and raid01

April 21st, 2015 No comments

RAID 0 over RAID 1(raid 0+1, raid 10, stripe of mirrors, better)

(RAID 1) A = Drive A1 + Drive A2 (Mirrored)
(RAID 1) B = Drive B1 + Drive B2 (Mirrored)
RAID 0 = (RAID 1) A + (RAID 1) B (Striped)

stripe-of-mirrors-raid10


RAID 1 over RAID 0(raid 1+0, raid01, mirror of stripes)

(RAID 0) A = Drive A1 + Drive A2 (Striped)
(RAID 0) B = Drive B1 + Drive B2 (Striped)
RAID 1 = (RAID 1) A + (RAID 1) B (Mirrored)
mirror-of-stripes

Categories: Hardware, IT Architecture, Storage, Systems Tags:

printtbl8.sql – oracle sqlplus print output vertically

April 17th, 2015 No comments

Put the following as printtbl8.sql in $ORACLE_HOME/rdbms/admin/printtbl8.sql:

set serveroutput on
set linesize 200
declare
    l_theCursor    integer default dbms_sql.open_cursor;
    l_columnValue    varchar2(4000);
    l_status        integer;
    l_descTbl    dbms_sql.desc_tab;
    l_colCnt        number;
    procedure execute_immediate( p_sql in varchar2 )
    is
    BEGIN
        dbms_sql.parse(l_theCursor,p_sql,dbms_sql.native);
        l_status := dbms_sql.execute(l_theCursor);
    END;
begin
    execute_immediate( 'alter session set nls_date_format=
                        ''dd-mon-yyyy hh24:mi:ss'' ');
    dbms_sql.parse(    l_theCursor,
                    replace( '&1', '"', ''''),
                    dbms_sql.native );
    dbms_sql.describe_columns( l_theCursor,
                            l_colCnt, l_descTbl );
    for i in 1 .. l_colCnt loop
        dbms_sql.define_column( l_theCursor, i,
                                l_columnValue, 4000 );
    end loop;
    l_status := dbms_sql.execute(l_theCursor);
    while ( dbms_sql.fetch_rows(l_theCursor) > 0 ) loop
        for i in 1 .. l_colCnt loop
            dbms_sql.column_value( l_theCursor, i,
                                l_columnValue );
            dbms_output.put_line
                ( rpad( l_descTbl(i).col_name,
         35 ) || ': ' || l_columnValue );
        end loop;
        dbms_output.put_line( '-----------------' );
    end loop;
    execute_immediate( 'alter session set nls_date_format=
                        ''dd-MON-yy'' ');
exception
    when others then
        execute_immediate( 'alter session set
                        nls_date_format=''dd-MON-yy'' ');
        raise;
end;
/

Now you can have a test in oracle sqlplus:

SQL> @?/rdbms/admin/printtbl8.sql 'select name,LOG_MODE,OPEN_MODE from v$database'
old 17: replace( '&1', '"', ''''),
new 17: replace( 'select name,LOG_MODE,OPEN_MODE from v$database', '"', ''''),

NAME : TEST
LOG_MODE : ARCHIVELOG
OPEN_MODE : READ WRITE
-----------------

PL/SQL procedure successfully completed.

Cool, right?

Categories: Databases, IT Architecture, Oracle DB Tags:

resolved – file filelists.xml.gz [Errno 5] OSError: [Errno 2] No such file or directory [Errno 256] No more mirrors to try

April 8th, 2015 No comments

Today below error prompted when running yum install some packages in linux:

file://localhost/tmp/common1/x86_64/redhat/50/base/ga/Server/repodata/filelists.xml.gz: [Errno 5] OSError: [Errno 2] No such file or directory: '/tmp/common1/x86_64/redhat/50/base/ga/Server/repodata/filelists.xml.gz'
Trying other mirror.
Error: failure: repodata/filelists.xml.gz from base: [Errno 256] No more mirrors to try.
You could try running: package-cleanup --problems
package-cleanup --dupes
rpm -Va --nofiles --nodigest

After some checking(yum clean all, download repo to /etc/yum.repos.d, etc), I finally found it's caused by the following entries in /etc/yum.conf:

[base]
name=Red Hat Linux - Base
baseurl=file://localhost/tmp/common1/x86_64/redhat/50/base/ga/Server

After I commented them, yum install can work now.

 

Categories: IT Architecture, Linux, Systems, Unix Tags:

resolved – Starting MySQL.The server quit without updating PID file (/var/lib/mysql/testvm.pid).

April 3rd, 2015 No comments
Today when I tried to start mysql it failed with below error:

    [root@testvm ~]# /etc/init.d/mysql start
    Starting MySQL.The server quit without updating PID file (/var/lib/mysql/testvm.pid).

First I had a check of /var/lib/mysql/testvm.err, and it had below entries:

    2015-04-03 00:11:39 2925 [Note] InnoDB: Using CPU crc32 instructions
    /usr/sbin/mysqld: Can't create/write to file '/tmp/ibDvk6bb' (Errcode: 13 - Permission denied)
    2015-04-03 00:11:39 7f28af6c6720  InnoDB: Error: unable to create temporary file; errno: 13
    2015-04-03 00:11:39 2925 [ERROR] Plugin 'InnoDB' init function returned error.
    2015-04-03 00:11:39 2925 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
    2015-04-03 00:11:39 2925 [ERROR] Unknown/unsupported storage engine: InnoDB
    2015-04-03 00:11:39 2925 [ERROR] Aborting

I had a check of /tmp permission, and it's not correct:

    [root@testvm ~]# ls -ld /tmp
    drwx------ 19 root root 4096 Apr  3 07:15 /tmp

So I changed permission for /tmp to 777 with sticky bit:

    [root@testvm ~]# chmod 1777 /tmp

    [root@testvm ~]# ls -ld /tmp
    drwxrwxrwt 19 root root 4096 Apr  3 07:15 /tmp

However, when I tried start mysql, it failed again with below errors in /var/lib/mysql/testvm.err:


    2015-04-03 00:20:42 18724 [ERROR] InnoDB: auto-extending data file ./ibdata1 is of a different size 640 pages (rounded down to MB) than specified in the .cnf file: initial 768 pages, max 0 (relevant if non-zero) pages!
    2015-04-03 00:20:42 18724 [ERROR] InnoDB: Could not open or create the system tablespace. If you tried to add new data files to the system tablespace, and it failed here, you should now edit innodb_data_file_path in my.cnf back to what it was, and remove the new ibdata files InnoDB created in this failed attempt. InnoDB only wrote those files full of zeros, but did not yet use them in any way. But be careful: do not remove old data files which contain your precious data!
    2015-04-03 00:20:42 18724 [ERROR] Plugin 'InnoDB' init function returned error.
    2015-04-03 00:20:42 18724 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
    2015-04-03 00:20:42 18724 [ERROR] Unknown/unsupported storage engine: InnoDB
    2015-04-03 00:20:42 18724 [ERROR] Aborting

So it's all about InnoDB engine. As InnoDB was not required in our env, so I determined to disable InnoDB:

    [root@testvm ~]# vi /etc/my.cnf
    [mysqld]
    innodb=OFF
    ignore-builtin-innodb
    skip-innodb
    default-storage-engine=myisam
    default-tmp-storage-engine=myisam

Later, the start of mysql succeeded. 
Categories: Databases, IT Architecture, MySQL DB Tags:

change NIC configuration to make new VLAN tag take effect

April 2nd, 2015 No comments

Sometimes you may want to add vlan tag to existing NIC, and after the addition, you'll need to change DNS names bound to the old tag with new IPs in the newly added vlan tag. After all these two steps done, you'll need to make changes on the hosts(take linux for example) to make these changes into effect.

In this example, I'm going to move the old v118_FE to the new VLAN v117_FE.

ifconfig v118_FE down
ifconfig bond0.118 down
cd /etc/sysconfig/network-scripts
mv ifcfg-bond0.118 ifcfg-bond0.117
vi ifcfg-bond0.117
    DEVICE=bond0.117
    BOOTPROTO=none
    USERCTL=no
    ONBOOT=yes
    BRIDGE=v117_FE
    VLAN=yes
mv ifcfg-v118_FE ifcfg-v117_FE
vi ifcfg-v117_FE
    DEVICE=v117_FE
    BOOTPROTO=none
    USERCTL=no
    ONBOOT=yes
    STP=off
    TYPE=Bridge
    IPADDR=10.119.236.13
    NETMASK=255.255.248.0
    NETWORK=10.119.232.0
    BROADCAST=10.119.239.255
ifup v117_FE
ifup bond0.117
reboot

resolved – VPN Service not available, The VPN agent service is not responding. Please restart this application after a while.

March 30th, 2015 No comments

Today when I tried to connect to VPN through Cisco AnyConnect Secure Mobility Client, the following error dialog prompted:

 

VPN Service not available

VPN Service not available

And after I clicked "OK" button, the following dialog prompted:

The VPN agent service is not responding

The VPN agent service is not responding

So all of the two dialogs were complaining about "VPN service" not available/not responding. So I ran "services.msc" in windows run and found below:

vpn service

vpn service

When I checked, the service "Cisco AnyConnect Secure Mobility Agent" was stopped, and the "Startup type" was "Manual". So I changed "Startup type" to "Automatic", click "Start", then "OK" to save.

After this, Cisco AnyConnect Secure Mobility Client was running ok and I can connect through it to VPN.

resolved – ORA-01013: user requested cancel of current operation

March 24th, 2015 No comments

ORA-01013: user requested cancel of current operation may occur in the following occasions:

  • All events blocking the shutdown do not occur within one hour, the shutdown operation aborts with the following message: ORA-01013: user requested cancel of current operation.
  • This message is also displayed if you interrupt the shutdown process, for example by pressing CTRL-C.
Categories: Databases, IT Architecture, Oracle DB Tags:

resolved – ext3: No journal on filesystem on disk

March 23rd, 2015 No comments

Today I met below error when trying to mount a disk:

[root@testvm ~]# mount /scratch
mount: wrong fs type, bad option, bad superblock on /dev/xvdb1,
missing codepage or other error
In some cases useful info is found in syslog - try
dmesg | tail or so

First I ran fsck -y /dev/xvdb1, but after it's done, the issue was still there(sometimes fsck -y /dev/xvdb1 could resolve this though). So as it suggested, I ran a dmesg | tail:

[root@testvm scratch]# dmesg | tail
Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
NFSD: starting 90-second grace period
ext3: No journal on filesystem on xvdb1
ext3: No journal on filesystem on xvdb1
ext3: No journal on filesystem on xvdb1
ext3: No journal on filesystem on xvdb1

So from here we can see that the root cause for mounting failure was "ext3: No journal on filesystem on xvdb1". I first ran "fsck -y /dev/xvdb1", and try mount again. But the issue was still there. So I tried with adding ext3 journal on that disk:

[root@testvm qgomsdc1]# tune2fs -j /dev/xvdb1
tune2fs 1.39 (29-May-2006)
Creating journal inode:

done
This filesystem will be automatically checked every 20 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

After this, the mount succeeded.

Categories: IT Architecture, Kernel, Linux, Systems, Unix Tags:

resolved – ORA-00020: maximum number of processes (1000) exceeded

March 18th, 2015 No comments

Today I encountered ORA-12516 error when trying to access oracle database:

[root@client-doxer ~]# sqlplus tauser/password1@rac0102-r.example.com:1521/qainfac1

SQL*Plus: Release 11.2.0.3.0 Production on Tue Mar 17 07:31:04 2015

Copyright (c) 1982, 2011, Oracle. All rights reserved.

ERROR:
ORA-12516: TNS:listener could not find available handler with matching protocol
stack

Enter user-name:

Then I had a try of connecting using VIP instead of scan name, but it failed too:

[root@client-doxer ~]# sqlplus tauser/password1@rac0102-v.example.com:1521/qainfac1

SQL*Plus: Release 11.2.0.3.0 Production on Tue Mar 17 07:37:22 2015

Copyright (c) 1982, 2011, Oracle. All rights reserved.

ERROR:
ORA-12516: TNS:listener could not find available handler with matching protocol
stack

Enter user-name:

Then on the database server, I had a check of service qainfac1:

[root@rac01 crsd]# /u01/app/11.2.0.4/grid/bin/crsctl status res -t|grep -A5 ora.qainf1.db
ora.qainf1.db
1 ONLINE ONLINE rac01 Open
2 OFFLINE OFFLINE Instance Shutdown
ora.qainf1.qainfac1.svc
1 ONLINE ONLINE rac01
2 OFFLINE OFFLINE

So one instance is running fine. I tried sqlplus connection from local server:

[oracle@rac01 ~]$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on Tue Mar 17 07:45:39 2015

Copyright (c) 1982, 2013, Oracle. All rights reserved.

ERROR:
ORA-00020: maximum number of processes (1000) exceeded

Enter user-name: ^C

That's it, "ORA-00020: maximum number of processes (1000) exceeded". Then it's going to be a question of adjusting parameter PROCESSES. As parameter PROCESSES cannot be changed with ALTER SYSTEM unless a server parameter file was used to start the instance and the change takes effect in subsequent instances, so a bounce of instance is needed to activiate the new setting:

SQL> set lines 200
SQL> col NAME for a30
SQL> col VALUE for a40
SQL> select NAME,VALUE,ISSES_MODIFIABLE,ISSYS_MODIFIABLE,ISINSTANCE_MODIFIABLE from v$parameter where name='processes';
NAME VALUE ISSES ISSYS_MOD ISINS
------------------------------ ---------------------------------------- ----- --------- -----
processes 1500 FALSE FALSE FALSE

SQL> show parameter processes;

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
aq_tm_processes integer 1
db_writer_processes integer 3
gcs_server_processes integer 2
global_txn_processes integer 1
job_queue_processes integer 1000
log_archive_max_processes integer 4
processes integer 1000

In trace file /u01/app/oracle/diag/rdbms/qainf1/qainf12/trace/alert_qainf12.log, I can see below errors:

Unable to allocate flashback log of 51094 blocks from
current recovery area of size 214748364800 bytes.
Recovery Writer (RVWR) is stuck until more space
is available in the recovery area.
Unable to write Flashback database log data because the
recovery area is full, presence of a guaranteed
restore point and no reusable flashback logs.

Here's Fast Recovery Area info:

SQL> show parameter db_recovery_file_dest;

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
db_recovery_file_dest string +DATA
db_recovery_file_dest_size big integer 200G

And here's ASM diskgroup info:

[oracle@rac01 ~]$ export ORACLE_SID=+ASM2
[oracle@rac01 ~]$ export ORACLE_HOME=/u01/app/11.2.0.4/grid
[oracle@rac01 ~]$ export PATH=$ORACLE_HOME/bin:$PATH
[oracle@rac01 ~]$ sqlplus / as sysasm

SQL*Plus: Release 11.2.0.4.0 Production on Wed Mar 18 02:27:13 2015

Copyright (c) 1982, 2013, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options

SQL> set lines 200;
SQL> select name, total_mb, free_mb, total_mb-free_mb used_mb from v$asm_diskgroup;

NAME TOTAL_MB FREE_MB USED_MB
------------------------------ ---------- ---------- ----------
DATA 4681689 533392 4148297

I had a check of restore points:

SQL> col NAME for a20
SQL> col time for a40
SQL> col SCN for 999999999999999
SQL> col STORAGE_SIZE for 999999999999999
SQL> SELECT NAME, SCN, TIME, DATABASE_INCARNATION#, GUARANTEE_FLASHBACK_DATABASE,STORAGE_SIZE FROM V$RESTORE_POINT;

NAME SCN TIME DATABASE_INCARNATION# GUA STORAGE_SIZE
-------------------- -------------- ---------------------------------------- --------------------- --- ------------
GRPT_BF_UPGR 14035000000000 03-MAR-15 04.16.12.000000000 PM 2 YES 214310000000

And I dropped the restore point to free space it's no longer needed to keep the restore point:

SQL> drop restore point GRPT_BF_UPGR;

After this, 214G space released from FRA and I can startup DB then set processes parameter to 1500(kill some processes with root by "ps -ef|grep <sid>" if sqlplus won't work even on local server):

SQL> alter system set processes=1500 scope=spfile;
SQL> shutdown immediate;
SQL> startup mount;
SQL> alter database flashback on;
SQL> alter database open;
SQL> select LOG_MODE,flashback_on from v$database;
LOG_MODE FLASHBACK_ON
------------ ------------------
ARCHIVELOG NO

Categories: Databases, IT Architecture, Oracle DB Tags:

sendmail DSN: Data format error

March 5th, 2015 No comments

If you met error when sending mail using sendmail(or linux mail/mailx command), then you should check /var/log/maillog for details. For example:

Mar 5 02:39:10 testhost1 sendmail[15281]: t252dAZr015281: from=root, size=78, class=0, nrcpts=1, msgid=<201503050239.t252dAZr015281@testhost1.us.example.com>, relay=root@localhost
Mar 5 02:39:10 testhost1 sendmail[15282]: t252dA8Z015282: from=<root@testhost1.us.example.com>, size=393, class=0, nrcpts=1, msgid=<201503050239.t252dAZr015281@testhost1.us.example.com>, proto=ESMTP, daemon=MTA, relay=localhost.localdomain [127.0.0.1]
Mar 5 02:39:10 testhost1 sendmail[15281]: t252dAZr015281: to=user1@example.com, ctladdr=root (0/0), delay=00:00:00, xdelay=00:00:00, mailer=relay, pri=30078, relay=[127.0.0.1] [127.0.0.1], dsn=2.0.0, stat=Sent (t252dA8Z015282 Message accepted for delivery)
Mar 5 02:39:10 testhost1 sendmail[15284]: t252dA8Z015282: to=<user1@example.com>, ctladdr=<root@testhost1.us.example.com> (0/0), delay=00:00:00, xdelay=00:00:00, mailer=esmtp, pri=120393, relay=smtpserver1.example.com. [192.151.231.4], dsn=5.6.0, stat=Data format error
Mar 5 02:39:10 testhost1 sendmail[15284]: t252dA8Z015282: t252dA8Z015284: DSN: Data format error
Mar 5 02:39:10 testhost1 sendmail[15284]: t252dA8Z015284: to=<root@testhost1.us.example.com>, delay=00:00:00, xdelay=00:00:00, mailer=local, pri=31660, dsn=2.0.0, stat=Sent

From here, you can see that after relaying, the mail finally failed with DSN code 5.6.0(Delivery Status Notification extension of SMTP). So you should check the code details:

5 Permanent or Fatal error. This can be caused by a non existent email address, DNS problem, or your email was blocked by the receiving server.
X.6.0 - Other or undefined media error

X.6.0 Other or undefined media error Not given Something about the content of a message caused it to be considered undeliverable and the problem cannot be well expressed with any of the other provided detail codes.

PS:

For more info abotu DSN code, you can check http://www.inmotionhosting.com/support/email/email-troubleshooting/smtp-and-esmtp-error-code-list or http://tools.ietf.org/rfc/rfc3463.txt or http://www.iana.org/assignments/smtp-enhanced-status-codes/smtp-enhanced-status-codes.xml for details.

Categories: IT Architecture, Linux, Systems Tags:

TCP Window Scaling – values about TCP buffer size

February 4th, 2015 No comments

TCP Window Scaling(TCP socket buffer size, TCP window size)

/proc/sys/net/ipv4/tcp_window_scaling
/proc/sys/net/ipv4/tcp_rmem - memory reserved for TCP rcv buffers. minimum, initial and maximum buffer size
/proc/sys/net/ipv4/tcp_wmem - memory reserved for TCP snd buffers
/proc/sys/net/core/rmem_max - maximum receive window
/proc/sys/net/core/wmem_max - maximum send window

The following values (which are the defaults for 2.6.17 with more than 1 GByte of memory) would be reasonable for all paths with a 4MB BDP or smaller:

echo 1 > /proc/sys/net/ipv4/tcp_moderate_rcvbuf #autotuning enabled. The receiver buffer size (and TCP window size) is dynamically updated (autotuned) for each connection. (Sender side autotuning has been present and unconditionally enabled for many years now).
echo 108544 > /proc/sys/net/core/wmem_max
echo 108544 > /proc/sys/net/core/rmem_max
echo "4096 87380 4194304" > /proc/sys/net/ipv4/tcp_rmem
echo "4096 16384 4194304" > /proc/sys/net/ipv4/tcp_wmem

Advanced TCP features

cat /proc/sys/net/ipv4/tcp_timestamps
cat /proc/sys/net/ipv4/tcp_window_scaling
cat /proc/sys/net/ipv4/tcp_sack

Here are some background knowledge:

  • The throughput of a communication is limited by two windows: the congestion window and the receive window. The former tries not to exceed the capacity of the network (congestion control) and the latter tries not to exceed the capacity of the receiver to process data (flow control). The receiver may be overwhelmed by data if for example it is very busy (such as a Web server). Each TCP segment contains the current value of the receive window. If for example a sender receives an ack which acknowledges byte 4000 and specifies a receive window of 10000 (bytes), the sender will not send packets after byte 14000, even if the congestion window allows it.
  • TCP uses what is called the "congestion window", or CWND, to determine how many packets can be sent at one time. The larger the congestion window size, the higher the throughput. The TCP "slow start" and "congestion avoidance" algorithms determine the size of the congestion window. The maximum congestion window is related to the amount of buffer space that the kernel allocates for each socket. For each socket, there is a default value for the buffer size, which can be changed by the program using a system library call just before opening the socket. There is also a kernel enforced maximum buffer size. The buffer size can be adjusted for both the send and receive ends of the socket.
  • To get maximal throughput it is critical to use optimal TCP send and receive socket buffer sizes for the link you are using. If the buffers are too small, the TCP congestion window will never fully open up. If the receiver buffers are too large, TCP flow control breaks and the sender can overrun the receiver, which will cause the TCP window to shut down. This is likely to happen if the sending host is faster than the receiving host. Overly large windows on the sending side is not usually a problem as long as you have excess memory; note that every TCP socket has the potential to request this amount of memory even for short connections, making it easy to exhaust system resources.
  • More about TCP Buffer Sizing is here.
  • More about /proc/sys/net/ipv4/* Variables is here.

resolved – TNS:listener does not currently know of service requested in connect descriptor

February 3rd, 2015 No comments

Today we found errors in weblogic log about datasource connection:

TNS:listener does not currently know of service requested in connect descriptor

And in our configuration, data source was using below info:

jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=testrac-r.example.com)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=testservice)))

This was weird as it worked before. After some debugging, we found that the 3 IPs of scan name testrac-r.example.com behaved abnormally on RAC:

[root@rac1 ~]# /sbin/ifconfig|egrep -B1 '192.168.20.5|192.168.20.6|192.168.20.7'
v115_FE:3 Link encap:Ethernet HWaddr 00:21:28:F0:30:4C
inet addr:192.168.20.5 Bcast:10.245.87.255 Mask:255.255.248.0
--
v115_FE:4 Link encap:Ethernet HWaddr 00:21:28:F0:30:4C
inet addr:192.168.20.7 Bcast:10.245.87.255 Mask:255.255.248.0
--
v115_FE:5 Link encap:Ethernet HWaddr 00:21:28:F0:30:4C
inet addr:192.168.20.6 Bcast:10.245.87.255 Mask:255.255.248.0

[root@rac2 ~]# /sbin/ifconfig|egrep -B1 '192.168.20.5|192.168.20.6|192.168.20.7'
v115_FE:6 Link encap:Ethernet HWaddr 00:21:28:E8:3C:16
inet addr:192.168.20.7 Bcast:10.245.87.255 Mask:255.255.248.0
--
v115_FE:7 Link encap:Ethernet HWaddr 00:21:28:E8:3C:16
inet addr:192.168.20.6 Bcast:10.245.87.255 Mask:255.255.248.0

As showed above, 192.168.20.6 and 192.168.20.7 were up on both of the nodes. This behavior indicated scan name was somehow wrong. So we did a bounce of scan name service. And after that, the issue was gone.

Categories: Databases, IT Architecture, Oracle DB Tags:

Close Putty sessions without exit confirmation dialog

January 14th, 2015 No comments

You can set this in Putty "Change Settings" -> "Window" -> "Behaviour", and uncheck "Warn before closing window". Save the config in "Session", and now all windows can be closed without any exit confirmation dialog.

putty_session

Categories: Misc Tags:

resolved – su: cannot set user id: Resource temporarily unavailable

January 12th, 2015 No comments

When i try to log on as user "test", error occurred:

su: cannot set user id: Resource temporarily unavailable

I had a check of limits.conf:

[root@testvm ~]# cat /etc/security/limits.conf|egrep -v '^$|^#'
oracle   soft   nofile    131072
oracle   hard   nofile    131072
oracle   soft   nproc    131072
oracle   hard   nproc    131072
oracle   soft   core    unlimited
oracle   hard   core    unlimited
oracle   soft   memlock    50000000
oracle   hard   memlock    50000000
@svrtech    soft    memlock         500000
@svrtech    hard    memlock         500000
*   soft   nofile    131072
*   hard   nofile    131072
*   soft   nproc    131072
*   hard   nproc    131072
*   soft   core    unlimited
*   hard   core    unlimited
*   soft   memlock    50000000
*   hard   memlock    50000000

Then I had a check of the number of processes/threads with the maximum number of processes to see whether it's coming over the line:

[root@c9qa131-slcn03vmf0293 ~]# ps -eLF | grep test | wc -l
1026

So it's not exceeding. Then I had a check of open files:

[root@testvm ~]# lsof | grep aime | wc -l

6059

It's not exceeding 131072 either, then why the error "su: cannot set user id: Resource temporarily unavailable" was there? Actually the culprit was in file /etc/security/limits.d/90-nproc.conf:

[root@testvm ~]# cat /etc/security/limits.d/90-nproc.conf
# Default limit for number of user's processes to prevent
# accidental fork bombs.
# See rhbz #432903 for reasoning.

* soft nproc 1024
root soft nproc unlimited

After I modified 1024 to 131072, the issue gone away immediately.

Categories: IT Architecture, Kernel, Linux, Systems, Unix Tags:

resolved – Error: Unable to connect to xend: Connection reset by peer. Is xend running?

January 7th, 2015 No comments

Today I met some issue when trying to run xm commands on a XEN server:

[root@xenhost1 ~]# xm list
Error: Unable to connect to xend: Connection reset by peer. Is xend running?

I had a check, and found xend was actually running:

[root@xenhost1 ~]# /etc/init.d/xend status
xend daemon running (pid 8329)

After some debugging, I found it's caused by libvirtd & xend corrupted. And then I did a bounce of them:

[root@xenhost1 ~]# /etc/init.d/libvirtd restart
Stopping libvirtd daemon: [ OK ]
Starting libvirtd daemon: [ OK ]

[root@xenhost1 ~]# /etc/init.d/xend restart #this may not be needed 
restarting xend...
xend daemon running (pid 19684)

Later, the xm commands went good.

PS:

For more information about libvirt, you can check here.

 

Categories: Clouding, IT Architecture, Oracle Cloud Tags:

remove entries in perl array with specified value

December 30th, 2014 No comments

Assume that in array @array_filtered:

my @array_filtered = ("your", "array", "here", 1, 3, 8, "here", 2, 5, 9, "sit", "here",3, 4, 7,"yes","now",8,1,7,6); #or my @array_filtered=qw(your array here 1 3 8 here 2 5 9 sit here 3 4 7 yes now 8 1 7 6) which uses Alternative Quotes(q, qq, qw, qx)

You want to remove entries that have value "here" or "now" and it's following 3 entries, you can use splice:

#!/usr/bin/perl
my @array_filtered = ("your", "array", "here", 1, 3, 8, "here", 2, 5, 9, "sit", "here",3, 4, 7,"yes","now",8,1,7,6);
my @search_for = ("here","now");
#return keys that have specified value, =~/!~ for regular expression, eq/ne for string, ==/!= for number. or use unless()/if(not()). use m{} instead of // if there's too much / in the expression and you're tired of using \/ to escape them.

$search_for_s=join('|',@search_for);
@index_all = grep { $array_filtered[$_] =~ /$search_for_s/ } 0..$#array_filtered;

for($i=0;$i<=$#index_all;$i++) {
@index_all_one = grep { $array_filtered[$_] =~ /$search_for_s/ } 0..$#array_filtered;
splice(@array_filtered,$index_all_one[0],4);
#print $indexone."\n"
}

print "@array_filtered"."\n";

The output is "your array sit yes 6".

PS:

  • For more info about perl regular expression(such as operators<m, s, tr> and their modifiers, complex regular expression cheat sheet<.\s\S\d\D\w\W[aeiou][^aeiou](foo|bar), \G, $, $&, $`, $'> and more), you can refer to this article.
  • The following is about perl alternative quotes:

q// is generally the same thing as using single quotes - meaning it doesn't interpolate values inside the delimiters.
qq// is the same as double quoting a string. It interpolates.
qw// return a list of white space delimited words. @q = qw/this is a test/ is functionally the same as @q = ('this', 'is', 'a', 'test')
qx// is the same thing as using the backtick operators.

Categories: IT Architecture, Perl, Programming Tags:

resolved – cssh installation on linux server

December 29th, 2014 No comments

ClusterSSH can be used if you need controls a number of xterm windows via a single graphical console window, and you want to run commands interactively on multiple servers over an ssh connection. This guide will show the process to install clusterssh on a linux box from tarball.

At the very first, you should download cssh tarball App-ClusterSSH-4.03_04.tar.gz from sourceforge. You may need export proxy settings if it's needed in your env:

export https_proxy=http://my-proxy.example.com:80/
export http_proxy=http://my-proxy.example.com:80/
export ftp_proxy=http://my-proxy.example.com:80/

After the proxy setting, you can now get the package:

wget 'http://sourceforge.net/projects/clusterssh/files/latest/download'
tar zxvf App-ClusterSSH-4.03_04.tar.gz
cd App-ClusterSSH-4.03_04
cat README

Before installing, let's install some prerequisites packages:

yum install gcc libX11-devel gnome* -y
yum groupinstall "X Window System" -y
yum groupinstall "GNOME Desktop Environment" -y
yum groupinstall "Graphical Internet" -y
yum groupinstall "Graphics" -y

Now run "perl Build.PL" as indicated by README:

[root@centos-32bits App-ClusterSSH-4.03_04]# perl Build.PL
Can't locate Module/Build.pm in @INC (@INC contains: /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.8 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.8/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.8 /usr/lib/perl5/vendor_perl /usr/lib/perl5/5.8.8/i386-linux-thread-multi /usr/lib/perl5/5.8.8 .) at Build.PL line 5.
BEGIN failed--compilation aborted at Build.PL line 5.

As it challenged, you need install Module::Build.pm first. Let's use cpan to install that module.

Run "cpan" and enter "follow" when below info occurred:

Policy on building prerequisites (follow, ask or ignore)? [ask] follow

If you had already ran cpan before, then you can configure the policy as below:

cpan> o conf prerequisites_policy follow
cpan> o conf commit

Now Let's install Module::Build:

cpan> install Module::Build

After the installation, let's run "perl Build.PL" again:

[root@centos-32bits App-ClusterSSH-4.03_04]# perl Build.PL
Checking prerequisites...
  requires:
    !  Exception::Class is not installed
    !  Tk is not installed
    !  Try::Tiny is not installed
    !  X11::Protocol is not installed
  build_requires:
    !  CPAN::Changes is not installed
    !  File::Slurp is not installed
    !  File::Which is not installed
    !  Readonly is not installed
    !  Test::Differences is not installed
    !  Test::DistManifest is not installed
    !  Test::PerlTidy is not installed
    !  Test::Pod is not installed
    !  Test::Pod::Coverage is not installed
    !  Test::Trap is not installed

ERRORS/WARNINGS FOUND IN PREREQUISITES.  You may wish to install the versions
of the modules indicated above before proceeding with this installation

Run 'Build installdeps' to install missing prerequisites.

Created MYMETA.yml and MYMETA.json
Creating new 'Build' script for 'App-ClusterSSH' version '4.03_04'

As the output says, run "./Build installdeps" to install the missing packages. Make sure you're in GUI env(through vncserver maybe), as "perl Build.PL" has a step to test GUI.

[root@centos-32bits App-ClusterSSH-4.03_04]# ./Build installdeps

......

Running Mkbootstrap for Tk::Xlib ()
chmod 644 "Xlib.bs"
"/usr/bin/perl" "/usr/lib/perl5/5.8.8/ExtUtils/xsubpp" -typemap "/usr/lib/perl5/5.8.8/ExtUtils/typemap" -typemap "/root/.cpan/build/Tk-804.032/Tk/typemap" Xlib.xs > Xlib.xsc && mv Xlib.xsc Xlib.c
make[1]: *** No rule to make target `pTk/tkInt.h', needed by `Xlib.o'. Stop.
make[1]: Leaving directory `/root/.cpan/build/Tk-804.032/Xlib'
make: *** [subdirs] Error 2
/usr/bin/make -- NOT OK
Running make test
Can't test without successful make
Running make install
make had returned bad status, install seems impossible

Errors again, we can see it's complaining something about TK related thing. To resolve this, I manully installed the latest perl-tk module as below:

wget --no-check-certificate 'https://github.com/eserte/perl-tk/archive/master.zip'
unzip master
cd perl-tk-master
perl Makefile.PL
make
make install

After this, let's run "./Build installdeps" and "perl Build.PL" again which all went through good:

[root@centos-32bits App-ClusterSSH-4.03_04]# ./Build installdeps

[root@centos-32bits App-ClusterSSH-4.03_04]# perl Build.PL

And let's run ./Build now:

[root@centos-32bits App-ClusterSSH-4.03_04]# ./Build
Building App-ClusterSSH
Generating: ccon
Generating: crsh
Generating: cssh
Generating: ctel

And now "./Build install" which is the last step:

[root@centos-32bits App-ClusterSSH-4.03_04]# ./Build install

After installation, let's have a test:

[root@centos-32bits App-ClusterSSH-4.03_04]# echo 'svr testserver1 testserver2' > /etc/clusters

Now run 'cssh svr', and you'll get the charm!

clusterssh

 

Categories: Clouding, IT Architecture, Linux, Systems, Unix Tags:

resolved – error:0D0C50A1:asn1 encoding routines:ASN1_item_verify:unknown message digest algorithm

December 17th, 2014 No comments

Today when I tried using curl to get url info, error occurred like below:

[root@centos-doxer ~]# curl -i --user username:password -H "Content-Type: application/json" -X POST --data @/u01/shared/addcredential.json https://testserver.example.com/actions -v

* About to connect() to testserver.example.com port 443

*   Trying 10.242.11.201... connected

* Connected to testserver.example.com (10.242.11.201) port 443

* successfully set certificate verify locations:

*   CAfile: /etc/pki/tls/certs/ca-bundle.crt

  CApath: none

* SSLv2, Client hello (1):

SSLv3, TLS handshake, Server hello (2):

SSLv3, TLS handshake, CERT (11):

SSLv3, TLS alert, Server hello (2):

error:0D0C50A1:asn1 encoding routines:ASN1_item_verify:unknown message digest algorithm

* Closing connection #0

After some searching, I found that it's caused by the current version of openssl(openssl-0.9.8e) does not support SHA256 Signature Algorithm. To resolve this, there are two ways:

1. add -k parameter to curl to ignore the SSL error

2. upgrade openssl to at least openssl-0.9.8o. Here's the way to upgrade openssl:

wget --no-check-certificate 'https://www.openssl.org/source/old/0.9.x/openssl-0.9.8o.tar.gz'
tar zxvf openssl-0.9.8o.tar.gz
cd openssl-0.9.8o
./config --prefix=/usr --openssldir=/usr/openssl
make
make test
make install

After this, run openssl version to confirm:

[root@centos-doxer openssl-0.9.8o]# /usr/bin/openssl version
OpenSSL 0.9.8o 01 Jun 2010

PS:

If you installed openssl from rpm package, then you'll find the openssl version is still the old one even after you install the new package. This is expected so don't rely too much on rpm:

[root@centos-doxer openssl-0.9.8o]# /usr/bin/openssl version
OpenSSL 0.9.8o 01 Jun 2010

Even after rebuilding rpm DB(rpm --rebuilddb), it's still the old version:

[root@centos-doxer openssl-0.9.8o]# rpm -qf /usr/bin/openssl
openssl-0.9.8e-26.el5_9.1
openssl-0.9.8e-26.el5_9.1

[root@centos-doxer openssl-0.9.8o]# rpm -qa|grep openssl
openssl-0.9.8e-26.el5_9.1
openssl-devel-0.9.8e-26.el5_9.1
openssl-0.9.8e-26.el5_9.1
openssl-devel-0.9.8e-26.el5_9.1

 

output analysis of linux last command

December 9th, 2014 No comments

Here's the output of last on my linux host:

root     pts/9        remote.example   Tue Dec  9 14:51   still logged in
testuser pts/2        :3               Tue Dec  9 14:49   still logged in
aime     pts/1        :2               Tue Dec  9 14:49   still logged in
root     pts/0        :1               Tue Dec  9 14:49   still logged in
testuser pts/13       remote.example   Tue Dec  9 10:48 - 10:52  (00:02)
reboot   system boot  2.6.23           Tue Dec  9 10:11          (04:39)
root     pts/11       10.182.120.179   Thu Dec  4 17:14 - 17:20  (00:06)
root     pts/11       10.182.120.179   Thu Dec  4 17:14 - 17:14  (00:00)
root     pts/10       10.182.120.179   Thu Dec  4 15:55 - 15:55  (00:00)
testuser pts/14       :3.0             Tue Dec  2 15:44 - 15:46  (00:01)
testuser pts/12       :3.0             Tue Dec  2 15:44 - 15:46  (00:01)
testuser pts/13       :3.0             Tue Dec  2 15:44 - 15:46  (00:01)
testuser pts/15       :3.0             Tue Dec  2 15:44 - 15:46  (00:01)
testuser pts/11       :3.0             Tue Dec  2 15:44 - 15:46  (00:01)
testuser pts/16       :3.0             Tue Dec  2 15:44 - 15:46  (00:01)
root     pts/10       10.182.120.179   Tue Dec  2 11:20 - 11:20  (00:00)
root     pts/7        10.182.120.179   Tue Dec  2 10:15 - down  (6+07:39)
root     pts/6        10.182.120.179   Tue Dec  2 10:15 - 17:55 (6+07:39)
root     pts/5        10.182.120.179   Tue Dec  2 10:15 - 17:55 (6+07:39)
root     pts/4        10.182.120.179   Tue Dec  2 10:15 - 17:55 (6+07:39)
root     pts/3        10.182.120.179   Tue Dec  2 10:15 - 17:55 (6+07:39)
root     pts/2        :1               Tue Dec  2 10:00 - down  (6+07:55)
aime     pts/1        :2               Tue Dec  2 10:00 - down  (6+07:55)
testuser pts/0        :3               Tue Dec  2 10:00 - down  (6+07:55)
reboot   system boot  2.6.23           Tue Dec  2 09:58         (6+07:56)

Here's some analysis:

  • User "reboot" is a pseudo-user for system reboot. Entries between two reboots are users who log on the system during two reboots. For info about login shells(.bash_profile) and interactive non-login shells(.bashrc), you can refer to here.
  • Here're columns meanings:

Column 1: User logged on

Column 2: The tty name after logging on

Column 3: Remote IP or hostname from which the user logged on. You can see ":1", ":2", ":3", that's vnc port number which vncserver are rendering against.

Column 4: Begin/End time of the session. If "still logged in", then means the user is still logged on; if there's value in parenthesis, then that's the total time of the logged on. For the latest "reboot"(red line 1), means the uptime till now; For the second "reboot"(red line 2), means the uptime between two reboots. Note however that this time is not always accurate, for example after system crash and unusual restart sequence. last calculates it as time between it and next reboot/shutdown.

 

Categories: IT Architecture, Linux, Systems, Unix Tags:

ORA-12154 – TNS:could not resolve the connect identifier specified

December 2nd, 2014 No comments

Today I try to connect to one Db service named pditui using the following easy connect method:

export ORACLE_HOME=/u01/app/oracle/product/11.2.0/client_1
export PATH=$ORACLE_HOME/bin:$PATH

sqlplus "sys/password@(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = scanname.test.example.com)(PORT = 1521))(CONNECT_DATA =(SERVER = DEDICATED)(SERVICE_NAME = pditui)))" as sysdba

However, the following error messages prompted:

SQL*Plus: Release 11.2.0.3.0 Production on Tue Dec 2 14:07:35 2014

Copyright (c) 1982, 2011, Oracle. All rights reserved.

ERROR:
ORA-12154: TNS:could not resolve the connect identifier specified

Enter user-name:
ERROR:
ORA-12162: TNS:net service name is incorrectly specified

The username/password and service name were all correct, but the error was there. After some checking, I found that it was caused by wrong configuration of NAMES.DIRECTORY_PATH in file $ORACLE_HOME/network/admin/sqlnet.ora:

[root@centos-doxer ~]# cat /u01/app/oracle/product/11.2.0/client_1/network/admin/sqlnet.ora
# sqlnet.ora Network Configuration File: /u01/app/oracle/product/11.2.0/client_1/network/admin/sqlnet.ora
# Generated by Oracle configuration tools.

#NAMES.DIRECTORY_PATH= (TNSNAMES)
NAMES.DIRECTORY_PATH= (TNSNAMES,ezconnect) -- add ezconnect methond here

ADR_BASE = /u01/app/oracle

After this, the connection was ok.

PS: 

You can read more about NAMES.DIRECTORY_PATH in file $ORACLE_HOME/network/admin/sqlnet.ora here.

Categories: Databases, Oracle DB Tags:

resolved – switching from Unbreakable Enterprise Kernel Release 2(UEKR2) to UEKR3 on Oracle Linux 6

November 24th, 2014 No comments

As we can see from here, the available kernels include the following 3 for Oracle Linux 6:

3.8.13 Unbreakable Enterprise Kernel Release 3 (x86_64 only)
2.6.39 Unbreakable Enterprise Kernel Release 2**
2.6.32 (Red Hat compatible kernel)

On one of our OEL6 VM, we found that it's using UEKR2:

[root@testbox aime]# cat /etc/issue
Oracle Linux Server release 6.4
Kernel \r on an \m

[root@testbox aime]# uname -r
2.6.39-400.211.1.el6uek.x86_64

So how can we switch the kernel to UEKR3(3.8)?

If your linux version is 6.4, first do a "yum update -y" to upgrade to 6.5 and uppper, and then reboot the host, and follow steps below.

[root@testbox aime]# ls -l /etc/grub.conf
lrwxrwxrwx. 1 root root 22 Aug 21 18:24 /etc/grub.conf -> ../boot/grub/grub.conf

[root@testbox aime]# yum update -y

If your linux version is 6.5 and upper, you'll find /etc/grub.conf and /boot/grub/grub.conf are different files(for yum update one. If your host is OEL6.5 when installed, then /etc/grub.conf should be softlink too):

[root@testbox ~]# ls -l /etc/grub.conf
-rw------- 1 root root 2356 Oct 20 05:26 /etc/grub.conf

[root@testbox ~]# ls -l /boot/grub/grub.conf
-rw------- 1 root root 1585 Nov 23 21:46 /boot/grub/grub.conf

In /etc/grub.conf, you'll see entry like below:

title Oracle Linux Server Unbreakable Enterprise Kernel (3.8.13-44.1.3.el6uek.x86_64)
root (hd0,0)
kernel /vmlinuz-3.8.13-44.1.3.el6uek.x86_64 ro root=/dev/mapper/vg01-lv_root rd_LVM_LV=vg01/lv_root rd_NO_LUKS rd_LVM_LV=vg01/lv_swap LANG=en_US.UTF-8 KEYTABLE=us console=hvc0 rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_NO_DM rhgb quiet
initrd /initramfs-3.8.13-44.1.3.el6uek.x86_64.img

What you'll need to do is just copying the entries above from /etc/grub.conf to /boot/grub/grub.conf, and then reboot the VM.

After rebooting, you'll find the kernel is now at UEKR3(3.8).

PS:

If you find the VM is OEL6.5 and /etc/grub.conf is a softlink to /boot/grub/grub.conf, then you could do the following to upgrade kernel to UEKR3:

1. add the following lines to /etc/yum.repos.d/public-yum-ol6.repo:

[public_ol6_UEKR3]
name=UEKR3 for Oracle Linux 6 ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/UEKR3/latest/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=1

2. List and install UEKR3:

[root@testbox aime]# yum list|grep kernel-uek|grep public_ol6_UEKR3
kernel-uek.x86_64 3.8.13-44.1.5.el6uek public_ol6_UEKR3
kernel-uek-debug.x86_64 3.8.13-44.1.5.el6uek public_ol6_UEKR3
kernel-uek-debug-devel.x86_64 3.8.13-44.1.5.el6uek public_ol6_UEKR3
kernel-uek-devel.x86_64 3.8.13-44.1.5.el6uek public_ol6_UEKR3
kernel-uek-doc.noarch 3.8.13-44.1.5.el6uek public_ol6_UEKR3
kernel-uek-firmware.noarch 3.8.13-44.1.5.el6uek public_ol6_UEKR3
kernel-uek-headers.x86_64 3.8.13-26.2.4.el6uek public_ol6_UEKR3

[root@testbox aime]# yum install -y kernel-uek* --disablerepo=* --enablerepo=public_ol6_UEKR3

3. Reboot

resolved – ORA-27300 OS system dependent operation:fork failed with status: 11

November 18th, 2014 No comments

Today we observed all our DB were in down status, and in the trace file:

Errors in file /u01/database/diag/rdbms/oimdb/OIMDB/trace/OIMDB_psp0_3173.trc:

ORA-27300: OS system dependent operation:fork failed with status: 11

ORA-27301: OS failure message: Resource temporarily unavailable

ORA-27302: failure occurred at: skgpspawn5

After some searching for ORA-27300, I found this article, which suggested that it's the issue of user processes used up and system could not spawn new one at the time. As the problem happened at Mon Nov 17 02:08:51 2014, so I did some check using sysstat sar:

[root@test sa]# sar -f /var/log/sa/sa17 -s 00:00:00 -e 03:20:00
Linux 2.6.32-300.27.1.el5uek (slcn11vmf0029) 11/17/14

00:00:01 CPU %user %nice %system %iowait %steal %idle
00:10:01 all 1.16 0.12 0.48 0.71 0.18 97.35
00:20:02 all 1.30 0.00 0.47 0.95 0.19 97.10
00:30:01 all 1.88 0.00 0.63 1.98 0.19 95.32
00:40:01 all 1.00 0.00 0.35 2.15 0.18 96.32
00:50:01 all 1.09 0.00 0.40 0.47 0.18 97.87
01:00:01 all 1.03 0.00 0.34 0.25 0.16 98.23
01:10:01 all 3.98 0.02 1.72 4.26 0.22 89.80
01:20:01 all 9.98 0.13 5.99 47.40 0.31 36.19
01:30:01 all 1.86 0.00 1.24 48.72 0.16 48.01
01:40:01 all 1.08 0.00 0.82 48.77 0.18 49.15
01:50:01 all 1.54 0.00 0.97 49.32 0.18 47.98
02:00:01 all 1.05 0.00 0.85 48.74 0.18 49.19 --- problem occurred at Mon Nov 17 02:08:51 2014
02:10:01 all 10.14 0.14 8.95 44.75 0.34 35.68
02:20:01 all 0.06 0.00 0.21 1.87 0.07 97.78
02:30:01 all 0.08 0.00 0.29 2.81 0.08 96.74
02:40:01 all 0.09 0.00 0.31 3.08 0.08 96.44
02:50:01 all 0.05 0.00 0.13 0.96 0.06 98.81
03:00:01 all 0.07 0.00 0.26 2.38 0.07 97.22
03:10:01 all 0.06 0.12 0.21 1.52 0.07 98.02
Average: all 1.85 0.03 1.20 15.89 0.16 80.88

[root@test sa]# sar -f /var/log/sa/sa17 -s 01:10:00 -e 02:11:00 -A
......
......
01:10:01 kbmemfree kbmemused %memused kbbuffers kbcached kbswpfree kbswpused %swpused kbswpcad
01:20:01 259940 15482728 98.35 2004 11703696 0 2104504 100.00 194056 -- even all SWAP spaces were used up
01:30:01 398584 15344084 97.47 904 11703152 0 2104504 100.00 191728
01:40:01 409104 15333564 97.40 984 11716924 0 2104504 100.00 191404
01:50:01 452844 15289824 97.12 1004 11711548 0 2104504 100.00 189076
02:00:01 440780 15301888 97.20 1424 11757600 0 2104504 100.00 189364
02:10:01 14602712 1139956 7.24 19548 382588 1978020 126484 6.01 3096
Average: 2760661 12982007 82.46 4311 9829251 329670 1774834 84.34 159787

So this proved that system was very busy during that time. I then increased oracle user's user process number to 131072 in /etc/security/limits.conf with the following:

* soft nproc 131072
* hard nproc 131072

And also set kernel.pid_max to 139264(which is 131072 plus 8192 which is recommended for OS stability) in /etc/sysctl.conf.

[root@test ~]# sysctl -a|grep pid_max
kernel.pid_max = 139264

Then increased memory from 16G to 32G of the box, and reboot.

resolved – high value of RX overruns in ifconfig

November 13th, 2014 No comments

Today we tried ssh to one host, it stuck soon there. And we observed that RX overruns in ifconfig output was high:

[root@test /]# ifconfig bond0
bond0 Link encap:Ethernet HWaddr 00:10:E0:0D:AD:5E
inet6 addr: fe80::210:e0ff:fe0d:ad5e/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:140234052 errors:0 dropped:0 overruns:12665 frame:0
TX packets:47259596 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:34204561358 (31.8 GiB) TX bytes:21380246716 (19.9 GiB)

Receiver overruns usually occur when packets come in faster than the kernel can service the last interrupt. But in our case, we are seeing increasing inbound errors on interface Eth105/1/7 of device bcd-c1z1-swi-5k07a/b, did shut and no shut but no change. And after some more debugging, we found one bad SFP/cable. After replaced that, the server came back to normal.

ucf-c1z1-swi-5k07a# sh int Eth105/1/7 | i err
4065 input error 0 short frame 0 overrun 0 underrun 0 ignored
0 output error 0 collision 0 deferred 0 late collision

ucf-c1z1-swi-5k07a# sh int Eth105/1/7 | i err
4099 input error 0 short frame 0 overrun 0 underrun 0 ignored
0 output error 0 collision 0 deferred 0 late collision

ucf-c1z1-swi-5k07a# sh int Eth105/1/7 counters errors

--------------------------------------------------------------------------------
Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscards
--------------------------------------------------------------------------------
Eth105/1/7 3740 483 0 4223 0 0

--------------------------------------------------------------------------------
Port Single-Col Multi-Col Late-Col Exces-Col Carri-Sen Runts
--------------------------------------------------------------------------------
Eth105/1/7 0 0 0 0 0 3740

--------------------------------------------------------------------------------
Port Giants SQETest-Err Deferred-Tx IntMacTx-Er IntMacRx-Er Symbol-Err
--------------------------------------------------------------------------------
Eth105/1/7 0 -- 0 0 0 0

ucf-c1z1-swi-5k07a# sh int Eth105/1/7 counters errors

--------------------------------------------------------------------------------
Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscards
--------------------------------------------------------------------------------
Eth105/1/7 4386 551 0 4937 0 0

--------------------------------------------------------------------------------
Port Single-Col Multi-Col Late-Col Exces-Col Carri-Sen Runts
--------------------------------------------------------------------------------
Eth105/1/7 0 0 0 0 0 4386

--------------------------------------------------------------------------------
Port Giants SQETest-Err Deferred-Tx IntMacTx-Er IntMacRx-Er Symbol-Err
--------------------------------------------------------------------------------
Eth105/1/7 0 -- 0 0 0 0

PS:

During debugging, we also found on server side, the interface eth0 is half duplex and speed at 100Mb/s:

-bash-3.2# ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Half
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000003 (3)
Link detected: yes

However, it should be full duplex and 1000Mb/s. So we also changed the speed, duplex to auto auto on switch and after that, the OS side is now showing the expected value:

-bash-3.2# ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000003 (3)
Link detected: yes

resolved – Exception: ha_check_cpu_compatibility failed:

Today when I tried to add one OVS server in OVMM pool, the following error messages prompted:

2014-11-12 06:25:18.083 WARNING failed:errcode=00000, errmsg=Unexpected error: <Exception: ha_check_cpu_compatibility
failed:<Exception: CPU not compatible! {'new_ovs_03': 'vendor_id=GenuineIntel;cpu_family=6;model=45', 'slce27vmf1002': 'vendor_id=GenuineIntel;cpu_family=6;model=44', 'new_ovs_03': 'vendor_id=GenuineIntel;cpu_family=6;model=45'}>

StackTrace:
File "/opt/ovs-agent-2.3/OVSSiteHA.py", line 248, in ha_check_cpu_compatibility
raise Exception("CPU not compatible! %s" % repr(d))
>

StackTrace:
File "/opt/ovs-agent-2.3/OVSSiteCluster.py", line 609, in cluster_check_prerequisite
raise Exception(msg)

StackTrace:
File "/opt/ovs-agent-2.3/OVSSiteCluster.py", line 646, in _cluster_setup
#_check(ret)
File "/opt/ovs-agent-2.3/OVSXCluster.py", line 340, in _check
raise OVSException(error=ret["error"])

2014-11-12 06:25:18.083 NOTIFICATION Failed setup cluster for agent 2.2.0...
2014-11-12 06:25:18.083 ERROR Cluster Setup when adding server
2014-11-12 06:25:18.087 ERROR [Server Pool Management][Server Pool][test_serverpool]:During adding servers ([new_ovs_03]) to server pool (test_serverpool), Cluster setup failed: (OVM-1011 OVM Manager communication with new_ovs_03 for operation HA Setup for Oracle VM Agent 2.2.0 failed:
errcode=00000, errmsg=Unexpected error: <Exception: ha_check_cpu_compatibility
failed:<Exception: CPU not compatible! {'new_ovs_03': 'vendor_id=GenuineIntel;cpu_family=6;model=45', 'slce27vmf1002': 'vendor_id=GenuineIntel;cpu_family=6;model=44', 'new_ovs_03': 'vendor_id=GenuineIntel;cpu_family=6;model=45'}>

)

As stated in the error message, the adding failed at cpu check. To resolve this, we can comment out the code where cpu check occurred.

File /opt/ovs-agent-2.3/OVSSiteCluster.py in line 646 on each OVS server in the server pool:

#ret = cluster_check_prerequisite(ha_enable=ha_enable)
#_check(ret)

Then bounce ovs-agent on each OVS server, and try add again.

Categories: Clouding, IT Architecture, Oracle Cloud Tags:

resolved – /lib/ld-linux.so.2: bad ELF interpreter: No such file or directory

November 7th, 2014 No comments

In one of our script, error prompted when we ran it today:

[root@testhost01 ~]# su - user1
[user1@testhost01 ~]$ /home/testuser/run_as_root 'su'
-bash: /usr/local/packages/aime/ias/run_as_root: /lib/ld-linux.so.2: bad ELF interpreter: No such file or directory

From the output, we can see that it's complaining for not founding file /lib/ld-linux.so.2:

[user1@testhost01 ~]$ ls -l /lib/ld-linux.so.2
ls: cannot access /lib/ld-linux.so.2: No such file or directory

I then checked on another host and found /lib/ld-linux.so.2 belonged to package glibc:

[root@centos-doxer ~]# ls -l /lib/ld-linux.so.2
lrwxrwxrwx 1 root root 9 May 9 2013 /lib/ld-linux.so.2 -> ld-2.5.so
[root@centos-doxer ~]# rpm -qf /lib/ld-linux.so.2
glibc-2.5-107.el5_9.4

However, on the problematic host, glibc was installed:

[root@testhost01 user1]# rpm -qa|grep glibc
glibc-headers-2.12-1.149.el6.x86_64
glibc-common-2.12-1.149.el6.x86_64
glibc-devel-2.12-1.149.el6.x86_64
glibc-2.12-1.149.el6.x86_64

I then tried making a soft link from /lib64/ld-2.12.so to /lib/ld-linux.so.2:

[root@testhost01 ~]# ln -s /lib64/ld-2.12.so /lib/ld-linux.so.2
[root@testhost01 ~]# su - user1
[user1@testhost01 ~]$ /usr/local/packages/aime/ias/run_as_root su
-bash: /usr/local/packages/aime/ias/run_as_root: Accessing a corrupted shared library

Hmmm, so it now complained about corrupted shared library. Maybe we need 32bit of glibc? So I removed the softlink, and then installed glibc.i686:

rm -rf /lib/ld-linux.so.2
yum -y install glibc.i686

After installation, I found /lib/ld-linux.so.2 was there already:

[root@testhost01 user1]# ls -l /lib/ld-linux.so.2
lrwxrwxrwx 1 root root 10 Nov 7 03:46 /lib/ld-linux.so.2 -> ld-2.12.so
[root@testhost01 user1]# rpm -qf /lib/ld-linux.so.2
glibc-2.12-1.149.el6.i686

And when I ran again the command, it returned ok:

[root@testhost01 user1]# su - user1
[user1@testhost01 ~]$ /home/testuser/run_as_root 'su'
[root@testhost01 user1]#

So from this, we can see that the issue was caused by /usr/local/packages/aime/ias/run_as_root supports only 32bit of glibc.

Categories: IT Architecture, Kernel, Linux, Systems Tags:

Resolved – AttributeError: ‘NoneType’ object has no attribute ‘_imgName’

November 6th, 2014 No comments

Today when I tried to list Virtual Machines on one Oracle OVMM, error message prompted:

[root@ovmm_test ~]# ovm -uadmin -ppassword vm ls
Traceback (most recent call last):
  File "/usr/bin/ovm", line 43, in ?
    ovmcli.ovmmain.main(sys.argv[1:])
  File "/usr/lib/python2.4/site-packages/ovmcli/ovmmain.py", line 122, in main
    return ovm.ovmcli.runcmd(args)
  File "/usr/lib/python2.4/site-packages/ovmcli/ovmcli.py", line 147, in runcmd
    return method(options)
  File "/usr/lib/python2.4/site-packages/ovmcli/ovmcli.py", line 1578, in do_vm_ls
    result.append((serverpool._serverPoolName, vm._imgName))
AttributeError: 'NoneType' object has no attribute '_imgName'

Then I tried list VMs by server pool:

[root@ovmm_test ~]# ovm -uadmin -ppassword vm ls -s Pool1_test
Name                 Size(MB) Mem   VCPUs Status  Server_Pool
testvm1              17750    8196  4     Running Pool1_test
testvm2               50518    8196  4     Running Pool1_test
testvm3          19546    8192  2     Running Pool1_test
testvm4          50518    20929 4     Running Pool1_test
testvm5          19546    8192  2     Running Pool1_test
[root@ovmm_test ~]# ovm -uadmin -ppassword vm ls -s Pool1_test_A
Traceback (most recent call last):
  File "/usr/bin/ovm", line 43, in ?
    ovmcli.ovmmain.main(sys.argv[1:])
  File "/usr/lib/python2.4/site-packages/ovmcli/ovmmain.py", line 122, in main
    return ovm.ovmcli.runcmd(args)
  File "/usr/lib/python2.4/site-packages/ovmcli/ovmcli.py", line 147, in runcmd
    return method(options)
  File "/usr/lib/python2.4/site-packages/ovmcli/ovmcli.py", line 1578, in do_vm_ls
    result.append((serverpool._serverPoolName, vm._imgName))
AttributeError: 'NoneType' object has no attribute '_imgName'

One pool was working and the other was not, so the problematic VMs must reside in pool Pool1_test_A.

Another symptom was that, although ovmcli wouldn't work, the OVMM GUI worked as expected and returns all the VMs.

As ovmcli read entries from Oracle DB(SID XE) on OVMM, so the issue maybe caused by the inconsistency between DB & OVMM agent DB.

I got the list of all VMs on the problematic server pool from OVMM GUI, and then ran the following query to get all entries in DB:

select IMG_NAME from OVS_VM_IMG where SITE_ID=110 and length(IMG_NAME)>50; #Pool1_test_A was with ID 110, got from table OVS_SITE. I used length() here because in the problematic server pool, VMs all should have IMG_NAME with more than 50 characters; if less than 50, then they were VM templates which should have no issue

Comparing the output from OVMM GUI & OVM DB, I found some entries which only existed in DB. And for all of these entries, they all had "Status" in "Creating", and also the

select IMG_NAME from OVS_VM_IMG where STATUS='Creating';

Then I removed these weird entries:

create table OVS_VM_IMG_bak20141106 as select * from OVS_VM_IMG;
delete from OVS_VM_IMG where STATUS='Creating'; #you can try rename/drop table OVS_VM_IMG(alter table TBL1 rename to TBL2; drop table TBL1), remove entries in backup table(OVS_VM_IMG_bak20141106), and then rename backup table(OVS_VM_IMG_bak20141106) to OVS_VM_IMG if failed at this step caused by foreign key or other reasons

After this, the issue got resolved.

PS:

After you removed entries with STATUS being "Creating", and if you found some more entries of this kind occurred in OVM DB, then maybe it's caused by VM templates not working or DB table corrupted. In this case, you'll need recover OVMM by rollong back to previous version of your backup, and then import VM templates/VM images etc.

resoved – nfs share chown: changing ownership of ‘blahblah': Invalid argument

October 28th, 2014 No comments

Today I encountered the following error when trying to change ownership of some files:

[root@test webdav]# chown -R apache:apache ./bigfiles/
chown: changing ownership of `./bigfiles/opcmessaging': Invalid argument
chown: changing ownership of `./bigfiles/': Invalid argument

This host is running CentOS 6.2, and in this version of OS, nfs4 is by default used:

[root@test webdav]# cat /proc/mounts |grep u01
nas-server.example.com:/export/share01/ /u01 nfs4 rw,relatime,vers=4,rsize=32768,wsize=32768

However, the NFS server does not support NFSv4 well, so I modified the share to use NFSv3 by force:

nas-server.example.com:/export/share01/ /u01 nfs rsize=32768,wsize=32768,hard,nolock,timeo=14,noacl,intr,mountvers=3,nfsvers=3

After umount/mount, the issue was resolved!

PS:

If the NAS server is SUN ZFS appliance, then the following should be noted, or the issue may occur even on CentOS/Redhat linux 5.x:

protocol_anonymous_user_mapping

root_directory_access

Categories: Hardware, IT Architecture, Linux, NAS, Storage, Systems Tags:

Sun ZFS storage stuck due to incorrect LACP configuration

October 24th, 2014 No comments

Today we met issue with Sun ZFS storage 7320. NFS shares provisioned from the ZFS appliance were not responding to requests, even a "df -h" will stuck there for a long long time. And when we checked from ZFS storage side, we found the following statistics:

1-high-io-before-stuck

 

And during our checking for the traffic source, the ZFS appliance backed to normal by itself:

2-recovered-by-itself

 

As we just configured LACP on this ZFS appliance the day before, so we doubted the issue was caused by incorrect network configuration. Here's the network config:

1-wrong-configuration

For "Policy", we should match with switch setup to even balance incoming/outgoing data flow.  Otherwise, we might experience uneven load balance. Our switch was set to L3, so L3 should be ok. We'll get better load spreading if the policy is L3+4 if the switch supports it.  With L3, all connections from any one IP will only use a single member of the aggregation.  With L3+4, it will load spread by UDP or TCP port too. More is here.

For "Mode", it should be set according to switch. If the switch is "passive" mode then server/storage needs to be on "active" mode, and vice versa.

For "Timer", it's regarding how often to check LACP status.

After checking switch setting, we found that the switch is in "Active" mode, and as ZFS appliance was also on "Active" mode, so that's the culprit. So we changed the setting to the following:

2-right-configurationAfter this, we had some observation and ZFS is now operating normally.

PS:

You should also have a check of disk operations, if there are timeout errors on the disks, then you should try replace them. Sometimes, a single disk may hang the SCSI bus.  Ideally, the system should fail the disk but it didn't happen. You should manually failed the disk to resolve the issue.

The ZFS Storage Appliance core analysis (previous note) confirms that the disk was the cause of the issue.

It was hanging up communication on the SCSI bus but once it was removed the issue was resolved.

It is uncommon for a single disk to hang up the bus, however; since the disks share the SCSI path (each drive does not have its own dedicated cabling and controller) it is sometimes seen.

You can check the ZFS appliance uptime by running "version show" in the console.

zfs-test:configuration> version show
Appliance Name: zfs-test
Appliance Product: Sun ZFS Storage 7320
Appliance Type: Sun ZFS Storage 7320
Appliance Version: 2013.06.05.2.2,1-1.1
First Installed: Sun Jul 22 2012 10:02:24 GMT+0000 (UTC)
Last Updated: Sun Oct 26 2014 22:11:03 GMT+0000 (UTC)
Last Booted: Wed Dec 10 2014 10:03:08 GMT+0000 (UTC)
Appliance Serial Number: d043d335-ae15-4350-ca35-b05ba2749c94
Chassis Serial Number: 1225FMM0GE
Software Part Number: Oracle 000-0000-00
Vendor Product ID: urn:uuid:418bff40-b518-11de-9e65-080020a9ed93
Browser Name: aksh 1.0
Browser Details: aksh
HTTP Server: Apache/2.2.24 (Unix)
SSL Version: OpenSSL 1.0.0k 5 Feb 2013
Appliance Kit: ak/SUNW,maguro_plus@2013.06.05.2.2,1-1.1
Operating System: SunOS 5.11 ak/generic@2013.06.05.2.2,1-1.1 64-bit
BIOS: American Megatrends Inc. 08080102 05/23/2011
Service Processor: 3.0.16.10

Categories: Hardware, NAS, Storage Tags:

resolved – auditd STDERR: Error deleting rule Error sending enable request (Operation not permitted)

September 19th, 2014 No comments

Today when I try to restart auditd, the following error message prompted:

[2014-09-18T19:26:41+00:00] ERROR: service[auditd] (cookbook-devops-kernelaudit::default line 14) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'
---- Begin output of /sbin/service auditd restart ----
STDOUT: Stopping auditd: [  OK  ]
Starting auditd: [FAILED]
STDERR: Error deleting rule (Operation not permitted)
Error sending enable request (Operation not permitted)
---- End output of /sbin/service auditd restart ----
Ran /sbin/service auditd restart returned 1

After some reading of manpage auditd, I realized that when audit "enabled" was set to 2(locked), any attempt to change the configuration in this mode will be audited and denied. And that maybe the reason of "STDERR: Error deleting rule (Operation not permitted)", "Error sending enable request (Operation not permitted)". Here's from man page of auditctl:

-e [0..2] Set enabled flag. When 0 is passed, this can be used to temporarily disable auditing. When 1 is passed as an argument, it will enable auditing. To lock the audit configuration so that it can't be changed, pass a 2 as the argument. Locking the configuration is intended to be the last command in audit.rules for anyone wishing this feature to be active. Any attempt to change the configuration in this mode will be audited and denied. The configuration can only be changed by rebooting the machine.

You can run auditctl -s to check the current setting:

[root@centos-doxer ~]# auditctl -s
AUDIT_STATUS: enabled=1 flag=1 pid=3154 rate_limit=0 backlog_limit=320 lost=0 backlog=0

And you can run auditctl -e <0|1|2> to change this attribute on the fly, or you can add -e <0|1|2> in /etc/audit/audit.rules. Please note after you modify this, a reboot is a must to make this into effect.

PS:

Here's more about linux audit.

resolved – Permission denied even after chmod 777 world readable writable

September 19th, 2014 No comments

Several team members asked me that when they want to change to some directories or read some files ,the system reported error "Permission denied". Even after setting world writable(chmod 777), the error was still there:

-bash-3.2$ cd /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs
-bash: cd: /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs: Permission denied

-bash-3.2$ cat /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs/wls_sdi1.out
cat: /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs/wls_sdi1.out: Permission denied

-bash-3.2$ ls -l /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs/wls_sdi1.out
-rwxrwxrwx 1 oracle oinstall 1100961066 Sep 19 07:37 /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs/wls_sdi1.out

In summary, if you want to read some file(e.g. wls_sdi1.out) under some directory(e.g. /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs), then except for "read bit" set on that file(chmod +r wls_sdi1.out), it's also needed that all parent directories of that file(/u01, /u01/local, /u01/local/wls, ......, /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs) have both "read bit" & "execute bit" set(you can check it by ls -ld <dir name>):

chmod +r wls_sdi1.out #first set "read bit" on the file
chmod +r /u01; chmod +x /u01; chmod +r /u01/local; chmod +x /u01/local; <...skipped...>chmod +r /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs; chmod +x /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs; #then set both "read bit" & "execute bit" on all parent directories

And at last, if you can log on as the file owner, then everything will be smooth. For /u01/local/config/m_domains/tasdc1_domain/servers/wls_sdi1/logs/wls_sdi1.out, it's owned by oracle user. So you can try log on as oracle user and do the operations.

Categories: IT Architecture, Kernel, Linux, Systems, Unix Tags: