Archive

Author Archive

resolved – /lib/ld-linux.so.2: bad ELF interpreter: No such file or directory

April 1st, 2014 No comments

When I ran perl command today, I met problem below:

[root@test01 bin]# /usr/local/bin/perl5.8
-bash: /usr/local/bin/perl5.8: /lib/ld-linux.so.2: bad ELF interpreter: No such file or directory

Now let's check which package /lib/ld-linux.so.2 belongs to on a good linux box:

[root@test02 ~]# rpm -qf /lib/ld-linux.so.2
glibc-2.5-118.el5_10.2

So here's the resolution to the issue:

[root@test01 bin]# yum install -y glibc.x86_64 glibc.i686 glibc-devel.i686 glibc-devel.x86_64 glibc-headers.x86_64

Categories: IT Architecture, Kernel, Linux, Systems Tags:

resolved – sudo: sorry, you must have a tty to run sudo

April 1st, 2014 3 comments

The error message below sometimes will occur when you run a sudo <command>:

sudo: sorry, you must have a tty to run sudo

To resolve this, you may comment out "Defaults requiretty" in /etc/sudoers(revoked by running visudo). Here is more info about this method.

However, sometimes it's not convenient or even not possible to modify /etc/sudoers, then you can consider the following:

echo -e "<password>\n"|sudo -S <sudo command>

For -S parameter of sudo, you may refer to sudo man page:

-S' The -S (stdin) option causes sudo to read the password from the standard input instead of the terminal device. The password must be followed by a newline character.

So here -S bypass tty(terminal device) to read the password from the standard input. And by this, we can now pipe password to sudo.

Resolved – print() on closed filehandle $fh at ./perl.pl line 6.

March 19th, 2014 No comments

You may find that print sometimes won't work as expected in perl, for example:

[root@centos-doxer test]# cat perl.pl
#!/usr/bin/perl
use warnings;
open($fh,"test.txt");
select $fh;
close $fh;
print "test";

You may expect "test" to be printed, but actually you got error message:

print() on closed filehandle $fh at ./perl.pl line 6.

So how's this happened? Please see my explanation:

[root@centos-doxer test]# cat perl.pl
#!/usr/bin/perl
use warnings;
open($fh,"test.txt");
select $fh;
close $fh; #here you closed $fh filehandle, but you should now reset filehandle to STDOUT
print "test";

Now here's the updated script:

#!/usr/bin/perl
use warnings;
open($fh,"test.txt");
select $fh;
close $fh;
select STDOUT;
print "test";

This way, you'll get "test" as expected!

 

Categories: IT Architecture, Perl, Programming Tags:

set vnc not asking for OS account password

March 18th, 2014 No comments

As you may know, vncpasswd(belongs to package vnc-server) is used to set password for users when connecting to vnc using a vnc client(such as tightvnc). When you connect to vnc-server, it'll ask for the password:

vnc-0After you connect to the host using VNC, you may also find that the remote server will ask again for OS password(this is set by passwd):

vnc-01For some cases, you may not want the second one. So here's the way to cancel this behavior:

vnc-1vnc-2

 

 

Categories: IT Architecture, Linux, Systems Tags: ,

stuck in PXE-E51: No DHCP or proxyDHCP offers were received, PXE-M0F: Exiting Intel Boot Agent, Network boot canceled by keystroke

March 17th, 2014 No comments

If you installed your OS and tried booting up it but stuck with the following messages:

stuck_pxe

Then one possibility is that, the configuration for your host's storage array is not right. For instance, it should be JBOD but you had configured it to RAID6.

Please note that this is only one possibility for this error, you may search for PXE Error Codes you encoutered for more details.

PS:

  • Sometimes, DHCP snooping may prevent PXE functioning, you can read more http://en.wikipedia.org/wiki/DHCP_snooping.
  • STP(Spanning-Tree Protocol) makes each port wait up to 50 seconds before data is allowed to be sent on the port. This Delay in turn can cause problems with some applications/protocols (PXE, Bootworks, etc.). To alleviate the problem, Porfast was implemented on Cisco devices, the terminology might differ between different vendor devices. You can read more http://www.symantec.com/business/support/index?page=content&id=HOWTO6019
  • ARP caching http://www.networkers-online.com/blog/2009/02/arp-caching-and-timeout/

Oracle BI Publisher reports – send mail when filesystems getting full

March 17th, 2014 No comments

Let's assume you have one Oracle BI Publisher report for filesystem checking. And now you want to write script for checking that report page and send mail to system admins when filesystems are getting full. As the default output of Oracle BI Publisher report needs javascript to work, and as you may know javascript is evil that wget/curl can not get them, so after log on, the next step you need to do is to find the html version's url of that report for you to use in your script(and the html page has all records when javascript one has only part of them):

BI_report_login

BI_export_html

 

Let's assume that the html's url is "http://www.example.com:9703/report.html", and the display of it was like the following:

bi report

Then here goes the script that will check this page for hosts that has less than 10% available space and send mail to system admins:

#!/usr/bin/perl
use HTML::Strip;
system("rm -f spacereport.html");
system("wget -q --no-proxy --no-check-certificate --post-data 'id=admin&passwd=password' 'http://www.example.com:9703/report.html' -O spacereport.html");
open($fh,"spacereport.html");

#or just @spacereport=<$fh>;
foreach(<$fh>){
push(@spacereport,$_);
}

#change array to hash
$index=0;
map {$pos{$index++}=$_} @spacereport;

#get location of <table> and </table>
#sort numerically ascending
for $char (sort {$a<=>$b} (keys %pos))
{
if($pos{$char} =~ /<table class="c27">/)
{
$table_start=$char;
}

if($pos{$char} =~ /<\/table>/)
{
$table_end=$char;
}

}

#get contents between <table> and </table>
for($i=$table_start;$i<=$table_end;$i++){
push(@table_array,$spacereport[$i]);
}
$table_htmlstr=join("",@table_array);

#get clear text between <table> and </table>
my $hs=HTML::Strip->new();
my $clean_text = $hs->parse($table_htmlstr);
$hs->eof;

@array_filtered=split("\n",$clean_text);

#remove empty array element
@array_filtered=grep { !/^\s+$/ } @array_filtered;
system("rm -f space_mail_warning.txt");
open($fh_mail_warning,">","space_mail_warning.txt");
select $fh_mail_warning;
for($j=4;$j<=$#array_filtered;$j=$j+4){
#put lines that has free space lower than 10% to space_mail_warning.txt
if($array_filtered[$j+2] <= 10){
print "Host: ".$array_filtered[$j]."\n";
print "Part: ".$array_filtered[$j+1]."\n";
print "Free(%): ".$array_filtered[$j+2]."\n";
print "Free(GB): ".$array_filtered[$j+3]."\n";
print "============\n\n";
}
}
close $fh_mail_warning;

system("rm -f space_mail_info.txt");
open($fh_mail_info,">","space_mail_info.txt");
select $fh_mail_info;
for($j=4;$j<=$#array_filtered;$j=$j+4){
#put lines that has free space lower than 15% to space_mail_info.txt
if($array_filtered[$j+2] <= 15){
print "Host: ".$array_filtered[$j]."\n";
print "Part: ".$array_filtered[$j+1]."\n";
print "Free(%): ".$array_filtered[$j+2]."\n";
print "Free(GB): ".$array_filtered[$j+3]."\n";
print "============\n\n";
}
}
close $fh_mail_info;

#send mail
#select STDOUT;
if(-s "space_mail_warning.txt"){
system('cat space_mail_warning.txt | /bin/mailx -s "Space Warning - please work with component owners to free space" [email protected]');
} elsif(-s "space_mail_info.txt"){
system('cat space_mail_info.txt | /bin/mailx -s "Space Info - Space checking mail" [email protected]');
}

Categories: IT Architecture, Perl, Programming Tags:

wget and curl tips

March 14th, 2014 No comments

Imagine you want to download all files under http://www.example.com/2013/downloads, and not files under http://www.example.com/2013 except for directory 'downloads', then you can do this:

wget -r --level 100 -nd --no-proxy --no-parent --reject "index.htm*" --reject "*gif" 'http://www.example.com/2013/downloads/' #--level 100 is large enough, as I've seen no site has more than 100 levels of sub-directories so far.

wget -p -k --no-proxy --no-check-certificate --post-data 'id=username&passwd=password' <url> -O output.html

wget --no-proxy --no-check-certificate --save-cookies cookies.txt <url>

wget --no-proxy --no-check-certificate --load-cookies cookies.txt <url>

curl -k -u 'username:password' <url>

curl -k -L -d id=username -d passwd=password <url>

curl --data "loginform:id=username&loginform:passwd=password" -k -L <url>

Here's one curl example to get SSL certs info on LTM:

#!/bin/bash
path="/var/tmp"
path_root="/var/tmp"

agent="Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.2)"

curl -v -L -k -A "$agent" -c ${path}/cookie "https://ltm-url/tmui/login.jsp?msgcode=1&"

curl -v -L -k -A "$agent" -b ${path}/cookie -c ${path}/cookie -e "https://ltm-url/tmui/login.jsp?msgcode=1&" -d "username=myusername&passwd=mypassword" "https://ltm-url/tmui/logmein.html?msgcode=1&"

curl -v -L -k -A "$agent" -b ${path}/cookie -c ${path}/cookie -o ${path_root}/certs-env.html "https://ltm-url/tmui/Control/jspmap/tmui/locallb/ssl_certificate/list.jsp?&startListIndex=0&showAll=true"

Now you can have a check of /var/tmp/certs-env.html for SSL certs info of Big IP VIPs.

resolved – ssh Read from socket failed: Connection reset by peer and Write failed: Broken pipe

March 13th, 2014 No comments

If you met following errors when ssh to linux box:

Read from socket failed: Connection reset by peer

Write failed: Broken pipe

Then there's one possibility that the linux box's filesystem was corrupted. As in my case there's output to stdout:

EXT3-fs error ext3_lookup: deleted inode referenced

To resolve this, you need make linux go to single user mode and fsck -y <filesystem>. You can get corrupted filesystem names when booting:

[/sbin/fsck.ext3 (1) -- /usr] fsck.ext3 -a /dev/xvda2
/usr contains a file system with errors, check forced.
/usr: Directory inode 378101, block 0, offset 0: directory corrupted

/usr: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)

[/sbin/fsck.ext3 (1) -- /oem] fsck.ext3 -a /dev/xvda5
/oem: recovering journal
/oem: clean, 8253/1048576 files, 202701/1048233 blocks
[/sbin/fsck.ext3 (1) -- /u01] fsck.ext3 -a /dev/xvdb
u01: clean, 36575/14548992 files, 2122736/29081600 blocks
[FAILED]

So in this case, I did fsck -y /dev/xvda2 && fsck -y /dev/xvda5. Later reboot host, and then everything went well.

PS:

If two VMs are booted up in two hypervisors and these VMs shared the same filesystem(like NFS), then after fsck -y one FS and booted up the VM, the FS will corrupt soon as there're other copies of itself is using that FS. So you need first make sure that only one copy of VM is running on hypervisors of the same server pool.

Categories: IT Architecture, Kernel, Linux, Systems Tags:

tcpdump & wireshark tips

March 13th, 2014 No comments

tcpdump [ -AdDefIKlLnNOpqRStuUvxX ] [ -B buffer_size ] [ -c count ]

[ -C file_size ] [ -G rotate_seconds ] [ -F file ]
[ -i interface ] [ -m module ] [ -M secret ]
[ -r file ] [ -s snaplen ] [ -T type ] [ -w file ]
[ -W filecount ]
[ -E spi@ipaddr algo:secret,... ]
[ -y datalinktype ] [ -z postrotate-command ] [ -Z user ] [ expression ]

#general format of a tcp protocol line

src > dst: flags data-seqno ack window urgent options
Src and dst are the source and destination IP addresses and ports.
Flags are some combination of S (SYN), F (FIN), P (PUSH), R (RST), W (ECN CWR) or E (ECN-Echo), or a single '.'(means no flags were set)
Data-seqno describes the portion of sequence space covered by the data in this packet.
Ack is sequence number of the next data expected the other direction on this connection.
Window is the number of bytes of receive buffer space available the other direction on this connection.
Urg indicates there is 'urgent' data in the packet.
Options are tcp options enclosed in angle brackets (e.g., <mss 1024>).

tcpdump -D #list of the network interfaces available
tcpdump -e #Print the link-level header on each dump line
tcpdump -S #Print absolute, rather than relative, TCP sequence numbers
tcpdump -s <snaplen> #Snarf snaplen bytes of data from each packet rather than the default of 65535 bytes
tcpdump -i eth0 -S -nn -XX vlan
tcpdump -i eth0 -S -nn -XX arp
tcpdump -i bond0 -S -nn -vvv udp dst port 53
tcpdump -i bond0 -S -nn -vvv host testhost
tcpdump -nn -S -vvv "dst host host1.example.com and (dst port 1521 or dst port 6200)"

tcpdump -nn -S udp dst port 111 #note that telnet is based on tcp protocol, NOT udp. So if you want to test UDP connection(udp is connection-less), then you must start up the app, then use tcpdump to test.

tcpdump -nn -S udp dst portrange 1-1023

Wireshark Capture Filters (in Capture -> Options)

Wireshark DisplayFilters (in toolbar)

 

EVENT DIAGRAM
Host A sends a TCP SYNchronize packet to Host BHost B receives A's SYN

Host B sends a SYNchronize-ACKnowledgement

Host A receives B's SYN-ACK

Host A sends ACKnowledge

Host B receives ACK.
TCP socket connection is ESTABLISHED.

3-way-handshake
TCP Three Way Handshake
(SYN,SYN-ACK,ACK)

TCP-CLOSE_WAIT

 

The upper part shows the states on the end-point initiating the termination.

The lower part the states on the other end-point.

So the initiating end-point (i.e. the client) sends a termination request to the server and waits for an acknowledgement in state FIN-WAIT-1. The server sends an acknowledgement and goes in state CLOSE_WAIT. The client goes into FIN-WAIT-2 when the acknowledgement is received and waits for an active close. When the server actively sends its own termination request, it goes into LAST-ACK and waits for an acknowledgement from the client. When the client receives the termination request from the server, it sends an acknowledgement and goes into TIME_WAIT and after some time into CLOSED. The server goes into CLOSED state once it receives the acknowledgement from the client.

PS:

You can refer to this article for a detailed explanation of tcp three-way handshake establishing/terminating a connection. And for tcpdump one, you can check below:

[c9sysdba@host2 ~]# telnet host1 14100
Trying 10.240.249.139...
Connected to host1.us.oracle.com (10.240.249.139).
Escape character is '^]'.
^]
telnet> quit
Connection closed.

[root@host1 ~]# tcpdump -vvv -S host host2
tcpdump: WARNING: eth0: no IPv4 address assigned
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
03:16:39.188951 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: TCP (6), length: 60) host1.us.oracle.com.14100 > host2.us.oracle.com.18890: S, cksum 0xa806 (correct), 3445765853:3445765853(0) ack 3946095098 win 5792 <mss 1460,sackOK,timestamp 854077220 860674218,nop,wscale 7> #2. host1 ack SYN package by host2, and add it by 1 as the number to identify this connection(3946095098). Then host1 send a SYN(3445765853).
03:16:41.233807 IP (tos 0x0, ttl 64, id 6650, offset 0, flags [DF], proto: TCP (6), length: 52) host1.us.oracle.com.14100 > host2.us.oracle.com.18890: F, cksum 0xdd48 (correct), 3445765854:3445765854(0) ack 3946095099 win 46 <nop,nop,timestamp 854079265 860676263> #5. host1 Ack F(3946095099), and then it send a F just as host2 did(3445765854 unchanged). 

[c9sysdba@host2 ~]# tcpdump -vvv -S host host1
tcpdump: WARNING: eth0: no IPv4 address assigned
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
03:16:39.188628 IP (tos 0x10, ttl 64, id 31059, offset 0, flags [DF], proto: TCP (6), length: 60) host2.us.oracle.com.18890 > host1.us.oracle.com.14100: S, cksum 0x265b (correct), 3946095097:3946095097(0) win 5792 <mss 1460,sackOK,timestamp 860674218 854045985,nop,wscale 7> #1. host2 send a SYN package to host1(3946095097)
03:16:39.188803 IP (tos 0x10, ttl 64, id 31060, offset 0, flags [DF], proto: TCP (6), length: 52) host2.us.oracle.com.18890 > host1.us.oracle.com.14100: ., cksum 0xed44 (correct), 3946095098:3946095098(0) ack 3445765854 win 46 <nop,nop,timestamp 860674218 854077220> #3. host2 ack the SYN sent by host1, and add 1 to identify this connection. The tcp connection is now established(3946095098 unchanged, ack 3445765854).
03:16:41.233397 IP (tos 0x10, ttl 64, id 31061, offset 0, flags [DF], proto: TCP (6), length: 52) host2.us.oracle.com.18890 > host1.us.oracle.com.14100: F, cksum 0xe546 (correct), 3946095098:3946095098(0) ack 3445765854 win 46 <nop,nop,timestamp 860676263 854077220> #4. host2 send a F(in) with a Ack, F will inform host1 that no more data needs sent(3946095098 unchanged), and ack is uded to identify the connection previously established(3445765854 unchanged)
03:16:41.233633 IP (tos 0x10, ttl 64, id 31062, offset 0, flags [DF], proto: TCP (6), length: 52) host2.us.oracle.com.18890 > host1.us.oracle.com.14100: ., cksum 0xdd48 (correct), 3946095099:3946095099(0) ack 3445765855 win 46 <nop,nop,timestamp 860676263 854079265> #6. host2 ack host1's F(3445765855), and the empty flag to identify the connection(3946095099 unchanged).

psftp through a proxy

March 5th, 2014 No comments

You may know that, we can set proxy in putty for ssh to remote host, as shown below:

putty_proxyAnd if you want to scp files from remote site to your local box, you can use putty's psftp.exe. There're many options for psftp.exe:

C:\Users\test>d:\PuTTY\psftp.exe -h
PuTTY Secure File Transfer (SFTP) client
Release 0.62
Usage: psftp [options] [user@]host
Options:
-V print version information and exit
-pgpfp print PGP key fingerprints and exit
-b file use specified batchfile
-bc output batchfile commands
-be don't stop batchfile processing if errors
-v show verbose messages
-load sessname Load settings from saved session
-l user connect with specified username
-P port connect to specified port
-pw passw login with specified password
-1 -2 force use of particular SSH protocol version
-4 -6 force use of IPv4 or IPv6
-C enable compression
-i key private key file for authentication
-noagent disable use of Pageant
-agent enable use of Pageant
-batch disable all interactive prompts

Although there's proxy setting option for putty.exe, there's no proxy setting for psftp.exe! So what should you do if you want to copy files back to local box, and there's firewall blocking you from doing this directly, and you must use a proxy?

As you may notice, there's "-load sessname" option in psftp.exe:

-load sessname Load settings from saved session

This option means that, if you have session opened by putty.exe, then you can use psftp.exe -load <session name> to copy files from remote site. For example, suppose you opened one session named mysession in putty.exe in which you set proxy there, then you can use "psftp.exe -load mysession" to copy files from remote site(no need for username/password, as you must have entered that in putty.exe session):

C:\Users\test>d:\PuTTY\psftp.exe -load mysession
Using username "root".
Remote working directory is /root
psftp> ls
Listing directory /root
drwx------ 3 ec2-user ec2-user 4096 Mar 4 09:27 .
drwxr-xr-x 3 root root 4096 Dec 10 23:47 ..
-rw------- 1 ec2-user ec2-user 388 Mar 5 05:07 .bash_history
-rw-r--r-- 1 ec2-user ec2-user 18 Sep 4 18:23 .bash_logout
-rw-r--r-- 1 ec2-user ec2-user 176 Sep 4 18:23 .bash_profile
-rw-r--r-- 1 ec2-user ec2-user 124 Sep 4 18:23 .bashrc
drwx------ 2 ec2-user ec2-user 4096 Mar 4 09:21 .ssh
psftp> help
! run a local command
bye finish your SFTP session
cd change your remote working directory
chmod change file permissions and modes
close finish your SFTP session but do not quit PSFTP
del delete files on the remote server
dir list remote files
exit finish your SFTP session
get download a file from the server to your local machine
help give help
lcd change local working directory
lpwd print local working directory
ls list remote files
mget download multiple files at once
mkdir create directories on the remote server
mput upload multiple files at once
mv move or rename file(s) on the remote server
open connect to a host
put upload a file from your local machine to the server
pwd print your remote working directory
quit finish your SFTP session
reget continue downloading files
ren move or rename file(s) on the remote server
reput continue uploading files
rm delete files on the remote server
rmdir remove directories on the remote server
psftp>

Now you can get/put files as we used to now.

PS:

If you do not need proxy connecting to remote site, then you can use psftp.exe CLI to get remote files directly. For example:

d:\PuTTY\psftp.exe [email protected] -i d:\PuTTY\aws.ppk -b d:\PuTTY\script.scr -bc -be -v

And in d:\PuTTY\script.scr is script for put/get files:

cd /backup
lcd c:\
mget *.tar.gz
close

Categories: IT Architecture, Linux, Systems Tags: ,

checking MTU or Jumbo Frame settings with ping

February 14th, 2014 No comments

You may set your linux box's MTU to jumbo frame sized 9000 bytes or larger, but if the switch your box connected to does not have jumbo frame enabled, then your linux box may met problems when sending & receiving packets.

So how can we get an idea of whether Jumbo Frame enabled on switch or linux box?

Of course you can log on switch and check, but we can also verify this from linux box that connects to switch.

On linux box, you can see the MTU settings of each interface using ifconfig:

[root@centos-doxer ~]# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 08:00:27:3F:C5:08
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:50502 errors:0 dropped:0 overruns:0 frame:0
TX packets:4579 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:9835512 (9.3 MiB) TX bytes:1787223 (1.7 MiB)
Base address:0xd010 Memory:f0000000-f0020000

As stated above, 9000 here doesn't mean that Jumbo Frame enabled on your box to switch. As you can verify with below command:

[root@testbox ~]# ping -c 2 -M do -s 1472 testbox2
PING testbox2.example.com (192.168.29.184) 1472(1500) bytes of data. #so here 1500 bytes go through the network
1480 bytes from testbox2.example.com (192.168.29.184): icmp_seq=1 ttl=252 time=0.319 ms
1480 bytes from testbox2.example.com (192.168.29.184): icmp_seq=2 ttl=252 time=0.372 ms

--- testbox2.example.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.319/0.345/0.372/0.032 ms
[root@testbox ~]#
[root@testbox ~]#
[root@testbox ~]# ping -c 2 -M do -s 1473 testbox2
PING testbox2.example.com (192.168.29.184) 1473(1501) bytes of data. #so here 1501 bytes can not go through. From here we can see that MTU for this box is 1500, although ifconfig says it's 9000
From testbox.example.com (192.168.28.40) icmp_seq=1 Frag needed and DF set (mtu = 1500)
From testbox.example.com (192.168.28.40) icmp_seq=1 Frag needed and DF set (mtu = 1500)

--- testbox2.example.com ping statistics ---
0 packets transmitted, 0 received, +2 errors

Also, if your the switch is Cisco one, you can verify whether the switch port connecting server has enabled jumbo frame or not by sniffing CDP (Cisco discover protocol) packet. Here's one example:

-bash-4.1# tcpdump -i eth0 -nn -v -c 1 ether[20:2] == 0x2000 #ether[20:2] == 0x2000 means capture only packets that have a 2 byte value of hex 2000 starting at byte 20
tcpdump: WARNING: eth0: no IPv4 address assigned
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
03:44:14.221022 CDPv2, ttl: 180s, checksum: 692 (unverified), length 287
Device-ID (0x01), length: 46 bytes: 'ucf-c1z3-swi-5k01b.ucf.oracle.com(SSI16010QJH)'
Address (0x02), length: 13 bytes: IPv4 (1) 192.168.0.242
Port-ID (0x03), length: 16 bytes: 'Ethernet111/1/12'
Capability (0x04), length: 4 bytes: (0x00000228): L2 Switch, IGMP snooping
Version String (0x05), length: 66 bytes:
Cisco Nexus Operating System (NX-OS) Software, Version 5.2(1)N1(4)
Platform (0x06), length: 11 bytes: 'N5K-C5548UP'
Native VLAN ID (0x0a), length: 2 bytes: 123
AVVID trust bitmap (0x12), length: 1 byte: 0x00
AVVID untrusted ports CoS (0x13), length: 1 byte: 0x00
Duplex (0x0b), length: 1 byte: full
MTU (0x11), length: 4 bytes: 1500 bytes #so here MTU size was set to 1500 bytes
System Name (0x14), length: 18 bytes: 'ucf-c1z3-swi-5k01b'
System Object ID (not decoded) (0x15), length: 14 bytes:
0x0000: 060c 2b06 0104 0109 0c03 0103 883c
Management Addresses (0x16), length: 13 bytes: IPv4 (1) 10.131.144.17
Physical Location (0x17), length: 13 bytes: 0x00/snmplocation
1 packets captured
1 packets received by filter
0 packets dropped by kernel
110 packets dropped by interface

PS:

  1. As for "-M do" parameter for ping, you may refer to man ping for more info. And as for DF(don't fragment) and Path MTU Discovery mentioned in the manpage, you may read more on http://en.wikipedia.org/wiki/Path_MTU_discovery and http://en.wikipedia.org/wiki/IP_fragmentation
  2. Here's more on tcpdump tips http://dazdaztech.wordpress.com/2013/05/17/using-tcpdump-to-see-cdp-or-lldp-packets/ and http://the-welters.com/professional/tcpdump.html
  3. Maximum packet size is the MTU plus the data-link header length. Packets are not always transmitted at the Maximum packet size. As we can see from output of iptraf -z eth0.
  4. Here's more about MTU:

The link layer, which is typically Ethernet, sends information into the network as a series of frames. Even though the layers above may have pieces of information much larger than the frame size, the link layer breaks everything up into frames(which in payload encloses IP packet such as TCP/UDP/ICMP) to send them over the network. This maximum size of data in a frame is known as the maximum transfer unit (MTU). You can use network configuration tools such as ip or ifconfig to set the MTU.

The size of the MTU has a direct impact on the efficiency of the network. Each frame in the link layer has a small header, so using a large MTU increases the ratio of user data to overhead (header). When using a large MTU, however, each frame of data has a higher chance of being corrupted or dropped. For clean physical links, a high MTU usually leads to better performance because it requires less overhead; for noisy links, however, a smaller MTU may actually enhance performance because less data has to be re-sent when a single frame is corrupted.

Here's one image of layers of network frames:

layers-of-network-frames

 

Oracle VM operations – poweron, poweroff, status, stat -r

January 27th, 2014 No comments

Here's the script:

#!/usr/bin/perl
#1.OVM must be running before operations
#2.run ovm_vm_operation.pl status before running ovm_vm_operation.pl poweroff or poweron
use Net::SSH::Perl;
$host = $ARGV[0];
$operation = $ARGV[1];
$user = 'root';
$password = 'password';

if($host eq "help") {
print "$0 OVM-name status|poweron|poweroff|stat-r\n";
exit;
}

$ssh = Net::SSH::Perl->new($host);
$ssh->login($user,$password);

if($operation eq "status") {
($stdout,$stderr,$exit) = $ssh->cmd("ovm -uadmin -pwelcome1 vm ls|grep -v VM_test");
open($host_fd,'>',"/var/tmp/${host}.status");
select $host_fd;
print $stdout;
close $host_fd;
} elsif($operation eq "poweroff") {
open($poweroff_fd,'<',"/var/tmp/${host}.status");
foreach(<$poweroff_fd>){
if($_ =~ "Server_Pool|OVM|Powered") {
next;
}
if($_ =~ /(.*?)\s+([0-9]{1,})\s+([0-9]{1,})\s+([0-9]{1,})\s+([a-zA-Z]{1,})\s+(.*)/){
$ssh->cmd("ovm -uadmin -pwelcome1 vm poweroff -n $1 -s $6");
sleep 12;
}
}
} elsif($operation eq "poweron") {
open($poweron_fd,'<',"/var/tmp/${host}.status");
foreach(<$poweron_fd>){
if($_ =~ "Server_Pool|OVM|Running") {
next;
}
if($_ =~ /(.*?)\s+([0-9]{1,})\s+([0-9]{1,})\s+([0-9]{1,})\s+([a-zA-Z]{1,})\s+Off(.*)/){
$ssh->cmd("ovm -uadmin -pwelcome1 vm poweron -n $1 -s $6");
#print "ovm -uadmin -pwelcome1 vm poweron -n $1 -s $6";
sleep 20;
}
}
} elsif($operation eq "stat-r") {
open($poweroff_fd,'<',"/var/tmp/${host}.status");
foreach(<$poweroff_fd>){
if($_ =~ /(.*?)\s+([0-9]{1,})\s+([0-9]{1,})\s+([0-9]{1,})\s+(Shutting\sDown|Initializing)\s+(.*)/){
#print "ovm -uadmin -pwelcome1 vm stat -r -n $1 -s $6";
$ssh->cmd("ovm -uadmin -pwelcome1 vm stat -r -n $1 -s $6");
sleep 1;
}
}
}

You can use the following to make the script run in parallel:

for i in <all OVMs>;do (./ovm_vm_operation.pl $i status &);done

avoid putty ssh connection sever or disconnect

January 17th, 2014 2 comments

After sometime, ssh will disconnect itself. If you want to avoid this, you can try run the following command:

while [ 1 ];do echo hi;sleep 60;done &

This will print message "hi" every 60 seconds on the standard output.

PS:

You can also set some parameters in /etc/ssh/sshd_config, you can refer to http://www.doxer.org/make-ssh-on-linux-not-to-disconnect-after-some-certain-time/

“Include snapshots” made NFS shares from ZFS appliance shrinking

January 17th, 2014 No comments

Today I met one weird issue when checking one NFS share mounted from ZFS appliance. The NFS filesystem mounted on client was shrinking when I removed files as the space on that filesystem was getting low. But what made me confused was that the filesystem's size would getting lower! Shouldn't the free space getting larger and the size keep unchanged?

After some debugging, I found that this was caused by ZFS appliance shares' "Include snapshots". When I uncheck "Include snapshots", the issue was gone!

zfs-appliance

Categories: Hardware, NAS, Storage Tags:

resolved – ESXi Failed to lock the file

January 13th, 2014 No comments

When I was power on one VM in ESXi, one error occurred:

An error was received from the ESX host while powering on VM doxer-test.
Cannot open the disk '/vmfs/volumes/4726d591-9c3bdf6c/doxer-test/doxer-test_1.vmdk' or one of the snapshot disks it depends on.
Failed to lock the file

And also:

unable to access file since it is locked

This apparently was caused by some storage issue. I firstly googled and found most of the posts were telling stories about ESXi working mechanism, and I tried some of them but with no luck.

Then I thought of that our storage datastore was using NFS/ZFS, and NFS has file lock issue as you know. So I mount the nfs share which datastore was using and removed one file named lck-c30d000000000000. After this, the VM booted up successfully! (or we can log on ESXi host, and remove lock file there also)

install java jdk on linux

January 7th, 2014 No comments

Here's the steps if you want to install java on linux:

wget <path to jre-7u25-linux-x64.rpm> -P /tmp
rpm -ivh /tmp/jre-7u25-linux-x64.rpm
mkdir -p /root/.mozilla/plugins
rm -f /root/.mozilla/plugins/libnpjp2.so
ln -s /usr/java/jre1.7.0_25/lib/amd64/libnpjp2.so /root/.mozilla/plugins/libnpjp2.so
ll /root/.mozilla/plugins/libnpjp2.so

add another root user and set password

January 7th, 2014 No comments

In linux, do the following to add another root user and set password:

mkdir -p /home/root2
useradd -u 0 -o -g root -G root -s /bin/bash -d /home/root2 root2
echo password | passwd --stdin root2

Categories: IT Architecture, Linux, Systems Tags:

oracle database tips – management

December 30th, 2013 No comments
oracle_instances_and_database
###General

dbca
netmgr
netca #su - grid, can change listener port with this
oifcfg #Oracle Interface Configuration Tool, used for adding new public/private interfaces
appvipcfg #appvipcfg create -network=1 -ip 172.17.1.108 -vipname httpd-vip -user=root
lsnrctl #su - grid first; change_password, save_config
asmca
asmcmd -p #asmcmd -p ls -l; ls --permission; lsdg; find -t datafile DATA/ sys*;pwd;lsct<asm client>;help cp

v$parameter(for current session), v$system_parameter(for new sessions), v$spparameter

sqlnet.ora #su - grid, Profiles, define sequence of naming method;access control(through netmgr)

orapwd #sync dictionary/password file after upgrading oracle db
/etc/oratab #which DBs are installed, and control whether dbstart/dbshut is used to start/stop DB

pmon registers to listener, alter system register. PMON worked with dispatcher, shared server architecture; dedicated server architecture(for restricted operations)

DB modes

startup nomount #read spfile or init.ora and start up oracle memory structures/background processes. instance is started but db is not associate with instance. may recreate control files in this mode.
alter database mount #the instance mounts the database. Control files(contains name of datafiles and redo logs) are read, but datafiles and redo logs still not open
startup force #restart instance(first shutdown abort then startup. if not shutdown properly before and cannot startup now)
startup mount

alter database open [read only]; #datafiles and redo logs open, ready for use

startup restrict #only DBA can use the DB
alter system quiesce restrict #The activities of other users continue until they become inactive
alter system unquiesce
shutdown normal(all connections quit)/immediate(rollback first)/transactional(after commit)/abort

SQL> select open_mode from v$database; #read write

segment, extents, blocks

Each segment is a single instance of a table,partition,cluster, index,or temporary or undo segment. So,for example,a table with two indexes is implemented as three segments in the schema.

As data is added to Oracle, it will first fill the blocks in the allocated extents and once those extents are full,new extents can be added to the segment as long as space allows.

EM

emctl status/start/stop dbconsole(management agent)/agent/oms
emca -config dbcontrol db -cluster #EM configuration assistant
emca -reconfig dbcontrol -cluster #reconfigure the Console to start on a different node (or on more than one if you desire)
emca -displayConfig dbcontrol -cluster #current config, Management Server's location
emca -addInstdb/-deleteInst

###Performance
AWR(Automatic Workload Repository) is

used for storing database statistics that are used for performance tuning). A set of tables is created for the AWR under the SYS schema in the SYSAUX tablespace

MMON(Manageability Monitor) captures base statistics every 60 minutes. snapshots and in memory statistics are called AWR(ADDM runs automatically after each AWR snapshot)

MMNL(Manageability Monitor Light) performing tasks related to the Active Session History (ASH), ASH refresh every second,record what the sessions are waiting for

ADDM(automatic database diagnostic monitor) diagnoses AWR report and suggest potential solutions
SQL>select * from v$sql where cpu_time>200000 #20s
explain plan for select count(*) from lineitem; #and then issue select * from table(DBMS_XPLAN.DISPLAY);
###LOG
show parameter dump_dest; #/u01/app/oracle/diag/rdbms/devdb/devdb1/{trace,alert}
$ORACLE_BASE/cfgtoollogs/dbca #for DBCA trace and log files
$ORACLE_BASE/admin/devdb

 

adump,  audit files
dpdump, Data Pump Files
hdump, High availability trace files
pfile, Initialization file;

 

$GRID_HOME/log/hostname is $LOG_HOME

 

ohasd/ #ohasd.bin's booting log
acfs/crsd/cssd/evmd/diskmon/gipcd/gnsd/gpnpd/mdnsd/racg/srvm
agent
client

 

ADR(automatical diagnotics repository)

 

adrci #su - oracle/grid, manage alert/trace files
select name, value from gv$diag_info; #ADR  directory structure for all instances, e.g. Diag Enabled/ADR Base/ADR Home/Diag Trace/Diag Alert/Health Monitor/Active Problem Count
adrci exec="show home";
show home;
set base xxx;<show parameter diagnostic_dest>
set homepath diag/rdbms/orcl6/orcl6;<or export ADR_HOME='xxx'>
show alert -p "message_text like '%start%'";
show alert -tail 5 -f;
show incident;
ips create package incident <incident id>;
ips generate package 1 in /tmp;
show tracefile;
show tracefile -I <incident id>;

 

show trace <tracefile name from show tracefile -I, like PROD1_lmhb_7430_i27729.trc>;
HM(health check)

 

SQL> select name from v$hm_check; #health check names
SQL> exec dbms_hm.run_check('DB Structure Integrity Check');
adrci> show hm_run;  #to check HM details

adrci> create report hm_run <RUN_NAME from show hm_run>;

adrci> show report hm_run <RUN_NAME from show hm_run>;

SQL>  select description, damage_description from v$hm_finding where run_id = 62; #run_id is from show hm_run

 

DBMS_MONITOR and trace(sql_trace is deprecated)

 

DBMS_MONITOR traces for a specific session(session_trace_enable or database_trace_enable for all sessions), module(serv_mod_act_trace_enable), action(serv_mod_act_trace_enable), or client identifier

 

SQL> SELECT sid, serial# FROM v$session WHERE username = 'TPCC'; # you may need to join V$SESSION to other dynamic performance views, such as V$SQL, to identify the session of interest
SQL> EXECUTE dbms_monitor.session_trace_enable (session_id=>164); #enable tracing for a specific session, the result will be appended to the current trace file
SQL> EXECUTE dbms_monitor.session_trace_disable (session_id=>164); #disable tracing for the session
tkprof #view the activities of a session with finer granularity after trace files generated
###Patch
Patch types

 

Interim patch #cannot wait until the next patch set to receive the product fix. use opatch to install
CPU(Critical Patch Update, overall release of security fixes each quarter)
PS(Patch Set, minor version upgrade, 11.1.0.6 -> 11.1.0.7. use OUI<Oracle Universal Installer> to apply)
PSU(Patch Set Updates, cumulative patches,  low risk and RAC rolling installable, include the latest CPU)
BP(Bundle Patch,  PSU for exadata)
Major release update(11.1 -> 11.2)
Opatch
opatch query -is_rolling_patch <patch path/> #four modes: all node patch mode/rolling patch mode/minumum downtime patch mode/local patch mode
opatch query <patch path> -all
$ORACLE_HOME/OPatch/opatch lsinventory -detail -oh $ORACLE_HOME #ensure inventory is not corrupted and see what patches have been applied #su - oracle/grid
/etc/oraInst.loc #inventory_loc(/u01/app/oraInventory)
/u01/app/oraInventory/ContentsXML/inventory.xml #stores oracle software products & their oracle_homes location
Opatch example
srvctl stop home -o $ORACLE_HOME -s /var/tmp/home_stop.txt -n node1 #su - oracle
srvctl stop home -o $ORACLE_HOME -s /var/tmp/home_stop_grid.txt -n node1 #su - grid
$GRID_HOME/crs/install/rootcrs.pl -unlock -crshome $GRID_HOME #su - root, Unlock CRS home
/u01/app/crs/OPatch/opatch apply #su - grid and patch
$GRID_HOME/crs/install/rootcrs.pl -patch #su - root, boot up local HAS that has been patched
crsctl check crs #check status, or crsctl check cluster -all. -n node1 for single node
srvctl start home -o $ORACLE_HOME -s /var/tmp/home_stop_grid.txt -n node1 #su - grid
srvctl start home -o $ORACLE_HOME -s /var/tmp/home_stop.txt -n node1 #su - oracle
now patch node2

 

###RAC
clusterware

server pools #Policy-managed databases, add/remove instance to cluster automatically

$GRID_HOME/racg/usrco #server side callouts

alter diskgroup diskgroupName online disk diskName #re-activate asm disk in DISK_REPAIR_TIME

select name,value from v$asm_attribute where group_number=3 and name not like 'template%'; #disk group attributes

Add RAC nodes

 

network/user ids/asmlib and asm module/disks
$ORACLE_HOME/bin/cluvfy stage -post hwos -n london4 -verbose #su - grid
cluvfy stage -pre nodeadd -n london4  -fixup -fixupdir /tmp
$GRID_HOME/oui/bin/addNode.sh -silent "CLUSTER_NEW_NODES={london4}" "CLUSTER_NEW_VIRTUAL_HOSTNAMES={london4-vip}"  #with or without GNS, then execute scripts prompted
cluvfy stage -post nodeadd -n london4  #done for adding node to grid
$ORACLE_HOME/oui/bin/addNode.sh -silent "CLUSTER_NEW_NODES={london4}" #add RDBMS software, then execute scripts prompted

 

Remove RAC nodes
ensure that no database instance or other custom resource type uses that nodeocrconfig -manualbackupdbca #remove database instance from a node.srvctl config listener -a #detailed listener configurationsrvctl disable listener -l LISTENER -n london2

srvctl stop listener -l LISTENER -n london2

./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME CLUSTER_NODES={london2} -local #on node2(to be deleted), pay attention to -local

 

$ORACLE_HOME/deinstall/deinstall -local #remove RDBMS home. if RDBMS binaries was installed on shared storage, then $ORACLE_HOME/oui/bin/runInstaller -detachHome ORACLE_HOME=$ORACLE_HOME
$ORACLE_HOME/oui/bin/runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={london1}"  #on node1
 $GRID_HOME/crs/install/rootcrs.pl -deconfig –force # update the OCR and remove the node from Grid, more on http://goo.gl/CDnQdZ
crsctl delete node -n london2  #su - root, on node1
/u01/app/crs/oui/bin/runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES=london2" CRS=TRUE -local  #on node2, remove grid software
./deinstall -local

 

./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME"CLUSTER_NODES={london1,london3}" CRS=TRUE #on node1

RAC startup/shutdown

crsctl start crs #su - root, on each node.crsctl start cluster will boot up daemons not running.  -n node1 to specified one node

srvctl start asm -n node1/node2 #su - oracle

srvctl start database -d devdb #su - oracle

ohasd(oracle restart, will respawn) will startup crs(through reading contents of /etc/oracle/scls_scr/<hostname>/root/ohasdrun, more on http://goo.gl/VH4F3T). oracle restart can startup ons/eons daemons: srvctl add/start ons/eons on single instance DB

crs(clusterware OR grid infrastructure) manages asm/db/vip/listener/ons<oracle notification services, which send out FAN(fast application notification) events> cluster resource(startup/stop/monitor/failover).

Oracle Cluster Registry

ocrconfig [-local] -showbackup #su - grid. Configuration tool for, saves info for cluster resource, so that crs will use this for managing cluster resources. OLR(Oracle Local Registry) mainly stores info about OHASD locally.

ocrdump /var/tmp/ocrdump.txt #run as root

ocrcheck #run as grid/oracle/root

restore OLR

ocrconfig -local -restore /u01/app/crs/cdata/london1/backup_20091010_211006.olr #init 2 first. check log in $GRID_HOME/log/hostname/client
#if OLR file is lost
touch /u01/app/crs/cdata/london1.olr #init 2 first
chown grid:oinstall london1.olr
ocrconfig -local -restore /u01/app/crs/cdata/london1/backup_20100625_085111.olr

restore OCR

crsctl stop cluster –all –f # stop the Clusterware stack on all nodes
crsctl start crs –excl # start the Clusterware stack in exclusive mode( start the required background processes that make up a local ASM instance). use crsctl disable crs to disable the automatic starting of the Clusterware stack. reboot the node. Then crsctl start crs –excl again.
If the diskgroup containing the voting files was lost # create a new one with exactly the same name and mount it. set the ASM compatibility to 11.2 for the diskgroup(alter diskgroup OCRVOTE set attribute 'compatible.asm'='11.2'). execute /etc/init.d/oracleasm scandisks on the nodes for which you did not create the disk
ocrconfig -restore backup_file_name #log files under $GRID_HOME/log/hostname/client. if errors saying crs is running, then crsctl stop res ora.crsd –init
crsctl stop crs #at this time, ohasdrun's content will be 'stop', although init.ohasd is still running
crsctl start crs
Start the cluster on the remaining nodes. If you have disabled the automatic start of the Clusterware stack in Step 2, re-enable it using crsctl enable crs

ocssd #cluster synchronization server

control membership of RAC nodes(join/leave). update voting disk every second with the status of node itself. can work with thirdparty HA like IBM hacmp except for clusterware

GPnP #grid plug and play

gpnptool get #su - grid first, info about public/private NIC/voting disk path can be seen

CTSS #Cluster time synchronization (CTSS)

observer<if ntp>/active mode<if not ntp>

GNS #grid naming service

manages VIP instead of using name server

mDNS, multicast DNS

restore voting disk

if OCR corrupted, then recover OCR first

crsctl stop cluster –all –f # stop the Clusterware stack on all nodes
crsctl start crs –excl # start the Clusterware stack in exclusive mode( start the required background processes that make up a local ASM instance). use crsctl disable crs to disable the automatic starting of the Clusterware stack. reboot the node. Then crsctl start crs –excl again.
If the diskgroup containing the voting files was lost # create a new one with exactly the same name and mount it. set the ASM compatibility to 11.2 for the diskgroup(alter diskgroup OCRVOTE set attribute 'compatible.asm'='11.2'). execute /etc/init.d/oracleasm scandisks on the nodes for which you did not create the disk. compatible.rdbms = 11.2  access_control.enabled = true
crsctl replace votedisk + disk_group_name
crsctl stop crs
crsctl start crs
Start the cluster on the remaining nodes. If you have disabled the automatic start of the Clusterware stack in Step 2, re-enable it using crsctl enable crs

ASM permission

alter diskgroup data set ownership owner = 'orareport' for file '+DATA/PROD/DATAFILE/users.259.679156903';

alter diskgroup data set permission owner = read write, group = read only, other = none  for file '+DATA/FIN/datafile/example.279.723575017';

select name, permissions,user_number,usergroup_number from v$asm_file f natural join v$asm_alias t where group_number = 3 and name = 'EXAMPLE.279.723575017' ;

asm template

select redundancy,stripe,name,primary_region,mirror_region from v$asm_template  where group_number = 3; #stripe has coarse/fine types.variable extent sizes can be used for coarse striping, the first 20,000 extents always equal the allocation unit (AU) size. The next 20,000 extents are 4 times the size of the AU. fine striping is used only for control files, online redo logs, and flashback logs. The stripe size is 128KB. Also by default, eight stripes are created for each file; therefore, the optimum number of disks in a disk group is a multiple of eight.

alter diskgroup data add template allhot attributes (hot);
create tablespace hottbs datafile '+DATA(allhot)' size 10M;
select bytes,type,redundancy,primary_region,mirror_region, hot_reads,hot_writes,cold_reads,cold_writes from v$asm_file where file_number = 284;

add/remove asm disk

For device-mapper-multipath

1.format the underlying block device, usually /dev/sd* on one node

2.With the device partitioned, the administrator can use either partprobe or kpartx to re-read the partition table on the other cluster nodes

3.A restart of the multipath daemon (“service multipathd restart”) should show the new partition in /dev/mapper(may use multipath -f first)

4.With the new block devices detected on all nodes, you could use ASMLib to mark the disk as an ASM disk on one node of the cluster

alter diskgroup data drop disk 'DATA11' rebalance power 3;

alter DISKGROUP DGNORM1 add DISK '/dev/rdsk/disk5' name disk5, '/dev/rdsk/disk6' name disk6;
alter diskgroup DG_DEST_DF add FAILGROUP FailgroupB disk '/asmdisks/asmdiskB48';
select group_number,name from v$asm_diskgroup;
select DISK_NUMBER, name, failgroup, group_number from v$asm_disk where group_number=3 order by name; #or order by 2
alter diskgroup DG_DEST_DF add FAILGROUP FailgroupA disk ‘/asmdisks/asmdiskA28′;
alter diskgroup DG_SRC_DF drop disk asmdiskA28;

Then  check V$ASM_OPERATION for information about a time remaining estimate and later physically remove the disk(It is safe to remove the ASM disk physically from the cluster only when the HEADER_STATUS of the V$ASM_DISK view shows “FORMER” for the disk you dropped.)

SQL> alter diskgroup data add disk 'ORCL:NEWSAN01', 'ORCL:NEWSAN02', 'ORCL:NEWSAN03',
drop disk 'ORCL:OLDSAN01', 'ORCL:NEWSAN02', 'ORCL:NEWSAN03' rebalance power 11; #parallel and one ebalance

kfed read /dev/oracleasm/disks/VOL1 #show ASM header info, su - grid

Startup status

STARTUP FORCE/nomount(Starts the ASM instance but does not mount any disk groups)

mount<or open,mounts all disks registered in the ASM_DISKGROUPS initialization parameter>

ASM operations

show parameter ASM_DISKGROUPS;
select name,state from v$asm_diskgroup ; #su - grid

SQL> select * from v$asm_disks;
SQL> select type from V$ASM_FILE group by TYPE #having amount > 300

select db_name,status,instance_name from v$asm_client; #connected clients. when OCR is stored in ASM, the asm instance itself will be its client

alter diskgroup DATA check;  #repair

srvctl start diskgroup -g diskgroupName #rather than using 'alter diskgroup mount all'

/etc/init.d/oracleasm createdisk VOL4 /dev/loop1 #dd and losetup first

create diskgroup FILEDG external redundancy disk 'ORCL:VOL4';

SQL> show parameter asm_diskstring; #search path for candidate disks

SQL> create diskgroup DGEXT1 external redundancy disk '/dev/rdsk1/disk1';
SQL> create diskgroup DGNORM1 normal redundancy disk FAILGROUP controller1 DISK '/dev/rdsk/disk1' name disk1, '/dev/rdsk/disk2' name disk2 FAILGROUP controller2 DISK '/dev/rdsk/disk3' name disk3, '/dev/rdsk/disk4' name disk4; #mirroring based on AU(1M or Exadata's 4M). In Normal,one mirror for each extent(2 fail groups at least); In high, two mirrors for each extent(3 fail groups at least). For each extent written to disk, another extent will be written into another failure group to provide redundancy.

asm options

compatible.asm, compatible.rdbms, compatible.advm, au_size, sector_size, disk_repair_time, access_control.enabled, access_control.umask

CRSCTL resource

crsctl status resource -t #crsctl status resource -h for help. crs_stat -t

crsctl status resource -t -init #check status of ohasd stack

crsctl status resource ora.devdb.db -p #get detail info about one resource(for example dependency relationships), resource profile

crsctl add resource TEST.db -type cluster_resource -file TEST.db.config #register the resource in Grid. TEST.db.config is output from -p

crsctl getperm resource TEST.db  #resource permission

crsctl setperm resource TEST.db -o oracle #oracle can startup this resource after this

crsctl delete resource ora.gsd #crsctl start/stop resource <resource name>

/u01/app/11.2.0/grid/bin/scriptagent #used to protect user defined resources, start/stop/check/clean/abort/<relocate>(combined with user action scripts)

SRVCTL scan

srvctl config scan #get scan info

srvctl status database -d devdb ##instance running status

srvctl config scan_listener #srvctl stop scan_listener , srvctl modify scan_listener -p  1526, change scan listener Endpoint(port); srvctl start scan_listener ; srvctl status scan_listener

show parameter local_listener/remote_listener;

#add scan ip

srvctl stop scan_listener
srvctl stop scan #scan vip
srvctl status scan_listener
srvctl status scan
srvctl modify scan -n cluster1.example.com #su - root
srvctl config scan
srvctl modify scan_listener -u #Update SCAN listeners to match the number of SCAN VIPs
srvctl config scan_listener
srvctl start scan
srvctl start scan_listener

SRVCTL service

srvctl add service –d PROD –s REPORTING –r PROD3 –a PROD1 –P BASIC –e SESSION # add a service named reporting to your four-node administrator-managed database that normally uses the third node, but can alternatively run on the first node
srvctl start service –d PROD –s REPORTING –i PROD1
srvctl config service -d PROD -s reporting #check status
srvctl status service -d PROD -s reporting
srvctl relocate service -d PROD -s reporting -i PROD1 -t PROD3 #different parameters for admin managed/policy managed

srvctl_1

srvctl_2

###Backup and Recovery
datapump
imp/exp scott/tiger file=scott.exp #logical export, client based. run @?/rdbms/admin/catexp.sql first
sql*loader(non-oracle DB) #sqlldr
expdp, impdp #data pump, can not be used when db is read only
expdp/impdp help=y
SQL>create directory backup_dir as '/backup/'; #default is dpump_dir
SQL>grant read,write on directory backup_dir to scott;
expdp scott/tiger dumpfile=scott.dmp directory=backup_dir (tables=scott.emp); #imp_full_database role
expdp \”/ as sysdba \” schemas=scott dumpfile=scott.dmp directory=backup_dir; #full/schemas/tables/tablespaces/transport_tablespaces<only metadata>
expdp system/manager DUMPFILE=expdat.dmp FULL=y LOGFILE=export.log COMPRESSION=ALL
expdp \"/ as sysdba\" schemas=SH ESTIMATE_ONLY=y ESTIMATE=BLOCKS;
expdp sh/sh parfile=exp.par
DIRECTORY=DPUMP_DIR
DUMPFILE=testsh.dmp
CONTENT=DATA_ONLY
EXCLUDE=TABLE:"in ('PROMOTIONS')"
QUERY=customers:"where cust_id=1"
create user "SHNEW" identified by "Testpass";
grant CREATE SESSION to SHNEW;
impdp \"/ as sysdba\" dumpfile=sh.dmp directory=backup_dir SCHEMAS=SH REMAP_SCHEMA=SH:SHNEW
impdp \"/ as sysdba\" dumpfile=sh.dmp directory=backup_dir REMAP_SCHEMA=SH:SHNEW TABLES=SH.PRODUCTS TABLE_EXISTS_ACTION=SKIP
undo
all use space in undotbs<id>
before submit, the status is active, and use a small portion of expired data, some unexpired will become expired. most of the space used will be the free space in undo tablespace
after submit, active will become unexpired, and expired won't change
SQL>select status,sum(bytes/1024/1024) from dba_undo_extents where tablespace_name='UNDOTBS1' group by status;
SQL>select tablespace_name,sum(bytes/1024/1024) from dba_free_space where tablespace_name='UNDOTBS1' group by tablespace_name; #undo free space, dba_data_files
undo data is subsequently logged as redo logs(when commit, lgw0 will write data into redo log; then the changes to the data files are in the buffer and can be written out at a later time<dbw0 write to datafile>)
SQL>show parameter undo_retention;
guarantee 900 seconds of  data(may exceed 900).if auto-extend and tablespace is used up,autoextend;if reached maxsize,then unexpired will be override(only when nogurantee; sql will fail if is  guarantee)
if fixed size, undo_retention is ignored, oracle will change maximum time automaticallynoguarentee #default, unexpired can also be overrided, so may not guarantee <undo_retention> seconds of read consistency
SQL>alter tablespace xxx retention guarantee #to make sure long-running queries will succeed. if tablespace is used up, then sql will fail
SQL>select tablespace_name, retention from dba_tablespaces; #guarantee or noguarantee
SQL>select to_char(begin_time, 'DD-MON-RR HH24:MI') begin_time, to_char(end_time, 'DD-MON-RR HH24:MI') end_time, tuned_undoretention from v$undostat order by end_time; #retention time calculated by system every 10 minutes; data more than 4 days ago are stored in BA_HIST_UNDOSTATrollback #delete from ...; rollback

SQL>flashback table emp to before drop;

redo

 

as opposite to undo, redo is for recovering already commited data

 

RMAN enable archivelog mode
shutdown immediate;
startup mount
back up your database
RMAN> backup as copy database tag="db_cold_090721";
RMAN> list copy;
Update init params as needed:
log_archive_dest || db_recovery_file_dest
log_archive_dest_n || db_recovery_file_dest
log_archive_format
alter database archivelog NOTE: apparently, you can only do this from sqlplus, not rman
alter database open
SQL> select LOG_MODE from v$database;
SQL> archive log list #su - oracle. some configurations, and whether archive log mode is enabled or not; archive log destination

RMAN & cold backup

RMAN Offline backup: This performs an immediate or normal shutdown, followed by a startup mount. This does not require archivelog mode.
RMAN Online backup:For this to be done, the database must be open and in archivelog mode.

startup nomount/restore control files/mount/recover/open
SQL>recover database until cancal; #until change 1234567; until time '2004-04-15:14:33:00', recovery from a hot backup.
alter database open resetlogs #restore using control file/will reset log sequence number to 1<v$log, v$logfile> and throw away all archive log

alter tablespace <name> begin backup #or alter database begin backup
[cp or ocopy (windows)] #no need to backup online redo logs(but you should archive the current redo logs and back those up)
alter tablespace <name> end backup #or alter database end backup, select * from v$backup<scn>

using recovery catalog instead of target database control file
SQL>create tablespace tbs_rman datafile '+DATA' size 200m autoextend on;
SQL>create user rman identified by rman temporary tablespace temp default tablespace tbs_rman quota unlimited on tbs_rman;
SQL>grant recovery_catalog_owner to rman;
RMAN> connect catalog rman/rman@devdb --connect to recovery catalog
RMAN> create catalog tablespace tbs_rman; --create recovery catalog
rman target sys/Te\$tpass@devdb catalog rman/rman@devdb #connect to target database and recovery catalog
RMAN > connect catalog rman/rman@devdb
RMAN > connect target sys/Te\$tpass@devdb
RMAN> register database; #register target database to recovery catalog
Commands frequently used
rman target /
RMAN> show all; #rman configurations
rman> backup current controlfile;
RMAN> backup database plus archivelog;
SQL> alter database backup controlfile to '/u01/app/oracle/control.bak' [reuse]; #binarybackup, reuse for override
SQL> alter database backup controlfile to trace as '/u01/app/oracle/control.trace'; #text backup
rman> backup tablespace new_tbs; #then datafile lost and error occurs while querying/create table
SQL>col NAME for a60
SQL> select FILE#, STATUS, NAME from v$datafile;
rman>restore datafile <10>;
rman>recover datafile 10(apply redo log);
SQL>alter tablespace new_tbs online;
RMAN>crosscheck backup; #checks that the RMAN catalog is in sync with the backup files on disk or the media management catalog. Missing backups will be marked as “Expired.”
alter system set max_dump_file_size=1000 scope=both;
show parameter log_archive_duplex_dest; #not have a single point of failure
show parameter archive_dest;
select dest_name,status,destination from V$ARCHIVE_DEST;SQL> show parameter log_archive_format;
SQL> alter system set log_archive_dest_1='location=+DA_SLCM07' scope=both; #After this, you can remove old archive log on the filesystem(usually under $ORACLE_HOME/dbs). more on http://goo.gl/kNdX91 and http://goo.gl/OcviNxSQL> alter system set log_archive_dest_1 = 'LOCATION=USE_DB_RECOVERY_FILE_DEST'; #use setting of DB_RECOVERY_FILE_DEST
SQL> show parameter db_recovery_file_dest #alter system set db_recovery_file_dest_size=9G scope=both; #startup mount first. <Flash Recovery Area>RMAN> connect target sys/Testpass;
RMAN> delete noprompt ARCHIVELOG UNTIL TIME 'sysdate-4';
RMAN>DELETE noprompt BACKUP COMPLETED BEFORE 'sysdate-2';

RMAN> list incarnation; #version
RMAN> crosscheck copy;
RMAN> delete expired copy; --remove expired copy, delete archivelog all is for remove all
RMAN> resync catalog
RMAN> list backup summary;
RMAN> list backup by file; #list backup sets and their detailed files and pieces
RMAN> report obsolete; # report on backups that are no longer needed because they exceed the retention policy
RMAN> restore database preview summary; #preview the restore and see the summary for the restore
RMAN> list backupset tag=tbs;
RMAN> configure controlfile autobackup; #backup control file automatically
RMAN> configure retention policy to redundancy 4; #to start deleting backups after four backups have been taken
RMAN> configure retention policy to recovery window of 15 days; # make point-in-time recovery possible up to the last 15 days and to make backups taken more than 14 days ago obsolete
RMAN> host 'echo "start `date`"';

RMAN > validate database;
RMAN > validate backupset 7;
RMAN > validate datafile 10;
RMAN > backup validate database archive log all;
RMAN > restore database validate;
RMAN > restore archive log all validate;

SQL> select name,completion_time from v$archived_log;

RMAN> crosscheck archivelog all;
RMAN> sql 'alter system switch logfile'; #forcely write redo log to disk(only for current thread<instance>)

SQL> alter system archive log current #This command ensures that all redo logs have been archived and it will wait for the archive to complete(on RAC, best practise. slow than switch logfile). later dbwr will write checkpoint(SGA dirty data) to datafile/control file, and update SCN
RMAN> list archivelog all;
SQL>select dbms_flashback.get_system_change_number from dual; #current SCN. select CURRENT_SCN from v$database;

rman>list failure; #Data recovery advisor, not supported on RAC
rman>advise failure <362> details;
rman>repair failure;

alter system checkpoint #sync redo and datafile, http://www.orafaq.com/wiki/Checkpoint
drop logfile group n #remove redo log

Backup policies

backup policies

DataGuard
SQL> alter database set standby database to maximize protection/availability/performance; #data guard modes
SQL> select force_logging from v$database;
alter database [NO] force logging; #forces all changes to be logged even if nologging. NOLOGGING/LOGGING(default)/FORCE LOGGING
ALTER TABLESPACE tablespace_name [no] FORCE LOGGING;
SQL> select tablespace_name,logging,force_logging from dba_tablespaces;
select table_name,logging from user_tables; #object level
alter table tb_a nologging;
RMAN scripts
RMAN> run{ #full backup
2> allocate channel ch1 device type disk;
3> backup as compressed backupset
4> database plus archivelog delete input #remove archived log after backed up
5> format='/u01/app/oracle/whole_%d_%U'
6> tag='whole_bak';
7> release channel ch1;}
RMAN> run{ #0-level incremental backup(differential by default)
2> allocate channel ch1 device type disk;
3> allocate channel ch2 device type disk;
4> backup as compressed backupset
5> incremental level 0
6> database plus archivelog delete input
7> format='/u01/app/oracle/inc_0_%d_%U'
8> tag='Inc_0';
9> release channel ch1;
10> release channel ch2;}RMAN> run{ #1-level  incremental backup(differential by default)
2> allocate channel ch1 device type disk;
3> allocate channel ch2 device type disk;
4> backup as compressed backupset
5> incremental level 1 database
6> format='/u01/app/oracle/Inc_1_%d_%U'
7> tag='Inc_1';
8> release channel ch1;
9> release channel ch2;}RMAN> run{ #1-level  incremental backup(accumulative)
2> allocate channel ch1 device type disk;
3> backup as compressed backupset
4> incremental level 1 cumulative database
5> format '/u01/app/oracle/Cum_1_%d_%U'
6> tag='Cul_1';
7> release channel ch1;}RMAN> run{ #backup tablespaces
2> allocate channel ch1 device type disk;
3> backup as compressed backupset
4> tablespace EXAMPLE,USERS
5> format='/u01/app/oracle/tbs_%d_%U'
6> tag='tbs';}

RMAN> run{ #backup datafile
2> allocate channel ch1 device type disk;
3> backup as compressed backupset
4> datafile 3
5> format='/u01/app/oracle/df_%d_%U'
6> tag='df';
7> release channel ch1;}

RMAN> run{ #backup archived logs through SCN
2> allocate channel ch1 device type disk;
3> backup as compressed backupset
4> archivelog from scn 9214472
5> format='/u01/app/oracle/arc_%d_%U'
6> tag='arc';
7> release channel ch1;}

RMAN> run{ #Image Copy backup
2> allocate channel ch1 device type disk;
3> backup as copy datafile 1,4
4> format '/u01/app/oracle/df_2_%d_%U'
5> tag 'copyback';
6> release channel ch1;}

RMAN > run { allocate channel c1 type disk; #Image Copy backup
RMAN > copy datafile 1 to '/u01/back/system.dbf';}
replace script BackupTEST1 { #scripts are stored in catalog
configure backup optimization on;
configure channel device type disk;
sql 'alter system archive log current';
backup database incremental 2 cumulative database;
release channel d1;
}
run {execute script BackupTEST1;}

MAN> replace script fullRestoreTEST1 { #recover
allocate channel ch1 type disk;
# Set a new location for logs
set archivelog destination to '/TD70/sandbox/TEST1/arch';
startup nomount;
restore controlfile;
alter database mount;
restore database; #file level restore. restore database/tablespace/datafile/controlfile/archivelog
recover database; #data level recover, applying redo log, and keep SCN consistent. recover database/tablespace/datafile
alter database open resetlogs;
release channel ch1;
}
host 'echo "start `date`"';
run {execute script fullRestoreTEST1;}
host 'echo "stop `date`"';
exit

###oracle SQL tips is here http://goo.gl/moTLnJ

self defined timeout for telnet on Linux

December 26th, 2013 No comments

telnet's default timeout value is relative high, so you may want to change timeout value to lower value such as 5 seconds. Here's the way that we can fulfill this:

#!/bin/bash

timeout()
{
waitfor=5
command=$*
$command &
commandpid=$!
( sleep $waitfor ; kill -9 $commandpid > /dev/null 2>&1 ) &
watchdog=$!
sleeppid=$PPID
wait $commandpid > /dev/null 2>&1
kill $sleeppid > /dev/null 2>&1
}

timeout telnet slcc29-scan1.us.oracle.com 1521 >> $output

Also, we can use expect and set timeout for expect. When telnet is integrated with expect, we can fulfill timeout for telnet through using expect's timeout value:

#!/usr/bin/expect

set timeout 30

send "<put telnet command here>\r"

Add static routes in linux which will survive reboot and network bouncing

December 24th, 2013 No comments

We can see that in linux, the file /etc/sysconfig/static-routes is revoked by /etc/init.d/network:

[root@test-linux ~]# grep static-routes /etc/init.d/network
# Add non interface-specific static-routes.
if [ -f /etc/sysconfig/static-routes ]; then
grep "^any" /etc/sysconfig/static-routes | while read ignore args ; do

So we can add rules in /etc/sysconfig/static-routes to let network routes survive reboot and network bouncing. The format of /etc/sysconfig/static-routes is like:

any net 10.247.17.0 netmask 255.255.255.192 gw 10.247.10.1
any net 10.247.11.128 netmask 255.255.255.192 gw 10.247.10.1

To make route in effect immediately, you can use route add:

route add -net 192.168.62.0 netmask 255.255.255.0 gw 192.168.1.1

But remember that to change the default gateway, we need modify /etc/sysconfig/network(modify GATEWAY=).

After the modification, bounce the network using service network restart to make the changes in effect.

PS: 

You need make sure network id follows -net, or you'll see error "route: netmask doesn't match route address".

remove duplicate images using fdupes and expect in linux

December 13th, 2013 No comments

I've got several thousands of pictures, but most of them had several exact copies of themselves. So I had to remove duplicate ones by hand firstly.

Later, I thought of that in linux we had md5sum which will give the same string for files with exact same contents. Then I tried to write some program, and that toke me some while.

I searched google and found that in linux, we had fdupes which can do the job very well. fdupes will calculate duplicate files based on file size/md5 value, and will prompt you to reserve one copy or all copies of the duplicates and remove others if you gave -d parameter to it. You can read more about fdupes here http://linux.die.net/man/1/fdupes

As all the pictures were on a windows machine, so I installed cygwin and installed fdupes and expect. Later I wrote a small script to reserve only one copy of the duplicate pictures for me(you will have to enter your option either reserving one copy or all copies by hand if you do not use expect, as there's no option for reserve one copy by the author of fdupes). Here's my program:

$ cat fdupes.expect
#!/usr/bin/expect
set timeout 1000000
spawn /home/andy/fdupes.sh
expect "preserve files" {
send "1\r";exp_continue
}

$ cat /home/andy/fdupes.sh
fdupes.exe -d /cygdrive/d/pictures #yup, my pictures are all on this directory on windows, i.e. d:\pictures

After this, you can just run fdupes.expect, and it will reserve only one copy and remove other duplicates for you.

PS: Here's man page of fdupes https://github.com/adrianlopezroche/fdupes

Common storage multi path Path-Management Software

December 12th, 2013 No comments
Vendor Path-Management Software URL
Hewlett-Packard AutoPath, SecurePath www.hp.com
Microsoft MPIO www.microsoft.com
Hitachi Dynamic Link Manager www.hds.com
EMC PowerPath www.emc.com
IBM RDAC, MultiPath Driver www.ibm.com
Sun MPXIO www.sun.com
VERITAS Dynamic Multipathing (DMP) www.veritas.com

resolved – mount clntudp_create: RPC: Program not registered

December 2nd, 2013 No comments

When I did a showmount -e localhost, error occured:

[root@centos-doxer ~]# showmount -e localhost
mount clntudp_create: RPC: Program not registered

So I checked what RPC program number of showmount was using:

[root@centos-doxer ~]# grep showmount /etc/rpc
mountd 100005 mount showmount

As this indicated, we should startup mountd daemon to make showmount -e localhost work. And mountd is part of nfs, so I started up nfs:

[root@centos-doxer ~]# /etc/init.d/nfs start
Starting NFS services: [ OK ]
Starting NFS quotas: [ OK ]
Starting NFS daemon: [ OK ]
Starting NFS mountd: [ OK ]

Now as mountd was running, showmount -e localhost should work.

 

quick configuration of python httpd server

November 28th, 2013 No comments

Let's assume that you want to copy files from server A to server B, and you have found that no scp available, but wget is there for use. Then you can try run one python command and use wget to download files from server A.

Here's the steps:

On server A:

cd <directory of files you want to copy>

python -m SimpleHTTPServer 8080 #notice the output of this command, for example, "Serving HTTP on 0.0.0.0 port 8000 ..."

Now you can open browser and visit http://<hostname of server A>:8000. You will notice files are there now.

On server B:

wget http://<hostname of server A>:8000/<files to copy>

After you've copied files, you can press control+c to terminate that python http server on Server A. (Or you can press ctrl+z, and then %<job id> & to make that python httpd server run in background)

VLAN in windows hyper-v

November 26th, 2013 No comments

Briefly, a virtual LAN (VLAN) can be regarded as a broadcast domain. It operates on the OSI
network layer 2. The exact protocol definition is known as 802.1Q. Each network packet belong-
ing to a VLAN has an identifier. This is just a number between 0 and 4095, with both 0 and 4095
reserved for other uses. Let’s assume a VLAN with an identifier of 10. A NIC configured with
the VLAN ID of 10 will pick up network packets with the same ID and will ignore all other IDs.
The point of VLANs is that switches and routers enabled for 802.1Q can present VLANs to dif-
ferent switch ports in the network. In other words, where a normal IP subnet is limited to a set
of ports on a physical switch, a subnet defined in a VLAN can be present on any switch port—if
so configured, of course.

Getting back to the VLAN functionality in Hyper-V: both virtual switches and virtual NICs
can detect and use VLAN IDs. Both can accept and reject network packets based on VLAN ID,
which means that the VM does not have to do it itself. The use of VLAN enables Hyper-V to
participate in more advanced network designs. One limitation in the current implementation is
that a virtual switch can have just one VLAN ID, although that should not matter too much in
practice. The default setting is to accept all VLAN IDs.

Difference between Computer Configuration settings and User Configuration settings in Active Directory Policy Editor

November 22nd, 2013 No comments
  • Computer Configuration settings are applied to computer accounts at startup and during the background refresh interval.
  • User Configuration settings are applied to the user accounts logon and during the background refresh interval.

resolved – sshd: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost= user=

November 20th, 2013 No comments

Today when I tried to log on one linux server with a normal account, errors were found in /var/log/secure:

Nov 20 07:43:39 test_linux sshd[11200]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=10.182.120.188 user=testuser
Nov 20 07:43:39 test_linux sshd[11200]: pam_ldap: error trying to bind (Invalid credentials)
Nov 20 07:43:42 test_linux sshd[11200]: nss_ldap: failed to bind to LDAP server ldaps://test.com:7501: Invalid credentials
Nov 20 07:43:42 test_linux sshd[11200]: nss_ldap: failed to bind to LDAP server ldap://test.com: Invalid credentials
Nov 20 07:43:42 test_linux sshd[11200]: nss_ldap: could not search LDAP server - Server is unavailable
Nov 20 07:43:42 test_linux sshd[11200]: nss_ldap: failed to bind to LDAP server ldaps://test.com:7501: Invalid credentials
Nov 20 07:43:43 test_linux sshd[11200]: nss_ldap: failed to bind to LDAP server ldap://test.com: Invalid credentials
Nov 20 07:43:43 test_linux sshd[11200]: nss_ldap: could not search LDAP server - Server is unavailable
Nov 20 07:43:55 test_linux sshd[11200]: pam_ldap: error trying to bind (Invalid credentials)
Nov 20 07:43:55 test_linux sshd[11200]: Failed password for testuser from 10.182.120.188 port 34243 ssh2
Nov 20 07:43:55 test_linux sshd[11201]: fatal: Access denied for user testuser by PAM account configuration

After some attempts on linux PAM(sshd, system-auth), I still got nothing. Later, I checked /etc/ldap.conf with one other box, and found the configuration on the problematic host was not right.

I copied the right ldap.conf and tried log on later, and the issue resolved.

PS:

You can read more about linux PAM here http://www.linux-pam.org/Linux-PAM-html/ (I recommend having a reading on the System Administrators' Guide as that may be the only one linux administrators can reach. You can also get a detailed info on some commonly used PAM modules such as pam_tally2.so, pam_unix.so, pam_cracklib, etc.)

Here's one configuration in /etc/pam.d/sshd:

#%PAM-1.0
auth required pam_tally2.so deny=3 onerr=fail unlock_time=1200 #lock account after 3 failed logins. The accounts will be automatically unlocked after 20 minutes
auth include system-auth
account required pam_nologin.so
account include system-auth
password include system-auth
session optional pam_keyinit.so force revoke
session include system-auth
session required pam_loginuid.so

You'll get error message "pam_tally2(sshd:auth): user test (502) tally 4, deny 3" in /var/log/secure when you try to log on the after the third time you entered wrong password. And "pam_tally2 --user test" will return 0 Failures after 20 minutes as you configured.

resolved – how to show all results in one page when searching your wordpress blog

November 13th, 2013 No comments

Assume that you have your own wordpress blog, and you note down everything you met in daily work.

Now you have some trouble again in work, and remembered that you've noted similar issue before. So you tried searching your wordpress blog with a keyword such as "trouble". Later, wordpress returned a result of 30 pages, each page had 10 articles. Now you scrolled and click "next page" a lot and that really frustrated you. What if you have all the searching result in one page? Thus you just need scroll the page and no waiting for loading pages of next, next, next page. (You may worry that the page load time will disappoint other guys searching your blog, but this proves to be little to worry, as no body will search your blog except yourself. Believe me buddy!)

Here goes the way to fulfill this functionality:

  1. Go to wordpress admin page, then click "Appearance" -> "Editor";
  2. Click archive.php in the right to edit this file(search.php refers to archive.php, so you should edit archive.php);
  3. Search for "have_posts()", and add one line above that line. The line to be added is like this: <?php query_posts($query_string . '&showposts=30'); ?> You may change 30 here to any number you want. As you guessed, this is the number that will show after searching.
  4. Save the change and try searching again. You'll notice the change.

PS:

  1. Note that every time you upgrade wordpress or your wordpress theme you may need to do above steps again;
  2. The idea is from http://wordpress.org/support/topic/show-all-content-on-search-page

resolved – kernel panic not syncing: Fatal exception Pid: comm: not Tainted

November 13th, 2013 No comments

We're install IDM OAM today and the linux server panic every time we run the startup script. Server panic info was like this:

Pid: 4286, comm: emdctl Not tainted 2.6.32-300.29.1.el5uek #1
Process emdctl (pid: 4286, threadinfo ffff88075bf20000, task ffff88073d0ac480)
Stack:
ffff88075bf21958 ffffffffa02b1769 ffff88075bf21948 ffff8807cdcce500
<0> ffff88075bf95cc8 ffff88075bf95ee0 ffff88075bf21998 ffffffffa01fd5c6
<0> ffffffffa02b1732 ffff8807bc2543f0 ffff88075bf95cc8 ffff8807bc2543f0
Call Trace:
[<ffffffffa02b1769>] nfs3_xdr_writeargs+0x37/0x7a [nfs]
[<ffffffffa01fd5c6>] rpcauth_wrap_req+0x7f/0x8b [sunrpc]
[<ffffffffa02b1732>] ? nfs3_xdr_writeargs+0x0/0x7a [nfs]
[<ffffffffa01f612a>] call_transmit+0x199/0x21e [sunrpc]
[<ffffffffa01fc8ba>] __rpc_execute+0x85/0x270 [sunrpc]
[<ffffffffa01fcae2>] rpc_execute+0x26/0x2a [sunrpc]
[<ffffffffa01f5546>] rpc_run_task+0x57/0x5f [sunrpc]
[<ffffffffa02abd86>] nfs_write_rpcsetup+0x20b/0x22d [nfs]
[<ffffffffa02ad1e8>] nfs_flush_one+0x97/0xc3 [nfs]
[<ffffffffa02a86b4>] nfs_pageio_doio+0x37/0x60 [nfs]
[<ffffffffa02a87c5>] nfs_pageio_complete+0xe/0x10 [nfs]
[<ffffffffa02ac264>] nfs_writepages+0xa7/0xe4 [nfs]
[<ffffffffa02ad151>] ? nfs_flush_one+0x0/0xc3 [nfs]
[<ffffffffa02acd2e>] nfs_write_mapping+0x63/0x9e [nfs]
[<ffffffff810f02fe>] ? __pmd_alloc+0x5d/0xaf
[<ffffffffa02acd9c>] nfs_wb_all+0x17/0x19 [nfs]
[<ffffffffa029f6f7>] nfs_do_fsync+0x21/0x4a [nfs]
[<ffffffffa029fc9c>] nfs_file_flush+0x67/0x70 [nfs]
[<ffffffff81117025>] filp_close+0x46/0x77
[<ffffffff81059e6b>] put_files_struct+0x7c/0xd0
[<ffffffff81059ef9>] exit_files+0x3a/0x3f
[<ffffffff8105b240>] do_exit+0x248/0x699
[<ffffffff8100e6a1>] ? xen_force_evtchn_callback+0xd/0xf
[<ffffffff8106898a>] ? freezing+0x13/0x15
[<ffffffff8105b731>] sys_exit_group+0x0/0x1b
[<ffffffff8106bd03>] get_signal_to_deliver+0x303/0x328
[<ffffffff8101120a>] do_notify_resume+0x90/0x6d7
[<ffffffff81459f06>] ? kretprobe_table_unlock+0x1c/0x1e
[<ffffffff8145ac6f>] ? kprobe_flush_task+0x71/0x7c
[<ffffffff8103164c>] ? paravirt_end_context_switch+0x17/0x31
[<ffffffff81123e8f>] ? path_put+0x22/0x27
[<ffffffff8101207e>] int_signal+0x12/0x17
Code: 55 48 89 e5 0f 1f 44 00 00 48 8b 06 0f c8 89 07 48 8b 46 08 0f c8 89 47 04 c9 48 8d 47 08 c3 55 48 89 e5 0f 1f 44 00 00 48 0f ce <48> 89 37 c9 48 8d 47 08 c3 55 48 89 e5 53 0f 1f 44 00 00 f6 06
RIP [<ffffffffa02b03c3>] xdr_encode_hyper+0xc/0x15 [nfs]
RSP <ffff88075bf21928>
---[ end trace 04ad5382f19cf8ad ]---
Kernel panic - not syncing: Fatal exception
Pid: 4286, comm: emdctl Tainted: G D 2.6.32-300.29.1.el5uek #1
Call Trace:
[<ffffffff810579a2>] panic+0xa5/0x162
[<ffffffff81450075>] ? threshold_create_device+0x242/0x2cf
[<ffffffff8100ed2f>] ? xen_restore_fl_direct_end+0x0/0x1
[<ffffffff814574b0>] ? _spin_unlock_irqrestore+0x16/0x18
[<ffffffff810580f5>] ? release_console_sem+0x194/0x19d
[<ffffffff810583be>] ? console_unblank+0x6a/0x6f
[<ffffffff8105766f>] ? print_oops_end_marker+0x23/0x25
[<ffffffff814583a6>] oops_end+0xb7/0xc7
[<ffffffff8101565d>] die+0x5a/0x63
[<ffffffff81457c7c>] do_trap+0x115/0x124
[<ffffffff81013731>] do_alignment_check+0x99/0xa2
[<ffffffff81012cb5>] alignment_check+0x25/0x30
[<ffffffffa02b03c3>] ? xdr_encode_hyper+0xc/0x15 [nfs]
[<ffffffffa02b06be>] ? xdr_encode_fhandle+0x15/0x17 [nfs]
[<ffffffffa02b1769>] nfs3_xdr_writeargs+0x37/0x7a [nfs]
[<ffffffffa01fd5c6>] rpcauth_wrap_req+0x7f/0x8b [sunrpc]
[<ffffffffa02b1732>] ? nfs3_xdr_writeargs+0x0/0x7a [nfs]
[<ffffffffa01f612a>] call_transmit+0x199/0x21e [sunrpc]
[<ffffffffa01fc8ba>] __rpc_execute+0x85/0x270 [sunrpc]
[<ffffffffa01fcae2>] rpc_execute+0x26/0x2a [sunrpc]
[<ffffffffa01f5546>] rpc_run_task+0x57/0x5f [sunrpc]
[<ffffffffa02abd86>] nfs_write_rpcsetup+0x20b/0x22d [nfs]
[<ffffffffa02ad1e8>] nfs_flush_one+0x97/0xc3 [nfs]
[<ffffffffa02a86b4>] nfs_pageio_doio+0x37/0x60 [nfs]
[<ffffffffa02a87c5>] nfs_pageio_complete+0xe/0x10 [nfs]
[<ffffffffa02ac264>] nfs_writepages+0xa7/0xe4 [nfs]
[<ffffffffa02ad151>] ? nfs_flush_one+0x0/0xc3 [nfs]
[<ffffffffa02acd2e>] nfs_write_mapping+0x63/0x9e [nfs]
[<ffffffff810f02fe>] ? __pmd_alloc+0x5d/0xaf
[<ffffffffa02acd9c>] nfs_wb_all+0x17/0x19 [nfs]
[<ffffffffa029f6f7>] nfs_do_fsync+0x21/0x4a [nfs]
[<ffffffffa029fc9c>] nfs_file_flush+0x67/0x70 [nfs]
[<ffffffff81117025>] filp_close+0x46/0x77
[<ffffffff81059e6b>] put_files_struct+0x7c/0xd0
[<ffffffff81059ef9>] exit_files+0x3a/0x3f
[<ffffffff8105b240>] do_exit+0x248/0x699
[<ffffffff8100e6a1>] ? xen_force_evtchn_callback+0xd/0xf
[<ffffffff8106898a>] ? freezing+0x13/0x15
[<ffffffff8105b731>] sys_exit_group+0x0/0x1b
[<ffffffff8106bd03>] get_signal_to_deliver+0x303/0x328
[<ffffffff8101120a>] do_notify_resume+0x90/0x6d7
[<ffffffff81459f06>] ? kretprobe_table_unlock+0x1c/0x1e
[<ffffffff8145ac6f>] ? kprobe_flush_task+0x71/0x7c
[<ffffffff8103164c>] ? paravirt_end_context_switch+0x17/0x31
[<ffffffff81123e8f>] ? path_put+0x22/0x27
[<ffffffff8101207e>] int_signal+0x12/0x17

We tried a lot(application coredump, kdump etc) but still not got solution until we notice that there were a lot of nfs related message in the kernel panic info(marked as red above).

As our linux server was not using NFS or autofs, so we tried upgrade nfs client(nfs-utils) and disabled autofs:

yum update nfs-utils

chkconfig autofs off

After this, the startup for IDM succeeded, and no server panic found anymore!

make ssh on linux not to disconnect after some certain time

November 1st, 2013 No comments

You connect to a linux box through ssh, and sometimes you just found ssh "hangs" there or just disconnected. That's what ssh configuration on server makes this happen.

You can do the following to make this disconnection time long enough so that you get across this annoying issue:

cp /etc/ssh/sshd_config{,.bak30}
sed -i '/ClientAliveInterval/ s/^/# /' /etc/ssh/sshd_config
sed -i '/ClientAliveCountMax/ s/^/# /' /etc/ssh/sshd_config
echo 'ClientAliveInterval 30' >> /etc/ssh/sshd_config
echo 'TCPKeepAlive yes' >> /etc/ssh/sshd_config
echo 'ClientAliveCountMax 99999' >> /etc/ssh/sshd_config
/etc/init.d/sshd restart

Enjoy!

Categories: IT Architecture, Linux, Systems Tags: