Archive

Archive for the ‘Storage’ Category

stuck in PXE-E51: No DHCP or proxyDHCP offers were received, PXE-M0F: Exiting Intel Boot Agent, Network boot canceled by keystroke

March 17th, 2014 No comments

If you installed your OS and tried booting up it but stuck with the following messages:

stuck_pxe

Then one possibility is that, the configuration for your host’s storage array is not right. For instance, it should be JBOD but you had configured it to RAID6.

Please note that this is only one possibility for this error, you may search for PXE Error Codes you encoutered for more details.

PS:

  • Sometimes, DHCP snooping may prevent PXE functioning, you can read more http://en.wikipedia.org/wiki/DHCP_snooping.
  • STP(Spanning-Tree Protocol) makes each port wait up to 50 seconds before data is allowed to be sent on the port. This Delay in turn can cause problems with some applications/protocols (PXE, Bootworks, etc.). To alleviate the problem, Porfast was implemented on Cisco devices, the terminology might differ between different vendor devices. You can read more http://www.symantec.com/business/support/index?page=content&id=HOWTO6019
  • ARP caching http://www.networkers-online.com/blog/2009/02/arp-caching-and-timeout/
Categories: Hardware, Storage, Systems Tags:

“Include snapshots” made NFS shares from ZFS appliance shrinking

January 17th, 2014 No comments

Today I met one weird issue when checking one NFS share mounted from ZFS appliance. The NFS filesystem mounted on client was shrinking when I removed files as the space on that filesystem was getting low. But what made me confused was that the filesystem’s size would getting lower! Shouldn’t the free space getting larger and the size keep unchanged?

After some debugging, I found that this was caused by ZFS appliance shares’ “Include snapshots”. When I uncheck “Include snapshots”, the issue was gone!

zfs-appliance

Categories: Hardware, NAS, Storage Tags:

resolved – ESXi Failed to lock the file

January 13th, 2014 No comments

When I was power on one VM in ESXi, one error occurred:

An error was received from the ESX host while powering on VM doxer-test.
Cannot open the disk ‘/vmfs/volumes/4726d591-9c3bdf6c/doxer-test/doxer-test_1.vmdk’ or one of the snapshot disks it depends on.
Failed to lock the file

And also:

unable to access file since it is locked

This apparently was caused by some storage issue. I firstly googled and found most of the posts were telling stories about ESXi working mechanism, and I tried some of them but with no luck.

Then I thought of that our storage datastore was using NFS/ZFS, and NFS has file lock issue as you know. So I mount the nfs share which datastore was using and removed one file named lck-c30d000000000000. After this, the VM booted up successfully! (or we can log on ESXi host, and remove lock file there also)

Categories: NAS, Oracle Cloud, Storage Tags:

Common storage multi path Path-Management Software

December 12th, 2013 No comments
Vendor Path-Management Software URL
Hewlett-Packard AutoPath, SecurePath www.hp.com
Microsoft MPIO www.microsoft.com
Hitachi Dynamic Link Manager www.hds.com
EMC PowerPath www.emc.com
IBM RDAC, MultiPath Driver www.ibm.com
Sun MPXIO www.sun.com
VERITAS Dynamic Multipathing (DMP) www.veritas.com
Categories: HA, Hardware, IT Architecture, SAN, Storage Tags:

SAN ports

December 10th, 2013 No comments

Basic SAN port modes of operation

The port’s mode of operation depends on what’s connected to the other side
of the port. Here are two general examples:
✓ All hosts (servers) and all storage ports operate as nodes (that is, places
where the data either originates or ends up), so their ports are called
N_Ports (node ports).
✓ All hub ports operate as loops (that is, places where the data travels in a
small Fibre Channel loop), so they’re called L_Ports (loop ports).
Switch ports are where it gets tricky. That’s because switch ports have mul-
tiple personalities: They become particular types of ports depending on what
gets plugged into them (check out Table 2-2 to keep all these confusing port
types straight). Here are some ways a switch port changes its function to
match what’s connected to it:
✓ Switch ports usually hang around as G_Ports (global ports) when nothing is
plugged into them. A G_Port doesn’t get a mode of operation until
something is plugged into it.
✓ If you plug a host into a switch port, it becomes an F_Port (fabric port).
The same thing happens if you plug in a storage array that’s running the
Fibre Channel-Switched (FC-SW) Protocol (more about this protocol in
the next section).
✓ If you plug a hub into a switch port, you get an FL_Port (fabric-to-loop
port); hub ports by themselves are always L_Ports (loop ports).
✓ When two switch ports are connected together, they become their own
small fabric, known as an E_Port (switch-to-switch expansion port) or a
T_Port ( Trunk port).
✓ A host port is always an N_Port (node port) — unless it’s attached to a
hub, in which case it’s an NL_port (node-to-loop port).
✓ A storage port, like a host port, is always an N_Port — unless it’s
connected to a hub, in which case it’s an NL_Port.
If that seems confusing, it used to be worse. Believe it or not, different switch
vendors used to name their ports differently, which confused everyone. Then
the Storage Network Industry Association (SNIA) came to save the day and
standardized the names you see in Figure 2-19.
If you want to get a good working handle on what’s going on in your SAN, use
Table 2-2 to find out what the port names mean after all the plugging-in is done.

Protocols used in a Fibre Channel SAN

Protocols are, in effect, an agreed-upon set of terms that different computer
devices use to communicate with one another. A protocol can be thought of
as the common language used by different types of networks. You’ll encoun-
ter three basic protocols in the Fibre Channel world:
✓ FC-AL: Fibre Channel-Arbitrated Loop Protocol is used by two devices
communicating within a Fibre Channel loop (created by plugging the
devices into a hub). Fibre Channel loops use hubs for the cable connec-
tions among all the SAN devices. Newer storage arrays that have internal
fiber disks use Fibre Channel loops to connect the disks to the array,
which is why they can have so many disks inside: Each loop can handle
126 disks, and you can have many loops in the array. The array uses the
FC-AL protocol to talk to the disks.
Each of the possible 126 devices on a Fibre Channel loop takes a turn
communicating with another device on the loop. Only one conversa-
tion can occur at a time; the protocol determines who gets to talk when.
Every device connected to the loop gets a loop address (loop ID) that
determines its priority when it uses the loop to talk.
✓ FC-SW: Fibre Channel-Switched Protocol is used by two devices commu-
nicating on a Fibre Channel switch. Switch ports are connected over a
backplane, which allows any device on the switch to talk to any other
device on the switch at the same time. Many conversations can occur
simultaneously through the switch. A switched fabric is created by con-
necting Fibre Channel switches; such a fabric can have thousands of
devices connected to it.
Each device in a fabric has an address called a World Wide Name (WWN)
that’s hard-coded at the factory onto the host bus adapter (HBA) that
goes into every server and every storage port. The WWN is like the
telephone number of a device in the fabric (or like the MAC address of
a network card) When the device is connected to the fabric, it logs in to
the fabric port, and its WWN registers in the name server so the switch

knows it’s connected to that port. The WWN is also sometimes called a
WWPN, or World Wide Port Name.
The WWN and a WWPN are the exact same thing, the actual address
for a Fibre Channel port. In some cases, large storage arrays can also
have what is known as a WWNN, or World Wide Node Name. Some Fibre
Channel storage manufactures use the WWNN for the entire array, and
then use an offset of the WWN for each port in the array for the WWPN.
I guess this is a Fibre Channel storage manufactures way of making the
World Wide Names they were given by the standards bodies last longer.
You can think of the WWNN as the device itself, and the WWPN as the
actual port within the device, but in the end, it’s all just a WWN.
The name server is like a telephone directory. When one device wants
to talk to another in the fabric, it uses the other device’s phone number
to call it up. The switch protocol acts like the telephone operator. The
first device asks the operator what the other device’s phone number is.
The operator locates the number in the directory (the name server) in
the switch, and then routes the call to the port where the other device is
located.
There is a trick you can use to determine whether the WWN refers to a
server on the fabric or a storage port on the fabric. Most storage ports’
WWN always start with the number 5, and most host bus adapters’ start
with either a 10 or a 21 as the first hexadecimal digits in the WWN. Think
of it like the area code for the phone number. If you see a number like
50:06:03:81:D6:F3:10:32, its probably a port on a storage array. A
number like 10:00:00:01:a9:42:fc:06 will be a servers’ HBA WWN.
✓ SCSI: The SCSI protocol is used by a computer application to talk to its
disk-storage devices. In a SAN, the SCSI protocol is layered on top of
either the FC-AL or FC-SW protocol to enable the application to get to
the disk drives within the storage arrays in a Fibre Channel SAN. This
makes Fibre Channel backward-compatible with all the existing applica-
tions that still use the SCSI protocol to talk to disks inside servers. If the
SCSI protocol was not used, all existing applications would have needed
to be recompiled to use a different method of talking to disk drives.
SCSI works a bit differently in a SAN from the way it does when it talks to
a single disk drive inside a server. SCSI inside a server runs over copper
wires, and data is transmitted in parallel across the wires. In a SAN, the
SCSI protocol is serialized, so each bit of data can be transmitted as a
pulse of light over a fiber-optic cable. If you want to connect older parallel
SCSI-based devices in a SAN, you have to use a data router, which acts as a
bridge between the serial SCSI used in a SAN and the parallel SCSI used in
the device. (See “Data routers,” earlier in this chapter, for the gory details.)
Although iSCSI and Infiniband protocols can also be used in storage networks,
the iSCSI protocol is used over an IP network and then usually bridged into

a Fibre Channel SAN. Infiniband, on the other hand, is used over a dedicated
Infiniband network as a server interconnect, and then bridged into a Fibre
Channel SAN for storage access. But the field is always changing: Infiniband
and iSCSI storage arrays are now becoming available, but they still use either
an IP or IB interface rather than FC.

Fabric addressing

The addressing scheme used in SAN fabrics is quite different than that in SAN
loops. A fabric can contain thousands of devices rather than the maximum
127 in a loop. Each device in the fabric must have a unique address, just as
every phone number in the world is unique. This is done by assigning every
device in a SAN fabric a World Wide Name (WWN).
What in the world is a World Wide Name?
Each device on the network has a World Wide Name, a 64-bit hexadecimal
number coded into it by its manufacturer. The WWN is often assigned via a
standard block of addresses made available for manufacturers to use. Thus
every device in a SAN fabric has a built-in address assigned by a central
naming authority — in this case, one of the standard-setting organizations
that control SAN standards — the Institute of Electrical and Electronics
Engineers (IEEE, pronounced eye triple-e). The WWN is sometimes referred to
by its IEEE address. A typical WWN in a SAN will look something like this:
20000000C8328FE6
On some devices, such as large storage arrays, the storage array itself is
assigned the WWN and the manufacturer then uses the assigned WWN as the
basis for virtual WWNs, which add sequential numbers to identify ports.
The WWN of the storage array is known as the World Wide Node Name or
WWNN. The resulting WWN of the port on the storage array is known as the

World Wide Port Name or WWPN. If the base WWN is (say) 20000000C8328F00
and the storage array has four ports, the array manufacturer could use the
assigned WWN as the base, and then use offsets to create the WWPN for each
port, like this:
20000000C8328F01 for port 1
20000000C8328F02 for port 2
20000000C8328F03 for port 3
20000000C8328F04 for port 4
The manufacturers can use offsets to create World Wide Names as long as
the offsets used do not overlap with any other assigned WWNs from the
block of addresses assigned to the manufacturer.
When it comes to Fibre Channel addressing, the term WWN always refers
to the WWPN of the actual ports, which are like the MAC addresses of an
Ethernet network card. The WWPN (now forever referred to as the WWN for
short) is always used in the name server in the switch to identify devices on
the SAN.

The name server

The name server is a logical service (a specialized program that runs in the
SAN switches) used by the devices connected to the SAN to locate other
devices. The name server in the switched fabric acts like a telephone direc-
tory listing. When a device is plugged into a switch, it logs in to the switch (a
process like logging in to your PC) and registers itself with the name server.
The name server uses its own database to store the WWN information for
every device connected to the fabric, as well as the switch port information
and the associated WWN of each device. When one device wants to talk to
another in the fabric, it looks up that device’s address (its WWN) in the name
server, finds out which port the device is located on, and communication is
then routed between the two devices.
Figure 3-5 shows the name server’s lookup operation in action. The arrows
show how the server on Switch 1 (with address 20000000C8328FE6)
locates the address of the storage device on Switch 2 (at address
50000000B2358D34). After it finds the storage device’s address in the name
server, it knows which switch it’s located on and how to get to the device.
When a network gets big enough to have a few hundred devices connected to a
bunch of switches, the use of a directory listing inside the fabric makes sense.

The switches’ name server information can be used to troubleshoot problems
in a SAN. If your device is connected to a switch but doesn’t get registered in
the name server table, then you know that the problem is somewhere between
the server and the switch; you may have a bad cable. (See Chapter 12 for more
SAN troubleshooting tips.)

fabric name server

Note: this article is from book <Storage Area Networks For Dummies®>.

Categories: Hardware, SAN, Storage Tags:

Understanding the Benefits of a SAN

December 10th, 2013 No comments

The typical benefits of using a SAN are a very high return on investment (ROI),
a reduction in the total cost of ownership (TCO) of computing capabilities, and
a pay-back period (PBP) of months rather than years. Here are some specific
ways you can expect a SAN to be beneficial:

Removes the distance limits of SCSI-connected disks: The maximum
length of a SCSI bus is around 25 meters. Fibre Channel SANs allow you
to connect your disks to your servers over much greater distances.
✓ Greater performance: Current Fibre Channel SANs allow connection
to disks at hundreds of megabytes per second; the near future will see
speeds in multiple gigabytes to terabytes per second.
✓ Increased disk utilization: SANs enable more than one server to access
the same physical disk, which lets you allocate the free space on those
disks more effectively.
✓ Higher availability to storage by use of multiple access paths: A SAN
allows for multiple physical connections to disks from a single or mul-
tiple servers.
✓ Deferred disk procurement: That’s business-speak for not having to buy
disks as often as you used to before getting a SAN. Because you can use
disk space more effectively, no space goes to waste.
✓ Reduced data center rack/floor space: Because you don’t need to buy big
servers with room for lots of disks, you can buy fewer, smaller servers —
an arrangement that takes up less room.
✓ New disaster-recovery capabilities: This is a major benefit. SAN devices
can mirror the data on the disks to another location. This thorough
backup capability can make your data safe if a disaster occurs.
✓ Online recovery: By using online mirrors of your data in a SAN device,
or new continuous data protection solutions, you can instantly recover
your data if it becomes lost, damaged, or corrupted.

✓ Better staff utilization: SANs enable fewer people to manage much
more data.
✓ Reduction of management costs as a percentage of storage costs:
Because you need fewer people, your management costs go down.
✓ Improved overall availability: This is another big one. SAN storage is
much more reliable than internal, server-based disk storage. Things
break a lot less often.
✓ Reduction of servers: You won’t need as many file servers with a SAN.
And because SANs are so fast, even your existing servers run faster
when connected to the SAN. You get more out of your current servers
and don’t need to buy new ones as often.
✓ Improved network performance and fewer network upgrades: You can
back up all your data over the SAN (which is dedicated to that purpose)
rather than over the LAN (which has other duties). Since you use less
bandwidth on the LAN, you can get more out of it.
✓ Increased input/output (I/O) performance and bulk data movement:
Yup, SANs are fast. They move data much faster than do internal drives
or devices attached to the LAN. In high-performance computing envi-
ronments, for example, IB (Infiniband) storage-network technology can
move a single data stream at multiple gigabytes per second.
✓ Reduced/eliminated backup windows: A backup window is the time it
takes to back up all your data. When you do your backups over the SAN
instead of over the LAN, you can do them at any time, day or night. If
you use CDP (Continuous Data Protection) solutions over the SAN, you
can pretty much eliminate backup as a separate process (it just happens
all the time).
✓ Protected critical data: SAN storage devices use advanced technology
to ensure that your critical data remains safe and available.
✓ Nondisruptive scalability: Sounds impressive, doesn’t it? It means you
can add storage to a storage network at any time without affecting the
devices currently using the network.
✓ Easier development and testing of applications: By using SAN-based
mirror copies of production data, you can easily use actual production
data to test new applications while the original application stays online.
✓ Support for server clusters: Server clustering is a method of making two
individual servers look like one and guard each other’s back. If one of
them has a heart attack, the other one takes over automatically to keep
the applications running. Clusters require access to a shared disk drive;
a SAN makes this possible.
✓ Storage on demand: Because SAN disks are available to any server in
the storage network, free storage space can be allocated on demand to
any server that needs it, any time. Storage virtualization can simplify
storage provisioning across storage arrays from multiple vendors.

Note:

This is from book <Storage Area Networks For Dummies®>.

Categories: Hardware, SAN, Storage Tags:

debugging nfs problem with snoop in solaris

December 3rd, 2013 No comments

Network analyzers are ultimately the most useful tools available when it comes to debugging NFS problems. The snoop network analyzer bundled with Solaris was introduced in Section 13.5. This section presents an example of how to use snoop to resolve NFS-related problems.

Consider the case where the NFS client rome attempts to access the contents of the filesystems exported by the server zeus through the /net automounter path:

rome% ls -la /net/zeus/export
total 5
dr-xr-xr-x   3 root     root           3 Jul 31 22:51 .
dr-xr-xr-x   2 root     root           2 Jul 31 22:40 ..
drwxr-xr-x   3 root     other        512 Jul 28 16:48 eng
dr-xr-xr-x   1 root     root           1 Jul 31 22:51 home
rome% ls /net/zeus/export/home
/net/zeus/export/home: Permission denied

 

The client is not able to open the contents of the directory /net/zeus/export/home, although the directory gives read and execute permissions to all users:

Code View: Scroll / Show All
rome% df -k /net/zeus/export/home
filesystem            kbytes    used   avail capacity  Mounted on
-hosts                     0       0       0     0%    /net/zeus/export/home

 

The df command shows the -hosts automap mounted on the path of interest. This means that the NFS filesystem rome:/export/home has not yet been mounted. To investigate the problem further, snoopis invoked while the problematic ls command is rerun:

Code View: Scroll / Show All
 rome# snoop -i /tmp/snoop.cap rome zeus
  1   0.00000      rome -> zeus      PORTMAP C GETPORT prog=100003 (NFS) vers=3 
proto=UDP
  2   0.00314      zeus -> rome      PORTMAP R GETPORT port=2049
  3   0.00019      rome -> zeus      NFS C NULL3
  4   0.00110      zeus -> rome      NFS R NULL3 
  5   0.00124      rome -> zeus      PORTMAP C GETPORT prog=100005 (MOUNT) vers=1 
proto=TCP
  6   0.00283      zeus -> rome      PORTMAP R GETPORT port=33168
  7   0.00094      rome -> zeus      TCP D=33168 S=49659 Syn Seq=1331963017 Len=0 
Win=24820 Options=<nop,nop,sackOK,mss 1460>
  8   0.00142      zeus -> rome      TCP D=49659 S=33168 Syn Ack=1331963018 
Seq=4025012052 Len=0 Win=24820 Options=<nop,nop,sackOK,mss 1460>
  9   0.00003      rome -> zeus      TCP D=33168 S=49659     Ack=4025012053 
Seq=1331963018 Len=0 Win=24820
 10   0.00024      rome -> zeus      MOUNT1 C Get export list
 11   0.00073      zeus -> rome      TCP D=49659 S=33168     Ack=1331963062 
Seq=4025012053 Len=0 Win=24776
 12   0.00602      zeus -> rome      MOUNT1 R Get export list 2 entries
 13   0.00003      rome -> zeus      TCP D=33168 S=49659     Ack=4025012173 
Seq=1331963062 Len=0 Win=24820
 14   0.00026      rome -> zeus      TCP D=33168 S=49659 Fin Ack=4025012173 
Seq=1331963062 Len=0 Win=24820
 15   0.00065      zeus -> rome      TCP D=49659 S=33168     Ack=1331963063 
Seq=4025012173 Len=0 Win=24820
 16   0.00079      zeus -> rome      TCP D=49659 S=33168 Fin Ack=1331963063 
Seq=4025012173 Len=0 Win=24820
 17   0.00004      rome -> zeus      TCP D=33168 S=49659     Ack=4025012174 
Seq=1331963063 Len=0 Win=24820
 18   0.00058      rome -> zeus      PORTMAP C GETPORT prog=100005 (MOUNT) vers=3 
proto=UDP
 19   0.00412      zeus -> rome      PORTMAP R GETPORT port=34582
 20   0.00018      rome -> zeus      MOUNT3 C Null
 21   0.00134      zeus -> rome      MOUNT3 R Null 
 22   0.00056      rome -> zeus      MOUNT3 C Mount /export/home
 23   0.23112      zeus -> rome      MOUNT3 R Mount Permission denied

 

Packet 1 shows the client rome requesting the port number of the NFS service (RPC program number 100003, Version 3, over the UDP protocol) from the server’s rpcbind (portmapper). Packet 2 shows the server’s reply indicating nfsd is running on port 2049. Packet 3 shows the automounter’s call to the server’s nfsd daemon to verify that it is indeed running. The server’s successful reply is shown in packet 4. Packet 5 shows the client’s request for the port number for RPC program number 100005, Version 1, over TCP (the RPC MOUNT program). The server replies with packet 6 with port=33168. Packets 7 through 9 are TCP hand shaking between our NFS client and the server’s mountd. Packet 10 shows the client’s call to the server’s mountd daemon (which implements the MOUNT program) currently running on port 33168. The client is requesting the list of exported entries. The server replies with packet 12 including the names of the two entries exported. Packets 18 and 19 are similar to packets 5 and 6, except that this time the client is asking for the port number of the MOUNT program version 3 running over UDP. Packet 20 and 21 show the client verifying that version 3 of the MOUNT service is up and running on the server. Finally, the client issues the Mount /export/home request to the server in packet 22, requesting the filehandle of the /export/home path. The server’s mountd daemon checks its export list, and determines that the host rome is not present in it and replies to the client with a “Permission Denied” error in packet 23.

The analysis indicates that the “Permission Denied” error returned to the ls command came from the MOUNT request made to the server, not from problems with directory mode bits on the client. Having gathered this information, we study the exported list on the server and quickly notice that the filesystem /export/home is exported only to the host verona:

rome$ showmount -e zeus
export list for zeus:
/export/eng  (everyone)
/export/home verona

 

We could have obtained the same information by inspecting the contents of packet 12, which contains the export list requested during the transaction:

Code View: Scroll / Show All
rome# snoop -i /tmp/cap -v -p 10,12
...
      Packet 10 arrived at 3:32:47.73
RPC:  ----- SUN RPC Header -----
RPC:  
RPC:  Record Mark: last fragment, length = 40
RPC:  Transaction id = 965581102
RPC:  Type = 0 (Call)
RPC:  RPC version = 2
RPC:  Program = 100005 (MOUNT), version = 1, procedure = 5
RPC:  Credentials: Flavor = 0 (None), len = 0 bytes
RPC:  Verifier   : Flavor = 0 (None), len = 0 bytes
RPC:  
MOUNT:----- NFS MOUNT -----
MOUNT:
MOUNT:Proc = 5 (Return export list)
MOUNT:
...
       Packet 12 arrived at 3:32:47.74
RPC:  ----- SUN RPC Header -----
RPC:  
RPC:  Record Mark: last fragment, length = 92
RPC:  Transaction id = 965581102
RPC:  Type = 1 (Reply)
RPC:  This is a reply to frame 10
RPC:  Status = 0 (Accepted)
RPC:  Verifier   : Flavor = 0 (None), len = 0 bytes
RPC:  Accept status = 0 (Success)
RPC:  
MOUNT:----- NFS MOUNT -----
MOUNT:
MOUNT:Proc = 5 (Return export list)
MOUNT:Directory = /export/eng
MOUNT:Directory = /export/home
MOUNT: Group = verona
MOUNT:

 

For simplicity, only the RPC and NFS Mount portions of the packets are shown. Packet 10 is the request for the export list, packet 12 is the reply. Notice that every RPC packet contains the transaction ID (XID), the message type (call or reply), the status of the call, and the credentials. Notice that the RPC header includes the string “This is a reply to frame 10″. This is not part of the network packet. Snoopkeeps track of the XIDs it has processed and attempts to match calls with replies and retransmissions. This feature comes in very handy during debugging. The Mount portion of packet 12 shows the list of directories exported and the group of hosts to which they are exported. In this case, we can see that /export/home was only exported with access rights to the host verona. The problem can be fixed by adding the host rome to the export list on the server.

PS:

Troubleshooting NFS locking problems in solaris

November 29th, 2013 No comments

Lock problems will be evident when an NFS client tries to lock a file, and it fails because someone has it locked. For applications that share access to files, the expectation is that locks will be short-lived. Thus, the pattern your users will notice when something is awry is that yesterday an application started up quite quickly, but today it hangs. Usually it is because an NFS/NLM client holds a lock on a file that your application needs to lock, and the holding client has crashed.

11.3.1. Diagnosing NFS lock hangs

On Solaris, you can use tools like pstack and truss to verify that processes are hanging in a lock request:

client1% ps -eaf | grep SuperApp
     mre 23796 10031  0 11:13:22 pts/6    0:00 SuperApp
client1% pstack 23796
23796:  SuperApp
 ff313134 fcntl    (1, 7, ffbef9dc)
 ff30de48 fcntl    (1, 7, ffbef9dc, 0, 0, 0) + 1c8
 ff30e254 lockf    (1, 1, 0, 2, ff332584, ff2a0140) + 98
 0001086c main     (1, ffbefac4, ffbefacc, 20800, 0, 0) + 1c
 00010824 _start   (0, 0, 0, 0, 0, 0) + dc
client1% truss -p 23796
fcntl(1, F_SETLKW, 0xFFBEF9DC)  (sleeping...)

 

This verifies that the application is stuck in a lock request. We can use pfiles to see what is going on with the files of process 23796:

client1% pfiles 23796
pfiles 23796
23796:  SuperApp
  Current rlimit: 256 file descriptors
   0: S_IFCHR mode:0620 dev:136,0 ino:37990 uid:466 gid:7 rdev:24,37
      O_RDWR
   1: S_IFREG mode:0644 dev:208,1823 ino:5516985 uid:466 gid:300 size:0
      O_WRONLY|O_LARGEFILE
      advisory write lock set by process 3242
   2: S_IFCHR mode:0620 dev:136,0 ino:37990 uid:466 gid:7 rdev:24,37
      O_RDWR

 

That we are told that there is an advisory lock set on file descriptor 1 that is set by another process, process ID 3242, is useful, but unfortunately it doesn’t tell us if 3242 is a local process or a process on another NFS client or NFS server. We also aren’t told if the file mapped to file descriptor 1 is a local file, or an NFS file. We are, however, told that the major and minor device numbers of the filesystem are 208 and 1823 respectively. If you run the mount command without any arguments, this dumps the list of mounted file systems. You should see a display similar to:

Code View: Scroll / Show All
/ on /dev/dsk/c0t0d0s0 read/write/setuid/intr/largefiles/onerror=panic/dev=2200000 
on Thu Dec 21 11:13:33 2000
/usr on /dev/dsk/c0t0d0s6 read/write/setuid/intr/largefiles/onerror=panic/
dev=2200006 on Thu Dec 21 11:13:34 2000
/proc on /proc read/write/setuid/dev=31c0000 on Thu Dec 21 11:13:29 2000
/dev/fd on fd read/write/setuid/dev=32c0000 on Thu Dec 21 11:13:34 2000
/etc/mnttab on mnttab read/write/setuid/dev=3380000 on Thu Dec 21 11:13:35 2000
/var on /dev/dsk/c0t0d0s7 read/write/setuid/intr/largefiles/onerror=panic/
dev=2200007 on Thu Dec 21 11:13:40 2000
/home/mre on spike:/export/home/mre remote/read/write/setuid/intr/dev=340071f on 
Thu Dec 28 08:51:30 2000

 

The numbers after dev= are in hexadecimal. Device numbers are constructed by taking the major number, shifting it left several bits, and then adding the minor number. Convert the minor number 1823 to hexadecimal, and look for it in the mount table:

Code View: Scroll / Show All
client1% printf "%x\n" 1823
71f
client1% mount | grep 'dev=.*71f'
/home/mre on spike:/export/home/mre remote/read/write/setuid/intr/dev=340071f on 
Thu Dec 28 08:51:30 2000

 

We now know four things:

  • This is an NFS file we are blocking on.
  • The NFS server name is spike.
  • The filesystem on the server is /export/home/mre.
  • The inode number of the file is 5516985.

One obvious cause you should first eliminate is whether the NFS server spike has crashed or not. If it hasn’t crashed, then the next step is to examine the server.

11.3.2. Examining lock state on NFS/NLM servers

Solaris and other System V-derived systems have a useful tool called crash for analyzing system state. Crash actually reads the Unix kernel’s memory and formats its data structures in a more human readable form. Continuing with the example from Section 11.3.1, assuming /export/home/mre is a directory on a UFS filesystem, which can be verified by doing:

spike# df -F ufs | grep /export
/export               (/dev/dsk/c0t0d0s7 ):  503804 blocks   436848 files

 

then you can use crash to get more lock state.

The crash command is like a shell, but with internal commands for examining kernel state. The internal command we will be using is lck :

Code View: Scroll / Show All
spike# crash
dumpfile = /dev/mem, namelist = /dev/ksyms, outfile = stdout
> lck
Active and Sleep Locks:
INO         TYP  START END     PROC  PID  FLAGS STATE   PREV     NEXT     LOCK
30000c3ee18  w   0      0       13   136   0021 3       48bf0f8  ae9008   6878d00 
30000dd8710  w   0      MAXEND  17   212   0001 3       8f1a48   8f02d8   8f0e18  
30001cce1c0  w   193    MAXEND  -1   3242  2021 3       6878850  c43a08   2338a38 

Summary From List:
 TOTAL    ACTIVE  SLEEP
   3      3       0
>

 

An important field is PROC. PROC is the “slot” number of the process. If it is -1, that indicates that the lock is being held by a nonlocal (i.e., an NFS client) process, and the PID field thus indicates the process ID, relative to the NFS client. In the sample display, we see one such entry:

Code View: Scroll / Show All
30001cce1c0  w   193    MAXEND  -1   3242  2021 3       6878850  c43a08   2338a38

 

Note that the process id, 3242, is equal to that which the pfiles command displayed earlier in this example. We can confirm that this lock is for the file in question via crash’s uinode command:

> uinode 30001cce1c0
UFS INODE MAX TABLE SIZE = 34020
ADDR         MAJ/MIN   INUMB  RCNT LINK   UID   GID    SIZE    MODE  FLAGS
30001cce1c0  136,  7   5516985   2    1   466   300    403  f---644  mt rf
>

 

The inode numbers match what pfiles earlier displayed on the NFS client. However, inode numbers are unique per local filesystem. We can make doubly sure this is the file by comparing the major and minor device numbers from the uinode command, 136 and 7, with that of the filesystem that is mounted on /export :

spike# ls -lL /dev/dsk/c0t0d0s7
brw-------   1 root     sys      136,  7 May  6  2000 /dev/dsk/c0t0d0s7
spike#

11.3.3. Clearing lock state

Continuing with our example from Section 11.3.2, at this point we know that the file is locked by another NFS client. Unfortunately, we don’t know which client it is, as crash won’t give us that information. We do however have a potential list of clients in the server’s /var/statmon/sm directory:

spike# cd /var/statmon/sm
spike# ls
client1       ipv4.10.1.0.25  ipv4.10.1.0.26  gonzo      java

 

The entries prefixed with ipv4 are just symbolic links to other entries. The non-symbolic link entries identify the hosts we want to check for.

The most likely cause of the lock not getting released is that the holding NFS client has crashed. You can take the list of hosts from the /var/statmon/sm directory and check if any are dead, or not responding due to a network partition. Once you determine which are dead, you can use Solaris’s clear_locks command to clear lock state. Let’s suppose you determine that gonzo is dead. Then you would do:

spike# clear_locks gonzo

 

If clearing the lock state of dead clients doesn’t fix the problem, then perhaps a now-live client crashed, but for some reason after it rebooted, its status monitor did not send a notification to the NLM server’s status monitor. You can log onto the live clients and check if they are currently mounting the filesystem from the server (in our example, spike:/export). If they are not, then you should consider using clear_locks to clear any residual lock state those clients might have had.

Ultimately, you may be forced to reboot your server. Short of that there are other things you could do. Since you know the inode number and filesystem of file in question, you can determine the file’s name:

spike# cd /export
find . -inum 5516985 -print
./home/mre/database

 

You could rename file database to something else, and copy it back to a file named database. Then kill and restart the SuperApp application on client1. Of course, such an approach requires intimate knowledge or experience with the application to know if this will be safe.

PS:

This article is from book <Managing NFS and NIS, Second Edition>.

Categories: NAS, Storage Tags:

nfs null map – white out any map entry affecting directory

November 25th, 2013 No comments

The automounter also has a map “white-out” feature, via the -null special map. It is used after a directory to effectively delete any map entry affecting that directory from the automounter’s set of maps. It must precede the map entry being deleted. For example:

/tools -null

This feature is used to override auto_master or direct map entries that may have been inherited from an NIS map. If you need to make per-machine changes to the automounter maps, or if you need local control over a mount point managed by the automounter, white-out the conflicting map entry with the -null map.

PS: this is from book <>

Categories: NAS, Storage Tags:

nfs direct map vs indirect map

November 25th, 2013 No comments
  • Indirect maps

Here is an indirect automounter map for the /tools directory, called auto_tools:

deskset         -ro      mahimahi:/tools2/deskset 
sting                    mahimahi:/tools2/sting 
news                     thud:/tools3/news 
bugview                  jetstar:/usr/bugview

 

The first field is called the map key and is the final component of the mount point. The map name suffix and the mount point do not have to share the same name, but adopting this convention makes it easy to associate map names and mount points. This four-entry map is functionally equivalent to the /etc/vfstab excerpt:

mahimahi:/tools2/desket - /tools/deskset  nfs - - ro 
mahimahi:/tools2/string - /tools/sting    nfs - -  
thud:/tools3/news       - /tools/news     nfs - -  
jetstar:/usr/bugview    - /tools/bugview  nfs - -
  • Direct maps

Direct maps define point-specific, nonuniform mount points. The best example of the need for a direct map entry is /usr/man. The /usr directory contains numerous other entries, so it cannot be an indirect mount point. Building an indirect map for /usr/man that uses /usr as a mount point will “cover up” /usr/bin and /usr/etc. A direct map allows the automounter to complete mounts on a single directory entry.

The key in a direct map is a full pathname, instead of the last component found in the indirect map. Direct maps also follow the /etc/auto_contents naming scheme. Here is a sample /etc/auto_direct:
/usr/man wahoo:/usr/share/man
/usr/local/bin mahimahi:/usr/local/bin.sun4

A major difference in behavior is that the real direct mount points are always visible to ls and other tools that read directory structures. The automounter treats direct mounts as individual directory entries, not as a complete directory, so the automounter gets queried whenever the directory containing the mount point is read. Client performance is affected in a marked fashion if direct mount points are used in several well-traveled directories. When a user reads a directory containing a number of direct mounts, the automounter initiates a flurry of mounting activity in response to the directory read requests. Section 9.5.3 describes a trick that lets you use indirect maps instead of direct maps. By using this trick, you can avoid mount storms caused by multiple direct mount points.

Contents from /etc/auto_master:
# Directory Map NFS Mount Options
/tools /etc/auto_tools -ro #this is indirect map
/- /etc/auto_direct #this is direct map
PS:
This article is mostly from book <Managing NFS and NIS, Second Edition>

Categories: NAS, Storage Tags:

SAN Terminology

September 13th, 2013 No comments
Term
Description
SCSI Target
A SCSI Target is a storage system end-point that provides a service of processing SCSI commands and I/O requests from an initiator. A SCSI Target is created by the storage system’s administrator, and is identified by unique addressing methods. A SCSI Target, once configured, consists of zero or more logical units.
SCSI Initiator
A SCSI Initiator is an application or production system end-point that is capable of initiating a SCSI session, sending SCSI commands and I/O requests. SCSI Initiators are also identified by unique addressing methods (See SCSI Targets).
Logical Unit
A Logical Unit is a term used to describe a component in a storage system. Uniquely numbered, this creates what is referred to as a Logicial Unit Number, or LUN. A storage system, being highly configurable, may contain many LUNS. These LUNs, when associated with one or more SCSI Targets, forms a unique SCSI device, a device that can be accessed by one or more SCSI Initiators.
iSCSI
Internet SCSI, a protocol for sharing SCSI based storage over IP networks.
iSER
iSCSI Extension for RDMA, a protocol that maps the iSCSI protocol over a network that provides RDMA services (i.e. InfiniBand). The iSER protocol is transparently selected by the iSCSI subsystem, based on the presence of correctly configured IB hardware. In the CLI and BUI, all iSER-capable components (targets and initiators) are managed as iSCSI components.
FC
Fibre Channel, a protocol for sharing SCSI based storage over a storage area network (SAN), consisting of fiber-optic cables, FC switches and HBAs.
SRP
SCSI RDMA Protocol, a protocol for sharing SCSI based storage over a network that provides RDMA services (i.e. InfiniBand).
IQN
An iSCSI qualified name, the unique identifier of a device in an iSCSI network. iSCSI uses the form iqn.date.authority:uniqueid for IQNs. For example, the appliance may use the IQN: iqn.1986-03.com.sun:02:c7824a5b-f3ea-6038-c79d-ca443337d92c to identify one of its iSCSI targets. This name shows that this is an iSCSI device built by a company registered in March of 1986. The naming authority is just the DNS name of the company reversed, in this case, “com.sun”. Everything following is a unique ID that Sun uses to identify the target.
Target portal
When using the iSCSI protocol, the target portal refers to the unique combination of an IP address and TCP port number by which an initiator can contact a target.
Target portal group
When using the iSCSI protocol, a target portal group is a collection of target portals. Target portal groups are managed transparently; each network interface has a corresponding target portal group with that interface’s active addresses. Binding a target to an interface advertises that iSCSI target using the portal group associated with that interface.
CHAP
Challenge-handshake authentication protocol, a security protocol which can authenticate a target to an initiator, an initiator to a target, or both.
RADIUS
A system for using a centralized server to perform CHAP authentication on behalf of storage nodes.
Target group
A set of targets. LUNs are exported over all the targets in one specific target group.
Initiator group
A set of initiators. When an initiator group is associated with a LUN, only initiators from that group may access the LUN.
Categories: Hardware, SAN, Storage Tags: ,

make label for swap device using mkswap and blkid

August 6th, 2013 No comments

If you want to label one swap partition in linux, you should not use e2label for this purpose. As e2label is for changing the label on an ext2/ext3/ext4 filesystem, which do not include swap filesystem.

If you use e2label for this, you will get the following error messages:

[root@node2 ~]# e2label /dev/xvda3 SWAP-VM
e2label: Bad magic number in super-block while trying to open /dev/xvda3
Couldn’t find valid filesystem superblock.

We should use mkswap for it. As mkswap has one option -L:

-L labelSpecify a label, to allow swapon by label. (Only for new style swap areas.)

So let’s see example below:

[root@node2 ~]# mkswap -L SWAP-VM /dev/xvda3
Setting up swapspace version 1, size = 2335973 kB
LABEL=SWAP-VM, no uuid

[root@node2 ~]# blkid
/dev/xvda1: LABEL=”/boot” UUID=”6c5ad2ad-bdf5-4349-96a4-efc9c3a1213a” TYPE=”ext3″
/dev/xvda2: LABEL=”/” UUID=”76bf0aaa-a58e-44cb-92d5-098357c9c397″ TYPE=”ext3″
/dev/xvdb1: LABEL=”VOL1″ TYPE=”oracleasm”
/dev/xvdc1: LABEL=”VOL2″ TYPE=”oracleasm”
/dev/xvdd1: LABEL=”VOL3″ TYPE=”oracleasm”
/dev/xvde1: LABEL=”VOL4″ TYPE=”oracleasm”
/dev/xvda3: LABEL=”SWAP-VM” TYPE=”swap”

[root@node2 ~]# swapon /dev/xvda3

[root@node2 ~]# swapon -s
Filename Type Size Used Priority
/dev/xvda3 partition 2281220 0 -1

So now we can add swap to /etc/fstab using LABEL=SWAP-VM:

LABEL=SWAP-VM           swap                    swap    defaults        0 0

Categories: Linux, Storage Tags: ,

iostat dm- mapping to physical device

July 30th, 2013 No comments

-bash-3.2# iostat -xn 2

avg-cpu: %user %nice %system %iowait %steal %idle
0.02 0.00 0.48 0.00 0.21 99.29

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 1949.00 0.00 129648.00 66.52 30.66 15.77 0.51 100.20
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 0.00 1139.00 0.00 88752.00 77.92 22.92 20.09 0.83 95.00

Device: rBlk_nor/s wBlk_nor/s rBlk_dir/s wBlk_dir/s rBlk_svr/s wBlk_svr/s rops/s wops/s
nas-host:/export/test/repo 0.00 0.00 0.00 218444.00 0.00 218332.00 3084.50 3084.50

Then how can we know which physical device dm-3 is mapping to?

-bash-3.2# cat /sys/block/dm-3/dev
253:3 #this is major, minor number of dm-3

 

-bash-3.2# dmsetup ls
dmnfs6 (253, 6)
dmnfs5 (253, 5)
dmnfs4 (253, 4)
dmnfs3 (253, 3)
dmnfs2 (253, 2)
dmnfs1 (253, 1)
dmnfs0 (253, 0)

-bash-3.2# ls -l /dev/mapper/dmnfs3
brw-rw—- 1 root disk 253, 3 Jul 30 08:28 /dev/mapper/dmnfs3

-bash-3.2# dmsetup status
dmnfs6: 0 27262976 nfs nfs 0
dmnfs5: 0 476938240 nfs nfs 0
dmnfs4: 0 29543535 nfs nfs 0
dmnfs3: 0 1536000000 nfs nfs 0 #device mapper of nfs, dm_nfs
dmnfs2: 0 29543535 nfs nfs 0 >
dmnfs1: 0 169508864 nfs nfs 0
dmnfs0: 0 29543535 nfs nfs 0

So we now know that it’s NFS which caused io busy.

PS:

device mapper(dm_mod module, dmsetup ls) http://en.wikipedia.org/wiki/Device_mapper

multiple devices(software raid, mdraid, /proc/mdstat) http://linux.die.net/man/8/mdadm and https://raid.wiki.kernel.org/index.php/Linux_Raid

dmraid(fake raid) https://wiki.archlinux.org/index.php/Installing_with_Fake_RAID

DM-MPIO(DM-Multipathing, multipath/multipathd, dm_multipath module, combined with SAN) http://en.wikipedia.org/wiki/Linux_DM_Multipath

Categories: Storage Tags:

resolved – differences between zfs ARC L2ARC ZIL

January 31st, 2013 No comments
  • ARC

zfs ARC(adaptive replacement cache) is a very fast cache located in the server’s memory.

For example, our ZFS server with 12GB of RAM has 11GB dedicated to ARC, which means our ZFS server will be able to cache 11GB of the most accessed data. Any read requests for data in the cache can be served directly from the ARC memory cache instead of hitting the much slower hard drives. This creates a noticeable performance boost for data that is accessed frequently.

  • L2ARC

As a general rule, you want to install as much RAM into the server as you can to make the ARC as big as possible. At some point, adding more memory is just cost prohibitive. That is where the L2ARC becomes important. The L2ARC is the second level adaptive replacement cache. The L2ARC is often called “cache drives” in the ZFS systems.

L2ARC is a new layer between Disk and the cache (ARC) in main memory for ZFS. It uses dedicated storage devices to hold cached data. The main role of this cache is to boost the performance of random read workloads. The intended L2ARC devices include 10K/15K RPM disks like short-stroked disks, solid state disks (SSD), and other media with substantially faster read latency than disk.

  • ZIL

ZIL(ZFS Intent Log) exists for performance improvement on synchronous writes. Synchronous write is very slow than asynchronous write, but it’s more stable. Essentially, the intent log of a file system is nothing more than an insurance against power failures, a to-do list if you will, that keeps track of the stuff that needs to be updated on disk, even if the power fails (or something else happens that prevents the system from updating its disks).

To get better performance, use separated disks(SSD) for ZIL, such as zpool add pool log c2d0.

Now I’m giving you an true example about zfs ZIL/L2ARC/ARC on SUN ZFS 7320 head:

test-zfs# zpool iostat -v exalogic
capacity operations bandwidth
pool alloc free read write read write
————————- —– —– —– —– —– —–
exalogic 6.78T 17.7T 53 1.56K 991K 25.1M
mirror 772G 1.96T 6 133 111K 2.07M
c0t5000CCA01A5FDCACd0 – - 3 36 57.6K 2.07M #these are the physical disks
c0t5000CCA01A6F5CF4d0 – - 2 35 57.7K 2.07M
mirror 772G 1.96T 5 133 110K 2.07M
c0t5000CCA01A6F5D00d0 – - 2 36 56.2K 2.07M
c0t5000CCA01A6F64F4d0 – - 2 35 57.3K 2.07M
mirror 772G 1.96T 5 133 110K 2.07M
c0t5000CCA01A76A7B8d0 – - 2 36 56.3K 2.07M
c0t5000CCA01A746CCCd0 – - 2 36 56.8K 2.07M
mirror 772G 1.96T 5 133 110K 2.07M
c0t5000CCA01A749A88d0 – - 2 35 56.7K 2.07M
c0t5000CCA01A759E90d0 – - 2 35 56.1K 2.07M
mirror 772G 1.96T 5 133 110K 2.07M
c0t5000CCA01A767FDCd0 – - 2 35 56.1K 2.07M
c0t5000CCA01A782A40d0 – - 2 35 57.1K 2.07M
mirror 772G 1.96T 5 133 110K 2.07M
c0t5000CCA01A782D10d0 – - 2 35 57.2K 2.07M
c0t5000CCA01A7465F8d0 – - 2 35 56.3K 2.07M
mirror 772G 1.96T 5 133 110K 2.07M
c0t5000CCA01A7597FCd0 – - 2 35 57.6K 2.07M
c0t5000CCA01A7828F4d0 – - 2 35 56.2K 2.07M
mirror 772G 1.96T 5 133 110K 2.07M
c0t5000CCA01A7829ACd0 – - 2 35 57.1K 2.07M
c0t5000CCA01A78278Cd0 – - 2 35 57.4K 2.07M
mirror 772G 1.96T 6 133 111K 2.07M
c0t5000CCA01A736000d0 – - 3 35 57.3K 2.07M
c0t5000CCA01A738000d0 – - 2 35 57.3K 2.07M
c0t5000A72030061B82d0 224M 67.8G 0 98 1 1.62M #ZIL(SSD write cache, ZFS Intent Log)
c0t5000A72030061C70d0 224M 67.8G 0 98 1 1.62M
c0t5000A72030062135d0 223M 67.8G 0 98 1 1.62M
c0t5000A72030062146d0 224M 67.8G 0 98 1 1.62M
cache – - – - – -
c2t2d0 334G 143G 15 6 217K 652K #L2ARC(SSD cache drives)
c2t3d0 332G 145G 15 6 215K 649K
c2t4d0 333G 144G 11 6 169K 651K
c2t5d0 333G 144G 13 6 192K 650K
c2t2d0 – - 0 0 0 0
c2t3d0 – - 0 0 0 0
c2t4d0 – - 0 0 0 0
c2t5d0 – - 0 0 0 0

And as for ARC:

test-zfs:> status memory show
Memory:
Cache 63.4G bytes #ARC
Unused 17.3G bytes
Mgmt 561M bytes
Other 491M bytes
Kernel 14.3G bytes

Categories: Kernel, NAS, SAN, Storage Tags: ,

sun zfs firmware upgrade howto

January 29th, 2013 No comments

This article is going to talk about upgrading firmware for sun zfs 7320(you may find other series of sun zfs heads works too):

Categories: NAS, SAN, Storage Tags:

perl script for monitoring sun zfs memory usage

January 16th, 2013 No comments

On zfs’s aksh, I can check memory usage with the following:

test-zfs:> status memory show
Memory:
Cache 719M bytes
Unused 15.0G bytes
Mgmt 210M bytes
Other 332M bytes
Kernel 7.79G bytes

So now I want to collect this memory usae information automatically for SNMP’s use. Here’s the steps:

cpan> o conf prerequisites_policy follow
cpan> o conf commit

Since the host is using proxy to get on the internet, so in /etc/wgetrc:

http_proxy = http://www-proxy.us.example.com:80/
ftp_proxy = http://www-proxy.us.example.com:80/
use_proxy = on

Now install the Net::SSH::Perl perl module:

PERL_MM_USE_DEFAULT=1 perl -MCPAN -e ‘install Net::SSH::Perl’

And to confirm that Net::SSH::Perl was installed, run the following command:

perl -e ‘use Net::SSH::Perl’ #no output is good, as it means the package was installed successfully

Now here goes the perl script to get the memory usage of sun zfs head:

[root@test-centos ~]# cat /var/tmp/mrtg/zfs-test-zfs-memory.pl
#!/usr/bin/perl
use strict;
use warnings;
use Net::SSH::Perl;
my $host = ‘test-zfs’;
my $user = ‘root’;
my $password = ‘password’;

my $ssh = Net::SSH::Perl->new($host);
$ssh->login($user,$password);
my ($stdout,$stderr,$exit) = $ssh->cmd(“status memory show”);
$ssh->cmd(“exit”);
if($stderr){
print “ErrorCode:$exit\n”;
print “ErrorMsg:$stderr”;
} else {
my @std_arr = split(/\n/, $stdout);
shift @std_arr;
foreach(@std_arr) {
if ($_ =~ /.+\b\s+(.+)M\sbytes/){
$_=$1/1024;
}
elsif($_ =~ /.+\b\s+(.+)G\sbytes/){
$_=$1;
}
else{}
}
foreach(@std_arr) {
print $_.”\n”;
}
}
exit $exit;

PS:
If you get the following error messages during installation of a perl module:

[root@test-centos ~]# perl -MCPAN -e ‘install SOAP::Lite’
CPAN: Storable loaded ok
CPAN: LWP::UserAgent loaded ok
Fetching with LWP:
ftp://ftp.perl.org/pub/CPAN/authors/01mailrc.txt.gz
LWP failed with code[500] message[LWP::Protocol::MyFTP: connect: Connection timed out]
Fetching with Net::FTP:
ftp://ftp.perl.org/pub/CPAN/authors/01mailrc.txt.gz

Trying with “/usr/bin/links -source” to get
ftp://ftp.perl.org/pub/CPAN/authors/01mailrc.txt.gz
ELinks: Connection timed out

Then you may have a check of whether you’re using proxy to get on the internet(run cpan > o conf init to re-configure cpan; later you should set /etc/wgetrc: http_proxy, ftp_proxy, use_proxy).

 

Categories: NAS, Perl, Storage Tags: ,

zfs iops on nfs iscsi disk

January 5th, 2013 No comments

On zfs storage 7000 series BUI, you may found the following statistic:

This may seem quite weird as you can see that, NFSv3(3052) + iSCSI(1021) is larger than Disk(1583). As iops for protocal NFSv3/iSCSI finally goes to Disk, so why iops for the two protocals is larger than Disk iops?

Here’s the reason:

Disk operations for NFSv3 and iSCSI are logical operations. These logical operations are then combined/optimized by sun zfs storage and then finally go to physical Disk operations.

PS:

1.When doing continuous access to disks(like VOD), disk throughputs will become the bottleneck of performance rather than IOPS. In constract, IOPS limits disk performance when random access is going on disks.

2.For NAS performance analytic, here are two good articles(in Chinese) http://goo.gl/Q2M7JE http://www.storageonline.com.cn/storage/nas/the-nas-performance-analysis-overview/

3.You may also wonder why Disk iops can be as high as 1583. As this number is the sum of all disk controllers of the zfs storage system. Here’s some ballpark numbers for HDD iops:

 

Categories: Hardware, NAS, SAN, Storage Tags:

zfs shared lun stoage set up for oracle RAC

January 4th, 2013 No comments
  • create iSCSI Target Group

Open zfs BUI, navigate through “Configuration” -> “SAN” -> “iSCSI Targets”. Then create new iSCSI Target by clicking plus sign. Give it an alias, and then select the Network interface(may be bond or LACP) you want to use. After creating this iSCSI target, drag the newly created target to the right side “iSCSI Target Groups” to create one iSCSI Target Group. You can give that iSCSI target group an name too. Note down the iSCSI Target Group’s iqn, this is important for later operations.(Network interfaces:use NAS interface. You can select multiple interfaces)

  • create iSCSI Initiator Group

Before going on the next step, we need first get the iSCSI initiator IQN for each hosts we want LUN allocated. On each host, execute the following command to get the iqn for iscsi on linux platform(You can edit this file before read it, for example, make iqn name ended with` hostname` so it’s easier for later operations on LUN<do a /etc/init.d/iscsi restart after your modification to initiatorname.iscsi>):

[root@test-host ~]# cat /etc/iscsi/initiatorname.iscsi
InitiatorName=<your host’s iqn name>

Now go back to zfs BUI, navigate through “Configuration” -> “SAN” -> “Initiators”. On the left side, click “iSCSI Initiators”, then click plus sign on it. Enter IQN you get from previos step and give it an name.(do this for each host you want iSCSI LUN allocated). After this, drag the newly created iSCSI initiator(s) from left side to form new iSCSI Initiator Groups on the right side(drag two items from the left to the same item on the right to form an group).


  • create shared LUNs for iSCSI Initiator Group

After this, we need now create LUNs for iSCSI Initiator Group(so that shared lun can be allocated, for example, oracle RAC need shared storage). Click on diskette sign on the just created iSCSI Initiator Group,select the project you want the LUN allocated from, give it a name, and assign the volume size. Select the right target group you created before(you can also create a new one e.g. RAC in shares).

PS:You can also now go to “shares” -> “Luns” and create Lun(s) using the target group you created and use default Initiator group. Note that one LUN need one iSCSI target. So you should create more iSCSI targets and add them to iSCSI target group if you want more LUNs.

  • scan shared LUNs from hosts

Now we’re going to operate on linux hosts. On each host you want iSCSI LUN allocated, do the following steps:

iscsiadm -m discovery -t st -p <ip address of your zfs storage>(use cluster’s ip if there’s zfs cluster)
iscsiadm -m node -T <variable, iSCSI Target Group iqn> -p <ip address of your zfs storage> -l #or use output from above command($1 is –portal, $3 is –targetname, -l is –login)
service iscsi restart

After these steps, you host(s) should now see the newly allocated iSCSI LUN(s), you can run fdisk -l to confirm.

Good luck!

Categories: NAS, Storage Tags:

how to turn on hba flags connected to EMC arrays

October 3rd, 2012 No comments

As per EMC recommendation following flags should be enabled for Vmware ESX hosts, if not there will be performance issues:

Common_Serial_Number(C)
SCSI_3(SC3)
SPC2_Protocol_Version(SPC2)

Here’s the commands that’ll do the trick:

sudo symmask -sid <sid> set hba_flags on C,SPC2,SC3 -enable -wwn <port wwn> -dir <dir number> -p <port number>

Categories: Hardware, NAS, SAN, Storage Tags:

Resolved – Errors found during scanning of LUN allocated from IBM XIV array

October 2nd, 2012 No comments

So here’s the story:
After the LUN(IBM XIV array) allocated, we run a ‘xiv_fc_admin -R’ to make the LUN visible to OS(testhost-db-clstr-vol_37 is the new LUN’s Volume Name):
root@testhost01 # xiv_devlist -o device,vol_name,vol_id
XIV Devices
——————————————————————-
Device Vol Name Vol Id
——————————————————————-
/dev/dsk/c2t500173804EE40140d19s2 testhost-db-clstr-vol_37 1974
——————————————————————-
/dev/dsk/c2t500173804EE40150d19s2 testhost-db-clstr-vol_37 1974
——————————————————————-
/dev/dsk/c4t500173804EE40142d19s2 testhost-db-clstr-vol_37 1974
——————————————————————-
/dev/dsk/c4t500173804EE40152d19s2 testhost-db-clstr-vol_37 1974
——————————————————————-



/dev/vx/dmp/xiv0_16 testhost-db-clstr-vol_17 1922



Non-XIV Devices
——————–
Device
——————–
/dev/vx/dmp/disk_0
——————–
/dev/vx/dmp/disk_1
——————–
/dev/vx/dmp/disk_2
——————–
/dev/vx/dmp/disk_3
——————–

Then, I ran ‘vxdctl enable’ in order to make the DMP device visible to OS, but error message prompted:
root@testhost01 # vxdctl enable
VxVM vxdctl ERROR V-5-1-0 Data Corruption Protection Activated – User Corrective Action Needed
VxVM vxdctl INFO V-5-1-0 To recover, first ensure that the OS device tree is up to date (requires OS specific commands).
VxVM vxdctl INFO V-5-1-0 Then, execute ‘vxdisk rm’ on the following devices before reinitiating device discovery:
xiv0_18, xiv0_18, xiv0_18, xiv0_18

After this, the new LUN disappered from output of ‘xiv_devlist -o device,vol_name,vol_id’(testhost-db-clstr-vol_37 disappered), and xiv0_18(the DMP device of new LUN) turned to ‘Unreachable device’, see below:

root@testhost01 # xiv_devlist -o device,vol_name,vol_id
XIV Devices
—————————————————–
Device Vol Name Vol Id
—————————————————–



Non-XIV Devices
——————–
Device
——————–
/dev/vx/dmp/disk_0
——————–
/dev/vx/dmp/disk_1
——————–
/dev/vx/dmp/disk_2
——————–
/dev/vx/dmp/disk_3
——————–
Unreachable devices: /dev/vx/dmp/xiv0_18
Also, ‘vxdisk list’ showed:
root@testhost01 # vxdisk list xiv0_18
Device: xiv0_18
devicetag: xiv0_18
type: auto
flags: error private autoconfig
pubpaths: block=/dev/vx/dmp/xiv0_18s2 char=/dev/vx/rdmp/xiv0_18s2
guid: -
udid: IBM%5F2810XIV%5F4EE4%5F07B6
site: -
Multipathing information:
numpaths: 4
c4t500173804EE40142d19s2 state=disabled
c4t500173804EE40152d19s2 state=disabled
c2t500173804EE40150d19s2 state=disabled
c2t500173804EE40140d19s2 state=disabled

I tried to format the new DMP device(xiv0_18), but failed with info below:
root@testhost01 # format -d /dev/vx/dmp/xiv0_18
Searching for disks…done

c2t500173804EE40140d19: configured with capacity of 48.06GB
c2t500173804EE40150d19: configured with capacity of 48.06GB
c4t500173804EE40142d19: configured with capacity of 48.06GB
c4t500173804EE40152d19: configured with capacity of 48.06GB
Unable to find specified disk ‘/dev/vx/dmp/xiv0_18′.

Also, ‘vxdisksetup -i’ failed with info below:
root@testhost01 # vxdisksetup -i /dev/vx/dmp/xiv0_18
prtvtoc: /dev/vx/rdmp/xiv0_18: No such device or address

And, ‘xiv_fc_admin -R’ failed with info below:
root@testhost01 # xiv_fc_admin -R
ERROR: Error during command execution: vxdctl enabled
====================================================
OK, that’s all of the symptoms and the headache, here’s the solution:
====================================================

1. Run ‘xiv_fc_admin -R’(ERROR: Error during command execution: vxdctl enabled will prompt, ignore it. this step scanned for new LUN). You can also run a devfsadm -c disk(not needed actually)
2. Now exclude problematic paths of the DMP device(you can check the paths from vxdisk list xiv0_18)
root@testhost01 # vxdmpadm exclude vxvm path=c4t500173804EE40142d19s2
root@testhost01 # vxdmpadm exclude vxvm path=c4t500173804EE40152d19s2
root@testhost01 # vxdmpadm exclude vxvm path=c2t500173804EE40150d19s2
root@testhost01 # vxdmpadm exclude vxvm path=c2t500173804EE40140d19s2
3. Now run ‘vxdctl enable’, the following error message will NOT showed:
VxVM vxdctl ERROR V-5-1-0 Data Corruption Protection Activated – User Corrective Action Needed
VxVM vxdctl INFO V-5-1-0 To recover, first ensure that the OS device tree is up to date (requires OS specific commands).
VxVM vxdctl INFO V-5-1-0 Then, execute ‘vxdisk rm’ on the following devices before reinitiating device discovery:
xiv0_18, xiv0_18, xiv0_18, xiv0_18
4. Now include the problematic paths:
root@testhost01 # vxdmpadm include vxvm path=c4t500173804EE40142d19s2
root@testhost01 # vxdmpadm include vxvm path=c4t500173804EE40152d19s2
root@testhost01 # vxdmpadm include vxvm path=c2t500173804EE40150d19s2
root@testhost01 # vxdmpadm include vxvm path=c2t500173804EE40140d19s2

5. Run ‘vxdctl enable’. After this, you should now see the DMP device in output of ‘xiv_devlist -o device,vol_name,vol_id’
root@testhost01 # xiv_devlist -o device,vol_name,vol_id
XIV Devices
—————————————————–
Device Vol Name Vol Id
—————————————————–



—————————————————–
/dev/vx/dmp/xiv0_18 testhost-db-clstr-vol_37 1974
—————————————————–



Non-XIV Devices
——————–
Device
——————–
/dev/vx/dmp/disk_0
——————–
/dev/vx/dmp/disk_1
——————–
/dev/vx/dmp/disk_2
——————–
/dev/vx/dmp/disk_3
——————–

6. ‘vxdisk list’ will now show the DMP device(xiv0_18) as ‘auto – - nolabel’, obviously we should now label the DMP device:
root@testhost01 # format -d xiv0_18
Searching for disks…done

c2t500173804EE40140d19: configured with capacity of 48.06GB
c2t500173804EE40150d19: configured with capacity of 48.06GB
c4t500173804EE40142d19: configured with capacity of 48.06GB
c4t500173804EE40152d19: configured with capacity of 48.06GB
Unable to find specified disk ‘xiv0_18′.

root@testhost01 # vxdisk list xiv0_18
Device: xiv0_18
devicetag: xiv0_18
type: auto
flags: nolabel private autoconfig
pubpaths: block=/dev/vx/dmp/xiv0_18 char=/dev/vx/rdmp/xiv0_18
guid: -
udid: IBM%5F2810XIV%5F4EE4%5F07B6
site: -
errno: Disk is not usable
Multipathing information:
numpaths: 4
c4t500173804EE40142d19s2 state=enabled
c4t500173804EE40152d19s2 state=enabled
c2t500173804EE40150d19s2 state=enabled
c2t500173804EE40140d19s2 state=enabled

root@testhost01 # vxdisksetup -i /dev/vx/dmp/xiv0_18
prtvtoc: /dev/vx/rdmp/xiv0_18: Unable to read Disk geometry errno = 0×16

Not again! But don’t panic this time. Now run format for each subpath of the DMP device(can be found in output of vxdisk list xiv0_18), for example:
root@testhost01 # format c4t500173804EE40142d19s2

c4t500173804EE40142d19s2: configured with capacity of 48.06GB
selecting c4t500173804EE40142d19s2
[disk formatted]
FORMAT MENU:
disk – select a disk
type – select (define) a disk type
partition – select (define) a partition table
current – describe the current disk
format – format and analyze the disk
repair – repair a defective sector
label – write label to the disk
analyze – surface analysis
defect – defect list management
backup – search for backup labels
verify – read and display labels
save – save new disk/partition definitions
inquiry – show vendor, product and revision
volname – set 8-character volume name
!<cmd> – execute <cmd>, then return
quit
format> label
Ready to label disk, continue? yes

format> save
Saving new disk and partition definitions
Enter file name["./format.dat"]:
format> quit

7. After the subpaths were labelled, now run a ‘vxdctl enable’ again. After this, you’ll find the DMP device turned it’s state from ‘auto – - nolabel’ to ‘auto:none – - online invalid’, and vxdisk list no longer showed the DMP device as ‘Disk is not usable’:
root@testhost01 # vxdisk list xiv0_18
Device: xiv0_18
devicetag: xiv0_18
type: auto
info: format=none
flags: online ready private autoconfig invalid
pubpaths: block=/dev/vx/dmp/xiv0_18s2 char=/dev/vx/rdmp/xiv0_18s2
guid: -
udid: IBM%5F2810XIV%5F4EE4%5F07B6
site: -
Multipathing information:
numpaths: 4
c4t500173804EE40142d19s2 state=enabled
c4t500173804EE40152d19s2 state=enabled
c2t500173804EE40150d19s2 state=enabled
c2t500173804EE40140d19s2 state=enabled

8. To add the new DMP device to Disk Group, the following steps should be followed:
/usr/lib/vxvm/bin/vxdisksetup -i xiv0_18
vxdg -g <dg_name> adddisk <disk_name>=<device name>
/usr/sbin/vxassist -g <dg_name> maxgrow <vol name> alloc=<newly-add-luns>
/etc/vx/bin/vxresize -g <dg_name> -bx <vol name> <new size>

 

Categories: Hardware, SAN, Storage Tags:

lvm snapshot backup

September 20th, 2012 No comments

Here are the updated steps for LVM snapshot backup. I will also put there into work info of the changes.

 

# determine available space.

sudo vgs | awk ‘/VolGroup/ {print $NF}’

 

# create snapshots, 1Gb should be enough but if you really struggle you can try even smaller cause it will only have to hold the delta between now and when you destroy them.

sudo lvcreate -L 1G -s -n rootLVsnap VolGroup00/rootLV

sudo lvcreate -L 1G -s -n varLVsnap VolGroup00/varLV

 

# create the FS for backups. allocate as much space as you have. last-resort: use NFS.

sudo lvcreate -n OSbackup -L 5G VolGroup00

sudo mkfs -t ext3 /dev/VolGroup00/OSbackup

sudo mkdir /OSbackup

sudo mount /dev/VolGroup00/OSbackup /OSbackup

 

# create a backup of root

# gzip is important. if you are really tight on space, try gzip -9 or even bzip2

sudo dd if=/dev/VolGroup00/rootLVsnap bs=1M | sudo sh -c ‘gzip -c > /OSbackup/root.dd.gz’

 

# now, remove root snapshot and extend backup fs

sudo lvremove VolGroup00/rootLVsnap

sudo lvextend -L +1G VolGroup00/OSbackup

sudo resize2fs /dev/VolGroup00/OSbackup

 

# backup var

sudo dd if=/dev/VolGroup00/varLVsnap bs=1M | sudo sh -c ‘gzip -c > /OSbackup/var.dd.gz’

sudo lvremove VolGroup00/varLVsnap

 

# backup boot

cd /boot; sudo tar -pczf /OSbackup/boot.tar.gz .

 

# unmount the fs and destroy mountpoint

sudo umount /OSbackup

sudo rmdir /OSbackup

Categories: Storage Tags:

thin provisioning aka virtual provisioning on EMC Symmetrix

July 28th, 2012 No comments

For basic information about thin provisioning, here’s some excerpts from wikipedia/HDS site:

Thin provisioning is the act of using virtualization technology to give the appearance of more physical resource than is actually available. It relies on on-demand allocation of blocks of data versus the traditional method of allocating all the blocks up front. This methodology eliminates almost all whitespace which helps avoid the poor utilization rates, often as low as 10%, that occur in the traditional storage allocation method where large pools of storage capacity are allocated to individual servers but remain unused (not written to). This traditional model is often called “fat” or “thick” provisioning.

Thin provisioning simplifies application storage provisioning by allowing administrators to draw from a central virtual pool without immediately adding physical disks. When an application requires more storage capacity, the storage system automatically allocates the necessary physical storage. This just-in-time method of provisioning decouples the provisioning of storage to an application from the physical addition of capacity to the storage system.

The term thin provisioning is applied to disk later in this article, but could refer to an allocation scheme for any resource. For example, real memory in a computer is typically thin provisioned to running tasks with some form of address translation technology doing the virtualization. Each task believes that it has real memory allocated. The sum of the allocated virtual memory assigned to tasks is typically greater than the total of real memory.

The following article below shows the step how to create thin pool, add and remove components from the pool and how to delete thin pool:

http://software-cluster.blogspot.co.uk/2011/09/create-emc-symmetrix-thin-devices.html

And for more information about thin provisioning on EMC Symmetrix V-Max  with Veritas Storage Foundation, the following PDF file may help you.

EMC Symmetrix V-Max with Veritas Storage Foundation.pdf

PS:

1.symcfg -sid 1234 list -datadev #list all TDAT devices(thin data devices which consists thin pool, and thin pool provide the actual physical storage to thin devices)
2.symcfg -sid 1234 list -tdev #list all TDEV devices(thin devices)

3.The following article may be useful for you if you encountered problems when trying to perform storage reclamation(VxVM vxdg ERROR V-5-1-16063 Disk d1 is used by one or more subdisks which are pending to be reclaimed):

http://www.symantec.com/business/support/index?page=content&id=TECH162709

 

 

Categories: Hardware, SAN, Storage Tags: ,

Resolved – VxVM vxconfigd ERROR V-5-1-0 Segmentation violation – core dumped

July 25th, 2012 2 comments

When I tried to import veritas disk group today using vxdg -C import doxerdg, there’s error message shown as the following:

VxVM vxdg ERROR V-5-1-684 IPC failure: Configuration daemon is not accessible
return code of vxdg import command is 768

VxVM vxconfigd DEBUG V-5-1-0 IMPORT: Trying to import the disk group using configuration database copy from emc5_0490
VxVM vxconfigd ERROR V-5-1-0 Segmentation violation – core dumped

Then I used pstack to print the stack trace of the dumped file:

root # pstack /var/core/core_doxerorg_vxconfigd_0_0_1343173375_140
core ‘core_doxerorg_vxconfigd_0_0_1343173375_14056′ of 14056: vxconfigd
ff134658 strcmp (fefc04e8, 103fba8, 0, 0, 31313537, 31313737) + 238
001208bc da_find_diskid (103fba8, 0, 0, 0, 0, 0) + 13c
002427dc dm_get_da (58f068, 103f5f8, 0, 0, 68796573, 0) + 14c
0023f304 ssb_check_disks (58f068, 0, f37328, fffffffc, 4, 0) + 3f4
0018f8d8 dg_import_start (58f068, 9c2088, ffbfed3c, 4, 0, 0) + 25d8
00184ec0 dg_reimport (0, ffbfedf4, 0, 0, 0, 0) + 288
00189648 dg_recover_all (50000, 160d, 3ec1bc, 1, 8e67c8, 447ab4) + 2a8
001f2f5c mode_set (2, ffbff870, 0, 0, 0, 0) + b4c
001e0a80 setup_mode (2, 3e90d4, 4d5c3c, 0, 6c650000, 6c650000) + 18
001e09a0 startup (4d0da8, 0, 0, fffffffc, 0, 4d5bcc) + 3e0
001e0178 main (1, ffbffa7c, ffbffa84, 44f000, 0, 0) + 1a98
000936c8 _start (0, 0, 0, 0, 0, 0) + b8

Then I tried restart vxconfigd, but it failed as well:

root@doxer#/sbin/vxconfigd -k -x syslog

VxVM vxconfigd ERROR V-5-1-0 Segmentation violation – core dumped

After reading the man page of vxconfigd, I determined to use -r reset to reset all Veritas Volume Manager configuration information stored in the kernel as part of startup processing. But before doing this, we need umount all vxvm volumes as stated in the man page:

The reset fails if any volume devices are in use, or if an imported shared disk group exists.

After umount all vxvm partitions, then I ran the following command:

vxconfid -k -r reset

After this, the importing of DGs succeeded.

Categories: SAN, Storage Tags: ,

resolved – df Input/output error from veritas vxfs

July 10th, 2012 No comments

If you got error like the following when do a df list which has veritas vxfs as underlying FS:

df: `/BCV/testdg’: Input/output error
df: `/BCV/testdg/ora’: Input/output error
df: `/BCV/testdg/ora/archivelog01′: Input/output error
df: `/BCV/testdg/ora/gg’: Input/output error

And when use vxdg list, you found the dgs are in disabled status:

testarc_PRD disabled 1275297639.26.doxer
testdb_PRD disabled 1275297624.24.doxer

Don’t panic, to resolve this, you need do the following:

1) Force umount of the failed fs’s
2) deporting and importing failed disk groups.
3) Fixing plexes which were in the DISABLED FAILED state.
4) Fsck.vxfs of failed fs’s
5) Remounting of the needable fs’s

Categories: SAN, Storage Tags:

difference between SCSI ISCSI FCP FCoE FCIP NFS CIFS DAS NAS SAN iFCP

May 30th, 2012 No comments

Here goes some differences between SCSI ISCSI FCP FCoE FCIP NFS CIFS DAS NAS SAN(excerpt from Internet):

Most storage networks use the SCSI protocol for communication between servers and disk drive devices. A mapping layer to other protocols is used to form a network: Fibre Channel Protocol (FCP), the most prominent one, is a mapping of SCSI over Fibre Channel; Fibre Channel over Ethernet (FCoE); iSCSI, mapping of SCSI over TCP/IP.

 

A storage area network (SAN) is a dedicated network that provides access to consolidated, block level data storage. SANs are primarily used to make storage devices, such as disk arrays, tape libraries, and optical jukeboxes, accessible to servers so that the devices appear like locally attached devices to the operating system. A storage area network (SAN) is a dedicated network that provides access to consolidated, block level data storage. SANs are primarily used to make storage devices, such as disk arrays, tape libraries, and optical jukeboxes, accessible to servers so that the devices appear like locally attached devices to the operating system. Historically, data centers first created “islands” of SCSI disk arrays as direct-attached storage (DAS), each dedicated to an application, and visible as a number of “virtual hard drives” (i.e. LUNs). Operating systems maintain their own file systems on their own dedicated, non-shared LUNs, as though they were local to themselves. If multiple systems were simply to attempt to share a LUN, these would interfere with each other and quickly corrupt the data. Any planned sharing of data on different computers within a LUN requires advanced solutions, such as SAN file systems or clustered computing. Despite such issues, SANs help to increase storage capacity utilization, since multiple servers consolidate their private storage space onto the disk arrays.Sharing storage usually simplifies storage administration and adds flexibility since cables and storage devices do not have to be physically moved to shift storage from one server to another. SANs also tend to enable more effective disaster recovery processes. A SAN could span a distant location containing a secondary storage array. This enables storage replication either implemented by disk array controllers, by server software, or by specialized SAN devices. Since IP WANs are often the least costly method of long-distance transport, the Fibre Channel over IP (FCIP) and iSCSI protocols have been developed to allow SAN extension over IP networks. The traditional physical SCSI layer could only support a few meters of distance – not nearly enough to ensure business continuance in a disaster.

More about FCIP is here http://en.wikipedia.org/wiki/Fibre_Channel_over_IP (still use FC protocol)

A competing technology to FCIP is known as iFCP. It uses routing instead of tunneling to enable connectivity of Fibre Channel networks over IP.

IP SAN uses TCP as a transport mechanism for storage over Ethernet, and iSCSI encapsulates SCSI commands into TCP packets, thus enabling the transport of I/O block data over IP networks.

Network-attached storage (NAS), in contrast to SAN, uses file-based protocols such as NFS or SMB/CIFS where it is clear that the storage is remote, and computers request a portion of an abstract file rather than a disk block. The key difference between direct-attached storage (DAS) and NAS is that DAS is simply an extension to an existing server and is not necessarily networked. NAS is designed as an easy and self-contained solution for sharing files over the network.

 

FCoE works with standard Ethernet cards, cables and switches to handle Fibre Channel traffic at the data link layer, using Ethernet frames to encapsulate, route, and transport FC frames across an Ethernet network from one switch with Fibre Channel ports and attached devices to another, similarly equipped switch.

 

When an end user or application sends a request, the operating system generates the appropriate SCSI commands and data request, which then go through encapsulation and, if necessary, encryption procedures. A packet header is added before the resulting IP packets are transmitted over an Ethernet connection. When a packet is received, it is decrypted (if it was encrypted before transmission), and disassembled, separating the SCSI commands and request. The SCSI commands are sent on to the SCSI controller, and from there to the SCSI storage device. Because iSCSI is bi-directional, the protocol can also be used to return data in response to the original request.

 

Fibre channel is more flexible; devices can be as far as ten kilometers (about six miles) apart if optical fiber is used as the physical medium. Optical fiber is not required for shorter distances, however, because Fibre Channel also works using coaxial cable and ordinary telephone twisted pair.

 

Network File System (NFS) is a distributed file system protocol originally developed by Sun Microsystems in 1984,[1] allowing a user on a client computer to access files over a network in a manner similar to how local storage is accessed. On the contrary, CIFS is its Windows-based counterpart used in file sharing.

Categories: NAS, SAN, Storage Tags:

check lun0 is the first mapped LUN before rescan-scsi-bus.sh(sg3_utils) on centos linux

May 26th, 2012 No comments

rescan-scsi-bus.sh from package sg3_utils scans all the SCSI buses on the system, updating the SCSI layer to reflect new devices on the bus. But in order for this to work, LUN0 must be the first mapped logical unit. Here’s some excerpt from wiki page:

LUN 0: There is one LUN which is required to exist in every target: zero. The logical unit with LUN zero is special in that it must implement a few specific commands, most notably Report LUNs, which is how an initiator can find out all the other LUNs in the target. But LUN zero need not provide any other services, such as a storage volume.

To confirm LUN0 is the first mapped LUN, do the following check if you’re using symantec storage foundation:

syminq -pdevfile |awk ‘!/^#/ {print $1,$4,$5}’ |sort -n | uniq | while read _sym _FA _port
do
if [[ -z "$(symcfg -sid $_sym -fa $_FA -p $_port -addr list | awk '$NF=="000"')" ]]
then
print Sym $_sym, FA $_FA:$_port
fi
done
If you see the following line, then it proves that lun0 is the first mapped LUN, and you can continue with the script rescan-scsi-bus.sh to scan new lun:

Symmetrix ID: 000287890217

Director Device Name Attr Address
———————- —————————– —- ————–
Ident Symbolic Port Sym Physical VBUS TID LUN
—— ——– —- —- ———————– —- — —

FA-4A 04A 0 0000 c1t600604844A56CA43d0s* VCM 0 00 000

PS:

For more infomation what Logical Unit Number(LUN) is, you may refer to:

http://en.wikipedia.org/wiki/Logical_Unit_Number

Categories: SAN, Storage Tags:

solaris format disk label Changing a disk label (EFI / SMI)

May 24th, 2012 No comments

I had inserted a drive into a V440 and after running devfsadm, I ran format on the disk. I was presented with the following partition table:

partition> p
Current partition table (original):
Total disk sectors available: 143358320 + 16384 (reserved sectors)

Part Tag Flag First Sector Size Last Sector
0 usr wm 34 68.36GB 143358320
1 unassigned wm 0 0 0
2 unassigned wm 0 0 0
3 unassigned wm 0 0 0
4 unassigned wm 0 0 0
5 unassigned wm 0 0 0
6 unassigned wm 0 0 0
8 reserved wm 143358321 8.00MB 143374704

This disk was used in a zfs pool and, as a result, uses an EFI label. The more familiar label that is used is an SMI label (8 slices; numbered 0-7 with slice 2 being the whole disk). The advantage of the EFI label is that it supports LUNs over 1TB in size and prevents overlapping partitions by providing a whole-disk device called cxtydz rather than using cxtydzs2.

However, I want to use this disk for UFS partitions. This means I need to get it back the SMI label for the device. Here’s how it’s done:

# format -e

partition> label
[0] SMI Label
[1] EFI Label
Specify Label type[1]: 0
Warning: This disk has an EFI label. Changing to SMI label will erase all
current partitions.
Continue? y
Auto configuration via format.dat[no]?
Auto configuration via generic SCSI-2[no]?
partition> q

format> q
#

Running format again will show that the SMI label was placed back onto the disk:

partition> p
Current partition table (original):
Total disk cylinders available: 14087 + 2 (reserved cylinders)

Part Tag Flag Cylinders Size Blocks
0 root wm 0 – 25 129.19MB (26/0/0) 264576
1 swap wu 26 – 51 129.19MB (26/0/0) 264576
2 backup wu 0 – 14086 68.35GB (14087/0/0) 143349312
3 unassigned wm 0 0 (0/0/0) 0
4 unassigned wm 0 0 (0/0/0) 0
5 unassigned wm 0 0 (0/0/0) 0
6 usr wm 52 – 14086 68.10GB (14035/0/0) 142820160
7 unassigned wm 0 0 (0/0/0) 0

partition>

PS:
  1. Keep in mind that changing disk labels will destroy any data on the disk.
  2. Here’s more info about EFI & SMI disk label -  http://docs.oracle.com/cd/E19082-01/819-2723/disksconcepts-14/index.html
  3. More on UEFI and BIOS - http://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface
Categories: Storage Tags:

what is fence or fencing device

May 16th, 2012 No comments

To understand what is fencing device, you need first know something about split-brian condition. read here for info: http://linux-ha.org/wiki/Split_Brain

Here’s is something about what fence device is:

Fencing is the disconnection of a node from shared storage. Fencing cuts off I/O from shared storage, thus ensuring data integrity. A fence device is a hardware device that can be used to cut a node off from shared storage. This can be accomplished in a variety of ways: powering off the node via a remote power switch, disabling a Fibre Channel switch port, or revoking a host’s SCSI 3 reservations. A fence agent is a software program that connects to a fence device in order to ask the fence device to cut off access to a node’s shared storage (via powering off the node or removing access to the shared storage by other means).

To check whether a LUN has SCSI-3 Persistent Reservation, run the following:

root@doxer# symdev -sid 369 show 2040|grep SCSI
SCSI-3 Persistent Reserve: Enabled

And here’s an article about I/O fencing using SCSI-3 Persistent Reservations in the configuration of SF Oracle RAC: http://sfdoccentral.symantec.com/sf/5.0/solaris64/html/sf_rac_install/sfrac_intro13.html

Categories: HA & HPC, Hardware, NAS, SAN, Storage Tags:

extend lvm

May 10th, 2012 No comments

We added new Hard disk. need add it to existing DG

take sdb as newly added disk

Step1: Format new disk and label as LVM disk
fdisk /dev/sdb
n -> p -> 1 -> return twice -> t -> 8e -> w

Step2: Create FS on new disk
mkfs.ext3 /dev/sdb1

Step3: Create as PV
pvcreate /dev/sdb1

Step4: Extend existing VG to new disk
vgextend VolGroup00 /dev/sdb1
then, you will get more free space now. check with:
vgs

Step5: Extend the Vol
lvextend -L +total_space_for_vol_g /dev/VolGroup00/LogVol00
lvextend -L +space_number_to_add_g /dev/VolGroup00/LogVol00

Step6: Online resize the Vol
resize2fs /dev/VolGroup00/LogVol00

Categories: Storage Tags:

Retrieve the contents of a hard disk

May 8th, 2012 No comments

Retrieve the contents of a hard disk

A disk read errors may give non-recoverable hardware, for example in / var / log / kern.log says:

Oct 30 11:22:52 ipbox kernel: hdc: dma_intr: status=0×51 { DriveReady SeekComplete Error }
Oct 30 11:22:52 ipbox kernel: hdc: dma_intr: error=0×40 { UncorrectableError }, LBAsect=11012416, sector=11012416
Oct 30 11:22:52 ipbox kernel: end_request: I/O error, dev hdc, sector 11012416
Oct 30 11:22:52 ipbox kernel: Buffer I/O error on device hdc, logical block 1376552
You can recover the recoverable groped with:

dd if=/dev/hdc1 of=/home/ipbox/hdc1.dmp conv=noerror,sync
fsck.ext3 -f /home/ipbox/hdc1.dmp
mount -o loop,ro -t ext3 /home/ipbox/hdc1.dmp /mnt
The first command reads the entire partition riversandola in a file, does not stop on errors and put zeros instead of unreadable sectors. The second command tries to retrieve the file system contained in the image saved, the third command mounts the file as a filesystem. The example is of course a filesystem ext3 .

ddrescue

The utility ddrescue (Debian package of the same name) can replace dd in more difficult cases. The program creates a log file read operation:

ddrescue – no-split / dev / sdb / tmp / sdb.img / tmp / sdb.log
For groped to read only those sectors that have had trouble with:

ddrescue – live – max-retries = 3 / dev / sdb / tmp / sdb.img / tmp / sdb.log
A further parameter to force the reading is – retrim .

Categories: Hardware, Storage Tags: