Archive for the ‘Hardware’ Category

general networking tips

April 18th, 2014 No comments

How Autonegotiation Works

First, let’s cover what autonegotiation does not do: when autonegotiation is enabled on a port, it does not automatically determine the configuration of the port on the other side of the Ethernet cable and then match it. This is a common misconception that often leads to problems.

Autonegotiation is a protocol and, as with any protocol, it only works if it’s running on both sides of the link. In other words, if one side of a link is running autonegotiation and the other side of the link is not, autonegotiation cannot determine the speed and duplex configuration of the other side. If autonegotiation is running on the other side of the link, the two devices decide together on the best speed and duplex mode. Each interface advertises the speeds and duplex modes at which it can operate, and the best match is selected (higher speeds and full duplex are preferred).

The confusion exists primarily because autonegotiation always seems to work. This is because of a feature called parallel detection, which kicks in when the autonegotiation process fails to find autonegotiation running on the other end of the link. Parallel detection works by sending the signal being received to the local 10Base-T, 100Base-TX, and 100Base-T4 drivers. If any one of these drivers detects the signal, the interface is set to that speed.

Parallel detection determines only the link speed, not the supported duplex modes. This is an important consideration because the common modes of Ethernet have differing levels of duplex support:

10Base-T was originally designed without full-duplex support. Some implementations of 10Base-T support full duplex, but many do not.

100Base-T has long supported full duplex, which has been the preferred method for connecting 100 Mbps links for as long as the technology has existed. However, the default behavior of 100Base-T is usually half duplex, and full-duplex support must be set manually.

Gigabit Ethernet has a much more robust autonegotiation protocol than 10M or 100M Ethernet. Gigabit interfaces should be left to autonegotiate in most situations.

10 Gigabit
10 Gigabit (10G) connections are generally dependent on fiber transceivers or special copper connections that differ from the RJ-45 connections seen on other Ethernet types. The hardware usually dictates how 10G connects. On a 6500, 10G interfaces usually require XENPAKs, which only run at 10G. On a Nexus 5000 switch, some of the ports are 1G/10G and can be changed with the speed command.

Because of the lack of widespread full-duplex support on 10Base-T and the typical default behavior of 100Base-T, when autonegotiation falls through to the parallel detection phase (which only detects speed), the safest thing for the driver to do is to choose half-duplex mode for the link.

As networks and networking hardware evolve, higher-speed links with more robust negotiation protocols will likely make negotiation problems a thing of the past. That being said, I still see 20-year-old routers in service, so knowing how autonegotiation works will be a valuable skill for years to come.

 When Autonegotiation Fails

In a half-duplex environment, the receiving (RX) line is monitored. If a frame is present on the RX link, no frames are sent until the RX line is clear. If a frame is received on the RX line while a frame is being sent on the transmitting (TX) line, a collision occurs. Collisions cause the collision error counter to be incremented—and the sending frame to be retransmitted—after a random back-off delay. This may seem counterintuitive in a modern switched environment, but remember that Ethernet was originally designed to work over a single wire. Switches and twisted pair came along later.


In full-duplex operation, the RX line is not monitored, and the TX line is always considered available. Collisions do not occur in full-duplex mode because the RX and TX lines are completely independent.

When one side of the link is full duplex and the other side is half duplex, a large number of collisions will occur on the half-duplex side. The issue may not be obvious, because a half-duplex interface normally shows collisions, while a full-duplex interface does not. Since full duplex means never having to test for a clear-to-send condition, a full-duplex interface will not record any errors in this situation. The problem should present itself as excessive collisions, but only on the half-duplex side.

Gigabit Ethernet uses a substantially more robust autonegotiation mechanism than the one described in this chapter. Gigabit Ethernet should thus always be set to autonegotiation, unless there is a compelling reason not to do so (such as an interface that will not properly negotiate). Even then, this should be considered a temporary workaround until the misbehaving part can be replaced.


When expanding a network using VLANs, you face the same limitations. If you connect another switch to a port that is configured for VLAN 20, the new switch will be able to forward frames only to or from VLAN 20. If you wanted to connect two switches, each containing four VLANs, you would need four links between the switches: one for each VLAN. A solution to this problem is to deploy trunks between switches. Trunks are links that carry frames for more than one VLAN.


Another way to route between VLANs is commonly known as the router-on-a-stick configuration. Instead of running a link from each VLAN to a router interface, you can run a single trunk from the switch to the router. All the VLANs will then pass over a single link.

Deploying a router on a stick saves a lot of interfaces on both the switch and the router. The downside is that the trunk is only one link, and the total bandwidth available on that link is only 10 Mbps. In contrast, when each VLAN has its own link, each VLAN has 10 Mbps to itself. Also, don’t forget that the router is passing traffic between VLANs, so chances are each frame will be seen twice on the same link—once to get to the router, and once to get back to the destination VLAN.

Jack is connected to VLAN 20 on Switch B, and Diane is connected to VLAN 20 on Switch A. Because there is a trunk connecting these two switches together, assuming the trunk is allowed to carry traffic for all configured VLANs, Jack will be able to communicate with Diane. Notice that the ports to which the trunk is connected are not assigned VLANs. These ports are trunk ports and, as such, do not belong to a single VLAN.



 Possible switch port modes related to trunking



 VTP #VLAN Trunking Protocol


VTP allows VLAN configurations to be managed on a single switch. Those changes are then propagated to every switch in the VTP domain. A VTP domain is a group of connected switches with the same VTP domain string configured. Interconnected switches with differently configured VTP domains will not share VLAN information. A switch can only be in one VTP domain; the VTP domain is null by default. Switches with mismatched VTP domains will not negotiate trunk protocols. If you wish to establish a trunk between switches with mismatched VTP domains, you must have their trunk ports set to mode trunk.


The main idea of VTP is that changes are made on VTP servers. These changes are then propagated to VTP clients, and any other VTP servers in the domain. Switches can be configured manually as VTP servers, VTP clients, or the third possibility, VTP transparent. A VTP transparent switch receives and forwards VTP updates but does not update its configuration to reflect the changes they contain. Some switches default to VTP server, while others default to VTP transparent. VLANs cannot be locally configured on a switch in client mode.

There is actually a fourth state for a VTP switch: off. A switch in VTP mode off will not accept VTP packets, and therefore will not forward them either. This can be handy if you want to stop the forwarding of VTP updates at some point in the network.


SW1 and SW2 are both VTP servers. SW3 is set to VTP transparent, and SW4 is a VTP client. Any changes to the VLAN information on SW1 will be propagated to SW2 and SW4. The changes will be passed through SW3 but will not be acted upon by that switch. Because the switch does not act on VTP updates, its VLANs must be configured manually if users on that switch are to interact with the rest of the network.

When a switch receives a VTP update, the first thing it does is compare the VTP domain name in the update to its own. If the domains are different, the update is ignored. If they are the same, the switch compares the update’s configuration revision number to its own. If the revision number of the update is lower than or equal to the switch’s own revision number, the update is ignored. If the update has a higher revision number, the switch sends an advertisement request. The response to this request is another summary advertisement, followed by subset advertisements. Once it has received the subset advertisements, the switch has all the information necessary to implement the required changes in the VLAN configuration.

When a switch’s VTP domain is null, if it receives a VTP advertisement over a trunk link, it will inherit the VTP domain and VLAN configuration from the switch on the other end of the trunk. This will happen only over manually configured trunks, as DTP negotiations cannot take place unless a VTP domain is configured. Be careful of this behavior, as it can cause serious heartache, nausea, and potential job loss if you’re not (or the person before you wasn’t).

VTP Pruning
On large or congested networks, VTP can create a problem when excess traffic is sent across trunks needlessly. The switches in the gray box all have ports assigned to VLAN 100, while the rest of the switches do not. With VTP active, all of the switches will have VLAN 100 configured, and as such will receive broadcasts initiated on that VLAN. However, those without ports assigned to VLAN 100 have no use for the broadcasts.

On a busy VLAN, broadcasts can amount to a significant percentage of traffic. In this case, all that traffic is being needlessly sent over the entire network, and is taking up valuable bandwidth on the interswitch trunks.
VTP pruning prevents traffic originating from a particular VLAN from being sent to switches on which that VLAN is not active (i.e., switches that do not have ports connected and configured for that VLAN). With VTP pruning enabled, the VLAN 100 broadcasts will be restricted to switches on which VLAN 100 is actively in use.

VTP pruning must be enabled or disabled throughout the entire VTP domain. Failure to configure VTP pruning properly can result in instability in the network. By default, all VLANs up to VLAN 1001 are eligible for pruning, except VLAN 1, which can never be pruned. VTP does not support the extended VLANs above VLAN 1001, so VLANs higher than 1001 cannot be pruned. If you enable VTP pruning on a VTP server, VTP pruning will automatically be enabled for the entire domain.

Dangers of VTP
Remember that many switches are VTP servers by default. Remember, also, that when a switch participating in VTP receives an update that has a higher revision number than its own configuration’s revision number, the switch will implement the new scheme. In our scenario, the lab’s 3750s had been functioning as a standalone network with the same VTP domain as the regular network. Multiple changes were made to their VLAN configurations, resulting in a high configuration revision number. When these switches, which were VTP servers, were connected to the more stable production network, they automatically sent out updates. Each switch on the main network, including the core 6509s, received an update with a higher revision number than its current configuration. Consequently, they all requested the VLAN configuration from the rogue 3750s and implemented that design.

 Link Aggregation

EtherChannel is the Cisco term for the technology that enables the bonding of up to eight physical Ethernet links into a single logical link. The non-Cisco term used for link aggregation is generally Link Aggregation, or LAG for short.


The default behavior is to assign one of the physical links to each packet that traverses the EtherChannel, based on the packet’s destination MAC address. This means that if one workstation talks to one server over an EtherChannel, only one of the physical links will be used. In fact, all of the traffic destined for that server will traverse a single physical link in the EtherChannel. This means that a single user will only ever get 1 Gbps from the EtherChannel at a time. This behavior can be changed to send each packet over a different physical link, but as you’ll see, there are limits to how well this works for applications like VoIP. The benefit arises when there are multiple destinations, which can each use a different path.

You can change the method the switch uses to determine which path to assign. The default behavior is to use the destination MAC address. However, depending on the version of the software and hardware in use, the options may include:

The source MAC address
The destination MAC address
The source and destination MAC addresses
The source IP address
The destination IP address
The source and destination IP addresses
The source port
The destination port
The source and destination ports

There is another terminology problem that can create many headaches for network administrators. While a group of physical Ethernet links bonded together is called an EtherChannel in Cisco parlance, Unix admins sometimes refer to the same configuration as a trunk. Of course, in the Cisco world the term “trunk” refers to something completely different: a link that labels frames with VLAN information so that multiple VLANs can traverse it. Some modern Unixes sometimes create a bond interface when performing link aggregation, and Windows admins often use the term teaming when combining links.
EtherChannel protocols


EtherChannel can negotiate with the device on the other side of the link. Two protocols are supported on Cisco devices. The first is the Link Aggregation Control Protocol (LACP), which is defined in IEEE specification 802.3ad. LACP is used when you’re connecting to non-Cisco devices, such as servers. The other protocol used in negotiating EtherChannel links is the Port Aggregation Control Protocol (PAgP). Since PAgP is Cisco-proprietary, it is used only when you’re connecting two Cisco devices via an EtherChannel. Each protocol supports two modes: a passive mode (auto in PAgP and passive in LACP), and an active mode (desirable in PAgP and active in LACP). Alternatively, you can set the mode to on, thus forcing the creation of the EtherChannel.

 Spanning Tree

The Spanning Tree Protocol (STP) is used to ensure that no Layer-2 loops exist in a LAN. Spanning tree is designed to prevent loops among bridges. A bridge is a device that connects multiple segments within a single collision domain. Switches are considered bridges—hubs are not.


When a switch receives a broadcast, it repeats the broadcast on every port (except the one on which it was received). In a looped environment, the broadcasts are repeated forever. The result is called a broadcast storm, and it will quickly bring a network to a halt. Spanning tree is an automated mechanism used to discover and break loops of this kind.

A useful tool when you’re troubleshooting a broadcast storm is the show processes cpu history command.
Here is the output from the show process cpu history command on switch B, which shows 0–3 percent CPU utilization over the course of the last minute:


The numbers on the left side of the graph are the CPU utilization percentages. The numbers on the bottom are seconds in the past (0 = the time of command execution). The numbers on the top of the graph show the integer values of CPU utilization for that time period on the graph. For example, according to the preceding graph, CPU utilization was normally 0 percent, but increased to 1 percent 5 seconds ago and to 3 percent 20 seconds ago. When the values exceed 10 percent, you’ll see visual peaks in the graph itself.

3550-IOS#sho mac-address-table | include 0030.1904.da60 #Another problem caused by a looped environment is MAC address tables (CAM tables in CatOS) being constantly updated.

Spanning tree elects a root bridge (switch) in the network. The root bridge is the bridge that all other bridges need to reach via the shortest path possible. Spanning tree calculates the cost for each path from each bridge in the network to the root bridge. The path with the lowest cost is kept intact, while all others are broken. Spanning tree breaks paths by putting ports into a blocking state.





Q: What’s the difference between bandwidth and speed?

A: Bandwidth is a capacity; speed is a rate. Bandwidth tells you the maximum amount of data that your network can transmit. Speed tells you the rate at which the data can travel. The bandwidth for a CAT-5 cable is 10/100 Base-T. The speed of a CAT-5 cable changes depending on conditions.

Q: What is Base-T?
A: Base-T refers to the different standards for Ethernet transmission rates. The 10 Base-T standard transfers data at 10 megabits per second (Mbps). The 100 Base-T standard transfers data at 100 Mbps. The 1000 Base-T standard transfers data at a massive 1000 Mbps.

Q: What is a crossover cable used for?
A: Suppose you want to connect a laptop to a desktop computer. One way of doing this is to use a switch or a hub to connect the two devices, and another way of doing this would be to use a crossover cable, a cable that can send and receive data on both ends at the same time. A crossover cable is different from a straight-through cable in that a straight-through cable can only send or receive data on one end at a time.

Q: Aren’t packets and frames really the same thing?
A: No. We call data transmitting over Ethernet frames. Inside those frames, in the data field, are packets. Generally frames have to due with the transmission protocol, i.e., Ethernet, ATM, Token Ring, etc. But, as you read more about networking, you will see that there is some confusion on this.

Q: A guy in my office calls packets datagrams. Are they the same?
A: Not really. Packets refer to any data sent in as packets. Whereas datagrams are used to refer to data sent in packets by an unreliable protocol such as UDP or ICMP.

Q: What’s the difference between megabits per second (Mbps) and megabytes per second (MBps)?
A: Megabits per second (Mbps) is a bandwidth rate used in the telecommunications and computer networking field. One megabit equals one million bursts of electrical current (aka binary pulses). Megabytes per second (MBps) is a data transfer rate used in computing. One megabyte equals 1, 048, 576 bytes, and one byte equals 8 binary digits (aka bits).
The order of the wires in an RJ-45 connector conforms to one of two standards. These standards are 568A and 568B.






We can convert to ASCII using hex

Once you learn to use hexadecimal, you realize just how cool it is. Hex and binary make great partners, which simplifies conversions between binary and ASCII. Hex is like a bridge between the weird world of binary and our world (the human, readable world).
Here’s what we do:

  • Break the byte in half.

Each half-byte is called a nibble. [Note from Editor: you’re kidding, right?]


  • Convert each half into its hexadecimal equivalent.

Because the binary number is broken into halves, the highest number you can get is 15 (which is “F” in hex).

  • Concatenate the two numbers.

Concatenate is a programmer’s word that simply means “put them beside each other from left to right.”

  • Look the number up in an ASCII table.(you can man ascii to see the full ascii hex character table)



Hubs – Switches – Routers


A hub receives incoming signals and sends them out on all the other ports. When several devices start sending signals, the hub’s incessant repetition creates heavy traffic and collisions. A collision happens when two signals run into one another, creating an error. The sending network device has to back off and wait to send the signal again.

A hub contains no processors, and this means that a hub has no real understanding of network data. It doesn’t understand MAC addresses or frames. It sees an incoming networking signal as a purely electrical signal, and passes it on.

A hub is really just an electrical repeater. It takes whatever signal comes in, and sends it out on all the other ports.


  • The source workstation sends a frame.

A frame carries the payload of data and keeps track of the time sent, as well as the MAC address of the source and the MAC address of the target.

  • The switch updates its MAC address table with the MAC address and the port it’s on.

Switches maintain MAC address tables. As frames come in, the switch’s knowledge of the traffic gets more descript. The switch matches ports with MAC addresses.

  • The switch forwards the frame to its target MAC address using information from its table.

It does this by sending the frame out the port where that MAC address is located as the MAC address table indicates.

Switches avoid collisions by storing and forwarding frames on the intranet. Switches are able to do this by using the MAC address of the frame. Instead of repeating the signal on all ports, it sends it on to the device that needs it.

A switch reads the signal as a frame and uses the frame’s information to send it where it’s supposed to go.



How the router moves data across networks

  • The sending device sends an ARP request for the MAC address of its default gateway.


  • The router responds with its MAC address.


  • The sending device sends its traffic to the router.


  • The router sends an ARP request for the device with the correct IP address on a different IP network.


  • The receiving device responds with its MAC address.


The router changes the MAC address in the frame and sends the data to the receiving device.


  • The source workstation sends a frame to the router.

It sends it to the router since the workstation the traffic is meant for is behind the router.

  • The router changes the source MAC address to its MAC address and changes the destination MAC address to the workstation the traffic is meant for.

If network traffic comes from a router, we can only see the router’s MAC address. All the workstations behind that router make up what we call an IP subnet. All a switch needs to look at to get frames to their destination is the MAC address. A router looks at the IP address from the incoming packet and forwards it if it is intended for a workstation located on the other network. Routers have far less network ports because they tend to connect to other routers or to switches. Computers are generally not connected directly to a router.

The switch decides where to send traffic based on the MAC address, whereas the router based on the IP address.

Q: But I have a DSL router at home, and my computer is directly connected to it. What is that all about?
A: Good observation. There are switches that have routing capability and routers that have switched ports. There is not a real clear line between the two devices. It is more about their primary function. Now, in large networks, there are switching routers. These have software that allow them to work as routers on switched ports. They are great to use and make building large sophisticated networks straightforward, but they are very expensive.

Q: So the difference betweeen my home DSL router and an enterprise switching router is the software?
A: The big difference is the hardware horsepower. Your home DSL router probably uses a small embedded processor or microcontroller which does all the processing. Switching routers and heavy duty routers have specialized processors with individual processors on each port. The name of the game is the speed at which is can move packets. Your home DSL router probably has a throughput of about 20 Mbps (Megabits per second), whereas a high end switching router can have a throughput of hundreds of Gbps (Gigabits per second) or more.

Hook Wireshark up to the switch

  • Connect your computer to the switch with a serial cable.

You will use this to communicate with the switch.

  • Open a terminal program such as Hyperterminal and get to the command prompt of the switch. Type in the commands below.


  • Hook up your computer to port 1 on the switch with an Ethernet cable.

You will use this to capture network traffic.

  • Startup Wireshark and capture some network traffic.






stuck in PXE-E51: No DHCP or proxyDHCP offers were received, PXE-M0F: Exiting Intel Boot Agent, Network boot canceled by keystroke

March 17th, 2014 No comments

If you installed your OS and tried booting up it but stuck with the following messages:


Then one possibility is that, the configuration for your host’s storage array is not right. For instance, it should be JBOD but you had configured it to RAID6.

Please note that this is only one possibility for this error, you may search for PXE Error Codes you encoutered for more details.


  • Sometimes, DHCP snooping may prevent PXE functioning, you can read more
  • STP(Spanning-Tree Protocol) makes each port wait up to 50 seconds before data is allowed to be sent on the port. This Delay in turn can cause problems with some applications/protocols (PXE, Bootworks, etc.). To alleviate the problem, Porfast was implemented on Cisco devices, the terminology might differ between different vendor devices. You can read more
  • ARP caching
Categories: Hardware, Storage, Systems Tags:

“Include snapshots” made NFS shares from ZFS appliance shrinking

January 17th, 2014 No comments

Today I met one weird issue when checking one NFS share mounted from ZFS appliance. The NFS filesystem mounted on client was shrinking when I removed files as the space on that filesystem was getting low. But what made me confused was that the filesystem’s size would getting lower! Shouldn’t the free space getting larger and the size keep unchanged?

After some debugging, I found that this was caused by ZFS appliance shares’ “Include snapshots”. When I uncheck “Include snapshots”, the issue was gone!


Categories: Hardware, NAS, Storage Tags:

resolved – ESXi Failed to lock the file

January 13th, 2014 No comments

When I was power on one VM in ESXi, one error occurred:

An error was received from the ESX host while powering on VM doxer-test.
Cannot open the disk ‘/vmfs/volumes/4726d591-9c3bdf6c/doxer-test/doxer-test_1.vmdk’ or one of the snapshot disks it depends on.
Failed to lock the file

And also:

unable to access file since it is locked

This apparently was caused by some storage issue. I firstly googled and found most of the posts were telling stories about ESXi working mechanism, and I tried some of them but with no luck.

Then I thought of that our storage datastore was using NFS/ZFS, and NFS has file lock issue as you know. So I mount the nfs share which datastore was using and removed one file named lck-c30d000000000000. After this, the VM booted up successfully! (or we can log on ESXi host, and remove lock file there also)

Categories: NAS, Oracle Cloud, Storage Tags:

Common storage multi path Path-Management Software

December 12th, 2013 No comments
Vendor Path-Management Software URL
Hewlett-Packard AutoPath, SecurePath
Microsoft MPIO
Hitachi Dynamic Link Manager
EMC PowerPath
IBM RDAC, MultiPath Driver
VERITAS Dynamic Multipathing (DMP)
Categories: HA, Hardware, IT Architecture, SAN, Storage Tags:

SAN ports

December 10th, 2013 No comments

Basic SAN port modes of operation

The port’s mode of operation depends on what’s connected to the other side
of the port. Here are two general examples:
✓ All hosts (servers) and all storage ports operate as nodes (that is, places
where the data either originates or ends up), so their ports are called
N_Ports (node ports).
✓ All hub ports operate as loops (that is, places where the data travels in a
small Fibre Channel loop), so they’re called L_Ports (loop ports).
Switch ports are where it gets tricky. That’s because switch ports have mul-
tiple personalities: They become particular types of ports depending on what
gets plugged into them (check out Table 2-2 to keep all these confusing port
types straight). Here are some ways a switch port changes its function to
match what’s connected to it:
✓ Switch ports usually hang around as G_Ports (global ports) when nothing is
plugged into them. A G_Port doesn’t get a mode of operation until
something is plugged into it.
✓ If you plug a host into a switch port, it becomes an F_Port (fabric port).
The same thing happens if you plug in a storage array that’s running the
Fibre Channel-Switched (FC-SW) Protocol (more about this protocol in
the next section).
✓ If you plug a hub into a switch port, you get an FL_Port (fabric-to-loop
port); hub ports by themselves are always L_Ports (loop ports).
✓ When two switch ports are connected together, they become their own
small fabric, known as an E_Port (switch-to-switch expansion port) or a
T_Port ( Trunk port).
✓ A host port is always an N_Port (node port) — unless it’s attached to a
hub, in which case it’s an NL_port (node-to-loop port).
✓ A storage port, like a host port, is always an N_Port — unless it’s
connected to a hub, in which case it’s an NL_Port.
If that seems confusing, it used to be worse. Believe it or not, different switch
vendors used to name their ports differently, which confused everyone. Then
the Storage Network Industry Association (SNIA) came to save the day and
standardized the names you see in Figure 2-19.
If you want to get a good working handle on what’s going on in your SAN, use
Table 2-2 to find out what the port names mean after all the plugging-in is done.

Protocols used in a Fibre Channel SAN

Protocols are, in effect, an agreed-upon set of terms that different computer
devices use to communicate with one another. A protocol can be thought of
as the common language used by different types of networks. You’ll encoun-
ter three basic protocols in the Fibre Channel world:
✓ FC-AL: Fibre Channel-Arbitrated Loop Protocol is used by two devices
communicating within a Fibre Channel loop (created by plugging the
devices into a hub). Fibre Channel loops use hubs for the cable connec-
tions among all the SAN devices. Newer storage arrays that have internal
fiber disks use Fibre Channel loops to connect the disks to the array,
which is why they can have so many disks inside: Each loop can handle
126 disks, and you can have many loops in the array. The array uses the
FC-AL protocol to talk to the disks.
Each of the possible 126 devices on a Fibre Channel loop takes a turn
communicating with another device on the loop. Only one conversa-
tion can occur at a time; the protocol determines who gets to talk when.
Every device connected to the loop gets a loop address (loop ID) that
determines its priority when it uses the loop to talk.
✓ FC-SW: Fibre Channel-Switched Protocol is used by two devices commu-
nicating on a Fibre Channel switch. Switch ports are connected over a
backplane, which allows any device on the switch to talk to any other
device on the switch at the same time. Many conversations can occur
simultaneously through the switch. A switched fabric is created by con-
necting Fibre Channel switches; such a fabric can have thousands of
devices connected to it.
Each device in a fabric has an address called a World Wide Name (WWN)
that’s hard-coded at the factory onto the host bus adapter (HBA) that
goes into every server and every storage port. The WWN is like the
telephone number of a device in the fabric (or like the MAC address of
a network card) When the device is connected to the fabric, it logs in to
the fabric port, and its WWN registers in the name server so the switch

knows it’s connected to that port. The WWN is also sometimes called a
WWPN, or World Wide Port Name.
The WWN and a WWPN are the exact same thing, the actual address
for a Fibre Channel port. In some cases, large storage arrays can also
have what is known as a WWNN, or World Wide Node Name. Some Fibre
Channel storage manufactures use the WWNN for the entire array, and
then use an offset of the WWN for each port in the array for the WWPN.
I guess this is a Fibre Channel storage manufactures way of making the
World Wide Names they were given by the standards bodies last longer.
You can think of the WWNN as the device itself, and the WWPN as the
actual port within the device, but in the end, it’s all just a WWN.
The name server is like a telephone directory. When one device wants
to talk to another in the fabric, it uses the other device’s phone number
to call it up. The switch protocol acts like the telephone operator. The
first device asks the operator what the other device’s phone number is.
The operator locates the number in the directory (the name server) in
the switch, and then routes the call to the port where the other device is
There is a trick you can use to determine whether the WWN refers to a
server on the fabric or a storage port on the fabric. Most storage ports’
WWN always start with the number 5, and most host bus adapters’ start
with either a 10 or a 21 as the first hexadecimal digits in the WWN. Think
of it like the area code for the phone number. If you see a number like
50:06:03:81:D6:F3:10:32, its probably a port on a storage array. A
number like 10:00:00:01:a9:42:fc:06 will be a servers’ HBA WWN.
✓ SCSI: The SCSI protocol is used by a computer application to talk to its
disk-storage devices. In a SAN, the SCSI protocol is layered on top of
either the FC-AL or FC-SW protocol to enable the application to get to
the disk drives within the storage arrays in a Fibre Channel SAN. This
makes Fibre Channel backward-compatible with all the existing applica-
tions that still use the SCSI protocol to talk to disks inside servers. If the
SCSI protocol was not used, all existing applications would have needed
to be recompiled to use a different method of talking to disk drives.
SCSI works a bit differently in a SAN from the way it does when it talks to
a single disk drive inside a server. SCSI inside a server runs over copper
wires, and data is transmitted in parallel across the wires. In a SAN, the
SCSI protocol is serialized, so each bit of data can be transmitted as a
pulse of light over a fiber-optic cable. If you want to connect older parallel
SCSI-based devices in a SAN, you have to use a data router, which acts as a
bridge between the serial SCSI used in a SAN and the parallel SCSI used in
the device. (See “Data routers,” earlier in this chapter, for the gory details.)
Although iSCSI and Infiniband protocols can also be used in storage networks,
the iSCSI protocol is used over an IP network and then usually bridged into

a Fibre Channel SAN. Infiniband, on the other hand, is used over a dedicated
Infiniband network as a server interconnect, and then bridged into a Fibre
Channel SAN for storage access. But the field is always changing: Infiniband
and iSCSI storage arrays are now becoming available, but they still use either
an IP or IB interface rather than FC.

Fabric addressing

The addressing scheme used in SAN fabrics is quite different than that in SAN
loops. A fabric can contain thousands of devices rather than the maximum
127 in a loop. Each device in the fabric must have a unique address, just as
every phone number in the world is unique. This is done by assigning every
device in a SAN fabric a World Wide Name (WWN).
What in the world is a World Wide Name?
Each device on the network has a World Wide Name, a 64-bit hexadecimal
number coded into it by its manufacturer. The WWN is often assigned via a
standard block of addresses made available for manufacturers to use. Thus
every device in a SAN fabric has a built-in address assigned by a central
naming authority — in this case, one of the standard-setting organizations
that control SAN standards — the Institute of Electrical and Electronics
Engineers (IEEE, pronounced eye triple-e). The WWN is sometimes referred to
by its IEEE address. A typical WWN in a SAN will look something like this:
On some devices, such as large storage arrays, the storage array itself is
assigned the WWN and the manufacturer then uses the assigned WWN as the
basis for virtual WWNs, which add sequential numbers to identify ports.
The WWN of the storage array is known as the World Wide Node Name or
WWNN. The resulting WWN of the port on the storage array is known as the

World Wide Port Name or WWPN. If the base WWN is (say) 20000000C8328F00
and the storage array has four ports, the array manufacturer could use the
assigned WWN as the base, and then use offsets to create the WWPN for each
port, like this:
20000000C8328F01 for port 1
20000000C8328F02 for port 2
20000000C8328F03 for port 3
20000000C8328F04 for port 4
The manufacturers can use offsets to create World Wide Names as long as
the offsets used do not overlap with any other assigned WWNs from the
block of addresses assigned to the manufacturer.
When it comes to Fibre Channel addressing, the term WWN always refers
to the WWPN of the actual ports, which are like the MAC addresses of an
Ethernet network card. The WWPN (now forever referred to as the WWN for
short) is always used in the name server in the switch to identify devices on
the SAN.

The name server

The name server is a logical service (a specialized program that runs in the
SAN switches) used by the devices connected to the SAN to locate other
devices. The name server in the switched fabric acts like a telephone direc-
tory listing. When a device is plugged into a switch, it logs in to the switch (a
process like logging in to your PC) and registers itself with the name server.
The name server uses its own database to store the WWN information for
every device connected to the fabric, as well as the switch port information
and the associated WWN of each device. When one device wants to talk to
another in the fabric, it looks up that device’s address (its WWN) in the name
server, finds out which port the device is located on, and communication is
then routed between the two devices.
Figure 3-5 shows the name server’s lookup operation in action. The arrows
show how the server on Switch 1 (with address 20000000C8328FE6)
locates the address of the storage device on Switch 2 (at address
50000000B2358D34). After it finds the storage device’s address in the name
server, it knows which switch it’s located on and how to get to the device.
When a network gets big enough to have a few hundred devices connected to a
bunch of switches, the use of a directory listing inside the fabric makes sense.

The switches’ name server information can be used to troubleshoot problems
in a SAN. If your device is connected to a switch but doesn’t get registered in
the name server table, then you know that the problem is somewhere between
the server and the switch; you may have a bad cable. (See Chapter 12 for more
SAN troubleshooting tips.)

fabric name server

Note: this article is from book <Storage Area Networks For Dummies®>.

Categories: Hardware, SAN, Storage Tags:

Understanding the Benefits of a SAN

December 10th, 2013 No comments

The typical benefits of using a SAN are a very high return on investment (ROI),
a reduction in the total cost of ownership (TCO) of computing capabilities, and
a pay-back period (PBP) of months rather than years. Here are some specific
ways you can expect a SAN to be beneficial:

Removes the distance limits of SCSI-connected disks: The maximum
length of a SCSI bus is around 25 meters. Fibre Channel SANs allow you
to connect your disks to your servers over much greater distances.
✓ Greater performance: Current Fibre Channel SANs allow connection
to disks at hundreds of megabytes per second; the near future will see
speeds in multiple gigabytes to terabytes per second.
✓ Increased disk utilization: SANs enable more than one server to access
the same physical disk, which lets you allocate the free space on those
disks more effectively.
✓ Higher availability to storage by use of multiple access paths: A SAN
allows for multiple physical connections to disks from a single or mul-
tiple servers.
✓ Deferred disk procurement: That’s business-speak for not having to buy
disks as often as you used to before getting a SAN. Because you can use
disk space more effectively, no space goes to waste.
✓ Reduced data center rack/floor space: Because you don’t need to buy big
servers with room for lots of disks, you can buy fewer, smaller servers —
an arrangement that takes up less room.
✓ New disaster-recovery capabilities: This is a major benefit. SAN devices
can mirror the data on the disks to another location. This thorough
backup capability can make your data safe if a disaster occurs.
✓ Online recovery: By using online mirrors of your data in a SAN device,
or new continuous data protection solutions, you can instantly recover
your data if it becomes lost, damaged, or corrupted.

✓ Better staff utilization: SANs enable fewer people to manage much
more data.
✓ Reduction of management costs as a percentage of storage costs:
Because you need fewer people, your management costs go down.
✓ Improved overall availability: This is another big one. SAN storage is
much more reliable than internal, server-based disk storage. Things
break a lot less often.
✓ Reduction of servers: You won’t need as many file servers with a SAN.
And because SANs are so fast, even your existing servers run faster
when connected to the SAN. You get more out of your current servers
and don’t need to buy new ones as often.
✓ Improved network performance and fewer network upgrades: You can
back up all your data over the SAN (which is dedicated to that purpose)
rather than over the LAN (which has other duties). Since you use less
bandwidth on the LAN, you can get more out of it.
✓ Increased input/output (I/O) performance and bulk data movement:
Yup, SANs are fast. They move data much faster than do internal drives
or devices attached to the LAN. In high-performance computing envi-
ronments, for example, IB (Infiniband) storage-network technology can
move a single data stream at multiple gigabytes per second.
✓ Reduced/eliminated backup windows: A backup window is the time it
takes to back up all your data. When you do your backups over the SAN
instead of over the LAN, you can do them at any time, day or night. If
you use CDP (Continuous Data Protection) solutions over the SAN, you
can pretty much eliminate backup as a separate process (it just happens
all the time).
✓ Protected critical data: SAN storage devices use advanced technology
to ensure that your critical data remains safe and available.
✓ Nondisruptive scalability: Sounds impressive, doesn’t it? It means you
can add storage to a storage network at any time without affecting the
devices currently using the network.
✓ Easier development and testing of applications: By using SAN-based
mirror copies of production data, you can easily use actual production
data to test new applications while the original application stays online.
✓ Support for server clusters: Server clustering is a method of making two
individual servers look like one and guard each other’s back. If one of
them has a heart attack, the other one takes over automatically to keep
the applications running. Clusters require access to a shared disk drive;
a SAN makes this possible.
✓ Storage on demand: Because SAN disks are available to any server in
the storage network, free storage space can be allocated on demand to
any server that needs it, any time. Storage virtualization can simplify
storage provisioning across storage arrays from multiple vendors.


This is from book <Storage Area Networks For Dummies®>.

Categories: Hardware, SAN, Storage Tags:

debugging nfs problem with snoop in solaris

December 3rd, 2013 No comments

Network analyzers are ultimately the most useful tools available when it comes to debugging NFS problems. The snoop network analyzer bundled with Solaris was introduced in Section 13.5. This section presents an example of how to use snoop to resolve NFS-related problems.

Consider the case where the NFS client rome attempts to access the contents of the filesystems exported by the server zeus through the /net automounter path:

rome% ls -la /net/zeus/export
total 5
dr-xr-xr-x   3 root     root           3 Jul 31 22:51 .
dr-xr-xr-x   2 root     root           2 Jul 31 22:40 ..
drwxr-xr-x   3 root     other        512 Jul 28 16:48 eng
dr-xr-xr-x   1 root     root           1 Jul 31 22:51 home
rome% ls /net/zeus/export/home
/net/zeus/export/home: Permission denied


The client is not able to open the contents of the directory /net/zeus/export/home, although the directory gives read and execute permissions to all users:

Code View: Scroll / Show All
rome% df -k /net/zeus/export/home
filesystem            kbytes    used   avail capacity  Mounted on
-hosts                     0       0       0     0%    /net/zeus/export/home


The df command shows the -hosts automap mounted on the path of interest. This means that the NFS filesystem rome:/export/home has not yet been mounted. To investigate the problem further, snoopis invoked while the problematic ls command is rerun:

Code View: Scroll / Show All
 rome# snoop -i /tmp/snoop.cap rome zeus
  1   0.00000      rome -> zeus      PORTMAP C GETPORT prog=100003 (NFS) vers=3 
  2   0.00314      zeus -> rome      PORTMAP R GETPORT port=2049
  3   0.00019      rome -> zeus      NFS C NULL3
  4   0.00110      zeus -> rome      NFS R NULL3 
  5   0.00124      rome -> zeus      PORTMAP C GETPORT prog=100005 (MOUNT) vers=1 
  6   0.00283      zeus -> rome      PORTMAP R GETPORT port=33168
  7   0.00094      rome -> zeus      TCP D=33168 S=49659 Syn Seq=1331963017 Len=0 
Win=24820 Options=<nop,nop,sackOK,mss 1460>
  8   0.00142      zeus -> rome      TCP D=49659 S=33168 Syn Ack=1331963018 
Seq=4025012052 Len=0 Win=24820 Options=<nop,nop,sackOK,mss 1460>
  9   0.00003      rome -> zeus      TCP D=33168 S=49659     Ack=4025012053 
Seq=1331963018 Len=0 Win=24820
 10   0.00024      rome -> zeus      MOUNT1 C Get export list
 11   0.00073      zeus -> rome      TCP D=49659 S=33168     Ack=1331963062 
Seq=4025012053 Len=0 Win=24776
 12   0.00602      zeus -> rome      MOUNT1 R Get export list 2 entries
 13   0.00003      rome -> zeus      TCP D=33168 S=49659     Ack=4025012173 
Seq=1331963062 Len=0 Win=24820
 14   0.00026      rome -> zeus      TCP D=33168 S=49659 Fin Ack=4025012173 
Seq=1331963062 Len=0 Win=24820
 15   0.00065      zeus -> rome      TCP D=49659 S=33168     Ack=1331963063 
Seq=4025012173 Len=0 Win=24820
 16   0.00079      zeus -> rome      TCP D=49659 S=33168 Fin Ack=1331963063 
Seq=4025012173 Len=0 Win=24820
 17   0.00004      rome -> zeus      TCP D=33168 S=49659     Ack=4025012174 
Seq=1331963063 Len=0 Win=24820
 18   0.00058      rome -> zeus      PORTMAP C GETPORT prog=100005 (MOUNT) vers=3 
 19   0.00412      zeus -> rome      PORTMAP R GETPORT port=34582
 20   0.00018      rome -> zeus      MOUNT3 C Null
 21   0.00134      zeus -> rome      MOUNT3 R Null 
 22   0.00056      rome -> zeus      MOUNT3 C Mount /export/home
 23   0.23112      zeus -> rome      MOUNT3 R Mount Permission denied


Packet 1 shows the client rome requesting the port number of the NFS service (RPC program number 100003, Version 3, over the UDP protocol) from the server’s rpcbind (portmapper). Packet 2 shows the server’s reply indicating nfsd is running on port 2049. Packet 3 shows the automounter’s call to the server’s nfsd daemon to verify that it is indeed running. The server’s successful reply is shown in packet 4. Packet 5 shows the client’s request for the port number for RPC program number 100005, Version 1, over TCP (the RPC MOUNT program). The server replies with packet 6 with port=33168. Packets 7 through 9 are TCP hand shaking between our NFS client and the server’s mountd. Packet 10 shows the client’s call to the server’s mountd daemon (which implements the MOUNT program) currently running on port 33168. The client is requesting the list of exported entries. The server replies with packet 12 including the names of the two entries exported. Packets 18 and 19 are similar to packets 5 and 6, except that this time the client is asking for the port number of the MOUNT program version 3 running over UDP. Packet 20 and 21 show the client verifying that version 3 of the MOUNT service is up and running on the server. Finally, the client issues the Mount /export/home request to the server in packet 22, requesting the filehandle of the /export/home path. The server’s mountd daemon checks its export list, and determines that the host rome is not present in it and replies to the client with a “Permission Denied” error in packet 23.

The analysis indicates that the “Permission Denied” error returned to the ls command came from the MOUNT request made to the server, not from problems with directory mode bits on the client. Having gathered this information, we study the exported list on the server and quickly notice that the filesystem /export/home is exported only to the host verona:

rome$ showmount -e zeus
export list for zeus:
/export/eng  (everyone)
/export/home verona


We could have obtained the same information by inspecting the contents of packet 12, which contains the export list requested during the transaction:

Code View: Scroll / Show All
rome# snoop -i /tmp/cap -v -p 10,12
      Packet 10 arrived at 3:32:47.73
RPC:  ----- SUN RPC Header -----
RPC:  Record Mark: last fragment, length = 40
RPC:  Transaction id = 965581102
RPC:  Type = 0 (Call)
RPC:  RPC version = 2
RPC:  Program = 100005 (MOUNT), version = 1, procedure = 5
RPC:  Credentials: Flavor = 0 (None), len = 0 bytes
RPC:  Verifier   : Flavor = 0 (None), len = 0 bytes
MOUNT:----- NFS MOUNT -----
MOUNT:Proc = 5 (Return export list)
       Packet 12 arrived at 3:32:47.74
RPC:  ----- SUN RPC Header -----
RPC:  Record Mark: last fragment, length = 92
RPC:  Transaction id = 965581102
RPC:  Type = 1 (Reply)
RPC:  This is a reply to frame 10
RPC:  Status = 0 (Accepted)
RPC:  Verifier   : Flavor = 0 (None), len = 0 bytes
RPC:  Accept status = 0 (Success)
MOUNT:----- NFS MOUNT -----
MOUNT:Proc = 5 (Return export list)
MOUNT:Directory = /export/eng
MOUNT:Directory = /export/home
MOUNT: Group = verona


For simplicity, only the RPC and NFS Mount portions of the packets are shown. Packet 10 is the request for the export list, packet 12 is the reply. Notice that every RPC packet contains the transaction ID (XID), the message type (call or reply), the status of the call, and the credentials. Notice that the RPC header includes the string “This is a reply to frame 10″. This is not part of the network packet. Snoopkeeps track of the XIDs it has processed and attempts to match calls with replies and retransmissions. This feature comes in very handy during debugging. The Mount portion of packet 12 shows the list of directories exported and the group of hosts to which they are exported. In this case, we can see that /export/home was only exported with access rights to the host verona. The problem can be fixed by adding the host rome to the export list on the server.


Troubleshooting NFS locking problems in solaris

November 29th, 2013 No comments

Lock problems will be evident when an NFS client tries to lock a file, and it fails because someone has it locked. For applications that share access to files, the expectation is that locks will be short-lived. Thus, the pattern your users will notice when something is awry is that yesterday an application started up quite quickly, but today it hangs. Usually it is because an NFS/NLM client holds a lock on a file that your application needs to lock, and the holding client has crashed.

11.3.1. Diagnosing NFS lock hangs

On Solaris, you can use tools like pstack and truss to verify that processes are hanging in a lock request:

client1% ps -eaf | grep SuperApp
     mre 23796 10031  0 11:13:22 pts/6    0:00 SuperApp
client1% pstack 23796
23796:  SuperApp
 ff313134 fcntl    (1, 7, ffbef9dc)
 ff30de48 fcntl    (1, 7, ffbef9dc, 0, 0, 0) + 1c8
 ff30e254 lockf    (1, 1, 0, 2, ff332584, ff2a0140) + 98
 0001086c main     (1, ffbefac4, ffbefacc, 20800, 0, 0) + 1c
 00010824 _start   (0, 0, 0, 0, 0, 0) + dc
client1% truss -p 23796
fcntl(1, F_SETLKW, 0xFFBEF9DC)  (sleeping...)


This verifies that the application is stuck in a lock request. We can use pfiles to see what is going on with the files of process 23796:

client1% pfiles 23796
pfiles 23796
23796:  SuperApp
  Current rlimit: 256 file descriptors
   0: S_IFCHR mode:0620 dev:136,0 ino:37990 uid:466 gid:7 rdev:24,37
   1: S_IFREG mode:0644 dev:208,1823 ino:5516985 uid:466 gid:300 size:0
      advisory write lock set by process 3242
   2: S_IFCHR mode:0620 dev:136,0 ino:37990 uid:466 gid:7 rdev:24,37


That we are told that there is an advisory lock set on file descriptor 1 that is set by another process, process ID 3242, is useful, but unfortunately it doesn’t tell us if 3242 is a local process or a process on another NFS client or NFS server. We also aren’t told if the file mapped to file descriptor 1 is a local file, or an NFS file. We are, however, told that the major and minor device numbers of the filesystem are 208 and 1823 respectively. If you run the mount command without any arguments, this dumps the list of mounted file systems. You should see a display similar to:

Code View: Scroll / Show All
/ on /dev/dsk/c0t0d0s0 read/write/setuid/intr/largefiles/onerror=panic/dev=2200000 
on Thu Dec 21 11:13:33 2000
/usr on /dev/dsk/c0t0d0s6 read/write/setuid/intr/largefiles/onerror=panic/
dev=2200006 on Thu Dec 21 11:13:34 2000
/proc on /proc read/write/setuid/dev=31c0000 on Thu Dec 21 11:13:29 2000
/dev/fd on fd read/write/setuid/dev=32c0000 on Thu Dec 21 11:13:34 2000
/etc/mnttab on mnttab read/write/setuid/dev=3380000 on Thu Dec 21 11:13:35 2000
/var on /dev/dsk/c0t0d0s7 read/write/setuid/intr/largefiles/onerror=panic/
dev=2200007 on Thu Dec 21 11:13:40 2000
/home/mre on spike:/export/home/mre remote/read/write/setuid/intr/dev=340071f on 
Thu Dec 28 08:51:30 2000


The numbers after dev= are in hexadecimal. Device numbers are constructed by taking the major number, shifting it left several bits, and then adding the minor number. Convert the minor number 1823 to hexadecimal, and look for it in the mount table:

Code View: Scroll / Show All
client1% printf "%x\n" 1823
client1% mount | grep 'dev=.*71f'
/home/mre on spike:/export/home/mre remote/read/write/setuid/intr/dev=340071f on 
Thu Dec 28 08:51:30 2000


We now know four things:

  • This is an NFS file we are blocking on.
  • The NFS server name is spike.
  • The filesystem on the server is /export/home/mre.
  • The inode number of the file is 5516985.

One obvious cause you should first eliminate is whether the NFS server spike has crashed or not. If it hasn’t crashed, then the next step is to examine the server.

11.3.2. Examining lock state on NFS/NLM servers

Solaris and other System V-derived systems have a useful tool called crash for analyzing system state. Crash actually reads the Unix kernel’s memory and formats its data structures in a more human readable form. Continuing with the example from Section 11.3.1, assuming /export/home/mre is a directory on a UFS filesystem, which can be verified by doing:

spike# df -F ufs | grep /export
/export               (/dev/dsk/c0t0d0s7 ):  503804 blocks   436848 files


then you can use crash to get more lock state.

The crash command is like a shell, but with internal commands for examining kernel state. The internal command we will be using is lck :

Code View: Scroll / Show All
spike# crash
dumpfile = /dev/mem, namelist = /dev/ksyms, outfile = stdout
> lck
Active and Sleep Locks:
30000c3ee18  w   0      0       13   136   0021 3       48bf0f8  ae9008   6878d00 
30000dd8710  w   0      MAXEND  17   212   0001 3       8f1a48   8f02d8   8f0e18  
30001cce1c0  w   193    MAXEND  -1   3242  2021 3       6878850  c43a08   2338a38 

Summary From List:
   3      3       0


An important field is PROC. PROC is the “slot” number of the process. If it is -1, that indicates that the lock is being held by a nonlocal (i.e., an NFS client) process, and the PID field thus indicates the process ID, relative to the NFS client. In the sample display, we see one such entry:

Code View: Scroll / Show All
30001cce1c0  w   193    MAXEND  -1   3242  2021 3       6878850  c43a08   2338a38


Note that the process id, 3242, is equal to that which the pfiles command displayed earlier in this example. We can confirm that this lock is for the file in question via crash’s uinode command:

> uinode 30001cce1c0
30001cce1c0  136,  7   5516985   2    1   466   300    403  f---644  mt rf


The inode numbers match what pfiles earlier displayed on the NFS client. However, inode numbers are unique per local filesystem. We can make doubly sure this is the file by comparing the major and minor device numbers from the uinode command, 136 and 7, with that of the filesystem that is mounted on /export :

spike# ls -lL /dev/dsk/c0t0d0s7
brw-------   1 root     sys      136,  7 May  6  2000 /dev/dsk/c0t0d0s7

11.3.3. Clearing lock state

Continuing with our example from Section 11.3.2, at this point we know that the file is locked by another NFS client. Unfortunately, we don’t know which client it is, as crash won’t give us that information. We do however have a potential list of clients in the server’s /var/statmon/sm directory:

spike# cd /var/statmon/sm
spike# ls
client1       ipv4.  ipv4.  gonzo      java


The entries prefixed with ipv4 are just symbolic links to other entries. The non-symbolic link entries identify the hosts we want to check for.

The most likely cause of the lock not getting released is that the holding NFS client has crashed. You can take the list of hosts from the /var/statmon/sm directory and check if any are dead, or not responding due to a network partition. Once you determine which are dead, you can use Solaris’s clear_locks command to clear lock state. Let’s suppose you determine that gonzo is dead. Then you would do:

spike# clear_locks gonzo


If clearing the lock state of dead clients doesn’t fix the problem, then perhaps a now-live client crashed, but for some reason after it rebooted, its status monitor did not send a notification to the NLM server’s status monitor. You can log onto the live clients and check if they are currently mounting the filesystem from the server (in our example, spike:/export). If they are not, then you should consider using clear_locks to clear any residual lock state those clients might have had.

Ultimately, you may be forced to reboot your server. Short of that there are other things you could do. Since you know the inode number and filesystem of file in question, you can determine the file’s name:

spike# cd /export
find . -inum 5516985 -print


You could rename file database to something else, and copy it back to a file named database. Then kill and restart the SuperApp application on client1. Of course, such an approach requires intimate knowledge or experience with the application to know if this will be safe.


This article is from book <Managing NFS and NIS, Second Edition>.

Categories: NAS, Storage Tags:

nfs null map – white out any map entry affecting directory

November 25th, 2013 No comments

The automounter also has a map “white-out” feature, via the -null special map. It is used after a directory to effectively delete any map entry affecting that directory from the automounter’s set of maps. It must precede the map entry being deleted. For example:

/tools -null

This feature is used to override auto_master or direct map entries that may have been inherited from an NIS map. If you need to make per-machine changes to the automounter maps, or if you need local control over a mount point managed by the automounter, white-out the conflicting map entry with the -null map.

PS: this is from book <>

Categories: NAS, Storage Tags:

nfs direct map vs indirect map

November 25th, 2013 No comments
  • Indirect maps

Here is an indirect automounter map for the /tools directory, called auto_tools:

deskset         -ro      mahimahi:/tools2/deskset 
sting                    mahimahi:/tools2/sting 
news                     thud:/tools3/news 
bugview                  jetstar:/usr/bugview


The first field is called the map key and is the final component of the mount point. The map name suffix and the mount point do not have to share the same name, but adopting this convention makes it easy to associate map names and mount points. This four-entry map is functionally equivalent to the /etc/vfstab excerpt:

mahimahi:/tools2/desket - /tools/deskset  nfs - - ro 
mahimahi:/tools2/string - /tools/sting    nfs - -  
thud:/tools3/news       - /tools/news     nfs - -  
jetstar:/usr/bugview    - /tools/bugview  nfs - -
  • Direct maps

Direct maps define point-specific, nonuniform mount points. The best example of the need for a direct map entry is /usr/man. The /usr directory contains numerous other entries, so it cannot be an indirect mount point. Building an indirect map for /usr/man that uses /usr as a mount point will “cover up” /usr/bin and /usr/etc. A direct map allows the automounter to complete mounts on a single directory entry.

The key in a direct map is a full pathname, instead of the last component found in the indirect map. Direct maps also follow the /etc/auto_contents naming scheme. Here is a sample /etc/auto_direct:
/usr/man wahoo:/usr/share/man
/usr/local/bin mahimahi:/usr/local/bin.sun4

A major difference in behavior is that the real direct mount points are always visible to ls and other tools that read directory structures. The automounter treats direct mounts as individual directory entries, not as a complete directory, so the automounter gets queried whenever the directory containing the mount point is read. Client performance is affected in a marked fashion if direct mount points are used in several well-traveled directories. When a user reads a directory containing a number of direct mounts, the automounter initiates a flurry of mounting activity in response to the directory read requests. Section 9.5.3 describes a trick that lets you use indirect maps instead of direct maps. By using this trick, you can avoid mount storms caused by multiple direct mount points.

Contents from /etc/auto_master:
# Directory Map NFS Mount Options
/tools /etc/auto_tools -ro #this is indirect map
/- /etc/auto_direct #this is direct map
This article is mostly from book <Managing NFS and NIS, Second Edition>

Categories: NAS, Storage Tags:

SAN Terminology

September 13th, 2013 No comments
SCSI Target
A SCSI Target is a storage system end-point that provides a service of processing SCSI commands and I/O requests from an initiator. A SCSI Target is created by the storage system’s administrator, and is identified by unique addressing methods. A SCSI Target, once configured, consists of zero or more logical units.
SCSI Initiator
A SCSI Initiator is an application or production system end-point that is capable of initiating a SCSI session, sending SCSI commands and I/O requests. SCSI Initiators are also identified by unique addressing methods (See SCSI Targets).
Logical Unit
A Logical Unit is a term used to describe a component in a storage system. Uniquely numbered, this creates what is referred to as a Logicial Unit Number, or LUN. A storage system, being highly configurable, may contain many LUNS. These LUNs, when associated with one or more SCSI Targets, forms a unique SCSI device, a device that can be accessed by one or more SCSI Initiators.
Internet SCSI, a protocol for sharing SCSI based storage over IP networks.
iSCSI Extension for RDMA, a protocol that maps the iSCSI protocol over a network that provides RDMA services (i.e. InfiniBand). The iSER protocol is transparently selected by the iSCSI subsystem, based on the presence of correctly configured IB hardware. In the CLI and BUI, all iSER-capable components (targets and initiators) are managed as iSCSI components.
Fibre Channel, a protocol for sharing SCSI based storage over a storage area network (SAN), consisting of fiber-optic cables, FC switches and HBAs.
SCSI RDMA Protocol, a protocol for sharing SCSI based storage over a network that provides RDMA services (i.e. InfiniBand).
An iSCSI qualified name, the unique identifier of a device in an iSCSI network. iSCSI uses the form for IQNs. For example, the appliance may use the IQN: to identify one of its iSCSI targets. This name shows that this is an iSCSI device built by a company registered in March of 1986. The naming authority is just the DNS name of the company reversed, in this case, “com.sun”. Everything following is a unique ID that Sun uses to identify the target.
Target portal
When using the iSCSI protocol, the target portal refers to the unique combination of an IP address and TCP port number by which an initiator can contact a target.
Target portal group
When using the iSCSI protocol, a target portal group is a collection of target portals. Target portal groups are managed transparently; each network interface has a corresponding target portal group with that interface’s active addresses. Binding a target to an interface advertises that iSCSI target using the portal group associated with that interface.
Challenge-handshake authentication protocol, a security protocol which can authenticate a target to an initiator, an initiator to a target, or both.
A system for using a centralized server to perform CHAP authentication on behalf of storage nodes.
Target group
A set of targets. LUNs are exported over all the targets in one specific target group.
Initiator group
A set of initiators. When an initiator group is associated with a LUN, only initiators from that group may access the LUN.
Categories: Hardware, SAN, Storage Tags: ,

make label for swap device using mkswap and blkid

August 6th, 2013 No comments

If you want to label one swap partition in linux, you should not use e2label for this purpose. As e2label is for changing the label on an ext2/ext3/ext4 filesystem, which do not include swap filesystem.

If you use e2label for this, you will get the following error messages:

[root@node2 ~]# e2label /dev/xvda3 SWAP-VM
e2label: Bad magic number in super-block while trying to open /dev/xvda3
Couldn’t find valid filesystem superblock.

We should use mkswap for it. As mkswap has one option -L:

-L labelSpecify a label, to allow swapon by label. (Only for new style swap areas.)

So let’s see example below:

[root@node2 ~]# mkswap -L SWAP-VM /dev/xvda3
Setting up swapspace version 1, size = 2335973 kB
LABEL=SWAP-VM, no uuid

[root@node2 ~]# blkid
/dev/xvda1: LABEL=”/boot” UUID=”6c5ad2ad-bdf5-4349-96a4-efc9c3a1213a” TYPE=”ext3″
/dev/xvda2: LABEL=”/” UUID=”76bf0aaa-a58e-44cb-92d5-098357c9c397″ TYPE=”ext3″
/dev/xvdb1: LABEL=”VOL1″ TYPE=”oracleasm”
/dev/xvdc1: LABEL=”VOL2″ TYPE=”oracleasm”
/dev/xvdd1: LABEL=”VOL3″ TYPE=”oracleasm”
/dev/xvde1: LABEL=”VOL4″ TYPE=”oracleasm”
/dev/xvda3: LABEL=”SWAP-VM” TYPE=”swap”

[root@node2 ~]# swapon /dev/xvda3

[root@node2 ~]# swapon -s
Filename Type Size Used Priority
/dev/xvda3 partition 2281220 0 -1

So now we can add swap to /etc/fstab using LABEL=SWAP-VM:

LABEL=SWAP-VM           swap                    swap    defaults        0 0

Categories: Linux, Storage Tags: ,

iostat dm- mapping to physical device

July 30th, 2013 No comments

-bash-3.2# iostat -xn 2

avg-cpu: %user %nice %system %iowait %steal %idle
0.02 0.00 0.48 0.00 0.21 99.29

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 1949.00 0.00 129648.00 66.52 30.66 15.77 0.51 100.20
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 0.00 1139.00 0.00 88752.00 77.92 22.92 20.09 0.83 95.00

Device: rBlk_nor/s wBlk_nor/s rBlk_dir/s wBlk_dir/s rBlk_svr/s wBlk_svr/s rops/s wops/s
nas-host:/export/test/repo 0.00 0.00 0.00 218444.00 0.00 218332.00 3084.50 3084.50

Then how can we know which physical device dm-3 is mapping to?

-bash-3.2# cat /sys/block/dm-3/dev
253:3 #this is major, minor number of dm-3


-bash-3.2# dmsetup ls
dmnfs6 (253, 6)
dmnfs5 (253, 5)
dmnfs4 (253, 4)
dmnfs3 (253, 3)
dmnfs2 (253, 2)
dmnfs1 (253, 1)
dmnfs0 (253, 0)

-bash-3.2# ls -l /dev/mapper/dmnfs3
brw-rw—- 1 root disk 253, 3 Jul 30 08:28 /dev/mapper/dmnfs3

-bash-3.2# dmsetup status
dmnfs6: 0 27262976 nfs nfs 0
dmnfs5: 0 476938240 nfs nfs 0
dmnfs4: 0 29543535 nfs nfs 0
dmnfs3: 0 1536000000 nfs nfs 0 #device mapper of nfs, dm_nfs
dmnfs2: 0 29543535 nfs nfs 0 >
dmnfs1: 0 169508864 nfs nfs 0
dmnfs0: 0 29543535 nfs nfs 0

So we now know that it’s NFS which caused io busy.


device mapper(dm_mod module, dmsetup ls)

multiple devices(software raid, mdraid, /proc/mdstat) and

dmraid(fake raid)

DM-MPIO(DM-Multipathing, multipath/multipathd, dm_multipath module, combined with SAN)

Categories: Storage Tags:

resolved – differences between zfs ARC L2ARC ZIL

January 31st, 2013 No comments
  • ARC

zfs ARC(adaptive replacement cache) is a very fast cache located in the server’s memory.

For example, our ZFS server with 12GB of RAM has 11GB dedicated to ARC, which means our ZFS server will be able to cache 11GB of the most accessed data. Any read requests for data in the cache can be served directly from the ARC memory cache instead of hitting the much slower hard drives. This creates a noticeable performance boost for data that is accessed frequently.

  • L2ARC

As a general rule, you want to install as much RAM into the server as you can to make the ARC as big as possible. At some point, adding more memory is just cost prohibitive. That is where the L2ARC becomes important. The L2ARC is the second level adaptive replacement cache. The L2ARC is often called “cache drives” in the ZFS systems.

L2ARC is a new layer between Disk and the cache (ARC) in main memory for ZFS. It uses dedicated storage devices to hold cached data. The main role of this cache is to boost the performance of random read workloads. The intended L2ARC devices include 10K/15K RPM disks like short-stroked disks, solid state disks (SSD), and other media with substantially faster read latency than disk.

  • ZIL

ZIL(ZFS Intent Log) exists for performance improvement on synchronous writes. Synchronous write is very slow than asynchronous write, but it’s more stable. Essentially, the intent log of a file system is nothing more than an insurance against power failures, a to-do list if you will, that keeps track of the stuff that needs to be updated on disk, even if the power fails (or something else happens that prevents the system from updating its disks).

To get better performance, use separated disks(SSD) for ZIL, such as zpool add pool log c2d0.

Now I’m giving you an true example about zfs ZIL/L2ARC/ARC on SUN ZFS 7320 head:

test-zfs# zpool iostat -v exalogic
capacity operations bandwidth
pool alloc free read write read write
————————- —– —– —– —– —– —–
exalogic 6.78T 17.7T 53 1.56K 991K 25.1M
mirror 772G 1.96T 6 133 111K 2.07M
c0t5000CCA01A5FDCACd0 – - 3 36 57.6K 2.07M #these are the physical disks
c0t5000CCA01A6F5CF4d0 – - 2 35 57.7K 2.07M
mirror 772G 1.96T 5 133 110K 2.07M
c0t5000CCA01A6F5D00d0 – - 2 36 56.2K 2.07M
c0t5000CCA01A6F64F4d0 – - 2 35 57.3K 2.07M
mirror 772G 1.96T 5 133 110K 2.07M
c0t5000CCA01A76A7B8d0 – - 2 36 56.3K 2.07M
c0t5000CCA01A746CCCd0 – - 2 36 56.8K 2.07M
mirror 772G 1.96T 5 133 110K 2.07M
c0t5000CCA01A749A88d0 – - 2 35 56.7K 2.07M
c0t5000CCA01A759E90d0 – - 2 35 56.1K 2.07M
mirror 772G 1.96T 5 133 110K 2.07M
c0t5000CCA01A767FDCd0 – - 2 35 56.1K 2.07M
c0t5000CCA01A782A40d0 – - 2 35 57.1K 2.07M
mirror 772G 1.96T 5 133 110K 2.07M
c0t5000CCA01A782D10d0 – - 2 35 57.2K 2.07M
c0t5000CCA01A7465F8d0 – - 2 35 56.3K 2.07M
mirror 772G 1.96T 5 133 110K 2.07M
c0t5000CCA01A7597FCd0 – - 2 35 57.6K 2.07M
c0t5000CCA01A7828F4d0 – - 2 35 56.2K 2.07M
mirror 772G 1.96T 5 133 110K 2.07M
c0t5000CCA01A7829ACd0 – - 2 35 57.1K 2.07M
c0t5000CCA01A78278Cd0 – - 2 35 57.4K 2.07M
mirror 772G 1.96T 6 133 111K 2.07M
c0t5000CCA01A736000d0 – - 3 35 57.3K 2.07M
c0t5000CCA01A738000d0 – - 2 35 57.3K 2.07M
c0t5000A72030061B82d0 224M 67.8G 0 98 1 1.62M #ZIL(SSD write cache, ZFS Intent Log)
c0t5000A72030061C70d0 224M 67.8G 0 98 1 1.62M
c0t5000A72030062135d0 223M 67.8G 0 98 1 1.62M
c0t5000A72030062146d0 224M 67.8G 0 98 1 1.62M
cache – - – - – -
c2t2d0 334G 143G 15 6 217K 652K #L2ARC(SSD cache drives)
c2t3d0 332G 145G 15 6 215K 649K
c2t4d0 333G 144G 11 6 169K 651K
c2t5d0 333G 144G 13 6 192K 650K
c2t2d0 – - 0 0 0 0
c2t3d0 – - 0 0 0 0
c2t4d0 – - 0 0 0 0
c2t5d0 – - 0 0 0 0

And as for ARC:

test-zfs:> status memory show
Cache 63.4G bytes #ARC
Unused 17.3G bytes
Mgmt 561M bytes
Other 491M bytes
Kernel 14.3G bytes

Categories: Kernel, NAS, SAN, Storage Tags: ,

sun zfs firmware upgrade howto

January 29th, 2013 No comments

This article is going to talk about upgrading firmware for sun zfs 7320(you may find other series of sun zfs heads works too):

Categories: NAS, SAN, Storage Tags:

perl script for monitoring sun zfs memory usage

January 16th, 2013 No comments

On zfs’s aksh, I can check memory usage with the following:

test-zfs:> status memory show
Cache 719M bytes
Unused 15.0G bytes
Mgmt 210M bytes
Other 332M bytes
Kernel 7.79G bytes

So now I want to collect this memory usae information automatically for SNMP’s use. Here’s the steps:

cpan> o conf prerequisites_policy follow
cpan> o conf commit

Since the host is using proxy to get on the internet, so in /etc/wgetrc:

http_proxy =
ftp_proxy =
use_proxy = on

Now install the Net::SSH::Perl perl module:

PERL_MM_USE_DEFAULT=1 perl -MCPAN -e ‘install Net::SSH::Perl’

And to confirm that Net::SSH::Perl was installed, run the following command:

perl -e ‘use Net::SSH::Perl’ #no output is good, as it means the package was installed successfully

Now here goes the perl script to get the memory usage of sun zfs head:

[root@test-centos ~]# cat /var/tmp/mrtg/
use strict;
use warnings;
use Net::SSH::Perl;
my $host = ‘test-zfs’;
my $user = ‘root’;
my $password = ‘password’;

my $ssh = Net::SSH::Perl->new($host);
my ($stdout,$stderr,$exit) = $ssh->cmd(“status memory show”);
print “ErrorCode:$exit\n”;
print “ErrorMsg:$stderr”;
} else {
my @std_arr = split(/\n/, $stdout);
shift @std_arr;
foreach(@std_arr) {
if ($_ =~ /.+\b\s+(.+)M\sbytes/){
elsif($_ =~ /.+\b\s+(.+)G\sbytes/){
foreach(@std_arr) {
print $_.”\n”;
exit $exit;

If you get the following error messages during installation of a perl module:

[root@test-centos ~]# perl -MCPAN -e ‘install SOAP::Lite’
CPAN: Storable loaded ok
CPAN: LWP::UserAgent loaded ok
Fetching with LWP:
LWP failed with code[500] message[LWP::Protocol::MyFTP: connect: Connection timed out]
Fetching with Net::FTP:

Trying with “/usr/bin/links -source” to get
ELinks: Connection timed out

Then you may have a check of whether you’re using proxy to get on the internet(run cpan > o conf init to re-configure cpan; later you should set /etc/wgetrc: http_proxy, ftp_proxy, use_proxy).


Categories: NAS, Perl, Storage Tags: ,

zfs iops on nfs iscsi disk

January 5th, 2013 No comments

On zfs storage 7000 series BUI, you may found the following statistic:

This may seem quite weird as you can see that, NFSv3(3052) + iSCSI(1021) is larger than Disk(1583). As iops for protocal NFSv3/iSCSI finally goes to Disk, so why iops for the two protocals is larger than Disk iops?

Here’s the reason:

Disk operations for NFSv3 and iSCSI are logical operations. These logical operations are then combined/optimized by sun zfs storage and then finally go to physical Disk operations.


1.When doing continuous access to disks(like VOD), disk throughputs will become the bottleneck of performance rather than IOPS. In constract, IOPS limits disk performance when random access is going on disks.

2.For NAS performance analytic, here are two good articles(in Chinese)

3.You may also wonder why Disk iops can be as high as 1583. As this number is the sum of all disk controllers of the zfs storage system. Here’s some ballpark numbers for HDD iops:


Categories: Hardware, NAS, SAN, Storage Tags:

zfs shared lun stoage set up for oracle RAC

January 4th, 2013 No comments
  • create iSCSI Target Group

Open zfs BUI, navigate through “Configuration” -> “SAN” -> “iSCSI Targets”. Then create new iSCSI Target by clicking plus sign. Give it an alias, and then select the Network interface(may be bond or LACP) you want to use. After creating this iSCSI target, drag the newly created target to the right side “iSCSI Target Groups” to create one iSCSI Target Group. You can give that iSCSI target group an name too. Note down the iSCSI Target Group’s iqn, this is important for later operations.(Network interfaces:use NAS interface. You can select multiple interfaces)

  • create iSCSI Initiator Group

Before going on the next step, we need first get the iSCSI initiator IQN for each hosts we want LUN allocated. On each host, execute the following command to get the iqn for iscsi on linux platform(You can edit this file before read it, for example, make iqn name ended with` hostname` so it’s easier for later operations on LUN<do a /etc/init.d/iscsi restart after your modification to initiatorname.iscsi>):

[root@test-host ~]# cat /etc/iscsi/initiatorname.iscsi
InitiatorName=<your host’s iqn name>

Now go back to zfs BUI, navigate through “Configuration” -> “SAN” -> “Initiators”. On the left side, click “iSCSI Initiators”, then click plus sign on it. Enter IQN you get from previos step and give it an name.(do this for each host you want iSCSI LUN allocated). After this, drag the newly created iSCSI initiator(s) from left side to form new iSCSI Initiator Groups on the right side(drag two items from the left to the same item on the right to form an group).

  • create shared LUNs for iSCSI Initiator Group

After this, we need now create LUNs for iSCSI Initiator Group(so that shared lun can be allocated, for example, oracle RAC need shared storage). Click on diskette sign on the just created iSCSI Initiator Group,select the project you want the LUN allocated from, give it a name, and assign the volume size. Select the right target group you created before(you can also create a new one e.g. RAC in shares).

PS:You can also now go to “shares” -> “Luns” and create Lun(s) using the target group you created and use default Initiator group. Note that one LUN need one iSCSI target. So you should create more iSCSI targets and add them to iSCSI target group if you want more LUNs.

  • scan shared LUNs from hosts

Now we’re going to operate on linux hosts. On each host you want iSCSI LUN allocated, do the following steps:

iscsiadm -m discovery -t st -p <ip address of your zfs storage>(use cluster’s ip if there’s zfs cluster)
iscsiadm -m node -T <variable, iSCSI Target Group iqn> -p <ip address of your zfs storage> -l #or use output from above command($1 is –portal, $3 is –targetname, -l is –login)
service iscsi restart

After these steps, you host(s) should now see the newly allocated iSCSI LUN(s), you can run fdisk -l to confirm.

Good luck!

Categories: NAS, Storage Tags:

resolved – error: conflicting types for ‘pci_pcie_cap’ – Infiniband driver OFED installation

December 10th, 2012 No comments

OFED is an abbr for “OpenFabrics Enterprise Distribution”. When installing OFED- on a centos/RHEL 5.8 linux system, I met the following problem:

In file included from /var/tmp/OFED_topdir/BUILD/ofa_kernel-
/var/tmp/OFED_topdir/BUILD/ofa_kernel- error: conflicting types for ‘pci_pcie_cap’
include/linux/pci.h:1015: error: previous definition of ‘pci_pcie_cap’ was here
make[4]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-] Error 1
make[3]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-] Error 2
make[2]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-] Error 2
make[1]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-] Error 2
make[1]: Leaving directory `/usr/src/kernels/2.6.18-308.′
make: *** [kernel] Error 2
error: Bad exit status from /var/tmp/rpm-tmp.70159 (%build)

After googling and experimenting, I can tell you the definite resolution for this problem -> install another version of OFED, OFED- Go to the following official site to download:


1. Here’s the file of OFED installation howto: OFED installation README.txt

2.Some Infiniband commands that may be useful for you:

service openibd status/start/stop
lspci |grep -i infini
ibv_devinfo #Verify the status of ports by using ibv_devinfo: all connected ports should report a “PORT_ACTIVE” state

how to turn on hba flags connected to EMC arrays

October 3rd, 2012 No comments

As per EMC recommendation following flags should be enabled for Vmware ESX hosts, if not there will be performance issues:


Here’s the commands that’ll do the trick:

sudo symmask -sid <sid> set hba_flags on C,SPC2,SC3 -enable -wwn <port wwn> -dir <dir number> -p <port number>

Categories: Hardware, NAS, SAN, Storage Tags:

Resolved – Errors found during scanning of LUN allocated from IBM XIV array

October 2nd, 2012 No comments

So here’s the story:
After the LUN(IBM XIV array) allocated, we run a ‘xiv_fc_admin -R’ to make the LUN visible to OS(testhost-db-clstr-vol_37 is the new LUN’s Volume Name):
root@testhost01 # xiv_devlist -o device,vol_name,vol_id
XIV Devices
Device Vol Name Vol Id
/dev/dsk/c2t500173804EE40140d19s2 testhost-db-clstr-vol_37 1974
/dev/dsk/c2t500173804EE40150d19s2 testhost-db-clstr-vol_37 1974
/dev/dsk/c4t500173804EE40142d19s2 testhost-db-clstr-vol_37 1974
/dev/dsk/c4t500173804EE40152d19s2 testhost-db-clstr-vol_37 1974

/dev/vx/dmp/xiv0_16 testhost-db-clstr-vol_17 1922

Non-XIV Devices

Then, I ran ‘vxdctl enable’ in order to make the DMP device visible to OS, but error message prompted:
root@testhost01 # vxdctl enable
VxVM vxdctl ERROR V-5-1-0 Data Corruption Protection Activated – User Corrective Action Needed
VxVM vxdctl INFO V-5-1-0 To recover, first ensure that the OS device tree is up to date (requires OS specific commands).
VxVM vxdctl INFO V-5-1-0 Then, execute ‘vxdisk rm’ on the following devices before reinitiating device discovery:
xiv0_18, xiv0_18, xiv0_18, xiv0_18

After this, the new LUN disappered from output of ‘xiv_devlist -o device,vol_name,vol_id’(testhost-db-clstr-vol_37 disappered), and xiv0_18(the DMP device of new LUN) turned to ‘Unreachable device’, see below:

root@testhost01 # xiv_devlist -o device,vol_name,vol_id
XIV Devices
Device Vol Name Vol Id

Non-XIV Devices
Unreachable devices: /dev/vx/dmp/xiv0_18
Also, ‘vxdisk list’ showed:
root@testhost01 # vxdisk list xiv0_18
Device: xiv0_18
devicetag: xiv0_18
type: auto
flags: error private autoconfig
pubpaths: block=/dev/vx/dmp/xiv0_18s2 char=/dev/vx/rdmp/xiv0_18s2
guid: -
udid: IBM%5F2810XIV%5F4EE4%5F07B6
site: -
Multipathing information:
numpaths: 4
c4t500173804EE40142d19s2 state=disabled
c4t500173804EE40152d19s2 state=disabled
c2t500173804EE40150d19s2 state=disabled
c2t500173804EE40140d19s2 state=disabled

I tried to format the new DMP device(xiv0_18), but failed with info below:
root@testhost01 # format -d /dev/vx/dmp/xiv0_18
Searching for disks…done

c2t500173804EE40140d19: configured with capacity of 48.06GB
c2t500173804EE40150d19: configured with capacity of 48.06GB
c4t500173804EE40142d19: configured with capacity of 48.06GB
c4t500173804EE40152d19: configured with capacity of 48.06GB
Unable to find specified disk ‘/dev/vx/dmp/xiv0_18′.

Also, ‘vxdisksetup -i’ failed with info below:
root@testhost01 # vxdisksetup -i /dev/vx/dmp/xiv0_18
prtvtoc: /dev/vx/rdmp/xiv0_18: No such device or address

And, ‘xiv_fc_admin -R’ failed with info below:
root@testhost01 # xiv_fc_admin -R
ERROR: Error during command execution: vxdctl enabled
OK, that’s all of the symptoms and the headache, here’s the solution:

1. Run ‘xiv_fc_admin -R’(ERROR: Error during command execution: vxdctl enabled will prompt, ignore it. this step scanned for new LUN). You can also run a devfsadm -c disk(not needed actually)
2. Now exclude problematic paths of the DMP device(you can check the paths from vxdisk list xiv0_18)
root@testhost01 # vxdmpadm exclude vxvm path=c4t500173804EE40142d19s2
root@testhost01 # vxdmpadm exclude vxvm path=c4t500173804EE40152d19s2
root@testhost01 # vxdmpadm exclude vxvm path=c2t500173804EE40150d19s2
root@testhost01 # vxdmpadm exclude vxvm path=c2t500173804EE40140d19s2
3. Now run ‘vxdctl enable’, the following error message will NOT showed:
VxVM vxdctl ERROR V-5-1-0 Data Corruption Protection Activated – User Corrective Action Needed
VxVM vxdctl INFO V-5-1-0 To recover, first ensure that the OS device tree is up to date (requires OS specific commands).
VxVM vxdctl INFO V-5-1-0 Then, execute ‘vxdisk rm’ on the following devices before reinitiating device discovery:
xiv0_18, xiv0_18, xiv0_18, xiv0_18
4. Now include the problematic paths:
root@testhost01 # vxdmpadm include vxvm path=c4t500173804EE40142d19s2
root@testhost01 # vxdmpadm include vxvm path=c4t500173804EE40152d19s2
root@testhost01 # vxdmpadm include vxvm path=c2t500173804EE40150d19s2
root@testhost01 # vxdmpadm include vxvm path=c2t500173804EE40140d19s2

5. Run ‘vxdctl enable’. After this, you should now see the DMP device in output of ‘xiv_devlist -o device,vol_name,vol_id’
root@testhost01 # xiv_devlist -o device,vol_name,vol_id
XIV Devices
Device Vol Name Vol Id

/dev/vx/dmp/xiv0_18 testhost-db-clstr-vol_37 1974

Non-XIV Devices

6. ‘vxdisk list’ will now show the DMP device(xiv0_18) as ‘auto – - nolabel’, obviously we should now label the DMP device:
root@testhost01 # format -d xiv0_18
Searching for disks…done

c2t500173804EE40140d19: configured with capacity of 48.06GB
c2t500173804EE40150d19: configured with capacity of 48.06GB
c4t500173804EE40142d19: configured with capacity of 48.06GB
c4t500173804EE40152d19: configured with capacity of 48.06GB
Unable to find specified disk ‘xiv0_18′.

root@testhost01 # vxdisk list xiv0_18
Device: xiv0_18
devicetag: xiv0_18
type: auto
flags: nolabel private autoconfig
pubpaths: block=/dev/vx/dmp/xiv0_18 char=/dev/vx/rdmp/xiv0_18
guid: -
udid: IBM%5F2810XIV%5F4EE4%5F07B6
site: -
errno: Disk is not usable
Multipathing information:
numpaths: 4
c4t500173804EE40142d19s2 state=enabled
c4t500173804EE40152d19s2 state=enabled
c2t500173804EE40150d19s2 state=enabled
c2t500173804EE40140d19s2 state=enabled

root@testhost01 # vxdisksetup -i /dev/vx/dmp/xiv0_18
prtvtoc: /dev/vx/rdmp/xiv0_18: Unable to read Disk geometry errno = 0×16

Not again! But don’t panic this time. Now run format for each subpath of the DMP device(can be found in output of vxdisk list xiv0_18), for example:
root@testhost01 # format c4t500173804EE40142d19s2

c4t500173804EE40142d19s2: configured with capacity of 48.06GB
selecting c4t500173804EE40142d19s2
[disk formatted]
disk – select a disk
type – select (define) a disk type
partition – select (define) a partition table
current – describe the current disk
format – format and analyze the disk
repair – repair a defective sector
label – write label to the disk
analyze – surface analysis
defect – defect list management
backup – search for backup labels
verify – read and display labels
save – save new disk/partition definitions
inquiry – show vendor, product and revision
volname – set 8-character volume name
!<cmd> – execute <cmd>, then return
format> label
Ready to label disk, continue? yes

format> save
Saving new disk and partition definitions
Enter file name["./format.dat"]:
format> quit

7. After the subpaths were labelled, now run a ‘vxdctl enable’ again. After this, you’ll find the DMP device turned it’s state from ‘auto – - nolabel’ to ‘auto:none – - online invalid’, and vxdisk list no longer showed the DMP device as ‘Disk is not usable’:
root@testhost01 # vxdisk list xiv0_18
Device: xiv0_18
devicetag: xiv0_18
type: auto
info: format=none
flags: online ready private autoconfig invalid
pubpaths: block=/dev/vx/dmp/xiv0_18s2 char=/dev/vx/rdmp/xiv0_18s2
guid: -
udid: IBM%5F2810XIV%5F4EE4%5F07B6
site: -
Multipathing information:
numpaths: 4
c4t500173804EE40142d19s2 state=enabled
c4t500173804EE40152d19s2 state=enabled
c2t500173804EE40150d19s2 state=enabled
c2t500173804EE40140d19s2 state=enabled

8. To add the new DMP device to Disk Group, the following steps should be followed:
/usr/lib/vxvm/bin/vxdisksetup -i xiv0_18
vxdg -g <dg_name> adddisk <disk_name>=<device name>
/usr/sbin/vxassist -g <dg_name> maxgrow <vol name> alloc=<newly-add-luns>
/etc/vx/bin/vxresize -g <dg_name> -bx <vol name> <new size>


Categories: Hardware, SAN, Storage Tags:

lvm snapshot backup

September 20th, 2012 No comments

Here are the updated steps for LVM snapshot backup. I will also put there into work info of the changes.


# determine available space.

sudo vgs | awk ‘/VolGroup/ {print $NF}’


# create snapshots, 1Gb should be enough but if you really struggle you can try even smaller cause it will only have to hold the delta between now and when you destroy them.

sudo lvcreate -L 1G -s -n rootLVsnap VolGroup00/rootLV

sudo lvcreate -L 1G -s -n varLVsnap VolGroup00/varLV


# create the FS for backups. allocate as much space as you have. last-resort: use NFS.

sudo lvcreate -n OSbackup -L 5G VolGroup00

sudo mkfs -t ext3 /dev/VolGroup00/OSbackup

sudo mkdir /OSbackup

sudo mount /dev/VolGroup00/OSbackup /OSbackup


# create a backup of root

# gzip is important. if you are really tight on space, try gzip -9 or even bzip2

sudo dd if=/dev/VolGroup00/rootLVsnap bs=1M | sudo sh -c ‘gzip -c > /OSbackup/root.dd.gz’


# now, remove root snapshot and extend backup fs

sudo lvremove VolGroup00/rootLVsnap

sudo lvextend -L +1G VolGroup00/OSbackup

sudo resize2fs /dev/VolGroup00/OSbackup


# backup var

sudo dd if=/dev/VolGroup00/varLVsnap bs=1M | sudo sh -c ‘gzip -c > /OSbackup/var.dd.gz’

sudo lvremove VolGroup00/varLVsnap


# backup boot

cd /boot; sudo tar -pczf /OSbackup/boot.tar.gz .


# unmount the fs and destroy mountpoint

sudo umount /OSbackup

sudo rmdir /OSbackup

Categories: Storage Tags:

thin provisioning aka virtual provisioning on EMC Symmetrix

July 28th, 2012 No comments

For basic information about thin provisioning, here’s some excerpts from wikipedia/HDS site:

Thin provisioning is the act of using virtualization technology to give the appearance of more physical resource than is actually available. It relies on on-demand allocation of blocks of data versus the traditional method of allocating all the blocks up front. This methodology eliminates almost all whitespace which helps avoid the poor utilization rates, often as low as 10%, that occur in the traditional storage allocation method where large pools of storage capacity are allocated to individual servers but remain unused (not written to). This traditional model is often called “fat” or “thick” provisioning.

Thin provisioning simplifies application storage provisioning by allowing administrators to draw from a central virtual pool without immediately adding physical disks. When an application requires more storage capacity, the storage system automatically allocates the necessary physical storage. This just-in-time method of provisioning decouples the provisioning of storage to an application from the physical addition of capacity to the storage system.

The term thin provisioning is applied to disk later in this article, but could refer to an allocation scheme for any resource. For example, real memory in a computer is typically thin provisioned to running tasks with some form of address translation technology doing the virtualization. Each task believes that it has real memory allocated. The sum of the allocated virtual memory assigned to tasks is typically greater than the total of real memory.

The following article below shows the step how to create thin pool, add and remove components from the pool and how to delete thin pool:

And for more information about thin provisioning on EMC Symmetrix V-Max  with Veritas Storage Foundation, the following PDF file may help you.

EMC Symmetrix V-Max with Veritas Storage Foundation.pdf


1.symcfg -sid 1234 list -datadev #list all TDAT devices(thin data devices which consists thin pool, and thin pool provide the actual physical storage to thin devices)
2.symcfg -sid 1234 list -tdev #list all TDEV devices(thin devices)

3.The following article may be useful for you if you encountered problems when trying to perform storage reclamation(VxVM vxdg ERROR V-5-1-16063 Disk d1 is used by one or more subdisks which are pending to be reclaimed):



Categories: Hardware, SAN, Storage Tags: ,

Resolved – VxVM vxconfigd ERROR V-5-1-0 Segmentation violation – core dumped

July 25th, 2012 2 comments

When I tried to import veritas disk group today using vxdg -C import doxerdg, there’s error message shown as the following:

VxVM vxdg ERROR V-5-1-684 IPC failure: Configuration daemon is not accessible
return code of vxdg import command is 768

VxVM vxconfigd DEBUG V-5-1-0 IMPORT: Trying to import the disk group using configuration database copy from emc5_0490
VxVM vxconfigd ERROR V-5-1-0 Segmentation violation – core dumped

Then I used pstack to print the stack trace of the dumped file:

root # pstack /var/core/core_doxerorg_vxconfigd_0_0_1343173375_140
core ‘core_doxerorg_vxconfigd_0_0_1343173375_14056′ of 14056: vxconfigd
ff134658 strcmp (fefc04e8, 103fba8, 0, 0, 31313537, 31313737) + 238
001208bc da_find_diskid (103fba8, 0, 0, 0, 0, 0) + 13c
002427dc dm_get_da (58f068, 103f5f8, 0, 0, 68796573, 0) + 14c
0023f304 ssb_check_disks (58f068, 0, f37328, fffffffc, 4, 0) + 3f4
0018f8d8 dg_import_start (58f068, 9c2088, ffbfed3c, 4, 0, 0) + 25d8
00184ec0 dg_reimport (0, ffbfedf4, 0, 0, 0, 0) + 288
00189648 dg_recover_all (50000, 160d, 3ec1bc, 1, 8e67c8, 447ab4) + 2a8
001f2f5c mode_set (2, ffbff870, 0, 0, 0, 0) + b4c
001e0a80 setup_mode (2, 3e90d4, 4d5c3c, 0, 6c650000, 6c650000) + 18
001e09a0 startup (4d0da8, 0, 0, fffffffc, 0, 4d5bcc) + 3e0
001e0178 main (1, ffbffa7c, ffbffa84, 44f000, 0, 0) + 1a98
000936c8 _start (0, 0, 0, 0, 0, 0) + b8

Then I tried restart vxconfigd, but it failed as well:

root@doxer#/sbin/vxconfigd -k -x syslog

VxVM vxconfigd ERROR V-5-1-0 Segmentation violation – core dumped

After reading the man page of vxconfigd, I determined to use -r reset to reset all Veritas Volume Manager configuration information stored in the kernel as part of startup processing. But before doing this, we need umount all vxvm volumes as stated in the man page:

The reset fails if any volume devices are in use, or if an imported shared disk group exists.

After umount all vxvm partitions, then I ran the following command:

vxconfid -k -r reset

After this, the importing of DGs succeeded.

Categories: SAN, Storage Tags: ,

resolved – df Input/output error from veritas vxfs

July 10th, 2012 No comments

If you got error like the following when do a df list which has veritas vxfs as underlying FS:

df: `/BCV/testdg’: Input/output error
df: `/BCV/testdg/ora’: Input/output error
df: `/BCV/testdg/ora/archivelog01′: Input/output error
df: `/BCV/testdg/ora/gg’: Input/output error

And when use vxdg list, you found the dgs are in disabled status:

testarc_PRD disabled 1275297639.26.doxer
testdb_PRD disabled 1275297624.24.doxer

Don’t panic, to resolve this, you need do the following:

1) Force umount of the failed fs’s
2) deporting and importing failed disk groups.
3) Fixing plexes which were in the DISABLED FAILED state.
4) Fsck.vxfs of failed fs’s
5) Remounting of the needable fs’s

Categories: SAN, Storage Tags:


July 5th, 2012 No comments

SIM - Systems Insight Manager(and SIM agent), port 50000 and https.

PSP - Proliant suppot package, conf files at /etc/hp-snmp-agents.conf and /etc/snmp/snmpd.conf. It talks to SIM server.

Categories: Hardware, Servers Tags:

error system dram available – some DIMMs not probed by solaris

June 12th, 2012 No comments

We encountered error message as the following after we disabled some components on a SUN T2000 server:

SEP 29 19:34:35 ERROR: System DRAM  Available: 004096 MB  Physical: 008192 MB

This means that only 4Gb memory available although 8G is physically installed on the server. We can confirm the memory size from the following:

# prtdiag
System Configuration: Sun Microsystems sun4v Sun Fire T200
Memory size: 3968 Megabytes


1. Why the server panic :
We don’t have clear data for that . But normally Solaris 10 on T2000 Systems may Panic When Low on Memory. The system panics with the following panic string:

hypervisor call 0×21 returned an unexpected error 2

But in our case , that’s also not happened .

2. The error which we can see from Alom:

sc> showcomponent
Disabled Devices
MB/CMP0/CH0/R1/D1 : [Forced DIAG fail (POST)]

DIMMs with CEs are being unnecessarily flagged by POST as faulty. When POST encounters a single CE, the associated DIMM is declared faulty and half of system’s memory is deconfigured and unavailable for Solaris. Since PSH (Predictive Self-Healing) is the primary means for detecting errors and diagnosing faults on the Niagara platforms, this policy is too aggressive (reference bug 6334560).
3.What action we can take now :

a) clear the SC log .
b)enable the component in SC .
c)Monitor the server .

if again same fault reports , we will replace the DIMM.


For more information about DIMM, you can refer to

Categories: Hardware, Servers Tags: ,

H/W under test during POST on SUN T2000 Series

June 12th, 2012 No comments

We got the following error messages during POST on a SUN T2000 Series server:

0:0:0>ERROR: TEST = Queue Block Mem Test
0:0:0>H/W under test = MB/CMP0/CH0/R1/D1/S0 (J0901)
0:0:0>Repair Instructions: Replace items in order listed by ‘H/W under
test’ above.
0:0:0>MSG = Pin 236 failed on MB/CMP0/CH0/R1/D1/S0 (J0901)
ERROR: The following devices are disabled:
Aborting auto-boot sequence.

To resolve this issue, we can disable the components in ALOM/ILOM and power off /on then try to reboot the machine. Here’s the steps:

If you use ALOM :
disablecomponent component

If you use ILOM :
-> set /SYS/component component_state=disabled
-> stop /SYS
-> start /SYS
Example :
-> set /SYS/MB/CMP0/CH0/R1/D1 component_state=disabled

-> stop /SYS
Are you sure you want to stop /SYS (y/n)? y
Stopping /SYS
-> start /SYS
Are you sure you want to start /SYS (y/n)? y
Starting /SYS

After you disabled the components, you should clear SC error log and FMA logs:

Clearing faults from SC:

a) Show the faults on the system controller
sc> showfaults -v

b) For each fault listed run
sc> clearfault <uuid>

c) re-enable the disabled components run
sc> clearasrdb

d) Clear ereports
sc> setsc sc_servicemode true
sc> clearereports -y

To clear the FMA faults and error logs from Solaris:
a) Show faults in FMA
# fmadm faulty

b) For each fault listed in the ‘fmadm faulty’ run
# fmadm repair <uuid>

c) Clear ereports and resource cache
# cd /var/fm/fmd
# rm e* f* c*/eft/* r*/*

d) Reset the fmd serd modules
# fmadm reset cpumem-diagnosis
# fmadm reset cpumem-retire
# fmadm reset eft
# fmadm reset io-retire

Categories: Hardware, Servers Tags:


May 30th, 2012 No comments

Here goes some differences between SCSI ISCSI FCP FCoE FCIP NFS CIFS DAS NAS SAN(excerpt from Internet):

Most storage networks use the SCSI protocol for communication between servers and disk drive devices. A mapping layer to other protocols is used to form a network: Fibre Channel Protocol (FCP), the most prominent one, is a mapping of SCSI over Fibre Channel; Fibre Channel over Ethernet (FCoE); iSCSI, mapping of SCSI over TCP/IP.


A storage area network (SAN) is a dedicated network that provides access to consolidated, block level data storage. SANs are primarily used to make storage devices, such as disk arrays, tape libraries, and optical jukeboxes, accessible to servers so that the devices appear like locally attached devices to the operating system. A storage area network (SAN) is a dedicated network that provides access to consolidated, block level data storage. SANs are primarily used to make storage devices, such as disk arrays, tape libraries, and optical jukeboxes, accessible to servers so that the devices appear like locally attached devices to the operating system. Historically, data centers first created “islands” of SCSI disk arrays as direct-attached storage (DAS), each dedicated to an application, and visible as a number of “virtual hard drives” (i.e. LUNs). Operating systems maintain their own file systems on their own dedicated, non-shared LUNs, as though they were local to themselves. If multiple systems were simply to attempt to share a LUN, these would interfere with each other and quickly corrupt the data. Any planned sharing of data on different computers within a LUN requires advanced solutions, such as SAN file systems or clustered computing. Despite such issues, SANs help to increase storage capacity utilization, since multiple servers consolidate their private storage space onto the disk arrays.Sharing storage usually simplifies storage administration and adds flexibility since cables and storage devices do not have to be physically moved to shift storage from one server to another. SANs also tend to enable more effective disaster recovery processes. A SAN could span a distant location containing a secondary storage array. This enables storage replication either implemented by disk array controllers, by server software, or by specialized SAN devices. Since IP WANs are often the least costly method of long-distance transport, the Fibre Channel over IP (FCIP) and iSCSI protocols have been developed to allow SAN extension over IP networks. The traditional physical SCSI layer could only support a few meters of distance – not nearly enough to ensure business continuance in a disaster.

More about FCIP is here (still use FC protocol)

A competing technology to FCIP is known as iFCP. It uses routing instead of tunneling to enable connectivity of Fibre Channel networks over IP.

IP SAN uses TCP as a transport mechanism for storage over Ethernet, and iSCSI encapsulates SCSI commands into TCP packets, thus enabling the transport of I/O block data over IP networks.

Network-attached storage (NAS), in contrast to SAN, uses file-based protocols such as NFS or SMB/CIFS where it is clear that the storage is remote, and computers request a portion of an abstract file rather than a disk block. The key difference between direct-attached storage (DAS) and NAS is that DAS is simply an extension to an existing server and is not necessarily networked. NAS is designed as an easy and self-contained solution for sharing files over the network.


FCoE works with standard Ethernet cards, cables and switches to handle Fibre Channel traffic at the data link layer, using Ethernet frames to encapsulate, route, and transport FC frames across an Ethernet network from one switch with Fibre Channel ports and attached devices to another, similarly equipped switch.


When an end user or application sends a request, the operating system generates the appropriate SCSI commands and data request, which then go through encapsulation and, if necessary, encryption procedures. A packet header is added before the resulting IP packets are transmitted over an Ethernet connection. When a packet is received, it is decrypted (if it was encrypted before transmission), and disassembled, separating the SCSI commands and request. The SCSI commands are sent on to the SCSI controller, and from there to the SCSI storage device. Because iSCSI is bi-directional, the protocol can also be used to return data in response to the original request.


Fibre channel is more flexible; devices can be as far as ten kilometers (about six miles) apart if optical fiber is used as the physical medium. Optical fiber is not required for shorter distances, however, because Fibre Channel also works using coaxial cable and ordinary telephone twisted pair.


Network File System (NFS) is a distributed file system protocol originally developed by Sun Microsystems in 1984,[1] allowing a user on a client computer to access files over a network in a manner similar to how local storage is accessed. On the contrary, CIFS is its Windows-based counterpart used in file sharing.

Categories: NAS, SAN, Storage Tags: