Sun ZFS storage stuck due to incorrect LACP configuration
Today we met issue with Sun ZFS storage 7320. NFS shares provisioned from the ZFS appliance were not responding to requests, even a "df -h" will stuck there for a long long time. And when we checked from ZFS storage side, we found the following statistics:
And during our checking for the traffic source, the ZFS appliance backed to normal by itself:
As we just configured LACP on this ZFS appliance the day before, so we doubted the issue was caused by incorrect network configuration. Here's the network config:
For "Policy", we should match with switch setup to even balance incoming/outgoing data flow. Otherwise, we might experience uneven load balance. Our switch was set to L3, so L3 should be ok. We'll get better load spreading if the policy is L3+4 if the switch supports it. With L3, all connections from any one IP will only use a single member of the aggregation. With L3+4, it will load spread by UDP or TCP port too. More is here.
For "Mode", it should be set according to switch. If the switch is "passive" mode then server/storage needs to be on "active" mode, and vice versa.
For "Timer", it's regarding how often to check LACP status.
After checking switch setting, we found that the switch is in "Active" mode, and as ZFS appliance was also on "Active" mode, so that's the culprit. So we changed the setting to the following:
You should also have a check of disk operations, if there are timeout errors on the disks, then you should try replace them. Sometimes, a single disk may hang the SCSI bus. Ideally, the system should fail the disk but it didn't happen. You should manually failed the disk to resolve the issue.
The ZFS Storage Appliance core analysis (previous note) confirms that the disk was the cause of the issue.
It was hanging up communication on the SCSI bus but once it was removed the issue was resolved.
It is uncommon for a single disk to hang up the bus, however; since the disks share the SCSI path (each drive does not have its own dedicated cabling and controller) it is sometimes seen.
You can check the ZFS appliance uptime by running "version show" in the console.
zfs-test:configuration> version show
Appliance Name: zfs-test
Appliance Product: Sun ZFS Storage 7320
Appliance Type: Sun ZFS Storage 7320
Appliance Version: 2013.06.05.2.2,1-1.1
First Installed: Sun Jul 22 2012 10:02:24 GMT+0000 (UTC)
Last Updated: Sun Oct 26 2014 22:11:03 GMT+0000 (UTC)
Last Booted: Wed Dec 10 2014 10:03:08 GMT+0000 (UTC)
Appliance Serial Number: d043d335-ae15-4350-ca35-b05ba2749c94
Chassis Serial Number: 1225FMM0GE
Software Part Number: Oracle 000-0000-00
Vendor Product ID: urn:uuid:418bff40-b518-11de-9e65-080020a9ed93
Browser Name: aksh 1.0
Browser Details: aksh
HTTP Server: Apache/2.2.24 (Unix)
SSL Version: OpenSSL 1.0.0k 5 Feb 2013
Appliance Kit: ak/SUNW,email@example.com,1-1.1
Operating System: SunOS 5.11 firstname.lastname@example.org,1-1.1 64-bit
BIOS: American Megatrends Inc. 08080102 05/23/2011
Service Processor: 22.214.171.124