create shared iscsi LUNs from local disk on Linux

We can use iscsitarget to share local disks as iscsi LUNs for clients. Below are brief steps.

First, install some packages:

yum install kernel-devel iscsi-initiator-utils -y #'kernel-uek-devel' if you are using oracle linux with UEK
cd iscsitarget-1.4.20.2 #download the package from here
make
make install

And below are some useful tips about iscsitarget:

The iSCSI target consists of a kernel module (/lib/modules/`uname -r`/extra/iscsi/iscsi_trgt.ko)

The kernel modules will be installed in the module directory of the kernel

The daemon(/usr/sbin/ietd) and the control tool(/usr/sbin/ietadm)

/etc/init.d/iscsi-target status

/etc/iet/{ietd.conf, initiators.allow, targets.allow}

Later, we can modify IET(iSCSI Enterprise Target) config file:

vi /etc/iet/ietd.conf

    Target iqn.2001-04.org.doxer:server1-shared-local-disks
        Lun 0 Path=/dev/sdb1,Type=fileio,ScsiId=0,ScsiSN=doxerorg
        Lun 1 Path=/dev/sdb2,Type=fileio,ScsiId=1,ScsiSN=doxerorg
        Lun 2 Path=/dev/sdb3,Type=fileio,ScsiId=2,ScsiSN=doxerorg
        Lun 3 Path=/dev/sdb4,Type=fileio,ScsiId=3,ScsiSN=doxerorg
        Lun 4 Path=/dev/sdb5,Type=fileio,ScsiId=4,ScsiSN=doxerorg
        Lun 5 Path=/dev/sdb6,Type=fileio,ScsiId=5,ScsiSN=doxerorg
        Lun 6 Path=/dev/sdb7,Type=fileio,ScsiId=6,ScsiSN=doxerorg
        Lun 7 Path=/dev/sdb8,Type=fileio,ScsiId=7,ScsiSN=doxerorg

chkconfig iscsi-target on
/etc/init.d/iscsi-target start

Assume the server sharing local disks for iscsi LUN is with IP 192.168.10.212, and we can do below on client hosts to scan for iscsi LUNs:

[root@client01 ~]# iscsiadm -m discovery -t st -p 192.168.10.212
Starting iscsid:                                           [  OK  ]
192.168.10.212:3260,1 iqn.2001-04.org.doxer:server1-shared-local-disks

[root@client01 ~]# iscsiadm -m node -T iqn.2001-04.org.doxer:server1-shared-local-disks -p 192.168.10.212 -l
Logging in to [iface: default, target: iqn.2001-04.org.doxer:server1-shared-local-disks, portal: 192.168.10.212,3260] (multiple)
Login to [iface: default, target: iqn.2001-04.org.doxer:server1-shared-local-disks, portal: 192.168.10.212,3260] successful.

[root@client01 ~]# iscsiadm -m session --rescan #or iscsiadm -m session -r SID --rescan
Rescanning session [sid: 1, target: iqn.2001-04.org.doxer:server1-shared-local-disks, portal: 192.168.10.212,3260]

[root@client01 ~]# iscsiadm -m session -P 3

You can also scan for LUNs on the server with local disk shared, but you should make sure iscsi-target service boot up between network & iscsi:

mv /etc/rc3.d/S39iscsi-target /etc/rc3.d/S12iscsi-target

PS:

  1. iSCSI target port default to 3260. You can check iscsi connection info in /var/lib/iscsi/send_targets/ - <iscsi portal ip, port>, and /var/lib/iscsi/nodes/ - <target iqn>/<iscsi portal ip, port>.
  2. If there are multiple targets to log on, we can use "iscsiadm -m node --loginall=all".  "iscsiadm -m node -T iqn.2001-04.org.doxer:server1-shared-local-disks -p 192.168.10.212 -u" to log out.
  3. More info is here (includes windows iscsi operation), and here is about create iSCSI on Oracle ZFS appliance.

 

resolved - mountd Cannot export /scratch, possibly unsupported filesystem or fsid= required

Today when I tried to access one autofs exported filesystem on one client host, it reported error:

[root@server01 ~]# cd /net/client01/scratch/
-bash: cd: scratch/: No such file or directory

From server side, we can see it's exported and writable:

[root@client01 ~]# cat /etc/exports
/scratch *(rw,no_root_squash)

[root@client01 ~]# df -h /scratch
Filesystem            Size  Used Avail Use% Mounted on
nas01:/export/generic/share_scratch
                      200G  103G   98G  52% /scratch

So I tried mount manually on client side, but still error reported:

[root@server01 ~]# mount client01:/scratch /media
mount: client01:/scratch failed, reason given by server: Permission denied

Here's log on server side:

[root@client01 scratch]# tail -f /var/log/messages
Nov  2 03:41:58 client01 mountd[2195]: Caught signal 15, un-registering and exiting.
Nov  2 03:41:58 client01 kernel: nfsd: last server has exited, flushing export cache
Nov  2 03:41:59 client01 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Nov  2 03:41:59 client01 kernel: NFSD: starting 90-second grace period
Nov  2 03:42:11 client01 mountd[16046]: authenticated mount request from 192.162.100.137:1002 for /scratch (/scratch)
Nov  2 03:42:11 client01 mountd[16046]: Cannot export /scratch, possibly unsupported filesystem or fsid= required
Nov  2 03:43:05 client01 mountd[16046]: authenticated mount request from 192.162.100.137:764 for /scratch (/scratch)
Nov  2 03:43:05 client01 mountd[16046]: Cannot export /scratch, possibly unsupported filesystem or fsid= required
Nov  2 03:44:11 client01 mountd[16046]: authenticated mount request from 192.165.28.40:670 for /scratch (/scratch)
Nov  2 03:44:11 client01 mountd[16046]: Cannot export /scratch, possibly unsupported filesystem or fsid= required

After some debugging, the reason was that the exported FS was already NFS filesystem on server side, and NFS FS cannot be exported again:

[root@client01 ~]# df -h /scratch
Filesystem            Size  Used Avail Use% Mounted on
nas01:/export/generic/share_scratch
                      200G  103G   98G  52% /scratch

To WA this, just do normal mount of the NFS share instead of using autofs export:

[root@server01 ~]# mount nas01:/export/generic/share_scratch /media

resolved - nfsv4 Warning: rpc.idmapd appears not to be running. All uids will be mapped to the nobody uid

Today when we tried to mount a nfs share as NFSv4(mount -t nfs4 testnas:/export/testshare01 /media), the following message prompted:

Warning: rpc.idmapd appears not to be running.
All uids will be mapped to the nobody uid.

And I had a check of file permissions under the mount point, they were owned by nobody as indicated:

[root@testvm~]# ls -l /u01/local
total 8
drwxr-xr-x 2 nobody nobody 2 Dec 18 2014 SMKIT
drwxr-xr-x 4 nobody nobody 4 Dec 19 2014 ServiceManager
drwxr-xr-x 4 nobody nobody 4 Mar 31 08:47 ServiceManager.15.1.5
drwxr-xr-x 4 nobody nobody 4 May 13 06:55 ServiceManager.15.1.6

However, as I checked, rpcidmapd was running:

[root@testvm ~]# /etc/init.d/rpcidmapd status
rpc.idmapd (pid 11263) is running...

After some checking, I found it's caused by low nfs version and some missed nfs4 packages on the OEL5 boxes. You can do below to fix this:

yum -y update nfs-utils nfs-utils-lib nfs-utils-lib-devel sblim-cmpi-nfsv4 nfs4-acl-tools
/etc/init.d/nfs restart
/etc/init.d/rpcidmapd restart

If you are using Oracle SUN ZFS appliance, then please make sure to set on ZFS side anonymous user mapping to "root" and also Custom NFSv4 identity domain to the one in your env(e.g. example.com) to avoid NFS clients nobody owner issue.

remove usb disk from LVM

On some servers, USB stick may become part of the LVM volume group. The usb is more prone to fail, which will cause big issues when they start. Also they work at a different speed than the drives, and this also causes performance issues.

For example, on one server, you can see below:

[root@test ~]# vgs
  VG      #PV #LV #SN Attr   VSize VFree
  DomUVol   2   1   0 wz--n- 3.77T    0

[root@test~]# pvs
  PV         VG      Fmt  Attr PSize PFree
  /dev/sdb1  DomUVol lvm2 a--  3.76T    0 #all PE allocated
  /dev/sdc1  DomUVol lvm2 a--  3.59G    0 #this is usb device

[root@test~]# lvs
  LV      VG      Attr   LSize Origin Snap%  Move Log Copy%  Convert
  scratch DomUVol -wi-ao 3.77T

[root@test ~]# df -h /scratch
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/DomUVol-scratch
                      3.7T  257G  3.3T   8% /scratch

[root@test~]# pvdisplay /dev/sdc1
  --- Physical volume ---
  PV Name               /dev/sdc1
  VG Name               DomUVol
  PV Size               3.61 GB / not usable 14.61 MB
  Allocatable           yes (but full)
  PE Size (KByte)       32768
  Total PE              115
  Free PE               0
  Allocated PE          115 #so physical extents are allocated on this usb device
  PV UUID               a8a0P5-AlCz-Cu5e-acC2-ldEQ-NPCn-kc4Du0

As you can see from above, as the USB device has PE allocated, so to remove that(pvreduce), we need first move PE from it to other PV. We can see the other PV has all space allocated(can also confirm from vgs above):

[root@test~]# pvdisplay /dev/sdb1
  --- Physical volume ---
  PV Name               /dev/sdb1
  VG Name               DomUVol
  PV Size               3.76 TB / not usable 30.22 MB
  Allocatable           yes (but full)
  PE Size (KByte)       32768
  Total PE              123360
  Free PE               0
  Allocated PE          123360
  PV UUID               5IyCgh-JsiV-EnpO-XKj4-yxNq-pRjI-d7LKGy

Here are the steps for taking usb device out of VG:

umount /scratch
#fsck -y /dev/mapper/DomUVol-scratch
lvreduce --size -5G /dev/mapper/DomUVol-scratch
resize2fs /dev/mapper/DomUVol-scratch

[root@test ~]# vgs
  VG      #PV #LV #SN Attr   VSize VFree
  DomUVol   2   1   0 wz--n- 3.77T 5.00G

[root@test ~]# pvs
  PV         VG      Fmt  Attr PSize PFree
  /dev/sdb1  DomUVol lvm2 a--  3.76T 1.41G
  /dev/sdc1  DomUVol lvm2 a--  3.59G 3.59G #PEs on the usb device are all freed, if not, use pvmove /dev/sdc1. More info is here about pvmove

[root@test ~]# pvdisplay /dev/sdc1
  --- Physical volume ---
  PV Name               /dev/sdc1
  VG Name               DomUVol
  PV Size               3.61 GB / not usable 14.61 MB
  Allocatable           yes
  PE Size (KByte)       32768
  Total PE              115
  Free PE               115
  Allocated PE          0
  PV UUID               a8a0P5-AlCz-Cu5e-acC2-ldEQ-NPCn-kc4Du0

[root@test ~]# vgreduce DomUVol /dev/sdc1
  Removed "/dev/sdc1" from volume group "DomUVol"

[root@test ~]# pvs
  PV         VG      Fmt  Attr PSize PFree
  /dev/sdb1  DomUVol lvm2 a--  3.76T 1.41G
  /dev/sdc1          lvm2 a--  3.61G 3.61G #VG column is empty for the usb device, this confirms the usb device is taken out of VG. You can run pvremove /dev/sdc1 to remove the pv.

PS:

  1. If you want to shrink lvm volume(lvreduce) on /, then you'll need go to linux rescue mode. Select "Skip" when the system prompts for the option of mounting / to /mnt/sysimage. Run "lvm vgchange -a y" first and then the other steps are more or less the same as above, but you'll need type "lvm" before any lvm command, such as "lvm lvs", "lvm pvs", "lvm lvreduce" etc.
  2. You can refer to this article about using vgcfgrestore to restore vg config from /etc/lvm/archive/.
  3. If having enough space on other PV, then use pvmove, refer to http://www.tldp.org/HOWTO/LVM-HOWTO/removeadisk.html

raid10 and raid01

RAID 0 over RAID 1(raid 0+1, raid 10, stripe of mirrors, better)

(RAID 1) A = Drive A1 + Drive A2 (Mirrored)
(RAID 1) B = Drive B1 + Drive B2 (Mirrored)
RAID 0 = (RAID 1) A + (RAID 1) B (Striped)

stripe-of-mirrors-raid10


RAID 1 over RAID 0(raid 1+0, raid01, mirror of stripes)

(RAID 0) A = Drive A1 + Drive A2 (Striped)
(RAID 0) B = Drive B1 + Drive B2 (Striped)
RAID 1 = (RAID 1) A + (RAID 1) B (Mirrored)
mirror-of-stripes

PS:

For write performance: raid0 > raid10 > raid5

For read performance: raid1

For data protection: raid1

Raid2 - put parity data into multiple disks. stipe using bit/byte

Raid3 - put parity data into single disk. Good for sequential data, but for random data, parity disk will become bottleneck. stipe using bit/byte

Raid4 - put parity data into multiple disks. stipe using block/record.

Raid5 - put parity data into all disks. Good for small/random access. write punishment(one write will generate two reads for old parity/data, two writes for new parity/data)

Raid6 - added another parity data. Can tolerate two disk failing. Worse write punishment.

resoved - nfs share chown: changing ownership of 'blahblah': Invalid argument

Today I encountered the following error when trying to change ownership of some files:

[root@test webdav]# chown -R apache:apache ./bigfiles/
chown: changing ownership of `./bigfiles/opcmessaging': Invalid argument
chown: changing ownership of `./bigfiles/': Invalid argument

This host is running CentOS 6.2, and in this version of OS, nfs4 is by default used:

[root@test webdav]# cat /proc/mounts |grep u01
nas-server.example.com:/export/share01/ /u01 nfs4 rw,relatime,vers=4,rsize=32768,wsize=32768

However, the NFS server does not support NFSv4 well, so I modified the share to use NFSv3 by force:

nas-server.example.com:/export/share01/ /u01 nfs rsize=32768,wsize=32768,hard,nolock,timeo=14,noacl,intr,mountvers=3,nfsvers=3

After umount/mount, the issue was resolved!

PS:

If the NAS server is SUN ZFS appliance, then the following should be noted, or the issue may occur even on CentOS/Redhat linux 5.x:

protocol_anonymous_user_mapping

root_directory_access

Sun ZFS storage stuck due to incorrect LACP configuration

Today we met issue with Sun ZFS storage 7320. NFS shares provisioned from the ZFS appliance were not responding to requests, even a "df -h" will stuck there for a long long time. And when we checked from ZFS storage side, we found the following statistics:

1-high-io-before-stuck

 

And during our checking for the traffic source, the ZFS appliance backed to normal by itself:

2-recovered-by-itself

 

As we just configured LACP on this ZFS appliance the day before, so we doubted the issue was caused by incorrect network configuration. Here's the network config:

1-wrong-configuration

For "Policy", we should match with switch setup to even balance incoming/outgoing data flow.  Otherwise, we might experience uneven load balance. Our switch was set to L3, so L3 should be ok. We'll get better load spreading if the policy is L3+4 if the switch supports it.  With L3, all connections from any one IP will only use a single member of the aggregation.  With L3+4, it will load spread by UDP or TCP port too. More is here.

For "Mode", it should be set according to switch. If the switch is "passive" mode then server/storage needs to be on "active" mode, and vice versa.

For "Timer", it's regarding how often to check LACP status.

After checking switch setting, we found that the switch is in "Active" mode, and as ZFS appliance was also on "Active" mode, so that's the culprit. So we changed the setting to the following:

2-right-configurationAfter this, we had some observation and ZFS is now operating normally.

PS:

You should also have a check of disk operations, if there are timeout errors on the disks, then you should try replace them. Sometimes, a single disk may hang the SCSI bus.  Ideally, the system should fail the disk but it didn't happen. You should manually failed the disk to resolve the issue.

The ZFS Storage Appliance core analysis (previous note) confirms that the disk was the cause of the issue.

It was hanging up communication on the SCSI bus but once it was removed the issue was resolved.

It is uncommon for a single disk to hang up the bus, however; since the disks share the SCSI path (each drive does not have its own dedicated cabling and controller) it is sometimes seen.

You can check the ZFS appliance uptime by running "version show" in the console.

zfs-test:configuration> version show
Appliance Name: zfs-test
Appliance Product: Sun ZFS Storage 7320
Appliance Type: Sun ZFS Storage 7320
Appliance Version: 2013.06.05.2.2,1-1.1
First Installed: Sun Jul 22 2012 10:02:24 GMT+0000 (UTC)
Last Updated: Sun Oct 26 2014 22:11:03 GMT+0000 (UTC)
Last Booted: Wed Dec 10 2014 10:03:08 GMT+0000 (UTC)
Appliance Serial Number: d043d335-ae15-4350-ca35-b05ba2749c94
Chassis Serial Number: 1225FMM0GE
Software Part Number: Oracle 000-0000-00
Vendor Product ID: urn:uuid:418bff40-b518-11de-9e65-080020a9ed93
Browser Name: aksh 1.0
Browser Details: aksh
HTTP Server: Apache/2.2.24 (Unix)
SSL Version: OpenSSL 1.0.0k 5 Feb 2013
Appliance Kit: ak/SUNW,maguro_plus@2013.06.05.2.2,1-1.1
Operating System: SunOS 5.11 ak/generic@2013.06.05.2.2,1-1.1 64-bit
BIOS: American Megatrends Inc. 08080102 05/23/2011
Service Processor: 3.0.16.10

resolved - fsinfo ERROR: Stale NFS file handle POST

Today when I tried mount NFS share from one NFS server, it timeout with "mount.nfs: Connection timed out".

I tried to search something in /var/log/messages but no useful info there was found. So I used tcpdump on NFS client:

[root@dcs-hm1-qa132 ~]# tcpdump -nn -vvv host 10.120.33.90 #server is 10.120.33.90, client is 10.120.33.130
23:49:11.598407 IP (tos 0x0, ttl 64, id 26179, offset 0, flags [DF], proto TCP (6), length 96)
10.120.33.130.1649240682 > 10.120.33.90.2049: 40 null
23:49:11.598741 IP (tos 0x0, ttl 62, id 61186, offset 0, flags [DF], proto TCP (6), length 80)
10.120.33.90.2049 > 10.120.33.130.1649240682: reply ok 24 null
23:49:11.598812 IP (tos 0x0, ttl 64, id 26180, offset 0, flags [DF], proto TCP (6), length 148)
10.120.33.130.1666017898 > 10.120.33.90.2049: 92 fsinfo fh Unknown/0100010000000000000000000000000000000000000000000000000000000000
23:49:11.599176 IP (tos 0x0, ttl 62, id 61187, offset 0, flags [DF], proto TCP (6), length 88)
10.120.33.90.2049 > 10.120.33.130.1666017898: reply ok 32 fsinfo ERROR: Stale NFS file handle POST:
23:49:11.599254 IP (tos 0x0, ttl 64, id 26181, offset 0, flags [DF], proto TCP (6), length 148)
10.120.33.130.1682795114 > 10.120.33.90.2049: 92 fsinfo fh Unknown/010001000000000000002FFF000002580000012C0007B0C00000000A00000000
23:49:11.599627 IP (tos 0x0, ttl 62, id 61188, offset 0, flags [DF], proto TCP (6), length 88)
10.120.33.90.2049 > 10.120.33.130.1682795114: reply ok 32 fsinfo ERROR: Stale NFS file handle POST:

The reason of "ERROR: Stale NFS file handle POST" may caused by the following reasons:

1.The NFS server is no longer available
2.Something in the network is blocking
3.In a cluster during failover of NFS resource the major & minor numbers on the secondary server taking over is different from that of the primary.

To resolve the issue, you can try bounce NFS service on NFS server using /etc/init.d/nfs restart.

stuck in PXE-E51: No DHCP or proxyDHCP offers were received, PXE-M0F: Exiting Intel Boot Agent, Network boot canceled by keystroke

If you installed your OS and tried booting up it but stuck with the following messages:

stuck_pxe

Then one possibility is that, the configuration for your host's storage array is not right. For instance, it should be JBOD but you had configured it to RAID6.

Please note that this is only one possibility for this error, you may search for PXE Error Codes you encoutered for more details.

PS:

  • Sometimes, DHCP snooping may prevent PXE functioning, you can read more http://en.wikipedia.org/wiki/DHCP_snooping.
  • STP(Spanning-Tree Protocol) makes each port wait up to 50 seconds before data is allowed to be sent on the port. This Delay in turn can cause problems with some applications/protocols (PXE, Bootworks, etc.). To alleviate the problem, Porfast was implemented on Cisco devices, the terminology might differ between different vendor devices. You can read more http://www.symantec.com/business/support/index?page=content&id=HOWTO6019
  • ARP caching http://www.networkers-online.com/blog/2009/02/arp-caching-and-timeout/
  • If you want to disable Network Boot protocol, then press Ctrl + S, and disable it.

pxe-intel-boot-agent-xe

"Include snapshots" made NFS shares from ZFS appliance shrinking

Today I met one weird issue when checking one NFS share mounted from ZFS appliance. The NFS filesystem mounted on client was shrinking when I removed files as the space on that filesystem was getting low. But what made me confused was that the filesystem's size would getting lower! Shouldn't the free space getting larger and the size keep unchanged?

After some debugging, I found that this was caused by ZFS appliance shares' "Include snapshots". When I uncheck "Include snapshots", the issue was gone!

zfs-appliance