Home > IT Architecture, Networking Security > Resolved Intel e1000e driver bug on 82574L Ethernet controller causing network blipping

Resolved Intel e1000e driver bug on 82574L Ethernet controller causing network blipping

April 1st, 2012

Earlier I posted a question about centos 6.2 lost internet connections intermittently. Now finally I got the right way to fix this.

Firstly, this is a known bug on Intel e1000e driver on linux platforms. This is a driver problem with the Intel 82574L(MSI/MSI-X interrupts issue). The internet connection lost itself now and then and there's nothing logged about this which is very bad for troubleshooting.
You can see more bug reporting about this at https://bugzilla.redhat.com/show_bug.cgi?id=632650

Fortunately, we can resolve this by install kmod-e1000e package from ELrepo.org. To solve this, you need do as the following(ignore lines with strikeouts):

  • Install kmod-e1000e offered by Elrepo

Import the public key:
rpm --import http://elrepo.org/RPM-GPG-KEY-elrepo.org

To install ELRepo for RHEL-5, SL-5 or CentOS-5:
rpm -Uvh http://elrepo.org/elrepo-release-5-3.el5.elrepo.noarch.rpm

To install ELRepo for RHEL-6, SL-6 or CentOS-6:
rpm -Uvh http://elrepo.org/elrepo-release-6-4.el6.elrepo.noarch.rpm

Before installing the new driver, let's see our old one:
[root@doxer sites]# lspci |grep -i ethernet
02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection

[root@doxer modprobe.d]# lsmod|grep e100
e1000e 219500 0

[root@doxer modprobe.d]# modinfo e1000e
filename: /lib/modules/2.6.32-220.7.1.el6.x86_64/kernel/drivers/net/e1000e/e1000e.ko
version: 1.4.4-k
license: GPL
description: Intel(R) PRO/1000 Network Driver
author: Intel Corporation, <linux.nics@intel.com>
srcversion: 6BD7BCA22E0864D9C8B756A

Now let's install the new kmod-e1000e offered by elrepo:
[root@doxer yum.repos.d]# yum list|grep -i e1000
kmod-e1000.x86_64 8.0.35-1.el6.elrepo elrepo
kmod-e1000e.x86_64 1.9.5-1.el6.elrepo elrepo

[root@doxer yum.repos.d]# yum -y install kmod-e1000e.x86_64

After installation, reboot your machine, and you'll find driver updated:
[root@doxer ~]# modinfo e1000e
filename: /lib/modules/2.6.32-220.7.1.el6.x86_64/weak-updates/e1000e/e1000e.ko
version: 1.9.5-NAPI
license: GPL
description: Intel(R) PRO/1000 Network Driver
author: Intel Corporation, <linux.nics@intel.com>
srcversion: 16A9E37B9207620F5453F5E

[root@doxer ~]# lsmod|grep e100
e1000e 229197 0

  • change kernel parameter
Append the following parameters to grub.conf kernel line:

pcie_aspm=off e1000e.IntMode=1,1 e1000e.InterruptThrottleRate=10000,10000 acpi=off

  • change NIC parameters(you should add these lines to /etc/rc.local)

#disable pause autonegotiate
/sbin/ethtool -A eth0 autoneg off
/sbin/ethtool -s eth0 autoneg off
#change tx ring buffer
/sbin/ethtool -G eth0 tx 4096 #maybe too large(consider 512). To increase interrupt rate, ethtool -C eth0 rx-usecs 10<10000 interrupts per second>
#change rx ring buffer
/sbin/ethtool -G eth0 rx 128
#disable wake on line
/sbin/ethtool -s eth0 wol d
#turn off offload
/sbin/ethtool -K eth0 tx off rx off sg off tso off gso off gro off
#enable TX pause
/sbin/ethtool -A eth0 tx on
#disable ASPM
/sbin/setpci -s 02:00.0 CAP_EXP+10.b=40
/sbin/setpci -s 00:19.0 CAP_EXP+10.b=40

PS:

  1. pcie_aspm is abbr for Active-State Power Management. This is somehow related to powersaving mechanism, you can get more info here.
  2. acpi is abbr for Advanced Configuration and Power Interface, you can refer to here
  3. apic is abbr for Advanced Programmable Interrupt Controller, it's somehow related to IRQ<Interrupt Request>. apic is one kind of many PICs, intel and some other NICs have this feature. You can read more info about this here.

Now reboot your machine and you're expected to have a more steady networking!

PS2:

The reason why there's so much strikeouts in this article is that I've struggled a lot with this kernel bug. Firstly, I thought it's caused by kernel bug of e1000e driver, and after some searching, I installed kmod-e1000e driver and modified the kernel parameter. Things turned better for a short time. Later, I found the issue was still there, so I tried compile the latest e1000e driver from intel. But neither this worked.

Later, I tried a script which monitored the networking of the time NIC went down. After the NIC failed for several times, I found that Tx traffic was so high each time NIC went to failure(TX bytes went up like 5Gb at a very short time). Based on this, I realized that there may be some DoS attack on the server. Using ntop & tcpdump, I found that DNS traffic was very large, but actually my host was not providing DNS services at all!

Then I wrote some iptable rules to disallow DNS queries etc, and after that, the host now is becoming steady again! Traffic went down as per normal, and everything is now on the track. I'm so happy and so excited about this as this is the first time I've stopped an DoS attack!

This problem is due to bug on Intel NICs' MSI and/or MSI-X interrupts. To solve this, you need download the latest Intel 82574L driver here. After downloading the source tarball to your server, do the following steps as the driver's README file:

  1. unzip: tar zxf e1000e-x.x.x.tar.gz
  2. cd e1000e-x.x.x/src/
  3. make CFLAGS_EXTRA=-DDISABLE_PCI_MSI install #this step is critical
  4. rmmod e1000e; modprobe e1000e
  5. add e1000e to /etc/modprobe.conf
  6. reboot server
After that, when you check intel e1000e driver module, you should now see:

[root@doxer ~]# modinfo e1000e
filename: /lib/modules/2.6.32-220.7.1.el6.x86_64/kernel/drivers/net/ethernet/intel/e1000e/e1000e.ko
version: 1.10.6-NAPI
license: GPL
description: Intel(R) PRO/1000 Network Driver
author: Intel Corporation, <linux.nics@intel.com>

.....blablabla.....

vermagic:       2.6.32-220.7.1.el6.x86_64 SMP mod_unload modversions

parm: copybreak:Maximum size of packet that is copied to a new buffer on receive (uint)
parm: TxIntDelay:Transmit Interrupt Delay (array of int)
parm: TxAbsIntDelay:Transmit Absolute Interrupt Delay (array of int)
parm: RxIntDelay:Receive Interrupt Delay (array of int)
parm: RxAbsIntDelay:Receive Absolute Interrupt Delay (array of int)
parm: InterruptThrottleRate:Interrupt Throttling Rate (array of int)
parm: IntMode:Interrupt Mode (array of int)
parm: SmartPowerDownEnable:Enable PHY smart power down (array of int)
parm: KumeranLockLoss:Enable Kumeran lock loss workaround (array of int)
parm: CrcStripping:Enable CRC Stripping, disable if your BMC needs the CRC (array of int)
parm: EEE:Enable/disable on parts that support the feature (array of int)
parm: Node:[ROUTING] Node to allocate memory on, default -1 (array of int)

And also, you may need to add pcie_aspm=off to the kernel cmd line in file /boot/grub/menu.lst to disable Active-State Power Management which may cause problems.

That's all steps to fix Intel e1000e driver bug on 82574L Ethernet controller.

NOTE:Please do not do steps below, it's proved not able to solve this 82574L driver bug!

Fortunately, we can resolve this by install kmod-e1000e package from ELrepo.org, here's all steps you need:
Import the public key:
rpm --import http://elrepo.org/RPM-GPG-KEY-elrepo.org

To install ELRepo for RHEL-5, SL-5 or CentOS-5:
rpm -Uvh http://elrepo.org/elrepo-release-5-3.el5.elrepo.noarch.rpm

To install ELRepo for RHEL-6, SL-6 or CentOS-6:
rpm -Uvh http://elrepo.org/elrepo-release-6-4.el6.elrepo.noarch.rpm

Before installing the new driver, let's see our old one:
[root@doxer sites]# lspci |grep -i ethernet
02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection

[root@doxer modprobe.d]# lsmod|grep e100
e1000e 219500 0

[root@doxer modprobe.d]# modinfo e1000e
filename: /lib/modules/2.6.32-220.7.1.el6.x86_64/kernel/drivers/net/e1000e/e1000e.ko
version: 1.4.4-k
license: GPL
description: Intel(R) PRO/1000 Network Driver
author: Intel Corporation, <linux.nics@intel.com>
srcversion: 6BD7BCA22E0864D9C8B756A

Now let's install the new kmod-e1000e offered by elrepo:
[root@doxer yum.repos.d]# yum list|grep -i e1000
kmod-e1000.x86_64 8.0.35-1.el6.elrepo elrepo
kmod-e1000e.x86_64 1.9.5-1.el6.elrepo elrepo

[root@doxer yum.repos.d]# yum -y install kmod-e1000e.x86_64

After installation, reboot your machine, and you'll find driver updated:
[root@doxer ~]# modinfo e1000e
filename: /lib/modules/2.6.32-220.7.1.el6.x86_64/weak-updates/e1000e/e1000e.ko
version: 1.9.5-NAPI
license: GPL
description: Intel(R) PRO/1000 Network Driver
author: Intel Corporation, <linux.nics@intel.com>
srcversion: 16A9E37B9207620F5453F5E

[root@doxer ~]# lsmod|grep e100
e1000e 229197 0

And also, you may need to add pcie_aspm=off to the kernel cmd line in file /boot/grub/menu.lst to disable Active-State Power Management which may cause problems.

 

 

You should get a better networking on linux now. Enjoy!

PS:

 

Actually, there're lot of talks over the internet about this problem, then I know it's not only me who was annoyed by this weird problem!

 

http://www.google.com.hk/search?hl=en&newwindow=1&safe=strict&q=Intel+e1000e+driver+bug&oq=Intel+e1000e+driver+bug&aq=f&aqi=&aql=&gs_l=serp.3...9108l9252l0l9707l2l2l0l0l0l0l0l0ll0l0.frgbld.

Problem resolved?


  1. khapota
    May 16th, 2012 at 03:24 | #1

    Thank for your topic. I get the same bug with 82574L driver. One question: Do i need to do two steps “change kernel parameter” and “change NIC parameters(you should add these lines to /etc/rc.local)

    • doxerorg
      May 16th, 2012 at 06:12 | #2

      Hi,
      yep, I would vote for this. I set for both of them and it’s running as expected.

  2. Mark
    July 27th, 2012 at 16:23 | #3

    Thanks a LOT for taking the time to document this! We ran into this problem on our new squid proxy servers but the problem didnt show up until it was under load (we had run it for months under a light load with no problems). Fortunately, one of the NIC’s failed yesterday (under no load) except with some errors in /var/log/messsages this time (unlike before). After searching a bit on the error messages, I started running across comments by others of similar problems and eventually found this link with detailed instructions on how to fix bug. I went ahead and made the changes and will now decide how much user traffic to send thru to see if the problem re-occurs. Thanks again!!

  3. August 16th, 2012 at 15:45 | #4

    Hello all, this may be related to something I’m experiencing. I have a Debian Lenny install and the same occurs all the time to me. My server runs a Clonezilla-SE and e1000 always crashes randomly, but never while sending the multicast over the network. Also, nothing appears on the va/log/messages but a “e1000 PCI INT A disabled”. I’m gonna try to re-install the NIC driver and see what I get.

  4. September 8th, 2012 at 01:38 | #5

    Given that this fix has been posted on April 1 – could someone confirm that it actually is the right way to fix the problem and that it actually worked. I am having an issue like this too and would like a confirmation from someone…

  5. Qf Yang
    May 29th, 2013 at 05:41 | #6

    Intel website (http://downloadmirror.intel.com/15817/eng/README.txt) says there is a option IntMode can be use in /etc/modprobe.conf :
    IntMode
    Valid Range: 0-2 (0=legacy, 1=MSI, 2=MSI-X)
    Default Value: 2

    Maybe this is a easy way to solve the issue:
    put lines below to /etc/modprobe.d/myoption.conf
    alias eth0 e1000e
    alias eth1 e1000e
    options e1000e IntMode=0,0

  6. Kyrian
    October 2nd, 2013 at 23:19 | #7

    Using Debian, elrepo is not an option (unless I use tools to convert to .deb). Thus far I’ve had “success” (I await being disproven by ifconfig in due course!) by compiling the module with “CFLAGS_EXTRA=-DE1000E_NO_NAPI make” (with bash as a shell, other docs suggest different incantations, but that didn’t work, so make sure environment variable passing for your shell is done correctly!) for version 2.5.4-NAPI, and using an rx queue length of 4096 with rx pause off. I lose track of what exactly fixed it, but using different IntMode values to force each of eth0 and eth1 onto different interrupt “types” combined with the aforementioned seemed to be finally what got it nailed. I think it’s the *different types* that’s the key here, not what type you use, so make sure you comma-separate the list so each ethX interface uses a different type to avoid clashes (/proc/interrupts btw!).

    One weird thing I noticed was that the number of dropped packets concurred directly with the number of dropped packets was exactly the sum of the xon/xoff flow control errors (which you can get with eg “/sbin/ethtool -S ethX | grep xo”) even if the flow control stuff concurred with what was happening on the switch port, and (although I could be wrong on this) the rx pause was turned off!
    The above is not to say that the configuration options and ethtool tweaks applied wouldn’t have worked with the “in situ” driver version, though, so maybe try those first before forcing a module version upgrade and rebooting.

    Frankly, this is pot luck, because these drivers seem so broken in so many different ways that work for different people and I think the vendor should probably up their game with documentation (which to their credit they do seem to have done in the README for the 2.5.4 driver, but that’s no good to many people if Google can’t find it, or you can’t hit the right keyword combinations to find it…).

    K.

  7. October 24th, 2013 at 20:44 | #8

    Have you tried running a newer kernel on any of these machines lately?

    I am testing 3.8.0 built by Ubuntu, and while it removed the crazy latency variation, I am still seeing some dropped RX packets in ifconfig.

    Is it reasonable to expect 0 dropped RX packets on one of these e1000e interfaces?

    • October 25th, 2013 at 09:06 | #9

      I think you better check uplink as well as local interface, e.g. routers, switches etc.

  8. Igor
    November 1st, 2013 at 06:44 | #10

    I am having same problem here. Outgoing connection is strong and i normally get 200mbit/sec data tx. But receiving rx is slow as hell – barely 5mbit. Might options described here help with my situation? Did they help anyone?
    Thank you.

    • November 1st, 2013 at 07:21 | #11

      I think you can try this, and also use some networking diagnostic tools to monitor the networking.

  1. April 5th, 2012 at 03:58 | #1
  2. April 7th, 2012 at 14:17 | #2
  3. April 18th, 2012 at 03:14 | #3
  4. December 14th, 2012 at 08:37 | #4