Due to what may be a hardware problem, we have discovered some
code paths in the ixgbe Intel 10G network driver that can hold an
interface's ixgbe->gen_lock mutex for anywhere from 10 msec upwards to
just over two seconds. In the process I have discovered what I think is
a register access bug in the driver.
The code path is:
- ixgbe_link_timer() fires every second for every ixgbe interface and
takes the interface's ixgbe->gen_lock mutex.
- it calls ixgbe_driver_link_check(). If the link is up, this
unconditionally calls ixgbe_fc_enable(). This is generally going
to call ixgbe_fc_enable_generic(), which calls ixgbe_fc_autoneg(),
which on 10G-T ports will call ixgbe_fc_autoneg_copper(), which
makes two hw->phy.ops.read_reg() calls.
The driver register access bug is here; ixgbe_fc_autoneg_copper()
does not check if either read actually succeeds, and they actually
can fail.
- these calls will generally or perhaps always call
ixgbe_read_phy_reg_generic(), which calls
hw->mac.ops.acquire_swfw_sync() and hw->mac.ops.release_swfw_sync().
On X540 chipsets, this will call X540-specific swfw sync operations.
ixgbe_release_swfw_sync_X540() has an unconditional msec_delay(5) call and
ixgbe_acquire_swfw_sync_X540() has a couple of conditional msec_delay(5)
and a 200-iteration timeout loop full of them.
This means that on X540 ports, ixgbe_link_timer() will hold the gen_lock
mutex for a minimum of 10 msec (since there are two releases, one for
each PHY register read). On our hardware we always see an additional
5 msec delay happen on each acquire, so the mutex is always held for
at least 20 msec.
On other chipsets, ixgbe_acquire_swfw_sync() has a 200-iteration timeout
loop with msec_delay(5). We have a dual-port 82599EB TN card where
access to the second port on the card currently fails to acquire swfw
sync; this makes every ixgbe_link_timer() tick take just over 2 seconds
(since there are two acquires attempted, one for each PHY register read)
and it holds the gen_lock mutex for all of this.
(We also have at least one other such card where the second port is
in use and doesn't show this problem; however, there may be traffic
volume/type issues involved.)
This is not just really bad because of a long mutex hold in general; we
see real lockups and stalls due to this. Under load the higher level
Illumos networking can frequently call ixgbe_rx_ring_intr_disable().
ixgbe_rx_ring_intr_disable() tries to take the interface ->gen_lock
mutex, so it is locked out while ixgbe_link_timer() is twiddling its
thumbs. On the hardware where the SWFW sync is failing, I have DTrace
traces showing *1.8 second* lockouts on three out of four CPU cores
(with ixgbe_link_timer() on the fourth core), sometimes with significant
spinlock time consumption[*]. Based on the user-level symptoms on the
machine (eg networking on all ports seems to stutter and stall badly),
I believe that this has significant consequences overall.
(Perhaps the general mac code path that calls ixgbe RX ring interrupt
disabling is holding broader, more high level locks. I don't know.
I've also seen rare issues with ixgbe_rx_ring_intr_enable().)
In addition on all of our X540-using machines I have DTrace traces
showing periodic 20+msec lockouts. These are of course much less common
than lockouts on the machine where ixgbe_link_timer() is holding the
mutex for two seconds since the contention window is smaller.
As if that wasn't enough: on the machine with the dual-port
82599EB TN card that isn't causing SWFW sync problems, I have seen
a high traffic volume on both ports at once cause ~40 millisecond
ixgbe_rx_ring_intr_disable() lockouts due to ixgbe_driver_link_check()
holding the mutex for that long (as well as shorter lockouts, eg
10msec). So I think this whole setup is kind of problematic.
Please let me know if people have questions or comments or if I
didn't manage to explain everything clearly.
At some point I expect to file this in the Illumos bug system,
perhaps with less words or a pointer to the mailing list archive.
- cks
[*: I have lockstat traces that show average spin time on this mutex
being 0.05 seconds.
]