RHEL6.5 has a recent AoE driver!

Version 6.5 of Red Hat Enterprise Linux includes version 83 (released May 2013) of Sam Hopkins’ canonical ATA-over-Ethernet kernel module.

Since RHEL6.4 was shipping version 47, which was six years out of date (and somewhat incompatible with current generations of AoE SAN appliances like the Coraid ESX and VSM) this is a huge improvement.

aoe/dkms/rhel6 redux

Situation as it stands:

The aoe v22i kernel module distributed with RHEL5 is not a major problem. It occasionally has boot issues, where it’s unable to find and mount volumes, but a reboot usually clears this right out, so I think it’s a minor timing bug, probably a race condition; the protocol uses fixed intervals which are not based on primes. We can ignore all that for the moment since it only happens occasionally at boot.

However, the aoe v47 kernel module distributed with RHEL6 causes an infinite loop at boot when used with the Coraid VSX appliances. The only way out of the loop appears to be hard power off. This is obviously a major issue!

Dell’s Dynamic Kernel Module System offers a way to use updated drivers that should prevent the system from blowing up every time a sysadmin types “yum update”. So we should, in theory, be able to use the latest greatest aoe module with dkms and be happy… provided both DKMS and the l.g.a.m. actually work. Hilarity ensues.

The aoe v6-79 kernel module currently available on the coraid and sourceforge sites works reasonably well. It spits out a screen or two of “unsupported ioctl” warnings shortly after boot, but these do not appear to affect function. It has another bug that will never affect disk I/O, but which is a major problem for DKMS. If you do a “modinfo aoe” the output is formatted incorrectly, and DKMS uses that output to determine kernel module version.

With help from others, I made a little patch set that fixes the modinfo problem with aoe6-79 and Ed Cashin at coraid.com is receptive to including it in the next version release. In the meantime I have built the patched version and integrated it with DKMS.

The DKMS package that comes with Acronis, that we have installed on most of our machines, is very broken. We need to replace it. I don’t know how to backfeed changes to Acronis, but for the moment I’m just going to make replacing Acronis’s package a requirement for installing my aoe6-79 package.

The DKMS package currently being distributed by Dell is also broken, at least on current Red Hat. I am trying to figure out how to patch that as well. I’ve already patched the dkms-autoinstaller init script, but now I need to figure out why –autoinstall does not work. The way DKMS reverses normal unix program output conventions is irritating – DKMS is chatty when it works, and silent when it breaks. This transgression of one of the most basic rules of *nix makes the bearded Dennis Ritchie sad.

Work is being done, progress is being made, and a breakthrough is inevitable, as Mr. Z. would say.

dkms rpm for aoe v79 working… sorta…

Sunil Gupta at Dell spotted the reason that DKMS didn’t like v79 of Coraid’s ATA-over-Ethernet driver – the module info was buggered up. Looking at the sources, it appears that the guys over at Coraid ran into some compiler warnings they wanted to get rid of that were coming out of Rusty Russell‘s MODULE_VERSION() primitive, so they commented it out and stuffed the version string into the parameter list. That doesn’t affect the normal operation of the driver module at all, and since generally the only thing that uses the output of modinfo is the Mark I Eyeball, most people (including me) didn’t even notice. But it blows DKMS right up, since the install function parses the output of modinfo to test module versions.

Sunil made a patch which worked (thanks Sunil!) but I didn’t like the way it broke Sam Harris’s versioning scheme, so I made my own. Then I noticed another bug in the module info, a spurious newline that doesn’t actually hurt DKMS, and I figured what the hell and patched that too.

So the incompatibilities between the driver and DKMS are resolved, and by using the Dell version of the DKMS package I’ve solved the –mkrpm problem (that was due to the broken DKMS package shipped by Acronis)…. but unfortunately it still doesn’t completely work.

Tomorrow I’ll figure out why I don’t get the new kernel module automagically compiled for me whenever I load a new kernel RPM. That is, after all, the whole point of this exercise. If I didn’t want fresh compiles I’d be using kmod, not DKMS.

new version of RHEL aoe initscript is up

Turns out Red Hat Enterprise Linux v6 is using udev 147, so the aoe udev rules as distributed by coraid & others are obsolete(ish). The kernel throws errors when it hits the NAME=”%k” clause, and the udev folks say “It is and always was completely superfluous. It will break kernel supplied DEVNAMEs and therefore it needs to be removed… Kernel 2.6.31 supplies the needed names if they are not the default.” The RHEL6 kernel is 2.6.32 so I have revised the rules accordingly.

(Remember, if you use zaoe you have to modify the mknode setting in the script to suit your version of Red Hat linux. Currently the default is RHEL6, although it’s mostly being used in RHEL5.)

Another, more worrisome issue is that using the Red Hat supplied kernel module causes problems with the new VSX and ESM hardware, but using the latest coraid module makes the kernel throw tons of ‘aoe: unknown ioctl 0x1’ errors. Damned if you do and if you don’t!

Red Hat just doesn’t get AOE

This post edited to correct my mistakes

ATA-over-Ethernet, the high performance low-cost SAN protocol developed by Sam Hopkins over at Coraid, never gets any love from Red Hat Enterprise Linux. The AoE kernel modules included with any given release are always egregiously out of date, and don’t even seem to be contemporary with the kernels they’ve been distributed with. If you want to use up-to-date drivers, you have to either wrap up coraid’s kernel module sources with DKMS and support a complete build chain, or create kernel-version independent modules, or else you’ll get system breakage from routine yum updates. Either way you have to muck about building RPMs so you won’t break package dependency and inventory tracking, and you have to set things up to get rid of the old Red Hat module each time you get a kernel update.

And despite the inclusion of an (elderly) aoe module in Red Hat, they don’t provide any officially blessed aoe-tools package. Coraid maintains a set of simple aoe management utilities, a remote console app for their aoe devices, and a throughput testing app, all free open source software. The basic aoe-tools package has been a part of Fedora for some time now, and lately coraid has been bundling the sources for the tools with the driver sources.

In my shop, we’ve been running more than 12 terabytes of AoE storage infrastructure on Red Hat EL 3, 4, and 5 for eight years or more now. Currently more than 150 TB in production. I had some pretty major problems with AoE on RHEL6, and openly blamed Red Hat’s out of date drivers for them, but those problems have been resolved. We had a bad VLAN trunk, basically, so it was really our fault (my sincere apologies to those I mistakenly accused).

You know, it’s always seemed to me that ATA-over-Ethernet should be a natural win for Red Hat. The simplicity of the protocol, when compared to Red Hat’s approved iSCSI, reminds me strongly of the value proposition that linux represented ten years ago when compared to proprietary unixes. You can build an AoE SAN that’s faster and more reliable than an iSCSI SAN for considerably less money; why would anyone purposely choose the less cost-effective solution? That’s why so many IT shops left HP-UX, SunOS, and MVS for Red Hat linux – because linux delivered the capabilities we needed for less cash.

Strangely, though, Red Hat treats AoE like an unloved and ugly stepchild, at best neglecting it entirely. Whenever a Fedora release is repackaged for publication as a Red Hat Enterprise Linux release, the aoe-tools package is removed; bleeding-edge Fedora probably provides a more stable AoE SAN platform than the flagship product. It’s deeply weird behavior and counterproductive for Red Hat.

I wrote an initscript for AoE under Red Hat that works under RHEL 3 through 6. It’s in the software section.

There’s an interesting discussion that references this blog post here.

udev under RHEL6 on a Dell m1000e blade server

The udev subsystem dynamically responds to udev events generated by the kernel and creates device nodes on the fly. This solves the old *nix problem of too many files in /dev that don’t have anything to do with the hardware you actually have. If you see a file named /dev/mouse, the system really truly has a mouse attached to it. Or at least that’s the plan. The kernel calls udev whenever you activate or deactivate any device, not just USB and PCMCIA cards, and this lets linux support absolutely any kind of hot-swappable hardware, even processors.

This is not the first dynamic hardware mapping scheme to hit linux – I think Red Hat EL4 used devfs and there was at least one other one before that. And people have tried to do it using the HAL daemon too.

The udev daemon reads “rules” that tell it how you want your devices named. This is so that different distributions that have historically used different names can all use udev, they just have their own rule sets. This also means that hardware invented tomorrow is trivial to add – just make a new rule.

In Red Hat Enterprise Linux v6, udev rules are read from multiple places in order to be maximally confusing and infuriating.

/lib/udev/rules.d is a “library” of standard rules that udev uses by default.

/etc/udev/rules.d is a place where you can put your own system-specific rules that will override the library. This is based on the names of the files, not the rules themselves.

/dev/.udev/rules.d is a secret hidden directory that seems to exist mostly to make you angry. I haven’t found any clear documentation of precedence of rule files found here. You have to experiment on a test system. The rules in here are usually referred to as “temporary rules” which may very well mean “permanent overrides” in normal English.

udev automatically detects changes to rules files, so changes take effect immediately without requiring udev to be restarted. However, the rules are not re-triggered automatically on already existing devices, so you might have learn to use the udevadm command (or just reboot) if you write any new rules.

But wait! There’s more! It was starting to make sense, and we can’t have that!

If you have a Dell system, a program called biosdevname will be invoked by rules in the library file /lib/udev/rules.d/71-biosdevname.rules whenever the system becomes aware that a network interface exists. The Dell white papers on the subject tell me that this program will rename your network devices and disk drives in a way that’s both counter-intuitive and inaccurately documented, but in reality it just renames the ethernet interfaces on a specific set of Dell systems.

Now, honestly, on some models of Dell server this behaviour makes a decent amount of sense; it labels the embedded ethernet ports on the system’s motherboard as “em1” and “em2” which is a good idea since Dell labeled the ports 1 & 2 (instead of the expected 0 and 1 – Real Computer Scientists can’t count past nine on their fingers). Then it will label any additional interfaces by the PCI bus positions, which is sort of uselessly informative but at least consistent.

On a Dell M1000e blade server, though, running biosdevname is worse than doing nothing. Instead of following their reasonably sane idea of making the labels on the chassis match the labels on the interfaces, the emn and pxpy naming scheme is mapped pseudo-randomly across the network ports. Net fabric A is arbitrarily designated as “embedded” (remember, on the M1000e none of the interfaces are embedded in the blades themselves), P3 is considered fabric B, and P1 is assigned to fabric C – even the order doesn’t match. So, on blade five, the interface names end up looking like this:

Blade Chassis Name -> Linux Interface Name
A1-port5 -> em1
A2-port5 -> em2
B1-port5 -> p3p1
B2-port5 -> p3p2
C1-port5 -> p1p1
C2-port5 -> p1p2

This is like purposeful obfuscation of location, and much worse than just nailing down eth0 through eth6 to specific MAC addresses (which RHEL6 would happily do if Dell’s biosdevname software wasn’t sticking its member into the pie).

The best solution I’ve found so far is to create a file named /etc/udev/rules.d/71-biosdevname.rules (thus overriding /lib/udev/rules.d/71-biosdevname.rules) and put a series of udev rules in it that associate specific names with specific MAC addresses. I map the actual Dell chassis labels like so:

Blade Chassis Name -> Linux Interface Name
A1-port5 -> ethA1p5
A2-port5 -> ethA2p5
B1-port5 -> ethB1p5
B2-port5 -> ethB2p5
C1-port5 -> aoeC1p5
C2-port5 -> aoeC2p5

With this setup, you will know which wire the telco guy unplugged as soon the system starts screaming about something gone wrong on ethA1p5 – it’s the Cat 6 ethernet wire in slot A1 port 5, obviously. You can also see which interfaces are intended to be used for our high-speed ATA-over-Ethernet SAN infrastructure.

That’s enough for today…