[ed. This should have been posted on Thursday, but werk eight my brane.]
@colin_clive(it's alive . . . It's Alive . . . IT'S ALIVE!)
A few months ago I procured two 3Ware hardware RAID cards. One for slackers (a nice SATA unit, with caching & acceleration), and one for myself (an older, lower-end unit, but probably adequate for my needs). My intention was to use mine to build a fileserver on which to store backups (primarily of my photos, but also of other important data to which I want semi-convenient, networked access).
As I was thinking about doing this, it occurred to me that I'd never named a machine "Sarnath." This is surprising, since as names go, it hits a trifecta: it is a city that is both mythical and destroyed (either one would be adequate for Slackers' Network inclusion) AND it exists(-ed) in the Dreamlands (my own personal subset of the Slackers' Network scheme)
For all of those reasons, it seemed the obvious choice for the new machine. And yet, I decided that I should not tempt fate that way, (the name of the story is "The Doom That Came to Sarnath," after all!) Naming my shiny new Athalon64 fileserver "sarnath" would be hubris: basically asking for it to be struck by lightning. (Technically, flooded and dragged to the deep by lizard-men, but that's not really likely in Northern California! Though I suppose swallowed up by an earthquake would be close enough...)
So, when I finally got around to ordering the other parts, I rejected "Sarnath" in favor of "Kiran[*]." Little did I know that the fate had already been sealed.
What follows is a blow-by-blow of my attempt to get this machine up and running:
- Putting the machine together took several hours, because, in my excitement, I kept doing things out of order, or just generally rushing.
- Installing CentOS 4.2 x86_64 from both a CD-ROM & DVD-ROM causes MANY errors. I assume the issue is the new (OEM) DVD drive.
- After many hours patiently retrying failed package reads, I manage to get the machine installed. The installer doesn't recognize the on-board GigE NIC. (One of the reasons I bought that particular motherboard, bad research on my part)
- I download the latest (binary *grumble*) network driver from nVidia, burn it to a CD, and install it.
- Attempting to update the packages to the latest versions, it appears that sustained network activity cause a kernel panic. (with a binary network driver, fan-fucking-tastic!) As it is now 3:30am, I opt to put this off until Monday evening.
- Working on the assumption that the optical disk errors are caused by a faulty DVD drive, I exchange it at Central Computer (where I'd paid a small premium to have it on Sunday (when Surplus Computer is closed))
- That evening I try: the replacemnt DVD drive, a DVD drive Jonah had lying around, and a CD-ROM drive I had grabbed as a sanity check. All of them fail in the same way with different cables.
Resigned to the fact that I will have to RMA something (the
motherboard most likely, but possibly memory), I don't spend a
lot of time fiddling with the network. Plus, I'm hoping
(against hope) that the network problem is in the chipset
(even though it's probably a software issue.). I do start a
metest86to run for a few hours. The memory checks out.
- I apply for an RMA for the motherboard. This will take a minimum of 3-5 business days. They'll ship it back to me UPS Ground. So glad I paid for a rush delivery. (I know this is SOP for most vendors, but I've known a few that ship RMAs with the same shipping as the original order, and really, this whole entry is just kvetching)
- It takes a week.
(It did spend an overnight stay in South San Francisco. I thought long and hard about asking them if I could come get it, but I didn't have time to work on it that night anyway.)
- Without installing the motherboard I test an install with a spare disk and as little connected as possible (in an attempt to minimize variables). The IDE/DMA issues appear to be fixed.
- Reinstall most of the hardware, and reinstall the OS on the RAID array. Everything seems fine.
During my 9 day wait for my motherboard, I did some research
on the nForce network chipset, and looks like the
forcedethdriver should be able to drive interface. But after some futzing, I give up on that.
In attempting to update the packages to the latest versions,
3w-xxxxdriver loses track of the array with the OS on it. It never recovers, and a hardware reset is necessary.
- I upgrade the firmware and BIOS on the 3Ware card in a vain attempt to make this problem go away.
- Several more tests confirm that it appears to happen most consistently when reading from the network & writing to the array.
I begin to swear like a sailor. Skimming the source for
3w-xxxx(since I have source for it, mad props to 3Ware on that front) seems to indicate that it's missing interrupts, but I'm hardly a kernel hacker. To me, the most likely culprit at this point is the proprietary nForce driver.
Further research suggests that the
forcedethdriver only from a later kernel (CentOS ships with 2.6.9, which is starting to show it's age) will drive the nForce gigE NIC on my motherboard.
- I download the kernel packages (and dependencies) from Fedora Core 4, and attempt to install them. Another freeze occurs during the install. In hindsight I realize I should have booted single user before installing to prevent this. The kernel panics on boot.
- I give up, and plan to buy an e1000.
- I buy an Intel e1000 card to use as a replacement, since that family of NICs has been supported under Linux since, well, effectively forever. I find it mildly funny that I'm using an Intel NIC in my new AMD64 server. Not funny "ha-ha." Running with the e1000 (and the nForce driver never even loaded) does not help the situation.
- I open tech support ticket with 3Ware on this issue
For another datapoint, I install the
i386version of CentOS 4 on this machine to see if it exhibits the same behavior. The console log for the installer shows the error message once (before the entire console disappears), but it is not fatal. This does not bode well. I update the ticket with this information.
- 3Ware support answers a trouble ticket on a second-hand card with a solution that works. (Turn off ACPI). Double-plus mega mad props to 3Ware.
Warily, I reinstall the
x86_64distribution, making sure ACPI is turned off from the outset. After installation I successfully upgrade the packages to the latest versions.
- With more testing, and more data moving around, I become more confident that the solution has been found.
So, in the end, there is a happy ending, and that's what is most important.
[*] Kiran contains a beautiful temple that a distant king visits once a year to pray to singing gods, and only he is allowed to enter the temple. Can you think of a better name for a backup server?