I can’t believe it, but its been almost a month since my last post. And what a month its been around my work. This has been one of the busiest and most difficult months that I can remember with the company. I have my hands in several different technologies, VMware and our blades are just two of my primary responsiblities. Over the past month, though, we’ve experienced a catastrophic failure of one of our blade enclosures. The failure has only occurred once, but the fall-out from this has taken almost a month to work out. And honestly, we’re still not through working out the kinks.
Of course, my story has to begin on Friday the 13th… Sometime around 9:00am, we started getting calls for both our SQL 2005 database cluster and our Exchange cluster. After investigation, we found that the active nodes were both in the same enclosure and a third ESX host in the same was experiencing problems, too. The problems were affecting both network and disk IO on the blades. All of our blades are boot from SAN, so the IO had to be a fiber-channel issue.
Several hours later, we were finally able to get enough response out of the nodes to be able to force a failover of services for Exchange, shortly followed by SQL 2005. As I worked with HP support, nothing improved on the affected servers. We were finally diagnosed with a problem mid-plane on the enclosure.
While waiting for the mid-plane to be dispatched to the field service folks, I requested that we go ahead and do a complete power-down on the enclosure and bring it up clean. This required physically removing power from the enclosure after powering down everything that I could from the onboard administrator.
After the reboot, everything looked much healthier. The blades came back to life and everything began operating as expected. After intense discussions on the HP side, we reseated our OA’s and the sleeve that they plug into on the back side of the enclosure. Net outcome was the same – everything still operating well. The OA’s nor the sleeve were loose, so we doubted that was the cause.
One nugget I learned from HP support (please vett this information on your own), is that the Virtual Connect interconnect modules require communication with the onboard administrators (OA’s). I’m still not sure I fully understand, but HP support did tell us that if VC lost communication to the OA, its possible that it caused our problems. If this is so, this smells like very, very bad engineering and design…
Continued investigation on HP’s part has pointed us back to the original diagnosis – a faulty mid-plane. Only by default did we return to that conculsion, however. This is the only piece of hardware common to the problems. Our only other conclusion was that this was a very bad, “hiccup” — which obviously buys us no real peace of mind…
So, sometime soon, we will be replacing the mid-plane of our enclosure. I have, of course, lost some faith in the HP blade ecosystem. We have plans to migrate our corporate VMware cluster onto blades, as well as some Citrix and other servers. Losing an enclosure like this has un-nerved those plans. We were fortunate to have drug our feet to only have 3 blades populated and serving anything at the time this happened. I will post updates as we move forward…