Home Datacenter Lessons learned from initial rollout of ESX4

Lessons learned from initial rollout of ESX4

by Philip Sellers

Following my November upgrade of Flex-10 VirtualConnect on my blade enclosure, I have begun my rollout and upgrades to ESX4 on a new blade cluster as well as one existing cluster.  There are quite a few lessons that I’ve learned on my roll-out. 

Lesson # 1: vCenter Server doesn’t cluster well with Microsoft Cluster Services
Well, its not so much vCenter Server as Update Manager that gave me problems.  I can get Update Manager to work on one node but it doesn’t want to cooperate on the other node of my Microsoft cluster.
This is not a major issue for us since we plan to implement vCenter Server Heartbeat on our vCenter install.  We view this as the official clustering solution for vCenter and so I believe it is the better option.

Lesson #2: Use the upgrade path in vCenter Update Manager to upgrade your ESX hosts – makes life really easy!
I was pretty surprised at how easily my four nodes upgraded to ESX4 using Update Manager.  It is well worth the bit of time to get things setup and push the upgrade out.  Upgrades is a nice new feature of Update Manager in vSphere that I had not heard any real information on.  You are able to upgrade virtual appliances and ESX hosts with the latest installs. On the ESX side, you load the ISO distribution from VMware onto the Update Manager server and then ESX captures configuration information, handles a scripted install using the hosted ISO on the Update Manager server and then loads back your configuration overtop the upgraded server.  Given the architecture changes between ESX 3.5 and vSphere ESX, it is a clean install — which was one of my concerns.  My boxes were 3.0.0 boxes, upgraded through 3.0.2 and then to 3.5 and 4 updates.

Lesson #3: Storage behavior in ESX4 and vCenter4 has changed…
One nice change, and I learned part of this as trial by fire (a separate post on that later), is that storage behavior changes in vCenter 4.
Storage is a more unified system in vCenter 4.  If a host makes or detects a change in the storage LUNs, it automatically kicks off storage rescans on other nodes to recognize the change you just made on a single host.  It also seems to do periodic rescans on its own in ESX4 to recognize LUN changes.  These are welcome administration changes to the operation of vCenter and ESX.

Lesson #4: Be very careful about presenting LUNs to two different ESX clusters when using HP Storageworks EVA’s.
Talking about trial by fire, we experienced a pretty major failure with our storage a couple weekends ago.  ESX4 is ALUA (asymmetric logical unit access) aware – meaning it should determine its optimal path automatically and use those optimal paths.  The EVA storage should respond back with the controller who ‘owns’ that LUN as the optimal path.  So, this should present little trouble.

While this sounds great, mixing the ESX4 behavior with ESX3.5 caused us to experiencing controller ping-pong where the controllers were flipping back and forth for particular LUNs.  LUNs not used by the ESX4 nodes, yet it happened.  EVA storage will automatically move control of a LUN from one controller to another depending on where it is receiving the majority of requests.  In a shared storage environment like ESX, that can be somewhat problematic if your two clusters are using different paths and thus a majority of requests changes every few minutes.  While, its my belief that we have a sick EVA and that’s more the problem, it did bring back the notion of planned storage pathing for our fiber channel, but that’s for a longer post.

Lesson #5: Rolling out slow, especially around the holidays, is SMART…
I mentioned early that we were rolling out our ESX4 slowly.  We have installed ESX4 on our new blade cluster and we only moved test and development virtual servers over to this new environment and we are glad.  As part of our migration, we moved the backend storage for the test and development to our disaster recovery storage array and that appears to be the sick EVA – it caused a couple complete outages for these virtual servers, but because we’d done a slow migration, production (running safely on our tried and true ESX 3.5 cluster) didn’t see the problems.

We don’t plan on touching much this week with Christmas coming Thursday.  It gives me a time to update the ole blog with some of the work that has been occupying my time.   This migration has a lot more to go, so there will be more meat as I move through the process.

You may also like