VMware Fault Tolerance : What is it? What does it do?

With the advent of vSphere, VMware has released a host of new features. Today I am going to talk about VMware Fault Tolerance. I’ll give you a overview, and talk to you about the Requirements. Next I’ll walk you through the setup and configuration , and finally, we will discuss both the benefits and pitfalls of Fault Tolerance. Oh, and I will provide you with some links to documentation both through the blog, and again at the end. Just a little light reading for a rainy day, incase you get bored. I almost forgot! I will also show you a Demo of Fault Tolerance , as I test failover. * note, to see video, please open this in a full window.
Overview:
Officially, VMware states the following

“Maximize uptime in your datacenter and reduce downtime management costs by enabling VMware Fault Tolerance for your virtual machines. VMware Fault Tolerance, based on vLockstep technology, provides zero downtime, zero data loss continuous availability for your applications, without the cost and complexity of traditional hardware or software clustering solutions.” Source http://www.vmware.com/products/fault-tolerance/overview.html

But what does that mean? Basically, that Fault tolerance provides you protection by running a backup copy of your VM on a second host. Possible, Yes, but as you will see, there are some stiff requirements. However, for those that need Continuous Availability, and are running VMware, it is a great offering. And for the rest of you, it gives you a compelling reason to take a hard look at VMware, and vSphere.
VMware has a technical description here : How Fault Tolerance Works and use Case example here : Fault Tolerance Use Cases
My first thought, would be that Fault Tolerance will allow you to achieve incredible uptime with pre existing smaller legacy applications. And that it could allow smaller shops the ability to have a Continuous Availability without having to know and or configure Microsoft Clustering.
Requirements, Interoperability and Preparation:
There is a extensive amount of documentation around Fault Tolerance Configuration Requirements , Fault Tolerance Interoperability , and Preparing Your Cluster and Hosts for Fault Tolerance. Therefore, I will quote the following from the above links.
Fault Tolerance Configuration Requirements


Cluster Prerequisites
Unlike VMware HA which, by default, protects every virtual machine in the cluster, VMware Fault Tolerance is enabled on individual virtual machines. For a cluster to support VMware Fault Tolerance, the following prerequisites must be met:

■     VMware HA must be enabled on the cluster. Host Monitoring should also be enabled. If it is not, when Fault Tolerance uses a Secondary VM to replace a Primary VM no new Secondary VM is created and redundancy is not restored.
■     Host certificate checking must be enabled for all hosts that will be used for Fault Tolerance. See Enable Host Certificate Checking.
■     Each host must have a VMotion and a Fault Tolerance Logging NIC configured. See Configure Networking for Host Machines.
■     At least two hosts must have processors from the same compatible processor group. While Fault Tolerance supports heterogeneous clusters (a mix of processor groups), you get the maximum flexibility if all hosts are compatible. See the VMware knowledge base article at
http://kb.vmware.com/kb/1008027 for information on supported processors.
■     All hosts must have the same ESX/ESXi version and patch level.
■     All hosts must have access to the virtual machines’ datastores and networks
.”


Host Prerequisites
A host can support fault tolerant virtual machines if it meets the following requirements.

■     A host must have processors from the FT-compatible processor group. See the VMware knowledge base article at http://kb.vmware.com/kb/1008027.
■     A host must be certified by the OEM as FT-capable. Refer to the current Hardware Compatibility List (HCL) for a list of FT-supported servers (see
http://www.vmware.com/resources/compatibility/search.php).
■     The host configuration must have Hardware Virtualization (HV) enabled in the BIOS. Some hardware manufacturers ship their products with HV disabled. The process for enabling HV varies among BIOSes. See the documentation for your hosts’ BIOSes for details on how to enable HV. If HV is not enabled, attempts to power on a fault tolerant virtual machine produce an error and the virtual machine does not power on.

Before Fault Tolerance can be turned on, a virtual machine must meet minimum requirements.
■     Virtual machine files must be stored on shared storage. Acceptable shared storage solutions include Fibre Channel, (hardware and software) iSCSI, NFS, and NAS.
■     Virtual machines must be stored in virtual RDM or virtual machine disk (VMDK) files that are thick provisioned with the Cluster Features option. If a virtual machine is stored in a VMDK file that is thin provisioned or thick provisioned without clustering features enabled and an attempt is made to enable Fault Tolerance, a message appears indicating that the VMDK file must be converted. Users can accept this automatic conversion (which requires the virtual machine to be powered off), allowing the disk to be converted and the virtual machine to be protected with Fault Tolerance. The amount of time needed for this conversion process can vary depending on the size of the disk and the host’s processor type.
■     Virtual machines must be running on one of the supported guest operating systems. See the VMware knowledge base article at
http://kb.vmware.com/kb/1008027 for more information.”

Basically, you need to make sure your cluster is setup correctly, that your CPU and NIC’s are supported, and the networking is setup for FT. Additionally, you need to make sure your VM Guest is supported and configured with support virtual hardware, Like Fault Tolerance support checked in the Hard Drives as you create them. I provided the full list above so that everyone is on the same page with a full understanding of the necessary Requirements. Here are a couple of screen shot examples. The first is of the network configuration, and the next of VM’s hard drive.
ft1
ft2
Fault Tolerance Interoperability


The following vSphere features are not supported for fault tolerant virtual machines.
■     Snapshots. Snapshots must be removed or committed before Fault Tolerance can be enabled on a virtual machine. In addition, it is not possible to take snapshots of virtual machines on which Fault Tolerance is enabled.
■     Storage VMotion. You cannot invoke Storage VMotion for virtual machines with Fault Tolerance turned on. To migrate the storage, you should temporarily turn off Fault Tolerance, and perform the storage VMotion action. When this is complete, you can turn Fault Tolerance back on.
■     DRS features. A fault tolerant virtual machine is automatically configured as DRS-disabled. DRS does initially place a Secondary VM, however, DRS does not make recommendations or load balance Primary or Secondary VMs when load balancing the cluster. The Primary and Secondary VMs can be manually migrated during normal operation.”

Here is a link with some other features that are incompatible, worth the read.
That cover’s the most of the requirements, additionally, you want to make sure your cost certificate checking is enable, and that you have configured everything correctly as stated above.

Here are the steps Preparing Your Cluster and Hosts for Fault Tolerance.

The tasks you should complete before attempting to enable Fault Tolerance for your cluster include:
■     Enable host certificate checking (if you are upgrading from a previous version of Virtual Infrastructure)
■     Configure networking for each host
■     Create the VMware HA cluster, add hosts, and check compliance

Whew, that section was fairly boring, my apologizes. 

Setup and Configuration:

Let’s hear the drum roll, & let’s enable Fault Tolerance! Here are the steps.
Turn On Fault Tolerance for Virtual Machines

  1. Right click on the VM, Select Fault Tolerance
  2. Click turn on Fault Tolerance.


Yep, to easy, yes I made those steps up, by enabling Fault Tolerance, without following directions.  Please check the above link for official directions.
Wait, if you run in to problem while powering on, where do you start?
VMware has a page of answers titled Turning On Fault Tolerance for Virtual Machines “The option to turn on Fault Tolerance is unavailable (grayed out) if any of these conditions apply”
I choose not to list all the possibilities here, but I wanted to provide the link for those of you that want to try this out, but have problems. The link also talks about validation checks, and how Fault Tolerance acts with a Powered on VM, VS a Powered off VM.
VMware also has Fault Tolerance Best Practices , VMware Fault Tolerance Configuration Recommendations, and Troubleshooting Fault Tolerance documentation.

 Benefits VS pitfalls:
VMware Fault Tolerance lets you protect your VM by basically running a Second copy. This allows for great protection as you need it, or on a 24/7 basis’s. That alone may be well worth ANY pitfalls. However, there are some system requirements, and it does take two hosts, and have some Performance impacts. See VMware vSphere 4 Fault Tolerance: Architecture and Performance & Comparing Fault Tolerance Performance & Overhead Utilizing VMmark v1.1.1
Over all, I would recommend you take a look at meeting the requirements if you see a Use or Need of Fault Tolerance, and are architecting a new VMware vSphere Environment.

Update: new VM Blog titled : Comparing Performance of 1vCPU Nehalem VM with 2vCPU Harpertown VM compares performance, worth a read, I quoted a tidbit from the end.

“Conclusion

A 1vCPU Xeon X5500 series based Exchange Server VM can support 50% more users per core than a 2vCPU VM based on previous generation processors while maintaining the same level of performance in terms of Sendmail latency.  This is accomplished while the VM’s CPU utilization remains below 50%, allowing plenty of capacity for peaks in workload and making an FT VM practical for use with Exchange Server 2007”

Failover demo:
Here I an showing a Exchange 2010 VM running loadgen, and a VMware VDI client pinging the Exchange VM while I do a test failover and as I restart the Secondary VM. * note, this is a fresh exchange install, with no configuration done, hence the exceptions in loadgen. And no, i don’t normally use FaceBook, or play Poker within VM’s.

Thanks for Reading!
I want to thanks to Jason Boche, for letting me have access to this lab for my testing. Thanks again! Without people like him, what would the world be like?
Roger L.