Warning: SolidFire Element OS 10.1 + vSphere Round Robin PSP + VVOLs = problems
During our recently Element OS 10.1 upgrade, we experienced a catastrophic failure between vSphere and Solidfire. Although the Solidfire array continued to stay online, IO between vSphere and array slowed to crawl leading to non-responsive hosts and VM workloads – in short – nothing worked in this cluster after upgrade. Solidfire support said they have only seen this once prior to our issues and although there is a known issue with Round Robin and VVOLs, it was not this bad. The solution is to setup a Fixed path selection policy (PSP) rule in vSphere, moving away from Solidfire’s best practice of round robin. This is only a mitigation configuration, however, and the fix will be in a later version of Element OS.
Our vSphere environment runs under the Solidfire best practices, which is round robin multipathing policy (PSP) for all Solidfire paths in vSphere. In this configuration with VVOLs, the upgrade to 10.1 is a problem. We ran for a year(ish) on 9.2 without issue using VVOLs and Round Robin PSP. We exclusively use VVOLs and experienced extended downtime until switching to a fixed PSP.
Solidfire has a known issue highlighting the problem, but as VVOL adoption is still quite low, the number of affected environments is low also. According to Solidfire support, traditional VMFS datastores are not affected – it is solely VVOLs.
Mitigation
Prior to Element OS 10.1 upgrade, you’ll need to create a rule to assign all Solidfire paths with Fixed PSP. If you had previously created a rule with Round Robin PSP, you’ll have to remove this rule. If you try to create a Fixed rule with ESXCLI and the Round Robin rule already exists, you get an ambiguous error in ESXi.
Check for the the Round Robin rule with:
[code]esxcli storage nmp satp rule list[/code]
or
[code] esxcli storage nmp satp rule list | grep Solid[/code]
Your actual remove command may vary (if, for instance, you do not have the IOPS=10 option). To remove an existing rule, use may use a version of the following ESXCLI command:
[code]esxcli storage nmp satp rule remove -s VMW_SATP_DEFAULT_AA -P VMW_PSP_RR -V "SolidFir" -M "SSD SAN" -e "SolidFire custom SATP rule" -O iops=10[/code]
To add the Fixed PSP rule for Solidfire Element OS 10.1 + VVOLs config:
[code] esxcli storage nmp satp rule add -s VMW_SATP_DEFAULT_AA -P VMW_PSP_FIXED -V "SolidFir" -M "SSD SAN" -e "SolidFire custom SATP rule"[/code]
It is imperative to note that the command above is a MITIGATION CONFIGURATION ONLY and is not the recommended best-practice configuration outside of this specific situation.Â
For general best practices, check out Josh Atwell and Aaron Patten’s PowerShell for setting up best practices around Solidfire: https://github.com/solidfire/PowerShell/blob/master/VMware/SolidFire-VMware-Best-Practices.ps1. Their code also has examples of how to remove rule in PowerShell, if you prefer that method to ESXCLI on a host.