vSAN Component Failure State - Degraded vs Absent

vSAN Component Failure State - Degraded vs Absent - Part II

Cause and Recovery of the degraded components

Scenario 1

Cause :

Capacity tier SSD / Magnetic disk drive failure in a Virtual SAN so the disk and all the components stored on the disk is marked as DEGRADED as the failure is permanent.

Behind the scenes:

If the VM has a policy that includes NumberOfFailuresToTolerate=1 or greater, the VM’s objects will still be accessible.

The disk state is marked as DEGRADED and can be verified via vSphere web client UI.

At this point, all in-flight I/O is halted while Virtual SAN reevaluates the availability of the object without the failed component as part of the active set of components.

If Virtual SAN concludes that the object is still available (based on available full mirror copy and witness), all in-flight I/O is restarted.

The typical time from physical removal of the drive, Virtual SAN processing this event, marking the component DEGRADED halting and restoring I/O flow is approximately 5-7 seconds.

Virtual SAN now looks for any hosts and disks that can satisfy the object requirements. This includes adequate free disk space and placement rules (e.g. 2 mirrors may not share the same hosts/fault domains). If such resources are found, Virtual SAN will create new components on there and start the recovery process immediately.

If the VM Storage Policy has NumberOfFailuresToTolerate=0, the VMDK will be inaccessible if one of the VMDK components (think one component of a stripe) exists on the disk. This will require a restore of the VM from a known good backup.

Solution :

Replace the failed Capacity disk drive on the host.

Scenario 2

Cause :
Cache tier SSD disk drive failure in a Virtual SAN so the entire disk group ( including capacity disks) as DEGRADED.Virtual SAN interprets the failure of a single flash caching device as a failure of the
entire disk group as the failure is permanent (disk is offline, no longer visible, etc.)
Behind the scenes:
Only the second sequence will change for the cache disk,
Disk group and the disks under the disk group states will be marked as DEGRADED and can be verified via the vSphere web client UI.
Solution :
Replace the failed Cache disk drive on the host.

Scenario 3

Cause:
The behavior of Virtual SAN when a storage I/O controller fails will be similar to having all-flash cache devices and all disks fail in all disk groups and components will be marked as DEGRADED in this situation (permanent error) and component rebuilding should be immediate.

If there are multiple disk groups on a host with a single controller, and all devices in both disk groups are impacted, then you might assume that the common controller is a root cause.
If there is a single disk group on a host with a single controller, and all devices in that disk group are impacted, additional research will be necessary to determine if the storage I/O controller is the culprit, or if it is the flash cache device that is at fault

Comments

PranishaMarch 18, 2022 at 10:42 AM
Are you looking for the best Azure Training in ChennaiHere is the best suggestion for you, Infycle Technologies the best Software training institute in Chennai to study Azure platform with the top demanding courses such as Graphic Design and Animation, Cyber Security, Blockchain, Data Science, Oracle, AWS DevOps, Python, Big data, Python, Selenium Testing, Medical Coding, etc., with best offers. To know more about the offers, approach us on +91-7504633633, +91-7502633633.
ReplyDelete
Replies
Sydney Chiropractor And MassageAugust 6, 2024 at 2:57 PM
This comment has been removed by the author.
ReplyDelete
Replies

Add comment

virtuaWisdom

Search virtuaWisdom