Problem Investigation: Virtual Machines Intermittently Hang at Blank Screen on Reboot

Estimated reading time: 8 minutes

Troubleshooting virtual machines (VMs) can often be a complex task, particularly when dealing with intermittent issues that don’t consistently reproduce. The problem could originate from various sources – be it the host system, the virtual machines themselves, the guest operating systems, or external factors. This blog post details a real-world problem investigation where VMs experienced sporadic freezes during reboot, resulting in Production impact. In addition, we will outline the investigative steps taken to pinpoint the root cause and the solution implemented to resolve the problem effectively.

Symptoms

During the scheduled reboot activities, a given VM will gracefully terminate services and achieve the “power-down” aspect of the reboot operation but may not successfully return to an online state. The result is a blank/black screen, where the VM remains in a halted state. Note that a reboot operation differs from a full power cycle event: the VM is never truly “powered off” during a graceful reboot.
A privileged user must authenticate to the VMware vCenter portal (may also intervene using CLI or PowerCLI scripts) and initiate a hard power cycle of the VM to restore service. This action is commonly achieved after alerts have been dispatched to the responsible team(s), thereby extending service disruption.
The symptoms appear to surface intermittently, as the same VM(s) facing the issue will reboot successfully without incident following a full power cycle.

Related Factors

Each of the affected virtual machines operate Windows Server 2016 and are fully patched.
- The scope of this investigation is limited to an environment which is predominantly served by Windows Server 2016 VMs, although it is possible that other operating system editions would be equally susceptible. This is not intended to be a limiting factor on its own, simply an observation.
The scheduled reboot activities are sometimes related to routine operating system patching but has also been observed during reboot cycles for unrelated activities.

Research and Review of Documentation

Following a broad search online, we located VMware KB 90493, which accurately describes the symptoms in addition to a set of criteria to help guide Administrators and Engineers measure applicability. Per the VMware KB, a given virtual machine may halt at a black screen upon reboot if each of the following criteria are met:

The Guest Operating System (OS) operates on 64-bit CPU architecture.
The VM object and Guest Operating System (OS) are configured to boot with the Extensible Firmware Interface (EFI), rather than the legacy BIOS selection.
The VM is configured with more than 1 vCPU.
The VM is configured with VM Hardware Version 13 or older.
- Note that in this sense, the term “older” implies a version less than or equal to version 13.
The VM was started on a Host operating vSphere/ESXi version 7.0 U2 or 7.0 U3.

At the time of investigation, it was our experience that each criterion applied to the impacted environment. Per the KB, we set forth to validate further by reviewing logs. Specifically, the “vmware.log” file for each VM. You can locate this file by browsing the respective datastore in vCenter and opening the directory named after the scoped VM(s), or by selecting a given VM in vCenter and clicking on Actions > Export System Logs and browsing the downloaded contents.

Per the KB, the BIOS UUID should be missing from the vmware.log file at the time when the VM rebooted and proceeded to hang at the blank screen. As you may suspect, the BIOS UUID is accurately recorded within the log file when the VM reboots successfully without issue.

Predictably, this behavior was present in our environment and the BIOS UUID was not recorded in the logs when the VM rebooted unsuccessfully. Therefore, the symptoms described in the KB match and have been validated through log review.

When reviewing logs, there are a couple of things to keep in mind and search for:

The default time zone is Zulu, which is unlikely to align with your native time zone. Keep this in mind when correlating timestamps with events.
The first marker to search for is “vcpu-0 - CPU reset: hard (mode Emulation)“, which indicates a reboot action. Search the log file for this string of text and correlate with the timestamp. You want to find an instance of this log record during a time when the VM rebooted and froze at a blank screen.
The second marker will read similar to “vcpu-0 - Guest: EFI ROM version: VMW71.00V.18227214.B64.2106252220 (64-bit RELEASE)“. Since version numbers can vary, it is recommended to look for “Guest: EFI ROM version” instead. In our case, this particular string was located 29 lines below the first marker described in the point above.
The line directly beneath the “EFI ROM version” above should reflect the BIOS UUID, which will appear similar to “vcpu-0 - BIOS-UUID is 42 25 b2 56 f1 fd 5a 2f-5b 76 88 67 eb 85 54 2b“, although your specific value is expected to be unique, of course.
- If the BIOS UUID line is missing, then you have accurately identified a specific instance of this issue. If the BIOS UUID appears in the log, proceed to additional instances, again searching for dates and times when the VM froze on reboot, and review logs once again.

Solution Assessment

Per the VMware KB, this issue is resolved in vSphere/ESXi 8.0. That said, it goes without saying that upgrading between major releases of the hypervisor is seldom the path of least resistance, especially when considering factors such as hardware compatibility and licensing. If upgrading to vSphere 8.0 is a near-term goal for your environment, then you may find this solution acceptable and simply choose to address future instances of this issue as needed (via hard power cycle) until the upgrade is fulfilled.

Assuming your environment serves Production workloads, it is likely not feasible to wait until the hosts are upgraded to address the issue. In this case, you have two supported options:

Option One (Preferred)

Upgrade the Virtual Hardware version of the scoped virtual machines to version 14 or greater. Although this is a quick and generally painless and inconsequential activity, it is highly recommended to capture backups (recall that snapshots are not backups) and only perform the upgrade during planned maintenance windows.

The most straightforward (ironically also the slowest) method of achieving the VM Hardware upgrade is to open the scoped VM in vCenter, although PowerCLI can certainly Orchestrate the upgrade more efficiently across multiple VMs, at the cost of complexity. Note that a given VM will be upgraded to the highest level supported by the parent Hypervisor when performed.

Option Two (Work-around)

Modify the .vmx file for the scoped VM(s) to include two configuration parameters below. Although this option carries lower risk than upgrading VM Hardware versions, you are unlikely to recall at a later time that this override was set, which can induce complications down the road. I would only recommend this option if upgrading between VM Hardware versions is unsuccessful, or your hypervisors are not operating above vSphere/ESXi 6.5. Note that the supported VM Hardware versions are directly linked to the hypervisor edition: while backward compatibility is supported, you cannot elevate nor operate a VM with a VM Hardware version which is higher than the parent hypervisor supports. The relationship between Hypervisor and VM Hardware versions is documented in VMware KB 1003746.

Considerations for Upgrading VM Hardware

Assuming you plan to implement “Option One” above by upgrading the VM Hardware version for each VM, there are a few factors to consider:

Upgrading the VM Hardware version can only be achieved while the VM is in a powered-off state. Although it is possible to “schedule” the upgrade to happen automatically on the next power cycle, the point remains – the VM must be powered-off in order to upgrade.
VM backups should be captured prior to ensure safety. Snapshots are not backups!
Remain cognizant of the lowest common denominator between host environments. For example, if you have two sets of environments, be it in the form of clusters or sites, and the Hypervisor version is not common between them (i.e., Site A operates ESXi 6.7 and Site B operates ESXi 7.0), upgrading the VM Hardware version to 17 would result in the VM no longer being compatible with hosts in Site A and therefore will not be able to power-on. However, upgrading the VM Hardware version to 14 is acceptable, thanks to backward compatibility.
Validation is important, especially when upgrading VM Hardware versions in bulk. Ensure each VM boots normally, and hard disks map as expected, especially on systems which are sensitive to disk ordering.

Symptoms

Related Factors

Research and Review of Documentation

Solution Assessment

Considerations for Upgrading VM Hardware

Leave a Reply Cancel reply