Troubleshoot NVIDIA Management Pack for Aria Operations – Missing GPU and Host Adapters

Estimated reading time: 6 minutes

Aria Operations (formerly vRealize Operations – vROPS) provides tremendous insight into your infrastructure by centralizing access and visualization of seemingly countless performance metrics and attributes. Like similar platforms, the intelligence and depth of performance data can be expanded by installing and configuring Management Packs (MP), which may be provided by VMware directly or offered by a trusted hardware/software partner. Connecting Management Packs to various aspects of your environment equips Operations with valuable insight to collect metrics, enable alerts, and enrich the “single pane of glass” experience.

One of my favorite Operations Management Packs is the “NVIDIA Virtual GPU Management Pack for VMware Aria Operations”, which can be downloaded from the NVIDIA Enterprise Portal. This Management Pack provides additional insight into each supported GPU installed in your vSphere Hosts, which is highly desirable for troubleshooting and trending utilization. I’ll cover the features in greater detail in a forthcoming blog post – stay tuned.

Today’s post investigates an issue where the Management Pack is installed and configured, but only one vSphere Host is reporting data according to the provided NVIDIA dashboards for Operations.

Symptoms

After uploading and installing the “NVIDIA Virtual GPU Management Pack for VMware Aria Operations” and supplying known-good credentials for a privileged user account, the provided NVIDIA Dashboards only show performance metrics and attributes for a subset of vSphere Hosts equipped with a supported GPU, or none at all.

Troubleshooting

First, ensure you have downloaded the latest version of the Management Pack from the NVIDIA Enterprise Portal and that your version of Aria Operations is supported. The release notes and other version-specific details can be obtained from the NVIDIA Documentation Library, specifically: https://docs.nvidia.com/grid/vrops/index.html

Second, ensure that the user account provided when creating the Adapter Instance (Data Sources > Integrations > Accounts Tab) has adequate permissions to the vSphere Host(s) equipped with one/multiple GPUs. It is highly recommended to configure a Service Account (vSphere Local or LDAP) with least-privilege access. If you use a custom Role in vCenter, the only permission required is “CIM Interaction“, located at: Host > CIM > CIM Interaction. Of course, you should confirm that the credentials are indeed valid and the role is assigned to the user. You may assign permissions at various levels and it is completely dependent on your environment as to which selection is most appropriate.

Generally, once the user account is configured on the “Integrations” page, metrics will begin populating within the first few collection cycles, which run on five-minute intervals by default. It may take a few collection cycles for data to represent correctly, so don’t be discouraged it the results are not immediately evident.

In my recent case, I double checked the above, but Operations continued to supply GPU metrics for only one vSphere Host. Let’s dig a big deeper…

If your Aria Operations environment is clustered, we need to identify which node is collecting/polling using the NVIDIA Adapter.

Within the Aria Operations web UI, navigate to Data Sources > Integrations > Accounts Tab.
Expand the “NVIDIA vGPU Adapter” line, and make note of the value displayed inside the “Collector” column.
Navigate to Administration > Support Logs.
Click on the “Group By” dropdown menu located beneath the “Support Logs” heading, then select “Node” from the short list of options. Notice that each node in your cluster will display in the tree view pane.
Expand the node from the above steps – we want to focus on logs from a particular collector. Once expanded, a number of directories appear.
Expand the “COLLECTOR” directory, then expand “adapters“, and finally, expand the “NvVGPUAdapter” directory. One or more log files will appear inside.
Double-click on the most recent log file to view the contents. If you aren’t sure which is the most recent, start with the first log file and look for timestamps inside the log content.

In this particular case, I observed the following log entries:

2024-02-23T18:43:04,478+0000 ERROR [Collector worker thread 17] (4025) com.nvidia.nvvgpu.adapter.NvVGPUAdapter.retryCollection - Error collecting data for host: myvspherehost.mydomain
com.nvidia.nvvgpu.adapter.exception.NvVGPUAdapterException: Received invalid response from CIM Provider. Error code: -6
  at com.nvidia.nvvgpu.adapter.client.CimClient.invokeCimMethod(CimClient.java:121) ~[NvVGPUAdapter.jar:?]
 	at com.nvidia.nvvgpu.adapter.client.DcgmClient.getGroupInfo(DcgmClient.java:861) ~[NvVGPUAdapter.jar:?]
 	at com.nvidia.nvvgpu.adapter.client.DcgmClient.getHostConfig(DcgmClient.java:277) ~[NvVGPUAdapter.jar:?]
 	at com.nvidia.nvvgpu.adapter.NvVGPUAdapter.retryCollection(NvVGPUAdapter.java:471) ~[NvVGPUAdapter.jar:?]
 	at com.nvidia.nvvgpu.adapter.NvVGPUAdapter.onCollect(NvVGPUAdapter.java:294) ~[NvVGPUAdapter.jar:?]
 	at com.integrien.alive.common.adapter3.AdapterBase.collectBase(AdapterBase.java:767) ~[vrops-adapters-sdk.jar:?]
 	at com.integrien.alive.common.adapter3.AdapterBase.collect(AdapterBase.java:553) ~[vrops-adapters-sdk.jar:?]
 	at com.integrien.alive.collector.CollectorWorkItem3.run(CollectorWorkItem3.java:47) ~[vcops-collector-1.0-SNAPSHOT.jar:?]
 	at com.integrien.alive.common.util.ThreadPool$WorkerItem.run(ThreadPool.java:275) ~[vrops-adapters-sdk.jar:?]
 	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
 	at java.lang.Thread.run(Unknown Source) ~[?:?]

JavaScript

This log entry implies that while authentication was successful, the Operations Collector did not receive a response from the CIM Provider, which is really a NVIDIA software component installed on the vSphere Hosts equipped with GPUs. In particular, we want to focus on the “nv-hostengine” process, which supports the Data Center GPU Manager or “DCGM” systemd service. This service is automatically installed and configured when GPU Drivers are installed on the vSphere Host and is considered by NVIDIA to be a “Developer Suite of Tools”. Therefore, it may not be evident if the “nv-hostengine” process is not working correctly, as it is not responsible for the ongoing operations of the GPU: rendering images, etc. You can read more about NVIDIA DCGM here: https://developer.nvidia.com/dcgm

Let’s see if the “nv-hostengine” process is running on each host, as it should be. Checking the status is quite simple.

Launch an SSH terminal session to the Host(s) in question.
Type or paste the command below, then press the Enter key.
ps | grep nv-hostengine

If the process is running correctly, the output should be similar to the below:

[root@myvspherehost:~] ps | grep nv-hostengine 20487340 20487340 nv-hostengine 20487341 20487340 nv-hostengine 20487342 20487340 nv-hostengine 20487348 20487340 nv-hostengine 20487349 20487340 nv-hostengine 20487350 20487340 nv-hostengine

If the process is not running, the output should simply return a blank line, as though ready for the next command.

In my case, only one host returned the expected output, whereas the others did not return anything. This indicates that the “nv-hostengine” process is not running as it should. To resolve, start the service using an SSH terminal session:

Launch an SSH terminal session to the Host(s) in question (if you already have a session open from the steps above, you can simply use the same session).
Type or paste the command below, then press the Enter key.
nv-hostengine -d
Wait a moment, then check if the process is now running by executing the same command as before:
ps | grep nv-hostengine

The output should now reflect lines similar to those above, although the numbers on the left are expected to be different. Wait for a few Aria Operations collection cycles to process, then check the NVIDIA Dashboards within Aria Operations to see if the additional hosts/GPUs are reflected accordingly. In my case, this is all that was required and metrics are reliably populating!

Symptoms

Troubleshooting

Leave a Reply Cancel reply