Questions for the NCP-AIN were updated on : Nov 21 ,2025
[InfiniBand Security]
You are configuring the Unified Fabric Manager (UFM) for an InfiniBand fabric in a multi-tenant
environment. You need to implement a solution that can detect potential security threats.
Which UFM feature uses analytics to detect security threats and predict network failures in
InfiniBand data centers?
C
Explanation:
The UFM Cyber-AI platform is an advanced feature of NVIDIA's Unified Fabric Manager designed to
enhance security and reliability in InfiniBand data centers. It leverages AI-powered analytics and
machine learning techniques to detect security threats, operational anomalies, and predict potential
network failures. By analyzing real-time and historical telemetry data, UFM Cyber-AI can identify
abnormal system behaviors, performance degradations, and usage profile changes. This proactive
approach enables administrators to address issues before they escalate, ensuring the integrity and
uptime of the data center.
Reference Extracts from NVIDIA Documentation:
"The NVIDIA Unified Fabric Manager (UFM) Cyber-AI platform offers enhanced and real-time
network telemetry, combined with AI-powered intelligence and advanced analytics. It enables IT
managers to discover operational anomalies and even predict network failures."
"UFM Cyber-AI uses machine learning (ML) techniques and AI models for anomaly detection and
prediction to learn the lifecycle patterns of data center network components."
“The NVIDIA UFM platforms revolutionize data center networking management by combining
enhanced, real-time network telemetry with AI-powered cyber intelligence and analytics to support
scale-out InfiniBand data centers. ... The UFM Cyber-AI platform takes fabric management to the
next level by adding an analytics layer powered by artificial intelligence. It enables data center
operators to proactively monitor and manage the InfiniBand fabric, predicting and preventing
potential failures, optimizing performance, and enhancing security. By analyzing telemetry data and
historical patterns, UFM Cyber-AI can detect anomalies that may indicate security threats or
operational issues, providing actionable insights to prevent downtime.”
[InfiniBand Optimization]
You are troubleshooting a Spectrum-X network and need to ensure that the network remains
operational in case of a link failure. Which feature of Spectrum-X ensures that the fabric continues to
deliver high performance even if there is a link failure?
B
Explanation:
RoCE Adaptive Routing is a key feature of NVIDIA Spectrum-X that ensures high performance and
resiliency in the network, even in the event of a link failure. This technology dynamically reroutes
traffic to the least congested and operational paths, effectively mitigating the impact of link failures.
By continuously evaluating the network's egress queue loads and receiving status notifications from
neighboring switches, Spectrum-X can adaptively select optimal paths for data transmission. This
ensures that the network maintains high throughput and low latency, crucial for AI workloads, even
when certain links are down.
Reference Extracts from NVIDIA Documentation:
"Spectrum-X employs global adaptive routing to quickly reroute traffic during link failures,
minimizing disruptions and preserving optimal storage fabric utilization."
"RoCE Adaptive Routing avoids congestion by dynamically routing large AI flows away from
congestion points. This approach improves network resource utilization, leaf/spine efficiency, and
performance."
[Spectrum-X Optimization]
Which service on Cumulus switches can monitor layer 1, layer 2, layer 3, tunnel, buffer, and ACL
related issues?
A
Explanation:
The "What Just Happened" (WJH) service on Cumulus switches provides real-time visibility into
network problems by monitoring various layers and components, including layer 1, layer 2, layer 3,
tunnel, buffer, and Access Control List (ACL) related issues. WJH streams detailed and contextual
telemetry data, enabling administrators to diagnose and troubleshoot network problems effectively.
Reference Extracts from NVIDIA Documentation:
"WJH can monitor layer 1, layer 2, layer 3, tunnel, buffer and ACL related issues."
"The WJH service enables you to diagnose network problems by looking at dropped packets."
[InfiniBand Security]
Which of the following options correctly describes the difference between UFM Telemetry, UFM
Enterprise, and UFM Cyber AI?
A
Explanation:
UFM Telemetry: Provides real-time monitoring and analysis of network performance, collecting data
such as port counters and cable information to assess the health and efficiency of the network.
UFM Enterprise: Focuses on comprehensive network management and optimization, enabling
administrators to monitor, operate, and optimize InfiniBand scale-out computing environments
effectively.
UFM Cyber AI: Detects and mitigates network security threats by analyzing telemetry data to identify
anomalies and potential security issues within the network infrastructure.
Reference Extracts from NVIDIA Documentation:
"UFM Telemetry provides real-time monitoring and analysis of network performance."
"UFM Enterprise is a powerful platform for managing InfiniBand scale-out computing environments."
"UFM Cyber-AI enhances the benefits of UFM Telemetry and UFM Enterprise services by detecting
and mitigating network security threats."
[InfiniBand Configuration]
You are configuring an InfiniBand network for an AI cluster and need to install the appropriate
software stack. Which NVIDIA software package provides the necessary drivers and tools for
InfiniBand configuration in Linux environments?
D
Explanation:
MLNX_OFED (Mellanox OpenFabrics Enterprise Distribution) is an NVIDIA-tested and packaged
version of the OpenFabrics Enterprise Distribution (OFED) for Linux. It provides the necessary drivers
and tools to support InfiniBand and Ethernet interconnects using the same RDMA (Remote Direct
Memory Access) and kernel bypass APIs. MLNX_OFED enables high-performance networking
capabilities essential for AI clusters, including support for up to 400Gb/s InfiniBand and RoCE (RDMA
over Converged Ethernet).
Reference Extracts from NVIDIA Documentation:
"MLNX_OFED is an NVIDIA tested and packaged version of OFED that supports two interconnect
types using the same RDMA (remote DMA) and kernel bypass APIs called OFED verbs – InfiniBand
and Ethernet."
"Up to 400Gb/s InfiniBand and RoCE (based on the RDMA over Converged Ethernet standard) over
10/25/40/50/100/200/400GbE are supported."
[InfiniBand Troubleshooting]
You suspect there might be connectivity issues in your InfiniBand fabric and need to perform a
comprehensive check. Which tool should you use to run a full fabric diagnostic and generate a
report?
C
Explanation:
The ibdiagnet utility is a fundamental tool for InfiniBand fabric discovery, error detection, and
diagnostics. It provides comprehensive reports on the fabric's health, including error reporting,
switch and Host Channel Adapter (HCA) configuration dumps, various counters reported by the
switches and HCAs, and parameters of devices such as switch fans, power supply units, cables, and
PCI lanes. Additionally, ibdiagnet performs validation for Unicast Routing, Adaptive Routing, and
Multicast Routing to ensure correctness and a credit-loop-free routing environment.
Reference Extracts from NVIDIA Documentation:
"The ibdiagnet utility is one of the basic tools for InfiniBand fabric discovery, error detection and
diagnostic. The output files of the ibdiagnet include error reporting, switch and HCA configuration
dumps, various counters reported by the switches and the HCAs."
"ibdiagnet also performs Unicast Routing, Adaptive Routing and Multicast Routing validation for
correctness and credit-loop free routing."
[InfiniBand Configuration]
What are the necessary steps to upgrade the MLNX-OS on InfiniBand Switches?
A
Explanation:
To upgrade the MLNX-OS on InfiniBand switches, the recommended procedure is as follows:
Connect to the switch via SSH: Establish a secure shell connection to the switch using its
management IP address.
Fetch the MLNX-OS software image: Obtain the appropriate MLNX-OS software image from the
official source or repository.
Use the 'install' command to perform the upgrade: Execute the 'install' command on the switch to
initiate the upgrade process with the fetched software image.
This method ensures a smooth and efficient upgrade without the need for physical intervention or
service disruption.
Reference Extracts from NVIDIA Documentation:
"Click on Systems → MLNX-OS Upgrade. Select the desired upgrade method (e.g. 'Install from local
file'). Select your image and click 'Install Image'."
[InfiniBand Security]
How does Spectrum-X achieve network isolation for multiple tenants?
B
Explanation:
Spectrum-X achieves network isolation in multi-tenant environments by implementing Layer 3
Virtual Network Identifiers (L3VNIs) per Virtual Routing and Forwarding (VRF) instance. This approach
allows each tenant to have a separate routing table and network segment, ensuring that traffic is
isolated and secure between tenants.
Reference Extracts from NVIDIA Documentation:
"Spectrum-X enhances multi-tenancy with performance isolation to ensure tenants' AI workloads
perform optimally and consistently."
[InfiniBand Configuration]
You need to configure a bond in Cumulus Linux. Which command should you use?
D
Explanation:
In Cumulus Linux, configuring a bond interface with Link Aggregation Control Protocol (LACP)
involves setting the bond mode to 'lacp'. The correct command to achieve this is:
nv set interface bond1 bond mode lacp
This command sets the bonding mode of 'bond1' to LACP, enabling dynamic link aggregation for
increased bandwidth and redundancy.
Reference Extracts from NVIDIA Documentation:
"To reset the link aggregation mode for bond1 to the default value of 802.3ad, run the nv set
interface bond1 bond mode lacp command."
[AI Network Architecture]
A major cloud provider is designing a new data center to support large-scale AI workloads,
particularly for training large language models. They want to optimize their network architecture for
maximum performance and efficiency.
Why is a rail-optimized topology considered a best practice for AI network architecture in this
scenario?
C
Explanation:
A rail-optimized topology is designed to enhance GPU-to-GPU communication by connecting each
GPU's Network Interface Card (NIC) to a dedicated rail switch. This configuration ensures predictable
traffic patterns and minimizes network interference between data flows, which is crucial for the
performance of large-scale AI workloads, such as training large language models. By reducing
contention and latency, this topology supports efficient and scalable AI training environments.
Reference Extracts from NVIDIA Documentation:
"Rail-optimized network topology helps maximize all-reduce performance while minimizing network
interference between flows."
"A Rail Optimized Stripe Architecture provides efficient data transfer between GPUs, especially
during computationally intensive tasks such as AI Large Language Models (LLM) training workloads,
where seamless data transfer is necessary to complete the tasks within a reasonable timeframe."
[Spectrum-X Configuration]
When creating a simu-lation in NVIDIA AIR, what syntax would you use to define a link between port
1 on spine-01 and port 41 on gpu-leaf-01?
A
Explanation:
NVIDIA AIR (AI-Ready Infrastructure) is a cloud-based simulation platform designed to model and
validate data center network deployments, including Spectrum-X Ethernet networks, using realistic
topologies and configurations. When creating a custom topology in NVIDIA AIR, users can define
network links between devices (e.g., spine and leaf switches) using a DOT file format, which is based
on the Graphviz graph visualization software. The question asks for the correct syntax to define a link
between port 1 on a spine switch (spine-01) and port 41 on a leaf switch (gpu-leaf-01) in a NVIDIA
AIR simulation.
According to NVIDIA’s official NVIDIA AIR documentation, the DOT file format is used to specify
network topologies, including nodes (devices) and links (connections between ports). The syntax for
defining a link in a DOT file uses a double dash (--) to indicate a connection between two ports, with
each port specified in the format "<node>":"<port>". For Spectrum-X networks, which typically use
Cumulus Linux or SONiC on NVIDIA Spectrum switches, ports are commonly labeled as swpX (switch
port X) rather than ethX (Ethernet interface), especially for switch-to-switch connections in a leaf-
spine topology. The correct syntax for the link between port 1 on spine-01 and port 41 on gpu-leaf-01
is:
"spine-01":"swp01" -- "gpu-leaf-01":"swp41"
This syntax uses swp01 and swp41 to denote switch ports, consistent with Cumulus Linux
conventions, and the double dash (--) to indicate the link, as required by the DOT file format.
Exact Extract from NVIDIA Documentation:
“You can create custom topologies in Air using a DOT file, which is the file type used with the open-
source graph visualization software, Graphviz. DOT files define nodes, attributes, and connections for
generating a topology for a network. The following is an example of a link definition in a DOT file:
"leaf01":"swp31" -- "spine01":"swp1"
This specifies a connection between port swp31 on leaf01 and port swp1 on spine01. Port names
typically follow the switch port naming convention (e.g., swpX) for Cumulus Linux-based switches.”
— NVIDIA Air Custom Topology Guide
This extract confirms that option A is the correct answer, as it uses the proper DOT file syntax with
swp01 and swp41 for port names and the double dash (--) for the link, aligning with NVIDIA AIR’s
topology definition process for Spectrum-X simulations.
Analysis of Other Options:
B . "spine-01":"swp1" to "gpu-leaf-01":"swp41": This option uses the correct port naming convention
(swp1 and swp41) but incorrectly uses the word to as the connector instead of the double dash (--).
The DOT file format requires -- to define links, making this syntax invalid for NVIDIA AIR.
C . "spine-01":"eth1" to "gpu-leaf-01":"eth41": This option uses ethX port names, which are typically
used for host interfaces (e.g., servers) rather than switch ports in Cumulus Linux or SONiC
environments. Switch ports in Spectrum-X topologies are labeled swpX. Additionally, the use of to
instead of -- is incorrect for DOT file syntax, making this option invalid.
D . "spine-01":"eth1" - "gpu-leaf-01":"eth41": This option uses a single dash (-) instead of the
required double dash (--) and incorrectly uses ethX port names instead of swpX. The ethX naming is
not standard for switch ports in Spectrum-X, and the single dash is not valid DOT file syntax, making
this option incorrect.
Why "spine-01":"swp01" -- "gpu-leaf-01":"swp41" is the Correct Answer:
Option A correctly adheres to the DOT file syntax used in NVIDIA AIR for defining network links:
Node and Port Naming: The nodes spine-01 and gpu-leaf-01 are specified with their respective ports
swp01 and swp41, following the swpX convention for switch ports in Cumulus Linux-based Spectrum-
X switches.
Link Syntax: The double dash (--) is the standard connector in DOT files to indicate a link between two
ports, as required by Graphviz and NVIDIA AIR.
Spectrum-X Context: In a Spectrum-X leaf-spine topology, connections between spine and leaf
switches (e.g., Spectrum-4 switches) use switch ports labeled swpX, making swp01 and swp41
appropriate for this simulation.
This syntax ensures that the NVIDIA AIR simulation accurately models the physical connection
between spine-01 port 1 and gpu-leaf-01 port 41, enabling validation of the Spectrum-X network
topology. The DOT file can be uploaded to NVIDIA AIR to generate the topology, as described in the
documentation.
[InfiniBand Configuration]
What are the two general user account types in MLNX-OS?
Pick the 2 correct responses below:
B, C
Explanation:
MLNX-OS, the operating system for NVIDIA's networking devices, defines two primary user account
types: admin and monitor. The admin account has full administrative privileges, allowing for
complete configuration and management of the system. The monitor account, on the other hand, is
designed for users who need to view system configurations and statuses without making any
changes. This separation ensures a clear distinction between users who manage the system and
those who monitor its operations.
Reference Extracts from NVIDIA Documentation:
"There are two user roles or account types: admin and monitor. As 'admin', the user is privileged to
run all the available commands. As 'monitor', the user can run commands that show system
configuration and status, or set terminal settings."
MLNX-OS is the network operating system used on NVIDIA’s Mellanox Ethernet switches, including
the Spectrum family (e.g., Spectrum-4 switches in the Spectrum-X platform), designed for high-
performance Ethernet networking in AI and HPC data centers. MLNX-OS provides a command-line
interface (CLI) for configuring and managing switch operations, with user accounts controlling access
to various commands and functions. The question asks for the two general user account types in
MLNX-OS, which define the primary privilege levels for user access.
According to NVIDIA’s official MLNX-OS documentation, the two general user account types in
MLNX-OS are:
monitor: This account type has read-only access, allowing users to view configurations, status, and
logs but not modify settings. It is used for monitoring and troubleshooting without risking
unintended changes.
admin: This account type has full read-write access, enabling users to view and modify all
configurations, execute commands, and manage the switch’s operations. It is intended for
administrators with complete control over the system.
These two account types represent the primary privilege levels in MLNX-OS, providing a clear
distinction between read-only monitoring and full administrative access.
Exact Extract from NVIDIA Documentation:
“MLNX-OS supports two primary user account types for managing switch operations:
monitor: Users with monitor privileges have read-only access to the system. They can view
configuration details, system status, and logs but cannot make changes to the configuration.
admin: Users with admin privileges have full read-write access, allowing them to configure, manage,
and troubleshoot all aspects of the switch, including executing privileged commands.
These account types ensure secure and controlled access to the switch’s management functions.”
— NVIDIA MLNX-OS User Manual
This extract confirms that options B (monitor) and C (admin) are the correct answers. These account
types are the standard privilege levels in MLNX-OS, used to manage access for monitoring and
administrative tasks on Spectrum switches, including those in Spectrum-X deployments.
[InfiniBand Security]
A cloud service provider is deploying the NVIDIA Spectrum-X Ethernet platform in a multi-tenant
environment. To ensure the security and isolation of each tenant's AI workload, the provider wants
to implement a feature that prevents unauthorized access to the network.
Which of the following features of the Spectrum-X platform should the provider implement?
D
Explanation:
In multi-tenant AI cloud environments, ensuring that each tenant's workloads are isolated and
secure is paramount. The NVIDIA Spectrum-X platform addresses this need through its Traffic
Isolation capabilities. This feature ensures that network resources are partitioned effectively,
preventing unauthorized access and interference between tenants. By implementing Traffic Isolation,
the provider can maintain strict boundaries between different tenant environments, ensuring both
security and performance consistency.
Reference Extracts from NVIDIA Documentation:
"Spectrum-X enhances multi-tenancy with performance isolation to ensure tenants' AI workloads
perform optimally and consistently."
"Spectrum-X utilizes the programmable congestion control function on the BlueField-3 hardware
platform to accurately assess the congestion condition of the traffic path by using in-band telemetry
information... to achieve the goal of performance isolation to ensure that each tenant gets the best
expected performance in the cloud and is not negatively affected by congestion of other tenants."
[Spectrum-X Optimization]
You have implemented adaptive routing in your Spectrum-X network to optimize AI workload
performance. You need to verify the effectiveness of this configuration and monitor its impact on
network congestion. Which tool would be most appropriate for monitoring and analyzing the
adaptive routing performance in your Spectrum-X environment?
A
Explanation:
NVIDIA NetQ is a comprehensive network operations tool designed to provide real-time visibility into
the health and performance of NVIDIA networking environments, including Spectrum-X. It offers
detailed telemetry and analytics, allowing administrators to monitor adaptive routing behaviors,
detect congestion, and analyze traffic patterns. By leveraging NetQ, you can ensure that adaptive
routing is functioning as intended and that the network is optimized for AI workloads.
Reference Extracts from NVIDIA Documentation:
"The NVIDIA NetQ network validation and ASIC monitoring tool set provide visibility into the
network health and behavior. The NetQ flow telemetry analysis shows the paths that data flows take
as they traverse the network, providing network latency and performance insights."
"By leveraging telemetry from Spectrum Ethernet switches and BlueField-3 SuperNICs, NVIDIA NetQ
can detect network issues proactively and troubleshoot network issues faster for optimal use of
network capacity."
[AI Network Architecture]
In an AI cluster using NVIDIA GPUs, which configuration parameter in the NicClusterPolicy custom
resource is crucial for enabling high-speed GPU-to-GPU communication across nodes?
A
Explanation:
The RDMA Shared Device Plugin is a critical component in the NicClusterPolicy custom resource for
enabling Remote Direct Memory Access (RDMA) capabilities in Kubernetes clusters. RDMA allows for
high-throughput, low-latency networking, which is essential for efficient GPU-to-GPU
communication across nodes in AI workloads. By deploying the RDMA Shared Device Plugin, the
cluster can leverage RDMA-enabled network interfaces, facilitating direct memory access between
GPUs without involving the CPU, thus optimizing performance.
Reference Extracts from NVIDIA Documentation:
"RDMA Shared Device Plugin: Deploy RDMA Shared device plugin. This plugin enables RDMA
capabilities in the Kubernetes cluster, allowing high-speed GPU-to-GPU communication across
nodes."
"The RDMA Shared Device Plugin is responsible for advertising RDMA-capable network interfaces to
Kubernetes, enabling pods to utilize RDMA for high-performance networking."