Questions for the NCA-GENL were updated on : Nov 21 ,2025
How can Retrieval Augmented Generation (RAG) help developers to build a trustworthy AI system?
D
Explanation:
Retrieval-Augmented Generation (RAG) enhances trustworthy AI by generating responses that cite
reference material from an external knowledge base, ensuring transparency and verifiability, as
discussed in NVIDIA’s Generative AI and LLMs course. RAG combines a retriever to fetch relevant
documents with a generator to produce responses, allowing outputs to be grounded in verifiable
sources, reducing hallucinations and improving trust. Option A is incorrect, as RAG does not focus on
security features like confidential computing. Option B is wrong, as RAG is unrelated to energy
efficiency. Option C is inaccurate, as RAG does not align models but integrates retrieved knowledge.
The course notes: “RAG enhances trustworthy AI by generating responses with citations from
external knowledge bases, improving transparency and verifiability of outputs.”
Reference: NVIDIA Building Transformer-Based Natural Language Processing Applications course;
NVIDIA Introduction to Transformer-Based Natural Language Processing.
Which of the following principles are widely recognized for building trustworthy AI? (Choose two.)
C, E
Explanation:
In building Trustworthy AI, privacy and nondiscrimination are widely recognized principles, as
emphasized in NVIDIA’s Generative AI and LLMs course. Privacy ensures that AI systems protect user
data and maintain confidentiality, often through techniques like confidential computing or data
anonymization. Nondiscrimination ensures that AI models avoid biases and treat all groups fairly,
mitigating issues like discriminatory outputs. Option A, conversational, is incorrect, as it is a feature
of some AI systems, not a Trustworthy AI principle. Option B, low latency, is a performance goal, not
a trust principle. Option D, scalability, is a technical consideration, not directly related to
trustworthiness. The course states: “Trustworthy AI principles include privacy, ensuring data
protection, and nondiscrimination, ensuring fair and unbiased model behavior, critical for ethical AI
development.”
Reference: NVIDIA Building Transformer-Based Natural Language Processing Applications course;
NVIDIA Introduction to Transformer-Based Natural Language Processing.
What is confidential computing?
A
Explanation:
Confidential computing is a technique for securing computer hardware and software from potential
threats by protecting data in use, as covered in NVIDIA’s Generative AI and LLMs course. It ensures
that sensitive data, such as model weights or user inputs, remains encrypted during processing,
using technologies like secure enclaves or trusted execution environments (e.g., NVIDIA H100 GPUs
with confidential computing capabilities). This enhances the security of AI systems. Option B is
incorrect, as it describes Trustworthy AI principles, not confidential computing. Option C is wrong, as
aligning outputs with human beliefs is unrelated to security. Option D is inaccurate, as data
integration is not the focus of confidential computing. The course notes: “Confidential computing
secures AI systems by protecting data in use, leveraging trusted execution environments to safeguard
sensitive information during processing.”
Reference: NVIDIA Building Transformer-Based Natural Language Processing Applications course;
NVIDIA Introduction to Transformer-Based Natural Language Processing.
In the development of Trustworthy AI, what is the significance of ‘Certification’ as a principle?
C
Explanation:
In the development of Trustworthy AI, ‘Certification’ as a principle involves verifying that AI models
are fit for their intended purpose according to regional or industry-specific standards, as discussed in
NVIDIA’s Generative AI and LLMs course. Certification ensures that models meet performance,
safety, and ethical benchmarks, providing assurance to stakeholders about their reliability and
appropriateness. Option A is incorrect, as transparency is a separate principle, not certification.
Option B is wrong, as ethical considerations are broader and not specific to certification. Option D is
inaccurate, as compliance with laws is related but distinct from certification’s focus on fitness for
purpose. The course states: “Certification in Trustworthy AI verifies that models meet regional or
industry-specific standards, ensuring they are fit for their intended purpose and reliable.”
Reference: NVIDIA Building Transformer-Based Natural Language Processing Applications course;
NVIDIA Introduction to Transformer-Based Natural Language Processing.
Which of the following options describes best the NeMo Guardrails platform?
C
Explanation:
The NVIDIA NeMo Guardrails platform is designed to ensure the ethical and safe use of AI systems,
particularly LLMs, by enforcing predefined rules and regulations, as highlighted in NVIDIA’s
Generative AI and LLMs course. It provides a framework to monitor and control LLM outputs,
preventing harmful or inappropriate responses and ensuring compliance with ethical guidelines.
Option A is incorrect, as NeMo Guardrails focuses on safety, not scalability or performance. Option B
is wrong, as it describes model development, not guardrails. Option D is inaccurate, as it does not
pertain to data factories but to ethical AI enforcement. The course notes: “NeMo Guardrails ensures
the ethical use of AI by monitoring and enforcing compliance with predefined rules, enhancing the
safety and trustworthiness of LLM outputs.”
Reference: NVIDIA Building Transformer-Based Natural Language Processing Applications course;
NVIDIA NeMo Framework User Guide.
“Hallucinations” is a term coined to describe when LLM models produce what?
C
Explanation:
In the context of LLMs, “hallucinations” refer to outputs that sound plausible and correct but are
factually incorrect or fabricated, as emphasized in NVIDIA’s Generative AI and LLMs course. This
occurs when models generate responses based on patterns in training data without grounding in
factual knowledge, leading to misleading or invented information. Option A is incorrect, as
hallucinations are not about similarity to input data but about factual inaccuracies. Option B is
wrong, as hallucinations typically refer to text, not image generation. Option D is inaccurate, as
hallucinations are grammatically coherent but factually wrong. The course states: “Hallucinations in
LLMs occur when models produce correct-sounding but factually incorrect outputs, posing challenges
for ensuring trustworthy AI.”
Reference: NVIDIA Building Transformer-Based Natural Language Processing Applications course;
NVIDIA Introduction to Transformer-Based Natural Language Processing.
When implementing data parallel training, which of the following considerations needs to be taken
into account?
C
Explanation:
In data parallel training, where a model is replicated across multiple devices with each processing a
portion of the data, synchronizing model weights is critical. As covered in NVIDIA’s Generative AI and
LLMs course, the ring all-reduce algorithm is an efficient method for syncing weights across
processes or devices. It minimizes communication overhead by organizing devices in a ring topology,
allowing gradients to be aggregated and shared efficiently. Option A is incorrect, as weights are
typically synced after each batch, not just at epoch ends, to ensure consistency. Option B is wrong, as
master-worker methods can create bottlenecks and are less scalable than all-reduce. Option D is
inaccurate, as keeping weights independent defeats the purpose of data parallelism, which requires
synchronized updates. The course notes: “In data parallel training, the ring all-reduce algorithm
efficiently synchronizes model weights across devices, reducing communication overhead and
ensuring consistent updates.”
Reference: NVIDIA Building Transformer-Based Natural Language Processing Applications course;
NVIDIA Introduction to Transformer-Based Natural Language Processing.
Imagine you are training an LLM consisting of billions of parameters and your training dataset is
significantly larger than the available RAM in your system. Which of the following would be an
alternative?
B
Explanation:
When training an LLM with a dataset larger than available RAM, using a memory-mapped file is an
effective alternative, as discussed in NVIDIA’s Generative AI and LLMs course. Memory-mapped files
allow the system to access portions of the dataset directly from disk without loading the entire
dataset into RAM, enabling efficient handling of large datasets. This approach leverages virtual
memory to map file contents to memory, reducing memory bottlenecks. Option A is incorrect, as
moving large datasets in and out of GPU memory via PCI bandwidth is inefficient and not a standard
practice for dataset storage. Option C is wrong, as discarding data reduces model quality and is not a
scalable solution. Option D is inaccurate, as eliminating semantically equivalent sentences is a
specific preprocessing step that does not address memory constraints. The course states: “Memory-
mapped files enable efficient training of LLMs on large datasets by accessing data from disk without
loading it fully into RAM, overcoming memory limitations.”
Reference: NVIDIA Building Transformer-Based Natural Language Processing Applications course;
NVIDIA Introduction to Transformer-Based Natural Language Processing.
What is the purpose of the NVIDIA NGC catalog?
D
Explanation:
The NVIDIA NGC catalog is a curated repository of GPU-optimized software for AI, machine learning,
and data science, as highlighted in NVIDIA’s Generative AI and LLMs course. It provides developers
with pre-built containers, pre-trained models, and tools optimized for NVIDIA GPUs, enabling faster
development and deployment of AI solutions, including LLMs. These resources are designed to
streamline workflows and ensure compatibility with NVIDIA hardware. Option A is incorrect, as NGC
is not primarily for testing or debugging but for providing optimized software. Option B is wrong, as it
is not a collaboration platform like GitHub. Option C is inaccurate, as NGC is not a marketplace for
buying and selling but a free resource hub. The course notes: “The NVIDIA NGC catalog offers a
curated collection of GPU-optimized AI and data science software, including containers and models,
to accelerate development and deployment.”
Reference: NVIDIA Building Transformer-Based Natural Language Processing Applications course;
NVIDIA NeMo Framework User Guide.
Which of the following optimizations are provided by TensorRT? (Choose two.)
C, D
Explanation:
NVIDIA TensorRT provides optimizations to enhance the performance of deep learning models
during inference, as detailed in NVIDIA’s Generative AI and LLMs course. Two key optimizations are
multi-stream execution and layer fusion. Multi-stream execution allows parallel processing of
multiple input streams on the GPU, improving throughput for concurrent inference tasks. Layer
fusion combines multiple layers of a neural network (e.g., convolution and activation) into a single
operation, reducing memory access and computation time. Option A, data augmentation, is
incorrect, as it is a preprocessing technique, not a TensorRT optimization. Option B, variable learning
rate, is a training technique, not relevant to inference. Option E, residual connections, is a model
architecture feature, not a TensorRT optimization. The course states: “TensorRT optimizes inference
through techniques like layer fusion, which combines operations to reduce overhead, and multi-
stream execution, which enables parallel processing for higher throughput.”
Reference: NVIDIA Building Transformer-Based Natural Language Processing Applications course;
NVIDIA Introduction to Transformer-Based Natural Language Processing.
Which of the following claims is correct about TensorRT and ONNX?
A
Explanation:
NVIDIA TensorRT is a deep learning inference library used to optimize and deploy models for high-
performance inference, while ONNX (Open Neural Network Exchange) is a format for model
interchange, enabling models to be shared across different frameworks, as covered in NVIDIA’s
Generative AI and LLMs course. TensorRT optimizes models (e.g., via layer fusion and quantization)
for deployment on NVIDIA GPUs, while ONNX ensures portability by providing a standardized model
representation. Option B is incorrect, as ONNX is not used for model creation but for interchange.
Option C is wrong, as TensorRT is not for model creation but optimization and deployment. Option D
is inaccurate, as ONNX is not for deployment but for model sharing. The course notes: “TensorRT
optimizes and deploys deep learning models for inference, while ONNX enables model interchange
across frameworks for portability.”
Reference: NVIDIA Building Transformer-Based Natural Language Processing Applications course;
NVIDIA Introduction to Transformer-Based Natural Language Processing.
Which of the following is a feature of the NVIDIA Triton Inference Server?
B
Explanation:
The NVIDIA Triton Inference Server is designed to optimize and deploy machine learning models for
inference, and one of its key features is dynamic batching, as noted in NVIDIA’s Generative AI and
LLMs course. Dynamic batching automatically groups inference requests into batches to maximize
GPU utilization, reducing latency and improving throughput for real-time applications. Option A,
model quantization, is incorrect, as it is typically handled by frameworks like TensorRT, not Triton.
Option C, gradient clipping, is a training technique, not an inference feature. Option D, model
pruning, is a model optimization method, not a Triton feature. The course states: “NVIDIA Triton
Inference Server supports dynamic batching, which optimizes inference by grouping requests to
maximize GPU efficiency and throughput.”
Reference: NVIDIA Building Transformer-Based Natural Language Processing Applications course;
NVIDIA Introduction to Transformer-Based Natural Language Processing.
You are in need of customizing your LLM via prompt engineering, prompt learning, or parameter-
efficient fine-tuning. Which framework helps you with all of these?
D
Explanation:
The NVIDIA NeMo framework is designed to support the development and customization of large
language models (LLMs), including techniques like prompt engineering, prompt learning (e.g.,
prompt tuning), and parameter-efficient fine-tuning (e.g., LoRA), as emphasized in NVIDIA’s
Generative AI and LLMs course. NeMo provides modular tools and pre-trained models that facilitate
these customization methods, allowing users to adapt LLMs for specific tasks efficiently. Option A,
TensorRT, is incorrect, as it focuses on inference optimization, not model customization. Option B,
DALI, is a data loading library for computer vision, not LLMs. Option C, Triton, is an inference server,
not a framework for LLM customization. The course notes: “NVIDIA NeMo supports LLM
customization through prompt engineering, prompt learning, and parameter-efficient fine-tuning,
enabling flexible adaptation for NLP tasks.”
Reference: NVIDIA Building Transformer-Based Natural Language Processing Applications course;
NVIDIA NeMo Framework User Guide.
What is the Open Neural Network Exchange (ONNX) format used for?
A
Explanation:
The Open Neural Network Exchange (ONNX) format is an open-standard representation for deep
learning models, enabling interoperability across different frameworks, as highlighted in NVIDIA’s
Generative AI and LLMs course. ONNX allows models trained in frameworks like PyTorch or
TensorFlow to be exported and used in other compatible tools for inference or further development,
ensuring portability and flexibility. Option B is incorrect, as ONNX is not designed to reduce training
time but to standardize model representation. Option C is wrong, as model compression is handled
by techniques like quantization, not ONNX. Option D is inaccurate, as ONNX is unrelated to sharing
literature. The course states: “ONNX is an open format for representing deep learning models,
enabling seamless model exchange and deployment across various frameworks and platforms.”
Reference: NVIDIA Building Transformer-Based Natural Language Processing Applications course;
NVIDIA Introduction to Transformer-Based Natural Language Processing.
What metrics would you use to evaluate the performance of a RAG workflow in terms of the
accuracy of responses generated in relation to the input query? (Choose two.)
D, E
Explanation:
In a Retrieval-Augmented Generation (RAG) workflow, evaluating the accuracy of responses relative
to the input query focuses on the quality of the retrieved context and the generated output. As
covered in NVIDIA’s Generative AI and LLMs course, two key metrics are response relevancy and
context precision. Response relevancy measures how well the generated response aligns with the
input query, often assessed through human evaluation or automated metrics like ROUGE or BLEU,
ensuring the output is pertinent and accurate. Context precision evaluates the retriever’s ability to
fetch relevant documents or passages from the knowledge base, typically measured by metrics like
precision@k, which assesses the proportion of retrieved items that are relevant to the query.
Options A (generator latency), B (retriever latency), and C (tokens generated per second) are
incorrect, as they measure performance efficiency (speed) rather than accuracy. The course notes:
“In RAG workflows, response relevancy ensures the generated output matches the query intent,
while context precision evaluates the accuracy of retrieved documents, critical for high-quality
responses.”
Reference: NVIDIA Building Transformer-Based Natural Language Processing Applications course;
NVIDIA Introduction to Transformer-Based Natural Language Processing.