Questions for the DATABRICKS GENERATIVE AI ENGINEER ASSOCIATE were updated on : Dec 01 ,2025
A Generative Al Engineer is ready to deploy an LLM application written using Foundation Model APIs.
They want to follow security best practices for production scenarios
Which authentication method should they choose?
A
Explanation:
The task is to deploy an LLM application using Foundation Model APIs in a production environment
while adhering to security best practices. Authentication is critical for securing access to Databricks
resources, such as the Foundation Model API. Let’s evaluate the options based on Databricks’
security guidelines for production scenarios.
Option A: Use an access token belonging to service principals
Service principals are non-human identities designed for automated workflows and applications in
Databricks. Using an access token tied to a service principal ensures that the authentication is scoped
to the application, follows least-privilege principles (via role-based access control), and avoids
reliance on individual user credentials. This is a security best practice for production deployments.
Databricks Reference: "For production applications, use service principals with access tokens to
authenticate securely, avoiding user-specific credentials" ("Databricks Security Best Practices," 2023).
Additionally, the "Foundation Model API Documentation" states: "Service principal tokens are
recommended for programmatic access to Foundation Model APIs."
Option B: Use a frequently rotated access token belonging to either a workspace user or a service
principal
Frequent rotation enhances security by limiting token exposure, but tying the token to a workspace
user introduces risks (e.g., user account changes, broader permissions). Including both user and
service principal options dilutes the focus on application-specific security, making this less ideal than
a service-principal-only approach. It also adds operational overhead without clear benefits over
Option A.
Databricks Reference: "While token rotation is a good practice, service principals are preferred over
user accounts for application authentication" ("Managing Tokens in Databricks," 2023).
Option C: Use OAuth machine-to-machine authentication
OAuth M2M (e.g., client credentials flow) is a secure method for application-to-service
communication, often using service principals under the hood. However, Databricks’ Foundation
Model API primarily supports personal access tokens (PATs) or service principal tokens over full
OAuth flows for simplicity in production setups. OAuth M2M adds complexity (e.g., managing refresh
tokens) without a clear advantage in this context.
Databricks Reference: "OAuth is supported in Databricks, but service principal tokens are simpler and
sufficient for most API-based workloads" ("Databricks Authentication Guide," 2023).
Option D: Use an access token belonging to any workspace user
Using a user’s access token ties the application to an individual’s identity, violating security best
practices. It risks exposure if the user leaves, changes roles, or has overly broad permissions, and it’s
not scalable or auditable for production.
Databricks Reference: "Avoid using personal user tokens for production applications due to security
and governance concerns" ("Databricks Security Best Practices," 2023).
Conclusion: Option A is the best choice, as it uses a service principal’s access token, aligning with
Databricks’ security best practices for production LLM applications. It ensures secure, application-
specific authentication with minimal complexity, as explicitly recommended for Foundation Model
API deployments.
A Generative Al Engineer is deciding between using LSH (Locality Sensitive Hashing) and HNSW
(Hierarchical Navigable Small World) for indexing their vector database Their top priority is semantic
accuracy
Which approach should the Generative Al Engineer use to evaluate these two techniques?
A
Explanation:
The task is to choose between LSH and HNSW for a vector database index, prioritizing semantic
accuracy. The evaluation must assess how well each method retrieves semantically relevant results.
Let’s evaluate the options.
Option A: Compare the cosine similarities of the embeddings of returned results against those of a
representative sample of test inputs
Cosine similarity measures semantic closeness between vectors, directly assessing retrieval accuracy
in a vector database. Comparing returned results’ embeddings to test inputs’ embeddings evaluates
how well LSH or HNSW preserves semantic relationships, aligning with the priority.
Databricks Reference: "Cosine similarity is a standard metric for evaluating vector search accuracy"
("Databricks Vector Search Documentation," 2023).
Option B: Compare the Bilingual Evaluation Understudy (BLEU) scores of returned results for a
representative sample of test inputs
BLEU evaluates text generation (e.g., translations), not vector retrieval accuracy. It’s irrelevant for
indexing performance.
Databricks Reference: "BLEU applies to generative tasks, not retrieval" ("Generative AI Cookbook").
Option C: Compare the Recall-Oriented-Understudy for Gisting Evaluation (ROUGE) scores of
returned results for a representative sample of test inputs
ROUGE is for summarization evaluation, not vector search. It doesn’t measure semantic accuracy in
retrieval.
Databricks Reference: "ROUGE is unsuited for vector database evaluation" ("Building LLM
Applications with Databricks").
Option D: Compare the Levenshtein distances of returned results against a representative sample of
test inputs
Levenshtein distance measures string edit distance, not semantic similarity in embeddings. It’s
inappropriate for vector-based retrieval.
Databricks Reference: No specific support for Levenshtein in vector search contexts.
Conclusion: Option A (cosine similarity) is the correct approach, directly evaluating semantic
accuracy in vector retrieval, as recommended by Databricks for Vector Search assessments.
A Generative Al Engineer is building a production-ready LLM system which replies directly to
customers. The solution makes use of the Foundation Model API via provisioned throughput. They
are concerned that the LLM could potentially respond in a toxic or otherwise unsafe way. They also
wish to perform this with the least amount of effort.
Which approach will do this?
A
Explanation:
The task is to prevent toxic or unsafe responses in an LLM system using the Foundation Model API
with minimal effort. Let’s assess the options.
Option A: Host Llama Guard on Foundation Model API and use it to detect unsafe responses
Llama Guard is a safety-focused model designed to detect toxic or unsafe content. Hosting it via the
Foundation Model API (a Databricks service) integrates seamlessly with the existing system,
requiring minimal setup (just deployment and a check step), and leverages provisioned throughput
for performance.
Databricks Reference: "Foundation Model API supports hosting safety models like Llama Guard to
filter outputs efficiently" ("Foundation Model API Documentation," 2023).
Option B: Add some LLM calls to their chain to detect unsafe content before returning text
Using additional LLM calls (e.g., prompting an LLM to classify toxicity) increases latency, complexity,
and effort (crafting prompts, chaining logic), and lacks the specificity of a dedicated safety model.
Databricks Reference: "Ad-hoc LLM checks are less efficient than purpose-built safety solutions"
("Building LLM Applications with Databricks").
Option C: Add a regex expression on inputs and outputs to detect unsafe responses
Regex can catch simple patterns (e.g., profanity) but fails for nuanced toxicity (e.g., sarcasm, context-
dependent harm), requiring significant manual effort to maintain and update rules.
Databricks Reference: "Regex-based filtering is limited for complex safety needs" ("Generative AI
Cookbook").
Option D: Ask users to report unsafe responses
User reporting is reactive, not preventive, and places burden on users rather than the system. It
doesn’t limit unsafe outputs proactively and requires additional effort for feedback handling.
Databricks Reference: "Proactive guardrails are preferred over user-driven monitoring" ("Databricks
Generative AI Engineer Guide").
Conclusion: Option A (Llama Guard on Foundation Model API) is the least-effort, most effective
approach, leveraging Databricks’ infrastructure for seamless safety integration.
A Generative Al Engineer has built an LLM-based system that will automatically translate user text
between two languages. They now want to benchmark multiple LLM's on this task and pick the best
one. They have an evaluation set with known high quality translation examples. They want to
evaluate each LLM using the evaluation set with a performant metric.
Which metric should they choose for this evaluation?
B
Explanation:
The task is to benchmark LLMs for text translation using an evaluation set with known high-quality
examples, requiring a performant metric. Let’s evaluate the options.
Option A: ROUGE metric
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures overlap between generated
and reference texts, primarily for summarization. It’s less suited for translation, where precision and
word order matter more.
Databricks Reference: "ROUGE is commonly used for summarization, not translation evaluation"
("Generative AI Cookbook," 2023).
Option B: BLEU metric
BLEU (Bilingual Evaluation Understudy) evaluates translation quality by comparing n-gram overlap
with reference translations, accounting for precision and brevity. It’s widely used, performant, and
appropriate for this task.
Databricks Reference: "BLEU is a standard metric for evaluating machine translation, balancing
accuracy and efficiency" ("Building LLM Applications with Databricks").
Option C: NDCG metric
NDCG (Normalized Discounted Cumulative Gain) assesses ranking quality, not text generation. It’s
irrelevant for translation evaluation.
Databricks Reference: "NDCG is suited for ranking tasks, not generative output scoring" ("Databricks
Generative AI Engineer Guide").
Option D: RECALL metric
Recall measures retrieved relevant items but doesn’t evaluate translation quality (e.g., fluency,
correctness). It’s incomplete for this use case.
Databricks Reference: No specific extract, but recall alone lacks the granularity of BLEU for text
generation tasks.
Conclusion: Option B (BLEU) is the best metric for translation evaluation, offering a performant and
standard approach, as endorsed by Databricks’ guidance on generative tasks.
A Generative Al Engineer is helping a cinema extend its website's chat bot to be able to respond to
questions about specific showtimes for movies currently playing at their local theater. They already
have the location of the user provided by location services to their agent, and a Delta table which is
continually updated with the latest showtime information by location. They want to implement this
new capability In their RAG application.
Which option will do this with the least effort and in the most performant way?
A
Explanation:
The task is to extend a cinema chatbot to provide movie showtime information using a RAG
application, leveraging user location and a continuously updated Delta table, with minimal effort and
high performance. Let’s evaluate the options.
Option A: Create a Feature Serving Endpoint from a FeatureSpec that references an online store
synced from the Delta table. Query the Feature Serving Endpoint as part of the agent logic / tool
implementation
Databricks Feature Serving provides low-latency access to real-time data from Delta tables via an
online store. Syncing the Delta table to a Feature Serving Endpoint allows the chatbot to query
showtimes efficiently, integrating seamlessly into the RAG agent’s tool logic. This leverages
Databricks’ native infrastructure, minimizing effort and ensuring performance.
Databricks Reference: "Feature Serving Endpoints provide real-time access to Delta table data with
low latency, ideal for production systems" ("Databricks Feature Engineering Guide," 2023).
Option B: Query the Delta table directly via a SQL query constructed from the user's input using a
text-to-SQL LLM in the agent logic / tool
Using a text-to-SQL LLM to generate queries adds complexity (e.g., ensuring accurate SQL
generation) and latency (LLM inference + SQL execution). While feasible, it’s less performant and
requires more effort than a pre-built serving solution.
Databricks Reference: "Direct SQL queries are flexible but may introduce overhead in real-time
applications" ("Building LLM Applications with Databricks").
Option C: Write the Delta table contents to a text column, then embed those texts using an
embedding model and store these in the vector index. Look up the information based on the
embedding as part of the agent logic / tool implementation
Converting structured Delta table data (e.g., showtimes) into text, embedding it, and using vector
search is inefficient for structured lookups. It’s effort-intensive (preprocessing, embedding) and less
precise than direct queries, undermining performance.
Databricks Reference: "Vector search excels for unstructured data, not structured tabular lookups"
("Databricks Vector Search Documentation").
Option D: Set up a task in Databricks Workflows to write the information in the Delta table
periodically to an external database such as MySQL and query the information from there as part of
the agent logic / tool implementation
Exporting to an external database (e.g., MySQL) adds setup effort (workflow, external DB
management) and latency (periodic updates vs. real-time). It’s less performant and more complex
than using Databricks’ native tools.
Databricks Reference: "Avoid external systems when Delta tables provide real-time data natively"
("Databricks Workflows Guide").
Conclusion: Option A minimizes effort by using Databricks Feature Serving for real-time, low-latency
access to the Delta table, ensuring high performance in a production-ready RAG chatbot.
A Generative Al Engineer is setting up a Databricks Vector Search that will lookup news articles by
topic within 10 days of the date specified An example query might be "Tell me about monster truck
news around January 5th 1992". They want to do this with the least amount of effort.
How can they set up their Vector Search index to support this use case?
B
Explanation:
The task is to set up a Databricks Vector Search index for news articles, supporting queries like
“monster truck news around January 5th, 1992,” with minimal effort. The index must filter by topic
and a 10-day date range. Let’s evaluate the options.
Option A: Split articles by 10-day blocks and return the block closest to the query
Pre-splitting articles into 10-day blocks requires significant preprocessing and index management
(e.g., one index per block). It’s effort-intensive and inflexible for dynamic date ranges.
Databricks Reference: "Static partitioning increases setup complexity; metadata filtering is preferred"
("Databricks Vector Search Documentation").
Option B: Include metadata columns for article date and topic to support metadata filtering
Adding date and topic as metadata in the Vector Search index allows dynamic filtering (e.g., date ± 5
days, topic = “monster truck”) at query time. This leverages Databricks’ built-in metadata filtering,
minimizing setup effort.
Databricks Reference: "Vector Search supports metadata filtering on columns like date or category
for precise retrieval with minimal preprocessing" ("Vector Search Guide," 2023).
Option C: Pass the query directly to the vector search index and return the best articles
Passing the full query (e.g., “Tell me about monster truck news around January 5th, 1992”) to Vector
Search relies solely on embeddings, ignoring structured filtering for date and topic. This risks
inaccurate results without explicit range logic.
Databricks Reference: "Pure vector similarity may not handle temporal or categorical constraints
effectively" ("Building LLM Applications with Databricks").
Option D: Create separate indexes by topic and add a classifier model to appropriately pick the best
index
Separate indexes per topic plus a classifier model adds significant complexity (index creation, model
training, maintenance), far exceeding “least effort.” It’s overkill for this use case.
Databricks Reference: "Multiple indexes increase overhead; single-index with metadata is simpler"
("Databricks Vector Search Documentation").
Conclusion: Option B is the simplest and most effective solution, using metadata filtering in a single
Vector Search index to handle date ranges and topics, aligning with Databricks’ emphasis on efficient,
low-effort setups.
A Generative Al Engineer is using an LLM to classify species of edible mushrooms based on text
descriptions of certain features. The model is returning accurate responses in testing and the
Generative Al Engineer is confident they have the correct list of possible labels, but the output
frequently contains additional reasoning in the answer when the Generative Al Engineer only wants
to return the label with no additional text.
Which action should they take to elicit the desired behavior from this LLM?
D
Explanation:
The LLM classifies mushroom species accurately but includes unwanted reasoning text, and the
engineer wants only the label. Let’s assess how to control output format effectively.
Option A: Use few shot prompting to instruct the model on expected output format
Few-shot prompting provides examples (e.g., input: description, output: label). It can work but
requires crafting multiple examples, which is effort-intensive and less direct than a clear instruction.
Databricks Reference: "Few-shot prompting guides LLMs via examples, effective for format control
but requires careful design" ("Generative AI Cookbook").
Option B: Use zero shot prompting to instruct the model on expected output format
Zero-shot prompting relies on a single instruction (e.g., “Return only the label”) without examples.
It’s simpler than few-shot but may not consistently enforce succinctness if the LLM’s default behavior
is verbose.
Databricks Reference: "Zero-shot prompting can specify output but may lack precision without
examples" ("Building LLM Applications with Databricks").
Option C: Use zero shot chain-of-thought prompting to prevent a verbose output format
Chain-of-Thought (CoT) encourages step-by-step reasoning, which increases verbosity—opposite to
the desired outcome. This contradicts the goal of label-only output.
Databricks Reference: "CoT prompting enhances reasoning but often results in detailed responses"
("Databricks Generative AI Engineer Guide").
Option D: Use a system prompt to instruct the model to be succinct in its answer
A system prompt (e.g., “Respond with only the species label, no additional text”) sets a global
instruction for the LLM’s behavior. It’s direct, reusable, and effective for controlling output style
across queries.
Databricks Reference: "System prompts define LLM behavior consistently, ideal for enforcing concise
outputs" ("Generative AI Cookbook," 2023).
Conclusion: Option D is the most effective and straightforward action, using a system prompt to
enforce succinct, label-only responses, aligning with Databricks’ best practices for output control.
A Generative Al Engineer needs to design an LLM pipeline to conduct multi-stage reasoning that
leverages external tools. To be effective at this, the LLM will need to plan and adapt actions while
performing complex reasoning tasks.
Which approach will do this?
B
Explanation:
The task requires an LLM pipeline for multi-stage reasoning with external tools, necessitating
planning, adaptability, and complex reasoning. Let’s evaluate the options based on Databricks’
recommendations for advanced LLM workflows.
Option A: Train the LLM to generate a single, comprehensive response without interacting with any
external tools, relying solely on its pre-trained knowledge
This approach limits the LLM to its static knowledge base, excluding external tools and multi-stage
reasoning. It can’t adapt or plan actions dynamically, failing the requirements.
Databricks Reference: "External tools enhance LLM capabilities beyond pre-trained knowledge"
("Building LLM Applications with Databricks," 2023).
Option B: Implement a framework like ReAct which allows the LLM to generate reasoning traces and
perform task-specific actions that leverage external tools if necessary
ReAct (Reasoning + Acting) combines reasoning traces (step-by-step logic) with actions (e.g., tool
calls), enabling the LLM to plan, adapt, and execute complex tasks iteratively. This meets all
requirements: multi-stage reasoning, tool use, and adaptability.
Databricks Reference: "Frameworks like ReAct enable LLMs to interleave reasoning and external tool
interactions for complex problem-solving" ("Generative AI Cookbook," 2023).
Option C: Encourage the LLM to make multiple API calls in sequence without planning or structuring
the calls, allowing the LLM to decide when and how to use external tools spontaneously
Unstructured, spontaneous API calls lack planning and may lead to inefficient or incorrect tool usage.
This doesn’t ensure effective multi-stage reasoning or adaptability.
Databricks Reference: Structured frameworks are preferred: "Ad-hoc tool calls can reduce reliability
in complex tasks" ("Building LLM-Powered Applications").
Option D: Use a Chain-of-Thought (CoT) prompting technique to guide the LLM through a series of
reasoning steps, then manually input the results from external tools for the final answer
CoT improves reasoning but relies on manual tool interaction, breaking automation and adaptability.
It’s not a scalable pipeline solution.
Databricks Reference: "Manual intervention is impractical for production LLM pipelines" ("Databricks
Generative AI Engineer Guide").
Conclusion: Option B (ReAct) is the best approach, as it integrates reasoning and tool use in a
structured, adaptive framework, aligning with Databricks’ guidance for complex LLM workflows.
A Generative Al Engineer is building a system that will answer questions on currently unfolding news
topics. As such, it pulls information from a variety of sources including articles and social media
posts. They are concerned about toxic posts on social media causing toxic outputs from their system.
Which guardrail will limit toxic outputs?
A
Explanation:
The system answers questions on unfolding news topics using articles and social media, with a
concern about toxic outputs from toxic inputs. A guardrail must limit toxicity in the LLM’s responses.
Let’s evaluate the options.
Option A: Use only approved social media and news accounts to prevent unexpected toxic data from
getting to the LLM
Curating input sources (e.g., verified accounts) reduces exposure to toxic content at the data
ingestion stage, directly limiting toxic outputs. This is a proactive guardrail aligned with data quality
control.
Databricks Reference: "Control input data quality to mitigate unwanted LLM behavior, such as
toxicity" ("Building LLM Applications with Databricks," 2023).
Option B: Implement rate limiting
Rate limiting controls request frequency, not content quality. It prevents overload but doesn’t
address toxicity in social media inputs or outputs.
Databricks Reference: Rate limiting is for performance, not safety: "Use rate limits to manage
compute load" ("Generative AI Cookbook").
Option C: Reduce the amount of context items the system will include in consideration for its
response
Reducing context might limit exposure to some toxic items but risks losing relevant information, and
it doesn’t specifically target toxicity. It’s an indirect, imprecise fix.
Databricks Reference: Context reduction is for efficiency, not safety: "Adjust context size based on
performance needs" ("Databricks Generative AI Engineer Guide").
Option D: Log all LLM system responses and perform a batch toxicity analysis monthly
Logging and analyzing responses is reactive, identifying toxicity after it occurs rather than preventing
it. Monthly analysis doesn’t limit real-time toxic outputs.
Databricks Reference: Monitoring is for auditing, not prevention: "Log outputs for post-hoc analysis,
but use input filters for safety" ("Building LLM-Powered Applications").
Conclusion: Option A is the most effective guardrail, proactively filtering toxic inputs from unverified
sources, which aligns with Databricks’ emphasis on data quality as a primary safety mechanism for
LLM systems.
A Generative Al Engineer wants their (inetuned LLMs in their prod Databncks workspace available for
testing in their dev workspace as well. All of their workspaces are Unity Catalog enabled and they are
currently logging their models into the Model Registry in MLflow.
What is the most cost-effective and secure option for the Generative Al Engineer to accomplish their
gAi?
D
Explanation:
The goal is to make fine-tuned LLMs from a production (prod) Databricks workspace available for
testing in a development (dev) workspace, leveraging Unity Catalog and MLflow, while ensuring cost-
effectiveness and security. Let’s analyze the options.
Option A: Use an external model registry which can be accessed from all workspaces
An external registry adds cost (e.g., hosting fees) and complexity (e.g., integration, security
configurations) outside Databricks’ native ecosystem, reducing security compared to Unity Catalog’s
governance.
Databricks Reference: "Unity Catalog provides a centralized, secure model registry within Databricks"
("Unity Catalog Documentation," 2023).
Option B: Setup a script to export the model from prod and import it to dev
Export/import scripts require manual effort, storage for model artifacts, and repeated execution,
increasing operational cost and risk (e.g., version mismatches, unsecured transfers). It’s less efficient
than a native solution.
Databricks Reference: Manual processes are discouraged when Unity Catalog offers built-in sharing:
"Avoid redundant workflows with Unity Catalog’s cross-workspace access" ("MLflow with Unity
Catalog").
Option C: Setup a duplicate training pipeline in dev, so that an identical model is available in dev
Duplicating the training pipeline doubles compute and storage costs, as it retrains the model from
scratch. It’s neither cost-effective nor necessary when the prod model can be reused securely.
Databricks Reference: "Re-running training is resource-intensive; leverage existing models where
possible" ("Generative AI Engineer Guide").
Option D: Use MLflow to log the model directly into Unity Catalog, and enable READ access in the
dev workspace to the model
Unity Catalog, integrated with MLflow, allows models logged in prod to be centrally managed and
accessed across workspaces with fine-grained permissions (e.g., READ for dev). This is cost-effective
(no extra infrastructure or retraining) and secure (governed by Databricks’ access controls).
Databricks Reference: "Log models to Unity Catalog via MLflow, then grant access to other
workspaces securely" ("MLflow Model Registry with Unity Catalog," 2023).
Conclusion: Option D leverages Databricks’ native tools (MLflow and Unity Catalog) for a seamless,
cost-effective, and secure solution, avoiding external systems, manual scripts, or redundant training.
A Generative Al Engineer is developing a RAG application and would like to experiment with
different embedding models to improve the application performance.
Which strategy for picking an embedding model should they choose?
A
Explanation:
The task involves improving a Retrieval-Augmented Generation (RAG) application’s performance by
experimenting with embedding models. The choice of embedding model impacts retrieval accuracy,
which is critical for RAG systems. Let’s evaluate the options based on Databricks Generative AI
Engineer best practices.
Option A: Pick an embedding model trained on related domain knowledge
Embedding models trained on domain-specific data (e.g., industry-specific corpora) produce vectors
that better capture the semantics of the application’s context, improving retrieval relevance. For
RAG, this is a key strategy to enhance performance.
Databricks Reference: "For optimal retrieval in RAG systems, select embedding models aligned with
the domain of your data" ("Building LLM Applications with Databricks," 2023).
Option B: Pick the most recent and most performant open LLM released at the time
LLMs are not embedding models; they generate text, not embeddings for retrieval. While recent
LLMs may be performant for generation, this doesn’t address the embedding step in RAG. This
option misunderstands the component being selected.
Databricks Reference: Embedding models and LLMs are distinct in RAG workflows: "Embedding
models convert text to vectors, while LLMs generate responses" ("Generative AI Cookbook").
Option C: Pick the embedding model ranked highest on the Massive Text Embedding Benchmark
(MTEB) leaderboard hosted by HuggingFace
The MTEB leaderboard ranks models across general tasks, but high overall performance doesn’t
guarantee suitability for a specific domain. A top-ranked model might excel in generic contexts but
underperform on the engineer’s unique data.
Databricks Reference: General performance is less critical than domain fit: "Benchmark rankings
provide a starting point, but domain-specific evaluation is recommended" ("Databricks Generative AI
Engineer Guide").
Option D: Pick an embedding model with multilingual support to support potential multilingual user
questions
Multilingual support is useful only if the application explicitly requires it. Without evidence of
multilingual needs, this adds complexity without guaranteed performance gains for the current use
case.
Databricks Reference: "Choose features like multilingual support based on application requirements"
("Building LLM-Powered Applications").
Conclusion: Option A is the best strategy because it prioritizes domain relevance, directly improving
retrieval accuracy in a RAG system—aligning with Databricks’ emphasis on tailoring models to
specific use cases.
A Generative Al Engineer is building an LLM-based application that has an
important transcription (speech-to-text) task. Speed is essential for the success of the application
Which open Generative Al models should be used?
D
Explanation:
The task requires an open generative AI model for a transcription (speech-to-text) task where speed
is essential. Let’s assess the options based on their suitability for transcription and performance
characteristics, referencing Databricks’ approach to model selection.
Option A: Llama-2-70b-chat-hf
Llama-2 is a text-based LLM optimized for chat and text generation, not speech-to-text. It lacks
transcription capabilities.
Databricks Reference: "Llama models are designed for natural language generation, not audio
processing" ("Databricks Model Catalog").
Option B: MPT-30B-Instruct
MPT-30B is another text-based LLM focused on instruction-following and text generation, not
transcription. It’s irrelevant for speech-to-text tasks.
Databricks Reference: No specific mention, but MPT is categorized under text LLMs in Databricks’
ecosystem, not audio models.
Option C: DBRX
DBRX, developed by Databricks, is a powerful text-based LLM for general-purpose generation. It
doesn’t natively support speech-to-text and isn’t optimized for transcription.
Databricks Reference: "DBRX excels at text generation and reasoning tasks" ("Introducing DBRX,"
2023)—no mention of audio capabilities.
Option D: whisper-large-v3 (1.6B)
Whisper, developed by OpenAI, is an open-source model specifically designed for speech-to-text
transcription. The “large-v3” variant (1.6 billion parameters) balances accuracy and efficiency, with
optimizations for speed via quantization or deployment on GPUs—key for the application’s
requirements.
Databricks Reference: "For audio transcription, models like Whisper are recommended for their
speed and accuracy" ("Generative AI Cookbook," 2023). Databricks supports Whisper integration in
its MLflow or Lakehouse workflows.
Conclusion: Only D. whisper-large-v3 is a speech-to-text model, making it the sole suitable choice. Its
design prioritizes transcription, and its efficiency (e.g., via optimized inference) meets the speed
requirement, aligning with Databricks’ model deployment best practices.
A Generative Al Engineer is developing a RAG system for their company to perform internal
document Q&A for structured HR policies, but the answers returned are frequently incomplete and
unstructured It seems that the retriever is not returning all relevant context The Generative Al
Engineer has experimented with different embedding and response generating LLMs but that did not
improve results.
Which TWO options could be used to improve the response quality?
Choose 2 answers
A, B
Explanation:
The problem describes a Retrieval-Augmented Generation (RAG) system for HR policy Q&A where
responses are incomplete and unstructured due to the retriever failing to return sufficient context.
The engineer has already tried different embedding and response-generating LLMs without success,
suggesting the issue lies in the retrieval process—specifically, how documents are chunked and
indexed. Let’s evaluate the options.
Option A: Add the section header as a prefix to chunks
Adding section headers provides additional context to each chunk, helping the retriever understand
the chunk’s relevance within the document structure (e.g., “Leave Policy: Annual Leave” vs. just
“Annual Leave”). This can improve retrieval precision for structured HR policies.
Databricks Reference: "Metadata, such as section headers, can be appended to chunks to enhance
retrieval accuracy in RAG systems" ("Databricks Generative AI Cookbook," 2023).
Option B: Increase the document chunk size
Larger chunks include more context per retrieval, reducing the chance of missing relevant
information split across smaller chunks. For structured HR policies, this can ensure entire sections or
rules are retrieved together.
Databricks Reference: "Increasing chunk size can improve context completeness, though it may trade
off with retrieval specificity" ("Building LLM Applications with Databricks").
Option C: Split the document by sentence
Splitting by sentence creates very small chunks, which could exacerbate the problem by fragmenting
context further. This is likely why the current system fails—it retrieves incomplete snippets rather
than cohesive policy sections.
Databricks Reference: No specific extract opposes this, but the emphasis on context completeness in
RAG suggests smaller chunks worsen incomplete responses.
Option D: Use a larger embedding model
A larger embedding model might improve vector quality, but the question states that experimenting
with different embedding models didn’t help. This suggests the issue isn’t embedding quality but
rather chunking/retrieval strategy.
Databricks Reference: Embedding models are critical, but not the focus when retrieval context is the
bottleneck.
Option E: Fine tune the response generation model
Fine-tuning the LLM could improve response coherence, but if the retriever doesn’t provide
complete context, the LLM can’t generate full answers. The root issue is retrieval, not generation.
Databricks Reference: Fine-tuning is recommended for domain-specific generation, not retrieval fixes
("Generative AI Engineer Guide").
Conclusion: Options A and B address the retrieval issue directly by enhancing chunk context—either
through metadata (A) or size (B)—aligning with Databricks’ RAG optimization strategies. C would
worsen the problem, while D and E don’t target the root cause given prior experimentation.
A Generative Al Engineer at an automotive company would like to build a question-answering
chatbot for customers to inquire about their vehicles. They have a database containing various
documents of different vehicle makes, their hardware parts, and common maintenance information.
Which of the following components will NOT be useful in building such a chatbot?
B
Explanation:
The task involves building a question-answering chatbot for an automotive company using a
database of vehicle-related documents. The chatbot must efficiently process customer inquiries and
provide accurate responses. Let’s evaluate each component to determine which is not useful, per
Databricks Generative AI Engineer principles.
Option A: Response-generating LLM
An LLM is essential for generating natural language responses to customer queries based on
retrieved information. This is a core component of any chatbot.
Databricks Reference: "The response-generating LLM processes retrieved context to produce
coherent answers" ("Building LLM Applications with Databricks," 2023).
Option B: Invite users to submit long, rather than concise, questions
Encouraging long questions is a user interaction design choice, not a technical component of the
chatbot’s architecture. Moreover, long, verbose questions can complicate intent detection and
retrieval, reducing efficiency and accuracy—counter to best practices for chatbot design. Concise
questions are typically preferred for clarity and performance.
Databricks Reference: While not explicitly stated, Databricks’ "Generative AI Cookbook" emphasizes
efficient query processing, implying that simpler, focused inputs improve LLM performance. Inviting
long questions doesn’t align with this.
Option C: Vector database
A vector database stores embeddings of the vehicle documents, enabling fast retrieval of relevant
information via semantic search. This is critical for a question-answering system with a large
document corpus.
Databricks Reference: "Vector databases enable scalable retrieval of context from large datasets"
("Databricks Generative AI Engineer Guide").
Option D: Embedding model
An embedding model converts text (documents and queries) into vector representations for
similarity search. It’s a foundational component for retrieval-augmented generation (RAG) in
chatbots.
Databricks Reference: "Embedding models transform text into vectors, facilitating efficient matching
of queries to documents" ("Building LLM-Powered Applications").
Conclusion: Option B is not a useful component in building the chatbot. It’s a user-facing suggestion
rather than a technical building block, and it could even degrade performance by introducing
unnecessary complexity. Options A, C, and D are all integral to a Databricks-aligned chatbot
architecture.
Which TWO chain components are required for building a basic LLM-enabled chat application that
includes conversational capabilities, knowledge retrieval, and contextual memory?
BC
Explanation:
Building a basic LLM-enabled chat application with conversational capabilities, knowledge retrieval,
and contextual memory requires specific components that work together to process queries,
maintain context, and retrieve relevant information. Databricks’ Generative AI Engineer
documentation outlines key components for such systems, particularly in the context of frameworks
like LangChain or Databricks’ MosaicML integrations. Let’s evaluate the required components:
Understanding the Requirements:
Conversational capabilities: The app must generate natural, coherent responses.
Knowledge retrieval: It must access external or domain-specific knowledge.
Contextual memory: It must remember prior interactions in the conversation.
Databricks Reference: "A typical LLM chat application includes a memory component to track
conversation history and a retrieval mechanism to incorporate external knowledge" ("Databricks
Generative AI Cookbook," 2023).
Evaluating the Options:
A . (Q): This appears incomplete or unclear (possibly a typo). Without further context, it’s not a valid
component.
B . Vector Stores: These store embeddings of documents or knowledge bases, enabling semantic
search and retrieval of relevant information for the LLM. This is critical for knowledge retrieval in a
chat application.
Databricks Reference: "Vector stores, such as those integrated with Databricks’ Lakehouse, enable
efficient retrieval of contextual data for LLMs" ("Building LLM Applications with Databricks").
C . Conversation Buffer Memory: This component stores the conversation history, allowing the LLM
to maintain context across multiple turns. It’s essential for contextual memory.
Databricks Reference: "Conversation Buffer Memory tracks prior user inputs and LLM outputs,
ensuring context-aware responses" ("Generative AI Engineer Guide").
D . External tools: These (e.g., APIs or calculators) enhance functionality but aren’t required for a
basic chat app with the specified capabilities.
E. Chat loaders: These might refer to data loaders for chat logs, but they’re not a core chain
component for conversational functionality or memory.
F . React Components: These relate to front-end UI development, not the LLM chain’s backend
functionality.
Selecting the Two Required Components:
For knowledge retrieval, Vector Stores (B) are necessary to fetch relevant external data, a
cornerstone of Databricks’ RAG-based chat systems.
For contextual memory, Conversation Buffer Memory (C) is required to maintain conversation
history, ensuring coherent and context-aware responses.
While an LLM itself is implied as the core generator, the question asks for chain components beyond
the model, making B and C the minimal yet sufficient pair for a basic application.
Conclusion: The two required chain components are B. Vector Stores and C. Conversation Buffer
Memory, as they directly address knowledge retrieval and contextual memory, respectively, aligning
with Databricks’ documented best practices for LLM-enabled chat applications.