Questions for the DATABRICKS CERTIFIED PROFESSIONAL DATA ENGINEER were updated on : Dec 01 ,2025
A data engineering team uses Databricks Lakehouse Monitoring to track the percent_null metric for a
critical column in their Delta table.
The profile metrics table (prod_catalog.prod_schema.customer_data_profile_metrics) stores hourly
percent_null values.
The team wants to:
Trigger an alert when the daily average of percent_null exceeds 5% for three consecutive days.
Ensure that notifications are not spammed during sustained issues.
Options:
A.
SELECT percent_null
FROM prod_catalog.prod_schema.customer_data_profile_metrics
WHERE window.end >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
Alert Condition: percent_null > 5
Notification Frequency: At most every 24 hours
B.
WITH daily_avg AS (
SELECT DATE_TRUNC('DAY', window.end) AS day,
AVG(percent_null) AS avg_null
FROM prod_catalog.prod_schema.customer_data_profile_metrics
GROUP BY DATE_TRUNC('DAY', window.end)
)
SELECT day, avg_null
FROM daily_avg
ORDER BY day DESC
LIMIT 3
Alert Condition: ALL avg_null > 5 for the latest 3 rows
Notification Frequency: Just once
C.
SELECT AVG(percent_null) AS daily_avg
FROM prod_catalog.prod_schema.customer_data_profile_metrics
WHERE window.end >= CURRENT_TIMESTAMP - INTERVAL '3' DAY
Alert Condition: daily_avg > 5
Notification Frequency: Each time alert is evaluated
D.
SELECT SUM(CASE WHEN percent_null > 5 THEN 1 ELSE 0 END) AS violation_days
FROM prod_catalog.prod_schema.customer_data_profile_metrics
WHERE window.end >= CURRENT_TIMESTAMP - INTERVAL '3' DAY
Alert Condition: violation_days >= 3
Notification Frequency: Just once
B
Explanation:
The key requirement is to detect when the daily average of percent_null is greater than 5% for three
consecutive days.
Option A only checks the last 24 hours, not consecutive days. It would trigger too frequently and
cause spam.
Option C calculates an average across all records in the last 3 days, but this could be skewed by one
high or low day — it does not ensure consecutive daily violations.
Option D simply counts days where the threshold was exceeded, but it does not guarantee that those
days were consecutive. This could incorrectly trigger on non-adjacent violations.
Option B is correct:
It aggregates hourly values into daily averages.
It checks that the last 3 consecutive days all had averages above 5%.
It avoids redundant alerts by using Notification Frequency: Just once.
This matches Databricks Lakehouse Monitoring best practices, where SQL alerts should be designed
to aggregate metrics to the correct granularity (daily here) and ensure consecutive threshold
violations before triggering.
Reference (Databricks Lakehouse Monitoring, SQL Alerts Best Practices):
Use DATE_TRUNC to compute metrics at the correct time granularity.
To detect consecutive-day issues, filter the last N daily aggregates and check conditions across all
rows.
Always configure alerts with controlled notification frequency to prevent alert fatigue.
A facilities-monitoring team is building a near-real-time PowerBI dashboard off the Delta table
device_readings:
Columns:
device_id (STRING, unique sensor ID)
event_ts (TIMESTAMP, ingestion timestamp UTC)
temperature_c (DOUBLE, temperature in °C)
Requirement:
For each sensor, generate one row per non-overlapping 5-minute interval, offset by 2 minutes (e.g.,
00:02–00:07, 00:07–00:12, …).
Each row must include interval start, interval end, and average temperature in that slice.
Downstream BI tools (e.g., Power BI) must use the interval timestamps to plot time-series bars.
Options:
A.
WITH buckets AS (
SELECT device_id,
window(event_ts, '5 minutes', '2 minutes', '5 minutes') AS win,
temperature_c
FROM device_readings
)
SELECT device_id,
win.start AS bucket_start,
win.end AS bucket_end,
AVG(temperature_c) AS avg_temp_5m
FROM buckets
GROUP BY device_id, win
ORDER BY device_id, bucket_start;
B.
SELECT device_id,
event_ts,
AVG(temperature_c) OVER (
PARTITION BY device_id
ORDER BY event_ts
RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW
) AS avg_temp_5m
FROM device_readings
WINDOW w AS (window(event_ts, '5 minutes', '2 minutes'));
C.
SELECT device_id,
date_trunc('minute', event_ts - INTERVAL 2 MINUTES) + INTERVAL 2 MINUTES AS bucket_start,
date_trunc('minute', event_ts - INTERVAL 2 MINUTES) + INTERVAL 7 MINUTES AS bucket_end,
AVG(temperature_c) AS avg_temp_5m
FROM device_readings
GROUP BY device_id, date_trunc('minute', event_ts - INTERVAL 2 MINUTES)
ORDER BY device_id, bucket_start;
D.
SELECT device_id,
window.start AS bucket_start,
window.end AS bucket_end,
AVG(temperature_c) AS avg_temp_5m
FROM device_readings
GROUP BY device_id, window(event_ts, '5 minutes', '5 minutes', '2 minutes')
ORDER BY device_id, bucket_start;
A
Explanation:
The correct way to satisfy non-overlapping windows with an offset in Databricks SQL is to use the
window function with three parameters: window duration, slide duration, and start offset.
In option A, the function call:
window(event_ts, '5 minutes', '2 minutes', '5 minutes')
creates 5-minute windows that slide every 5 minutes, with a 2-minute offset, which exactly matches
the requirement (intervals like 00:02–00:07, 00:07–00:12, …).
Option B is incorrect because it uses a windowed aggregation with RANGE, which produces
overlapping sliding averages, not discrete non-overlapping buckets.
Option C manually constructs bucket boundaries with date_trunc and offsets, but this is brittle and
less efficient than the built-in window function.
Option D incorrectly passes four parameters to window but with the wrong ordering (5 minutes, 5
minutes, 2 minutes). This creates a sliding window every 5 minutes with overlap, rather than true
non-overlapping shifted windows.
Reference (Databricks SQL Windowing Functions):
Databricks documentation specifies that:
window(time_col, windowDuration, slideDuration, startTime)
produces tumbling or sliding windows. When slideDuration = windowDuration, it produces non-
overlapping tumbling windows. The startTime argument allows for offset windows, which is why '2
minutes' ensures alignment at 00:02, 00:07, etc.
Thus, A is the only correct solution as it directly implements non-overlapping, offset-based tumbling
windows.
Which approach demonstrates a modular and testable way to use DataFrame.transform for ETL code
in PySpark?
A.
class Pipeline:
def transform(self, df):
return df.withColumn("value_upper", upper(col("value")))
pipeline = Pipeline()
assertDataFrameEqual(pipeline.transform(test_input), expected)
B.
def upper_value(df):
return df.withColumn("value_upper", upper(col("value")))
def filter_positive(df):
return df.filter(df["id"] > 0)
pipeline_df = df.transform(upper_value).transform(filter_positive)
C.
def upper_transform(df):
return df.withColumn("value_upper", upper(col("value")))
actual = test_input.transform(upper_transform)
assertDataFrameEqual(actual, expected)
D.
def transform_data(input_df):
# transformation logic here
return output_df
test_input = spark.createDataFrame([(1, "a")], ["id", "value"])
assertDataFrameEqual(transform_data(test_input), expected)
B
Explanation:
Comprehensive and Detailed
Databricks and Apache Spark recommend building modular and reusable ETL transformations by
leveraging the DataFrame.transform() API. This method allows you to chain multiple transformation
functions in a clean and testable way.
Option A: Encapsulating the logic in a class (Pipeline) works, but it reduces modularity and flexibility.
It does not show the true intended use of DataFrame.transform() which is chaining functional
transformations.
Option B: This is the correct approach. It defines small, reusable functions (upper_value,
filter_positive) that each take a DataFrame and return a transformed DataFrame. By chaining them
with df.transform(func), you can compose ETL pipelines in a clear and declarative manner. This
enables unit testing of individual functions and makes the ETL pipeline modular, testable, and
production-ready.
Option C: This shows a single transformation wrapped in a function and tested, but it lacks pipeline
composition — it is not demonstrating modular chaining across multiple transformations.
Option D: This simply defines a transformation function with hardcoded logic. It does not leverage
DataFrame.transform() nor demonstrate modularity through composition.
Therefore, Option B is the best demonstration of how to use DataFrame.transform() in PySpark ETL
pipelines.
Databricks documentation explicitly highlights that DataFrame.transform() allows developers to
“chain together reusable functions in a readable and modular way, improving testability and
maintainability of ETL code.” This makes B the correct and officially supported pattern.
A data engineer is tasked with ensuring that a Delta table in Databricks continuously retains deleted
files for 15 days (instead of the default 7 days), in order to permanently comply with the
organization’s data retention policy.
Which code snippet correctly sets this retention period for deleted files?
A
Explanation:
Comprehensive and Detailed
In Delta Lake, the property delta.deletedFileRetentionDuration controls how long deleted data files
are retained before being permanently removed during a VACUUM operation.
By default, this retention duration is set to 7 days.
To comply with stricter retention requirements, organizations can explicitly update the table property
using an ALTER TABLE statement.
Option A uses the correct SQL command:
ALTER TABLE my_table SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = 'interval 15 days')
This updates the Delta table metadata so that all future operations respect the 15-day retention
policy for deleted files.
Why not the others?
B: This code incorrectly tries to set the property via the DeltaTable API. Delta’s Python API does not
expose direct attributes like deletedFileRetentionDuration; instead, properties must be set through
ALTER TABLE or DataFrameWriter options.
C: VACUUM ... RETAIN specifies a one-time file cleanup action (e.g., retaining 15 hours of history),
not a persistent retention policy. It cannot be used to set a continuous retention duration.
D: Setting spark.conf applies a session-level configuration and does not permanently update the
table’s retention metadata. Once the session ends, this configuration is lost.
Therefore, Option A is the correct and documented approach for persistently enforcing a 15-day
deleted file retention period in Delta Lake.
A data engineer is creating a data ingestion pipeline to understand where customers are taking their
rented bicycles during use. The engineer noticed that, over time, data being transmitted from the
bicycle sensors fail to include key details like latitude and longitude. Downstream analysts need both
the clean records and the quarantined records available for separate processing.
The data engineer already has this code:
import dlt
from pyspark.sql.functions import expr
rules = {
"valid_lat": "(lat IS NOT NULL)",
"valid_long": "(long IS NOT NULL)"
}
quarantine_rules = "NOT({})".format(" AND ".join(rules.values()))
@dlt.view
def raw_trips_data():
return spark.readStream.table("ride_and_go.telemetry.trips")
How should the data engineer meet the requirements to capture good and bad data?
A.
@dlt.table(name="trips_data_quarantine")
def trips_data_quarantine():
return (
spark.readStream.table("raw_trips_data")
.filter(expr(quarantine_rules))
)
B.
@dlt.view
@dlt.expect_or_drop("lat_long_present", "(lat IS NOT NULL AND long IS NOT NULL)")
def trips_data_quarantine():
return spark.readStream.table("ride_and_go.telemetry.trips")
C.
@dlt.table
@dlt.expect_all_or_drop(rules)
def trips_data_quarantine():
return spark.readStream.table("raw_trips_data")
D.
@dlt.table(partition_cols=["is_quarantined", ])
@dlt.expect_all(rules)
def trips_data_quarantine():
return (
spark.readStream.table("raw_trips_data")
.withColumn("is_quarantined", expr(quarantine_rules))
)
A
Explanation:
Comprehensive and Detailed
The requirement is that both valid (good) and invalid (bad) records must be captured and available
separately for downstream processing. Invalid records should not simply be dropped; they must be
quarantined in a dedicated table.
In Databricks Lakeflow Declarative Pipelines (DLT), this is achieved by creating separate output
tables:
One table for valid records (Silver table) that pass the expectations.
Another quarantine table that explicitly captures records failing the expectations.
Option A correctly implements this by:
Declaring a DLT table trips_data_quarantine.
Using .filter(expr(quarantine_rules)) to isolate invalid records (records where latitude or longitude is
NULL).
This ensures analysts can query both good records (from the main Silver pipeline table) and bad
records (from the quarantine table).
Why not the others?
B: Uses @dlt.expect_or_drop, which drops invalid records instead of quarantining them. This violates
the requirement that quarantined data should be available.
C: Same as B, but applies expectations in bulk with expect_all_or_drop. Again, bad data is dropped,
not quarantined.
D: Adds an is_quarantined flag in the same table. While it marks bad records, it does not separate
them into a distinct quarantine table as required by the business use case.
Therefore, Option A is the only solution aligned with Databricks documentation for quarantining
invalid data into a dedicated table while keeping valid data in the main pipeline.
A security analytics pipeline must enrich billions of raw connection logs with geolocation data. The
join hinges on finding which IPv4 range each event’s address falls into.
Table 1: network_events (≈ 5 billion rows)
event_id ip_int
3232235777
Table 2: ip_ranges (≈ 2 million rows)
start_ip_int end_ip_int country
3232235520 3232236031 US
The query is currently very slow:
SELECT n.event_id, n.ip_int, r.country
FROM network_events n
JOIN ip_ranges r
ON n.ip_int BETWEEN r.start_ip_int AND r.end_ip_int;
Which change will most dramatically accelerate the query while preserving its logic?
B
Explanation:
Comprehensive and Detailed Explanation from Databricks Documentation:
The query joins billions of rows (network_events) with millions of rows (ip_ranges) using a range
predicate (BETWEEN). Unlike equality joins (=), range joins are not efficiently handled by broadcast
or sort-merge joins because:
Broadcast Join (D): Effective for small tables but only for equality joins. Since this query uses a range
condition, broadcast will not reduce the complexity of scanning billions of records across non-
equality conditions.
Sort-Merge Join (C): Works for ordered joins but is inefficient on range conditions. Sorting billions of
records adds excessive overhead and will not resolve the bottleneck.
Increasing Shuffle Partitions (A): Only spreads out shuffle work but does not address the fundamental
inefficiency of range-based lookups at scale.
Range Joins in Spark (RANGE_JOIN hint):
Databricks provides range join optimizations specifically for conditions such as BETWEEN. By
applying a RANGE_JOIN hint, Spark can build optimized data structures (such as interval indexes or
partition pruning strategies) that map billions of input rows to ranges much faster. This avoids brute-
force scans and unnecessary shuffle costs.
Thus, Option B is the correct solution because:
It leverages range-join optimization, which is purpose-built for queries joining massive event logs to
smaller lookup tables with IP ranges.
This ensures Spark can evaluate billions of rows against millions of ranges with optimized matching
logic, drastically improving query performance while preserving correctness.
Reference: Databricks SQL Performance Tuning Guide – Range Joins and Join Hints (RANGE_JOIN,
BROADCAST, MERGE).
A data engineer is using Lakeflow Declarative Pipelines Expectations feature to track the data quality
of their incoming sensor data. Periodically, sensors send bad readings that are out of range, and they
are currently flagging those rows with a warning and writing them to the silver table along with the
good data. They’ve been given a new requirement – the bad rows need to be quarantined in a
separate quarantine table and no longer included in the silver table.
This is the existing code for their silver table:
@dlt.table
@dlt.expect("valid_sensor_reading", "reading < 120")
def silver_sensor_readings():
return spark.readStream.table("bronze_sensor_readings")
What code will satisfy the requirements?
A.
@dlt.table
@dlt.expect("valid_sensor_reading", "reading < 120")
def silver_sensor_readings():
return spark.readStream.table("bronze_sensor_readings")
@dlt.table
@dlt.expect("invalid_sensor_reading", "reading >= 120")
def quarantine_sensor_readings():
return spark.readStream.table("bronze_sensor_readings")
B.
@dlt.table
@dlt.expect_or_drop("valid_sensor_reading", "reading < 120")
def silver_sensor_readings():
return spark.readStream.table("bronze_sensor_readings")
@dlt.table
@dlt.expect("invalid_sensor_reading", "reading < 120")
def quarantine_sensor_readings():
return spark.readStream.table("bronze_sensor_readings")
C.
@dlt.table
@dlt.expect_or_drop("valid_sensor_reading", "reading < 120")
def silver_sensor_readings():
return spark.readStream.table("bronze_sensor_readings")
@dlt.table
@dlt.expect_or_drop("invalid_sensor_reading", "reading >= 120")
def quarantine_sensor_readings():
return spark.readStream.table("bronze_sensor_readings")
D.
@dlt.table
@dlt.expect_or_drop("valid_sensor_reading", "reading < 120")
def silver_sensor_readings():
return spark.readStream.table("bronze_sensor_readings")
@dlt.table
@dlt.expect("invalid_sensor_reading", "reading >= 120")
def quarantine_sensor_readings():
return spark.readStream.table("bronze_sensor_readings")
A
Explanation:
Comprehensive and Detailed Explanation from Databricks Documentation:
Lakeflow Declarative Pipelines (DLT) supports data quality enforcement using @dlt.expect,
@dlt.expect_or_drop, and @dlt.expect_all.
@dlt.expect applies a rule and records whether rows pass or fail the condition but does not drop
failing rows. Instead, failing rows can be written to a quarantine table.
@dlt.expect_or_drop enforces that only rows passing the condition flow downstream, dropping bad
records automatically.
In this case, the requirement is:
Good rows (reading < 120) go to the silver table.
Bad rows (reading >= 120) go to a quarantine table.
Bad rows should not be included in silver.
The correct implementation is Option A, where:
The silver table uses @dlt.expect to validate reading < 120. These rows flow normally.
The quarantine table applies an expectation for reading >= 120, ensuring bad records are captured
separately.
Other options are incorrect:
Option B/D: These either use expect_or_drop incorrectly or apply wrong conditions, leading to
dropped rows without quarantining properly.
Option C: Uses expect_or_drop for both tables, which would discard bad rows instead of persisting
them into a quarantine table.
Thus, Option A meets the business requirement to split good and bad data streams while ensuring
both are captured for auditing and processing.
Reference: Databricks Documentation – Delta Live Tables (DLT) Expectations: @dlt.expect,
@dlt.expect_or_drop, and Quarantine Tables.
A data engineer is configuring a Lakeflow Declarative Pipeline to process CDC (Change Data Capture)
data from a source. The source events sometimes arrive out of order, and multiple updates may
occur with the same update_timestamp but with different update_sequence_id.
What should the data engineer do to ensure events are sequenced correctly?
C
Explanation:
Comprehensive and Detailed Explanation from Databricks Documentation:
When handling CDC data, sequencing is critical because updates may arrive out of order or multiple
changes may occur for the same record at the same timestamp. Databricks’ AUTO CDC APIs provide
built-in constructs to handle ordering logic.
The correct mechanism is to use the SEQUENCE BY clause in the CDC configuration. Specifically,
when both update_timestamp and update_sequence_id exist, the recommended approach is:
SEQUENCE BY STRUCT(event_timestamp, update_sequence_id)
This ensures that within the same record key, the engine applies updates in the exact sequence they
occurred, resolving conflicts where multiple updates share the same timestamp but differ in
sequence ID.
Option A (track_history_column_list) is used for historical tracking and auditing changes, not for
sequencing logic. It ensures lineage but does not enforce correct event order.
Option B (dropDuplicates()) only removes exact duplicates; it cannot guarantee sequencing
correctness when multiple updates exist.
Option C is correct: SEQUENCE BY STRUCT(event_timestamp, update_sequence_id) explicitly
enforces ordering, as recommended by the CDC pipeline guidelines.
Option D (window function) would be a manual approach in Spark Structured Streaming, but
Lakeflow Declarative Pipelines already provide native CDC sequencing support, making this
unnecessary.
Thus, the best practice per Databricks CDC documentation is to use Option C with SEQUENCE BY
STRUCT.
Reference: Databricks Lakeflow Declarative Pipelines — AUTO CDC APIs and Sequencing with
SEQUENCE BY
A data engineer is designing a Lakeflow Declarative Pipeline to process streaming order data. The
pipeline uses Auto Loader to ingest data and must enforce data quality by ensuring customer_id and
amount are greater than zero. Invalid records should be dropped.
Which Lakeflow Declarative Pipelines configurations implement this requirement using Python?
A.
@dlt.table
def silver_orders():
return (
dlt.read_stream("bronze_orders")
.expect_or_drop("valid_customer", "customer_id IS NOT NULL")
.expect_or_drop("valid_amount", "amount > 0")
)
B.
@dlt.table
@dlt.expect("valid_customer", "customer_id IS NOT NULL")
@dlt.expect("valid_amount", "amount > 0")
def silver_orders():
return dlt.read_stream("bronze_orders")
C.
@dlt.table
def silver_orders():
return (
dlt.read_stream("bronze_orders")
.expect("valid_customer", "customer_id IS NOT NULL")
.expect("valid_amount", "amount > 0")
)
D.
@dlt.table
@dlt.expect_or_drop("valid_customer", "customer_id IS NOT NULL")
@dlt.expect_or_drop("valid_amount", "amount > 0")
def silver_orders():
return dlt.read_stream("bronze_orders")
A
Explanation:
Comprehensive and Detailed Explanation from Databricks Documentation:
Lakeflow Declarative Pipelines (LDP), formerly Delta Live Tables (DLT), supports enforcing data quality
using expectations. Expectations can either:
Track violations (expect) → records that do not meet conditions are flagged but still included in the
pipeline.
Drop violations (expect_or_drop) → records that do not meet conditions are excluded from
downstream tables.
Fail pipeline on violations (expect_or_fail) → records that fail conditions stop the pipeline.
In this scenario, the requirement explicitly states that invalid records (where customer_id is null or
amount ≤ 0) must be dropped. According to the official documentation, the correct method is
.expect_or_drop("expectation_name", "SQL_predicate") applied on the streaming input.
Option A is correct: It uses .expect_or_drop directly within the transformation chain for both rules,
ensuring records that fail are removed before writing to the silver table.
Option B incorrectly uses @dlt.expect decorators, which only track violations but do not drop invalid
rows.
Option C uses .expect, which also only flags rows, not drop them.
Option D uses @dlt.expect_or_drop decorator syntax, which is not supported in Python API;
expect_or_drop must be applied as a method on the DataFrame, not as a decorator.
Therefore, the correct solution is Option A, which ensures compliance by enforcing data quality and
dropping invalid rows programmatically during ingestion.
Reference: Databricks Lakeflow Declarative Pipelines Documentation — Expectations (expect,
expect_or_drop, expect_or_fail)
A data engineering team needs to implement a tagging system for their tables as part of an
automated ETL process, and needs to apply tags programmatically to tables in Unity Catalog.
Which SQL command adds tags to a table programmatically?
A
Explanation:
Comprehensive and Detailed Explanation from Databricks Documentation:
Unity Catalog in Databricks provides the ability to attach tags (key-value metadata pairs) to securable
objects such as catalogs, schemas, tables, volumes, and functions. Tags are critical for governance,
compliance, and automation, as they allow organizations to track metadata like sensitivity,
ownership, business purpose, and retention policies directly at the object level.
According to the official Databricks SQL reference for Unity Catalog, the correct way to
programmatically add tags to a table is by using the ALTER TABLE … SET TAGS command. The syntax
is:
ALTER TABLE table_name SET TAGS ('tag_name' = 'tag_value', ...);
This command can be used within ETL workflows or jobs to automatically apply metadata during or
after ingestion, ensuring that governance and compliance rules are embedded in the pipeline itself.
Option A is correct because it uses the supported syntax for applying tags.
Option B (APPLY TAGS) is not valid SQL in Unity Catalog and is not recognized by Databricks.
Option C confuses COMMENT with TAGS. While COMMENT can add descriptive text to a table, it
does not handle tags.
Option D (SET TAGS FOR) is not a valid SQL construct in Databricks for applying tags.
Thus, Option A is the only valid and documented way to programmatically set tags on a table in Unity
Catalog.
Reference: Databricks SQL Language Reference — ALTER TABLE … SET TAGS (Unity Catalog)
A data engineer is configuring a Databricks Asset Bundle to deploy a job with granular permissions.
The requirements are:
• Grant the data-engineers group CAN_MANAGE access to the job.
• Ensure the auditors’ group can view the job but not modify/run it.
• Avoid granting unintended permissions to other users/groups.
How should the data engineer deploy the job while meeting the requirements?
A.
resources:
jobs:
my-job:
name: data-pipeline
tasks: [...]
job_clusters: [...]
permissions:
- group_name: data-engineers
level: CAN_MANAGE
- group_name: auditors
level: CAN_VIEW
- group_name: admin-team
level: IS_OWNER
B.
resources:
jobs:
my-job:
name: data-pipeline
tasks: [...]
job: [...]
permissions:
- group_name: data-engineers
level: CAN_MANAGE
permissions:
- group_name: auditors
level: CAN_VIEW
C.
permissions:
- group_name: data-engineers
level: CAN_MANAGE
- group_name: auditors
level: CAN_VIEW
resources:
jobs:
my-job:
name: data-pipeline
tasks: [...]
job_clusters: [...]
D.
resources:
jobs:
my-job:
name: data-pipeline
tasks: [...]
job_clusters: [...]
permissions:
- group_name: data-engineers
level: CAN_MANAGE
- group_name: auditors
level: CAN_VIEW
D
Explanation:
Comprehensive and Detailed Explanation from Databricks Documentation:
Databricks Asset Bundles (DABs) allow jobs, clusters, and permissions to be defined as code in YAML
configuration files. According to the Databricks documentation on job permissions and bundle
deployment, when defining permissions within a job resource, they must be scoped directly under
that specific job’s definition. This ensures that permissions are applied only to the intended job
resource and not inadvertently propagated to other jobs or resources.
In this scenario, the data engineer must grant the data-engineers group CAN_MANAGE access,
allowing them to configure, edit, and manage the job, while the auditors group should only have
CAN_VIEW, giving them read-only access to see configurations and results without the ability to
modify or execute. Importantly, no additional groups should be granted permissions, in order to
follow the principle of least privilege.
Options A and B introduce unnecessary or unintended groups (like admin-team in A) or define
permissions outside of the job scope (as in B). Option C improperly separates the permissions block
outside the job resource, which is not aligned with Databricks bundle best practices.
Option D is the correct approach because it defines the job resource my-job with its name, tasks,
clusters, and the exact intended permissions (CAN_MANAGE for data-engineers and CAN_VIEW for
auditors). This aligns with Databricks’ principle of least privilege and ensures compliance with
governance standards in Unity Catalog-enabled workspaces.
Reference: Databricks Asset Bundles documentation — Managing Jobs and Permissions
A data engineer needs to install the PyYAML Python package within an air-gapped Databricks
environment. The workspace has no direct internet access to PyPI. The engineer has downloaded the
.whl file locally and wants it available automatically on all new clusters.
Which approach should the data engineer use?
B
Explanation:
Comprehensive and Detailed Explanation From Exact Extract of Databricks Data Engineer
Documents:
For secure, air-gapped Databricks deployments, the recommended practice is to host dependency
files such as .whl packages in Unity Catalog Volumes — a managed storage layer governed by Unity
Catalog.
Once stored in a volume, these files can be safely referenced from cluster-scoped init scripts, which
automatically execute installation commands (e.g., pip install
/Volumes/catalog/schema/path/PyYAML.whl) during cluster startup.
This ensures consistent environment setup across clusters and compliance with data governance
rules.
User directories (A) lack enterprise security controls; private repositories (C) are not viable in air-
gapped setups; and Git repos (D) do not trigger package installation. Therefore, B is the correct and
officially approved method.
A data engineer is designing a pipeline in Databricks that processes records from a Kafka stream
where late-arriving data is common.
Which approach should the data engineer use?
D
Explanation:
Comprehensive and Detailed Explanation From Exact Extract of Databricks Data Engineer
Documents:
In Structured Streaming, event-time watermarks control how long the engine waits for late-arriving
data before finalizing aggregations. By setting an appropriate watermark, Databricks can handle late
data gracefully — incorporating records that arrive within the defined window while discarding
excessively delayed events.
This approach ensures accurate aggregations, minimizes state size, and prevents memory leaks.
Manual reprocessing (A) or overwriting entire datasets (B) is inefficient and costly, while Auto CDC
(C) is used for change tracking in Delta tables, not for streaming event lateness.
Thus, using watermarking is the recommended and official approach for managing late data in
streaming pipelines.
A data engineering team has a time-consuming data ingestion job with three data sources. Each
notebook takes about one hour to load new data. One day, the job fails because a notebook update
introduced a new required configuration parameter. The team must quickly fix the issue and load the
latest data from the failing source.
Which action should the team take?
A
Explanation:
Comprehensive and Detailed Explanation From Exact Extract of Databricks Data Engineer
Documents:
The repair run capability in Databricks Jobs allows re-execution of failed tasks without re-running
successful ones. When a parameterized job fails due to missing or incorrect task configuration,
engineers can perform a repair run to fix inputs or parameters and resume from the failed state.
This approach saves time, reduces cost, and ensures workflow continuity by avoiding unnecessary
recomputation. Additionally, updating the task definition with the missing parameter prevents future
runs from failing.
Running the job manually (B) loses run context; (C) alone does not prevent recurrence; (D) delays
resolution. Thus, A follows the correct operational and recovery practice.
A data engineer is designing an append-only pipeline that needs to handle both batch and streaming
data in Delta Lake. The team wants to ensure that the streaming component can efficiently track
which data has already been processed.
Which configuration should be set to enable this?
C
Explanation:
Comprehensive and Detailed Explanation From Exact Extract of Databricks Data Engineer
Documents:
When working with Delta Lake streaming ingestion, checkpointing is critical for maintaining fault
tolerance and ensuring exactly-once data processing semantics.
The checkpointLocation parameter defines the directory where Spark Structured Streaming stores
progress information, offsets, and metadata. This allows the engine to resume processing from the
last committed offset without reprocessing previously ingested data.
Without checkpointing, each stream restart would reprocess all data, leading to duplicates.
Parameters like partitionBy or schema options (mergeSchema / overwriteSchema) affect table
structure, not data lineage tracking. Therefore, the correct and required configuration for efficient
streaming state management is checkpointLocation.