databricks DATABRICKS CERTIFIED ASSOCIATE DEVELOPER FOR APACHE SPARK 3 5 Exam Questions

Questions for the DATABRICKS CERTIFIED ASSOCIATE DEVELOPER FOR APACHE SPARK 3 5 were updated on : Dec 01 ,2025

Page 1 out of 9. Viewing questions 1-15 out of 135

Question 1

54 of 55.
What is the benefit of Adaptive Query Execution (AQE)?

  • A. It allows Spark to optimize the query plan before execution but does not adapt during runtime.
  • B. It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.
  • C. It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.
  • D. It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.
Answer:

D

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
Adaptive Query Execution (AQE) is a Spark SQL feature introduced to dynamically optimize queries at
runtime based on actual data statistics collected during execution.
Key benefits include:
Runtime plan adaptation: Spark adjusts the physical plan after some stages complete.
Skew handling: Automatically splits skewed partitions to balance work distribution.
Join strategy optimization: Dynamically switches between shuffle join and broadcast join depending
on partition sizes.
Coalescing shuffle partitions: Reduces the number of small tasks for better performance.
Example configuration:
spark.conf.set("spark.sql.adaptive.enabled", True)
This enables AQE globally in Spark 3.5.
Why the other options are incorrect:
A: AQE adapts during runtime, not only before execution.
B: Task distribution is a base Spark feature, not specific to AQE.
C: AQE specifically addresses runtime skew and join adjustments.
Reference:
Spark SQL Adaptive Query Execution Guide — Runtime optimization, skew handling, and join
strategy adjustment.
Databricks Exam Guide (June 2025): Section “Troubleshooting and Tuning Apache Spark DataFrame
API Applications” — Adaptive Query Execution benefits and configuration.

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 2

49 of 55.
In the code block below, aggDF contains aggregations on a streaming DataFrame:
aggDF.writeStream \
.format("console") \
.outputMode("???") \
.start()
Which output mode at line 3 ensures that the entire result table is written to the console during each
trigger execution?

  • A. AGGREGATE
  • B. COMPLETE
  • C. REPLACE
  • D. APPEND
Answer:

B

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
Structured Streaming supports three output modes:
Append: Writes only new rows since the last trigger.
Update: Writes only updated rows.
Complete: Writes the entire result table after every trigger execution.
For aggregations like groupBy().count(), only complete mode outputs the entire table each time.
Example:
aggDF.writeStream \
.outputMode("complete") \
.format("console") \
.start()
Why the other options are incorrect:
A: “AGGREGATE” is not a valid output mode.
C: “REPLACE” does not exist.
D: “APPEND” writes only new rows, not the full table.
Reference:
PySpark Structured Streaming — Output Modes (append, update, complete).
Databricks Exam Guide (June 2025): Section “Structured Streaming” — output modes and use cases
for aggregations.

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 3

48 of 55.
A data engineer needs to join multiple DataFrames and has written the following code:
from pyspark.sql.functions import broadcast
data1 = [(1, "A"), (2, "B")]
data2 = [(1, "X"), (2, "Y")]
data3 = [(1, "M"), (2, "N")]
df1 = spark.createDataFrame(data1, ["id", "val1"])
df2 = spark.createDataFrame(data2, ["id", "val2"])
df3 = spark.createDataFrame(data3, ["id", "val3"])
df_joined = df1.join(broadcast(df2), "id", "inner") \
.join(broadcast(df3), "id", "inner")
What will be the output of this code?

  • A. The code will work correctly and perform two broadcast joins simultaneously to join df1 with df2, and then the result with df3.
  • B. The code will fail because only one broadcast join can be performed at a time.
  • C. The code will fail because the second join condition (df2.id == df3.id) is incorrect.
  • D. The code will result in an error because broadcast() must be called before the joins, not inline.
Answer:

A

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
Spark supports multiple broadcast joins in a single query plan, as long as each broadcasted
DataFrame is small enough to fit under the configured threshold.
Execution Plan:
Spark broadcasts df2 to all executors.
Joins df1 (big) with broadcasted df2.
Then broadcasts df3 and performs another join with the intermediate result.
The result is efficient and avoids shuffling large data.
Why the other options are incorrect:
B: Multiple broadcast joins are supported in Spark 3.x.
C: The join condition is correct since all use id as the key.
D: broadcast() can be used inline; it’s valid syntax.
Reference:
PySpark SQL Functions — broadcast() usage.
Databricks Exam Guide (June 2025): Section “Developing Apache Spark DataFrame/DataSet API
Applications” — multiple broadcast join optimization.
===========

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 4

47 of 55.
A data engineer has written the following code to join two DataFrames df1 and df2:
df1 = spark.read.csv("sales_data.csv")
df2 = spark.read.csv("product_data.csv")
df_joined = df1.join(df2, df1.product_id == df2.product_id)
The DataFrame df1 contains ~10 GB of sales data, and df2 contains ~8 MB of product data.
Which join strategy will Spark use?

  • A. Shuffle join, as the size difference between df1 and df2 is too large for a broadcast join to work efficiently.
  • B. Shuffle join, because AQE is not enabled, and Spark uses a static query plan.
  • C. Shuffle join because no broadcast hints were provided.
  • D. Broadcast join, as df2 is smaller than the default broadcast threshold.
Answer:

D

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
Spark automatically uses a broadcast hash join when one side of the join is small enough to fit within
the broadcast threshold.
Default threshold:
spark.sql.autoBroadcastJoinThreshold = 10MB (as of Spark 3.5)
Since df2 is 8 MB, Spark automatically broadcasts it to all executors. This avoids a shuffle on the large
dataset (df1) and speeds up the join.
Why the other options are incorrect:
A: 8 MB < 10 MB threshold → broadcast join is efficient.
B: AQE is not required for broadcast joins; it’s a static optimization.
C: Broadcast hints are optional — Spark infers automatically.
Reference:
Databricks Exam Guide (June 2025): Section “Troubleshooting and Tuning Apache Spark DataFrame
API Applications” — broadcast joins and optimization.
Spark SQL Join Strategies — Broadcast hash join and shuffle join thresholds.
===========

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 5

46 of 55.
A data engineer is implementing a streaming pipeline with watermarking to handle late-arriving
records.
The engineer has written the following code:
inputStream \
.withWatermark("event_time", "10 minutes") \
.groupBy(window("event_time", "15 minutes"))
What happens to data that arrives after the watermark threshold?

  • A. Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.
  • B. Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.
  • C. Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.
  • D. The watermark ensures that late data arriving within 10 minutes of the latest event time will be processed and included in the windowed aggregation.
Answer:

A

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
Watermarking in Structured Streaming defines how late a record can arrive based on event time
before Spark discards it.
Behavior:
.withWatermark("event_time", "10 minutes")
This means Spark will keep state for 10 minutes beyond the maximum event time seen so far.
Any data arriving later than 10 minutes after the current watermark is ignored — it will not be
included in the aggregation or output.
Why the other options are incorrect:
B: Late data beyond the watermark threshold is not included.
C: Late data is not moved to a new window; it’s simply dropped.
D: True for late data within the watermark threshold, not after it.
Reference:
Spark Structured Streaming Guide — withWatermark() behavior and late data handling.
Databricks Exam Guide (June 2025): Section “Structured Streaming” — watermarking and state
cleanup behavior.
===========

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 6

45 of 55.
Which feature of Spark Connect should be considered when designing an application that plans to
enable remote interaction with a Spark cluster?

  • A. It is primarily used for data ingestion into Spark from external sources.
  • B. It provides a way to run Spark applications remotely in any programming language.
  • C. It can be used to interact with any remote cluster using the REST API.
  • D. It allows for remote execution of Spark jobs.
Answer:

D

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
Spark Connect enables remote execution of Spark jobs by decoupling the client from the driver using
the Spark Connect protocol (gRPC).
It allows users to run Spark code from different environments (like notebooks, IDEs, or remote
clients) while executing jobs on the cluster.
Key Features:
Enables remote interaction between client and Spark driver.
Supports interactive development and lightweight client sessions.
Improves developer productivity without needing driver resources locally.
Why the other options are incorrect:
A: Spark Connect is not limited to ingestion tasks.
B: It allows multi-language clients (Python, Scala, etc.) but runs via Spark Connect API, not arbitrary
remote code.
C: Uses gRPC protocol, not REST.
Reference:
Databricks Exam Guide (June 2025): Section “Using Spark Connect to Deploy Applications” —
describes Spark Connect architecture and remote execution model.
Spark 3.5 Documentation — Spark Connect overview and client-server protocol.

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 7

44 of 55.
A data engineer is working on a real-time analytics pipeline using Spark Structured Streaming.
They want the system to process incoming data in micro-batches at a fixed interval of 5 seconds.
Which code snippet fulfills this requirement?
A.
query = df.writeStream \
.outputMode("append") \
.trigger(processingTime="5 seconds") \
.start()
B.
query = df.writeStream \
.outputMode("append") \
.trigger(continuous="5 seconds") \
.start()
C.
query = df.writeStream \
.outputMode("append") \
.trigger(once=True) \
.start()
D.
query = df.writeStream \
.outputMode("append") \
.start()

  • A. Option A
  • B. Option B
  • C. Option C
  • D. Option D
Answer:

A

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
To process data in fixed micro-batch intervals, use the .trigger(processingTime="interval") option in
Structured Streaming.
Correct usage:
query = df.writeStream \
.outputMode("append") \
.trigger(processingTime="5 seconds") \
.start()
This instructs Spark to process available data every 5 seconds.
Why the other options are incorrect:
B: continuous triggers are for continuous processing mode (different execution model).
C: once=True runs the stream a single time (batch mode).
D: Default trigger runs as fast as possible, not fixed intervals.
Reference:
PySpark Structured Streaming Guide — Trigger types: processingTime, once, continuous.
Databricks Exam Guide (June 2025): Section “Structured Streaming” — controlling streaming triggers
and batch intervals.
===========

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 8

43 of 55.
An organization has been running a Spark application in production and is considering disabling the
Spark History Server to reduce resource usage.
What will be the impact of disabling the Spark History Server in production?

  • A. Prevention of driver log accumulation during long-running jobs
  • B. Improved job execution speed due to reduced logging overhead
  • C. Loss of access to past job logs and reduced debugging capability for completed jobs
  • D. Enhanced executor performance due to reduced log size
Answer:

C

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
The Spark History Server provides a web UI for viewing past completed applications, including event
logs, stages, and performance metrics.
If disabled:
Spark jobs still run normally,
But users lose the ability to review historical job metrics, DAGs, or logs after completion.
Thus, debugging, performance analysis, and audit capabilities are lost.
Why the other options are incorrect:
A: Disabling History Server doesn’t manage logs.
B/D: Minimal overhead; disabling doesn’t improve runtime speed or executor performance.
Reference:
Databricks Exam Guide (June 2025): Section “Apache Spark Architecture and Components” — Spark
UI, History Server, and event logging.
Spark Administration Docs — History Server functionality and configuration.
===========

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 9

42 of 55.
A developer needs to write the output of a complex chain of Spark transformations to a Parquet
table called events.liveLatest.
Consumers of this table query it frequently with filters on both year and month of the event_ts
column (a timestamp).
The current code:
from pyspark.sql import functions as F
final = df.withColumn("event_year", F.year("event_ts")) \
.withColumn("event_month", F.month("event_ts")) \
.bucketBy(42, ["event_year", "event_month"]) \
.saveAsTable("events.liveLatest")
However, consumers report poor query performance.
Which change will enable efficient querying by year and month?

  • A. Replace .bucketBy() with .partitionBy("event_year", "event_month")
  • B. Change the bucket count (42) to a lower number
  • C. Add .sortBy() after .bucketBy()
  • D. Replace .bucketBy() with .partitionBy("event_year") only
Answer:

A

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
When queries frequently filter on certain columns, partitioning by those columns ensures partition
pruning, allowing Spark to scan only relevant directories instead of the entire dataset.
Correct code:
final.write.partitionBy("event_year", "event_month").parquet("events.liveLatest")
This improves read performance dramatically for filters like:
SELECT * FROM events.liveLatest WHERE event_year = 2024 AND event_month = 5;
bucketBy() helps in clustering and joins, not in partition pruning for file-based tables.
Why the other options are incorrect:
B: Bucket count changes parallelism, not query pruning.
C: sortBy organizes data within files, not across partitions.
D: Partitioning by only one column limits pruning benefits.
Reference:
Spark SQL DataFrameWriter — partitionBy() for partitioned tables.
Databricks Exam Guide (June 2025): Section “Using Spark SQL” — partitioning vs. bucketing and
query optimization.
===========

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 10

41 of 55.
A data engineer is working on the DataFrame df1 and wants the Name with the highest count to
appear first (descending order by count), followed by the next highest, and so on.
The DataFrame has columns:
id | Name | count | timestamp
---------------------------------
1 | USA | 10
2 | India | 20
3 | England | 50
4 | India | 50
5 | France | 20
6 | India | 10
7 | USA | 30
8 | USA | 40
Which code fragment should the engineer use to sort the data in the Name and count columns?

  • A. df1.orderBy(col("count").desc(), col("Name").asc())
  • B. df1.sort("Name", "count")
  • C. df1.orderBy("Name", "count")
  • D. df1.orderBy(col("Name").desc(), col("count").asc())
Answer:

A

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
To sort a Spark DataFrame by multiple columns, use .orderBy() (or .sort()) with column expressions.
Correct syntax for descending and ascending mix:
from pyspark.sql.functions import col
df1.orderBy(col("count").desc(), col("Name").asc())
This sorts primarily by count in descending order and secondarily by Name in ascending order
(alphabetically).
Why the other options are incorrect:
B/C: Default sort order is ascending; won’t place highest counts first.
D: Reverses sorting logic — sorts Name descending, not required.
Reference:
PySpark DataFrame API — orderBy() and col() for sorting with direction.
Databricks Exam Guide (June 2025): Section “Using Spark DataFrame APIs” — sorting, ordering, and
column expressions.
===========

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 11

40 of 55.
A developer wants to refactor older Spark code to take advantage of built-in functions introduced in
Spark 3.5.
The original code:
from pyspark.sql import functions as F
min_price = 110.50
result_df = prices_df.filter(F.col("price") > min_price).agg(F.count("*"))
Which code block should the developer use to refactor the code?

  • A. result_df = prices_df.filter(F.col("price") > F.lit(min_price)).agg(F.count("*"))
  • B. result_df = prices_df.where(F.lit("price") > min_price).groupBy().count()
  • C. result_df = prices_df.withColumn("valid_price", when(col("price") > F.lit(min_price), True))
  • D. result_df = prices_df.filter(F.lit(min_price) > F.col("price")).count()
Answer:

A

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
To compare a column value with a Python literal constant in a DataFrame expression, use F.lit() to
convert it into a Spark literal.
Correct refactor:
from pyspark.sql import functions as F
min_price = 110.50
result_df = prices_df.filter(F.col("price") > F.lit(min_price)).agg(F.count("*"))
This avoids type mismatches and ensures Spark executes the filter expression on the cluster.
Why the other options are incorrect:
B: where() syntax is valid, but F.lit("price") is incorrect — wraps string literal, not a column.
C: withColumn adds a column, not needed for this aggregation.
D: Comparison logic reversed.
Reference:
PySpark SQL Functions — lit(), col(), and DataFrame filters.
Databricks Exam Guide (June 2025): Section “Developing Apache Spark DataFrame/DataSet API
Applications” — filtering, literals, and aggregations.

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 12

39 of 55.
A Spark developer is developing a Spark application to monitor task performance across a cluster.
One requirement is to track the maximum processing time for tasks on each worker node and
consolidate this information on the driver for further analysis.
Which technique should the developer use?

  • A. Broadcast a variable to share the maximum time among workers.
  • B. Configure the Spark UI to automatically collect maximum times.
  • C. Use an RDD action like reduce() to compute the maximum time.
  • D. Use an accumulator to record the maximum time on the driver.
Answer:

C

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
RDD actions like reduce() aggregate values across all partitions and return the result to the driver.
To compute the maximum processing time, reduce() is ideal because it combines results from all
tasks efficiently.
Example:
max_time = rdd_times.reduce(lambda x, y: max(x, y))
This aggregates maximum values from all executors into a single result on the driver.
Why the other options are incorrect:
A: Broadcast variables distribute read-only data; they cannot aggregate results.
B: Spark UI provides visualization, not programmatic collection.
D: Accumulators support additive operations only (e.g., counters, sums), not non-associative ones
like max.
Reference:
Spark RDD API — reduce() for aggregations.
Databricks Exam Guide (June 2025): Section “Apache Spark Architecture and Components” —
actions, accumulators, and broadcast variables.
===========

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 13

38 of 55.
A data engineer is working with Spark SQL and has a large JSON file stored at /data/input.json.
The file contains records with varying schemas, and the engineer wants to create an external table in
Spark SQL that:
Reads directly from /data/input.json.
Infers the schema automatically.
Merges differing schemas.
Which code snippet should the engineer use?
A.
CREATE EXTERNAL TABLE users
USING json
OPTIONS (path '/data/input.json', mergeSchema 'true');
B.
CREATE TABLE users
USING json
OPTIONS (path '/data/input.json');
C.
CREATE EXTERNAL TABLE users
USING json
OPTIONS (path '/data/input.json', inferSchema 'true');
D.
CREATE EXTERNAL TABLE users
USING json
OPTIONS (path '/data/input.json', mergeAll 'true');

  • A. Option A
  • B. Option B
  • C. Option C
  • D. Option D
Answer:

A

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
To handle JSON files with evolving or differing schemas, Spark SQL supports the option mergeSchema
'true', which merges all fields across files into a unified schema.
Correct syntax:
CREATE EXTERNAL TABLE users
USING json
OPTIONS (path '/data/input.json', mergeSchema 'true');
This creates an external table directly on the JSON data, inferring schema automatically and merging
variations.
Why the other options are incorrect:
B: Missing schema merge configuration — fails with inconsistent files.
C: inferSchema applies to CSV/other file types, not JSON.
D: mergeAll is not a valid Spark SQL option.
Reference:
Spark SQL Data Sources — JSON file options (mergeSchema, path).
Databricks Exam Guide (June 2025): Section “Using Spark SQL” — creating external tables and
schema inference for JSON data.
===========

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 14

37 of 55.
A data scientist is working with a Spark DataFrame called customerDF that contains customer
information.
The DataFrame has a column named email with customer email addresses.
The data scientist needs to split this column into username and domain parts.
Which code snippet splits the email column into username and domain columns?
A.
customerDF = customerDF \
.withColumn("username", split(col("email"), "@").getItem(0)) \
.withColumn("domain", split(col("email"), "@").getItem(1))
B.
customerDF = customerDF.withColumn("username", regexp_replace(col("email"), "@", ""))
C.
customerDF = customerDF.select("email").alias("username", "domain")
D.
customerDF = customerDF.withColumn("domain", col("email").split("@")[1])

  • A. Option A
  • B. Option B
  • C. Option C
  • D. Option D
Answer:

A

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
The split() function in PySpark splits strings into an array based on a given delimiter.
Then, .getItem(index) extracts a specific element from the array.
Correct usage:
from pyspark.sql.functions import split, col
customerDF = customerDF \
.withColumn("username", split(col("email"), "@").getItem(0)) \
.withColumn("domain", split(col("email"), "@").getItem(1))
This creates two new columns derived from the email field:
"username" → text before @
"domain" → text after @
Why the other options are incorrect:
B: regexp_replace only replaces text; does not split into multiple columns.
C: .select() cannot alias multiple derived columns like this.
D: Column objects are not native Python strings; cannot use standard .split().
Reference:
PySpark SQL Functions — split() and getItem().
Databricks Exam Guide (June 2025): Section “Developing Apache Spark DataFrame/DataSet API
Applications” — manipulating and splitting column data.
===========

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 15

36 of 55.
What is the main advantage of partitioning the data when persisting tables?

  • A. It compresses the data to save disk space.
  • B. It automatically cleans up unused partitions to optimize storage.
  • C. It ensures that data is loaded into memory all at once for faster query execution.
  • D. It optimizes by reading only the relevant subset of data from fewer partitions.
Answer:

D

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
Partitioning a dataset divides data into separate directories based on partition column values. When
queries filter on partitioned columns, Spark can prune irrelevant partitions — meaning it only reads
files that match the filter criteria.
Advantage:
Reduces I/O and improves performance by scanning only relevant subsets of data.
Example:
/data/sales/year=2023/month=10/...
/data/sales/year=2024/month=01/...
A query filtering WHERE year = 2024 reads only the relevant partition.
Why the other options are incorrect:
A: Compression is independent of partitioning.
B: Spark does not automatically clean partitions unless managed manually.
C: Partitioning does not cause Spark to load entire data into memory.
Reference:
Databricks Exam Guide (June 2025): Section “Using Spark SQL” — partitioning and pruning for
optimized data retrieval.
Spark SQL Documentation — DataFrameWriter partitionBy() and query optimization.
===========

Discussions
vote your answer:
A
B
C
D
0 / 1000
To page 2