amazon AWS Certified Data Engineer - Associate exam practice questions
Questions for the AMAZON DEA C01 were updated on : Dec 01 ,2025
Page 1 out of 13. Viewing questions 1-15 out of 190
Question 1
A data engineer is configuring an AWS Glue Apache Spark extract, transform, and load (ETL) Job. The job contains a sort-merge join of two large and equally sized DataFrames. The job is failing with the following error: No space left on device. Which solution will resolve the error?
A. Use the AWS Glue Spark shuffle manager.
B. Deploy an Amazon Elastic Block Store (Amazon EBS) volume for the job to use.
C. Convert the sort-merge join in the job to be a broadcast join.
D. Convert the DataFrames to DynamicFrames, and perform a DynamicFrame join in the job.
Answer:
D
User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
0/ 1000
Question 2
A data engineer is using an Apache Iceberg framework to build a data lake that contains 100 TB of data. The data engineer wants to run AWS Glue Apache Spark Jobs that use the Iceberg framework. What combination of steps will meet these requirements? (Select TWO.)
A. Create a key named -conf for an AWS Glue job. Set Iceberg as a value for the --datalake-formats job parameter.
B. Specify the path to a specific version of Iceberg by using the --extra-Jars job parameter. Set Iceberg as a value for the ~ datalake-formats job parameter.
C. Set Iceberg as a value for the -datalake-formats job parameter.
D. Set the -enable-auto-scaling parameter to true.
E. Add the -job-bookmark-option: job-bookmark-enable parameter to an AWS Glue job.
Answer:
A, E
User Votes:
A
50%
B
50%
C
50%
D
50%
E
50%
Discussions
0/ 1000
Question 3
A company uses an Amazon Redshift cluster as a data warehouse that is shared across two departments. To comply with a security policy, each department must have unique access permissions.
A. Group tables and views for each department into dedicated schemas. Manage permissions at the schema level.
B. Group tables and views for each department into dedicated databases. Manage permissions at the database level.
C. Update the names of the tables and views to follow a naming convention that contains the department names. Manage permissions based on the new naming convention.
D. Create an IAM user group for each department. Use identity-based IAM policies to grant table and view permissions based on the IAM user group.
Answer:
A
User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
0/ 1000
Question 4
A data engineer notices slow query performance on a highly partitioned table that is in Amazon Athen a. The table contains daily data for the previous 5 years, partitioned by date. The data engineer wants to improve query performance and to automate partition management. Which solution will meet these requirements?
A. Use an AWS Lambda function that runs daily. Configure the function to manually create new partitions in AW5 Glue for each day's data.
B. Use partition projection in Athena. Configure the table properties by using a date range from 5 years ago to the present.
C. Reduce the number of partitions by changing the partitioning schema from dairy to monthly granularity.
D. Increase the processing capacity of Athena queries by allocating more compute resources.
Answer:
B
User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
0/ 1000
Question 5
A company uses AWS Glue Apache Spark jobs to handle extract, transform, and load (ETL) workloads. The company has enabled logging and monitoring for all AWS Glue jobs. One of the AWS Glue jobs begins to fail. A data engineer investigates the error and wants to examine metrics for all individual stages within the job. How can the data engineer access the stage metrics?
A. Examine the AWS Glue job and stage details in the Spark UI.
B. Examine the AWS Glue job and stage metrics in Amazon CloudWatch.
C. Examine the AWS Glue job and stage logs in AWS CloudTrail logs.
D. Examine the AWS Glue job and stage details by using the run insights feature on the job.
Answer:
A
User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
0/ 1000
Question 6
A company wants to combine data from multiple software as a service (SaaS) applications for analysis. A data engineering team needs to use Amazon QuickSight to perform the analysis and build dashboards. A data engineer needs to extract the data from the SaaS applications and make the data available for QuickSight queries. Which solution will meet these requirements in the MOST operationally efficient way?
A. Create AWS Lambda functions that call the required APIs to extract the data from the applications. Store the data in an Amazon S3 bucket. Use AWS Glue to catalog the data in the S3 bucket. Create a data source and a dataset in QuickSight
B. Use AWS Lambda functions as Amazon Athena data source connectors to run federated queries against the SaaS applications. Create an Athena data source and a dataset in QuickSight.
C. Use Amazon AppFlow to create a Row for each SaaS application. Set an Amazon S3 bucket as the destination. Schedule the flows to extract the data to the bucket. Use AWS Glue to catalog the data in the S3 bucket. Create a data source and a dataset in QuickSight.
D. Export data the from the SaaS applications as Microsoft Excel files. Create a data source and a dataset in QuickSight by uploading the Excel files.
Answer:
C
User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
0/ 1000
Question 7
A company wants to migrate a data warehouse from Teradata to Amazon Redshift. Which solution will meet this requirement with the LEAST operational effort?
A. Use AWS Database Migration Service (AWS DMS) Schema Conversion to migrate the schema. Use AWS DMS to migrate the data.
B. Use the AWS Schema Conversion Tool (AWS SCT) to migrate the schema. Use AWS Database Migration Service (AWS DMS) to migrate the data.
C. Use AWS Database Migration Service (AWS DMS) to migrate the data. Use automatic schema conversion.
D. Manually export the schema definition from Teradata. Apply the schema to the Amazon Redshift database. Use AWS Database Migration Service (AWS DMS) to migrate the data.
Answer:
B
User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
0/ 1000
Question 8
A company has a data pipeline that uses an Amazon RDS instance, AWS Glue jobs, and an Amazon S3 bucket. The RDS instance and AWS Glue jobs run in a private subnet of a VPC and in the same security group. A use' made a change to the security group that prevents the AWS Glue jobs from connecting to the RDS instance. After the change, the security group contains a single rule that allows inbound SSH traffic from a specific IP address. The company must resolve the connectivity issue. Which solution will meet this requirement?
A. Add an inbound rule that allows all TCP traffic on all TCP ports. Set the security group as the source.
B. Add an inbound rule that allows all TCP traffic on all UDP ports. Set the private IP address of the RDS instance as the source.
C. Add an inbound rule that allows all TCP traffic on all TCP ports. Set the DNS name of the RDS instance as the source.
D. Replace the source of the existing SSH rule with the private IP address of the RDS instance. Create an outbound rule with the same source, destination, and protocol as the inbound SSH rule.
Answer:
A
User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
0/ 1000
Question 9
A company receives marketing campaign data from a vendor. The company ingests the data into an Amazon S3 bucket every 40 to 60 minutes. The data is in CSV format. File sizes are between 100 KB and 300 KB. A data engineer needs to set-up an extract, transform, and load (ETL) pipeline to upload the content of each file to Amazon Redshift. Which solution will meet these requirements with the LEAST operational overhead?
A. Create an AWS Lambda function that connects to Amazon Redshift and runs a COPY command. Use Amazon EventBridge to invoke the Lambda function based on an Amazon S3 upload trigger.
B. Create an Amazon Data Firehose stream. Configure the stream to use an AWS Lambda function as a source to pull data from the S3 bucket. Set Amazon Redshift as the destination.
C. Use Amazon Redshift Spectrum to query the S3 bucket. Configure an AWS Glue Crawler for the S3 bucket to update metadata in an AWS Glue Data Catalog.
D. Creates an AWS Database Migration Service (AWS DMS) task. Specify an appropriate data schema to migrate. Specify the appropriate type of migration to use.
Answer:
D
User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
0/ 1000
Question 10
A company uses Amazon Redshift as its data warehouse service. A data engineer needs to design a physical data model. The data engineer encounters a de-normalized table that is growing in size. The table does not have a suitable column to use as the distribution key. Which distribution style should the data engineer use to meet these requirements with the LEAST maintenance overhead?
A. ALL distribution
B. EVEN distribution
C. AUTO distribution
D. KEY distribution
Answer:
B
User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
0/ 1000
Question 11
A data engineer is troubleshooting an AWS Glue workflow that occasionally fails. The engineer determines that the failures are a result of data quality issues. A business reporting team needs to receive an email notification any time the workflow fails in the future. Which solution will meet this requirement?
A. Create an Amazon Simple Notification Service (Amazon SNS) FIFO topic. Subscribe the team's email account to the SNS topic. Create an AWS Lambda function that initiates when the AWS Glue job state changes to FAILED. Set the SNS topic as the target.
B. Create an Amazon Simple Notification Service (Amazon SNS) standard topic. Subscribe the team's email account to the SNS topic. Create an Amazon EventBridge rule that triggers when the AWS Glue Job state changes to FAILED. Set the SNS topic as the target.
C. Create an Amazon Simple Queue Service (Amazon SQS) FIFO queue. Subscribe the team's email account to the SQS queue. Create an AWS Config rule that triggers when the AWS Glue job state changes to FAILED. Set the SQS queue as the target.
D. Create an Amazon Simple Queue Service (Amazon SQS) standard queue. Subscribe the team's email account to the SQS queue. Create an Amazon EventBridge rule that triggers when the AWS Glue job state changes to FAILED. Set the SQS queue as the target.
Answer:
B
User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
0/ 1000
Question 12
A data engineer needs to run a data transformation job whenever a user adds a file to an Amazon S3 bucket. The job will run for less than 1 minute. The job must send the output through an email message to the data engineer. The data engineer expects users to add one file every hour of the day. Which solution will meet these requirements in the MOST operationally efficient way?
A. Create a small Amazon EC2 instance that polls the S3 bucket for new files. Run transformation code on a schedule to generate the output. Use operating system commands to send email messages.
B. Run an Amazon Elastic Container Service (Amazon ECS) task to poll the S3 bucket for new files. Run transformation code on a schedule to generate the output. Use operating system commands to send email messages.
C. Create an AWS Lambda function to transform the data. Use Amazon S3 Event Notifications to invoke the Lambda function when a new object is created. Publish the output to an Amazon Simple Notification Service (Amazon SNS) topic. Subscribe the data engineer's email account to the topic.
D. Deploy an Amazon EMR cluster. Use EMR File System (EMRFS) to access the files in the S3 bucket. Run transformation code on a schedule to generate the output to a second S3 bucket. Create an Amazon Simple Notification Service (Amazon SNS) topic. Configure Amazon S3 Event Notifications to notify the topic when a new object is created.
Answer:
C
User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
0/ 1000
Question 13
A company manages an Amazon Redshift data warehouse. The data warehouse is in a public subnet inside a custom VPC A security group allows only traffic from within itself- An ACL is open to all traffic. The company wants to generate several visualizations in Amazon QuickSight for an upcoming sales event. The company will run QuickSight Enterprise edition in a second AW5 account inside a public subnet within a second custom VPC. The new public subnet has a security group that allows outbound traffic to the existing Redshift cluster. A data engineer needs to establish connections between Amazon Redshift and QuickSight. QuickSight must refresh dashboards by querying the Redshift cluster. Which solution will meet these requirements?
A. Configure the Redshift security group to allow inbound traffic on the Redshift port from the QuickSight security group.
B. Assign Elastic IP addresses to the QuickSight visualizations. Configure the QuickSight security group to allow inbound traffic on the Redshift port from the Elastic IP addresses.
C. Confirm that the CIDR ranges of the Redshift VPC and the QuickSight VPC are the same. If CIDR ranges are different, reconfigure one CIDR range to match the other. Establish network peering between the VPCs.
D. Create a QuickSight gateway endpoint in the Redshift VPC. Attach an endpoint policy to the gateway endpoint to ensure only specific QuickSight accounts can use the endpoint.
Answer:
A
User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
0/ 1000
Question 14
A data engineer is optimizing query performance in Amazon Athena notebooks that use Apache Spark to analyze large datasets that are stored in Amazon S3. The data is partitioned. An AWS Glue crawler updates the partitions. The data engineer wants to minimize the amount of data that is scanned to improve efficiency of Athena queries. Which solution will meet these requirements?
A. Apply partition filters in the queries.
B. Increase the frequency of AWS Glue crawler invocations to update the data catalog more often.
C. Organize the data that is in Amazon S3 by using a nested directory structure.
D. Configure Spark to use in-memory caching for frequently accessed data.
Answer:
A
User Votes:
A
50%
B
50%
C
50%
D
50%
Discussions
0/ 1000
Question 15
A retail company stores order information in an Amazon Aurora table named Orders. The company needs to create operational reports from the Orders table with minimal latency. The Orders table contains billions of rows, and over 100,000 transactions can occur each second. A marketing team needs to join the Orders data with an Amazon Redshift table named Campaigns in the marketing team's data warehouse. The operational Aurora database must not be affected. Which solution will meet these requirements with the LEAST operational effort?
A. Use AW5 Database Migration Service (AWS DMS) Serverless to replicate the Orders table to Amazon Redshift. Create a materialized view in Amazon Redshift to join with the Campaigns table.
B. Use the Aurora zero-ETL integration with Amazon Redshift to replicate the Orders table. Create a materialized view in Amazon Redshift to join with the Campaigns table.
C. Use AWS Glue to replicate the Orders table to Amazon Redshift. Create a materialized view in Amazon Redshift to join with the Campaigns table.
D. Use federated queries to query the Orders table directly from Aurora. Create a materialized view in Amazon Redshift to join with the Campaigns table.