Snowflake DSA-C02 Exam Questions

Questions for the DSA-C02 were updated on : Dec 01 ,2025

Page 1 out of 5. Viewing questions 1-15 out of 65

Question 1

Mark the incorrect statement regarding usage of Snowflake Stream & Tasks?

  • A. Snowflake automatically resizes and scales the compute resources for serverless tasks.
  • B. Snowflake ensures only one instance of a task with a schedule (i.e. a standalone task or the root task in a DAG) is executed at a given time. If a task is still running when the next scheduled execution time occurs, then that scheduled time is skipped.
  • C. Streams support repeatable read isolation.
  • D. An standard-only stream tracks row inserts only.
Answer:

D

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
All are correct except a standard-only stream tracks row inserts only.
A standard (i.e. delta) stream tracks all DML changes to the source object, including inserts, up-dates,
and deletes (including table truncates).

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 2

Which tools helps data scientist to manage ML lifecycle & Model versioning?

  • A. MLFlow
  • B. Pachyderm
  • C. Albert
  • D. CRUX
Answer:

A, B

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
Model versioning in a way involves tracking the changes made to an ML model that has been
previously built. Put differently, it is the process of making changes to the configurations of an ML
Model. From another perspective, we can see model versioning as a feature that helps Machine
Learning Engineers, Data Scientists, and related personnel create and keep multiple versions of the
same model.
Think of it as a way of taking notes of the changes you make to the model through tweaking
hyperparameters, retraining the model with more data, and so on.
In model versioning, a number of things need to be versioned, to help us keep track of important
changes. I’ll list and explain them below:
Implementation code: From the early days of model building to optimization stages, code or in this
case source code of the model plays an important role. This code experiences significant changes
during optimization stages which can easily be lost if not tracked properly. Because of this, code is
one of the things that are taken into consideration during the model versioning process.
Data: In some cases, training data does improve significantly from its initial state during model op-
timization phases. This can be as a result of engineering new features from existing ones to train our
model on. Also there is metadata (data about your training data and model) to consider versioning.
Metadata can change different times over without the training data actually changing. We need to be
able to track these changes through versioning
Model: The model is a product of the two previous entities and as stated in their explanations, an ML
model changes at different points of the optimization phases through hyperparameter setting, model
artifacts and learning coefficients. Versioning helps take record of the different versions of a Machine
Learning model.
MLFlow & Pachyderm are the tools used to manage ML lifecycle & Model versioning.

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 3

You are training a binary classification model to support admission approval decisions for a college
degree program.
How can you evaluate if the model is fair, and doesn’t discriminate based on ethnicity?

  • A. Evaluate each trained model with a validation dataset and use the model with the highest accuracy score.
  • B. Remove the ethnicity feature from the training dataset.
  • C. Compare disparity between selection rates and performance metrics across ethnicities.
  • D. None of the above.
Answer:

C

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
By using ethnicity as a sensitive field, and comparing disparity between selection rates and
performance metrics for each ethnicity value, you can evaluate the fairness of the model.

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 4

You previously trained a model using a training dataset. You want to detect any data drift in the new
data collected since the model was trained.
What should you do?

  • A. Create a new dataset using the new data and a timestamp column and create a data drift monitor that uses the training dataset as a baseline and the new dataset as a target.
  • B. Create a new version of the dataset using only the new data and retrain the model.
  • C. Add the new data to the existing dataset and enable Application Insights for the service where the model is deployed.
  • D. Retrained your training dataset after correcting data outliers & no need to introduce new data.
Answer:

A

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
To track changing data trends, create a data drift monitor that uses the training data as a baseline
and the new data as a target.
Model drift and decay are concepts that describe the process during which the performance of a
model deployed to production degrades on new, unseen data or the underlying assumptions about
the data change.
These are important metrics to track once models are deployed to production. Models must be
regularly re-trained on new data. This is referred to as refitting the model. This can be done either on
a periodic basis, or, in an ideal scenario, retraining can be triggered when the performance of the
model degrades below a certain pre-defined threshold.

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 5

Which metric is not used for evaluating classification models?

  • A. Recall
  • B. Accuracy
  • C. Mean absolute error
  • D. Precision
Answer:

C

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
The four commonly used metrics for evaluating classifier performance are:
1. Accuracy: The proportion of correct predictions out of the total predictions.
2. Precision: The proportion of true positive predictions out of the total positive predictions
(precision = true positives / (true positives + false positives)).
3. Recall (Sensitivity or True Positive Rate): The proportion of true positive predictions out of the total
actual positive instances (recall = true positives / (true positives + false negatives)).
4. F1 Score: The harmonic mean of precision and recall, providing a balance between the two metrics
(F1 score = 2 * ((precision * recall) / (precision + recall))).
Root Mean Squared Error (RMSE)and Mean Absolute Error (MAE) are metrics used to evaluate a
Regression Model. These metrics tell us how accurate our predictions are and, what is the amount of
deviation from the actual values.

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 6

Which of the following metrics are used to evaluate classification models?

  • A. Area under the ROC curve
  • B. F1 score
  • C. Confusion matrix
  • D. All of the above
Answer:

D

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
Evaluation metrics are tied to machine learning tasks. There are different metrics for the tasks of
classification and regression. Some metrics, like precision-recall, are useful for multiple tasks.
Classification and regression are examples of supervised learning, which constitutes a majority of
machine learning applications. Using different metrics for performance evaluation, we should be
able to im-prove our model’s overall predictive power before we roll it out for production on unseen
data. Without doing a proper evaluation of the Machine Learning model by using different evaluation
metrics, and only depending on accuracy, can lead to a problem when the respective model is
deployed on unseen data and may end in poor predictions.
Classification metrics are evaluation measures used to assess the performance of a classification
model. Common metrics include accuracy (proportion of correct predictions), precision (true
positives over total predicted positives), recall (true positives over total actual positives), F1 score
(har-monic mean of precision and recall), and area under the receiver operating characteristic curve
(AUC-ROC).
Confusion Matrix
Confusion Matrix is a performance measurement for the machine learning classification problems
where the output can be two or more classes. It is a table with combinations of predicted and actual
values.
It is extremely useful for measuring the Recall, Precision, Accuracy, and AUC-ROC curves.
The four commonly used metrics for evaluating classifier performance are:
1. Accuracy: The proportion of correct predictions out of the total predictions.
2. Precision: The proportion of true positive predictions out of the total positive predictions
(precision = true positives / (true positives + false positives)).
3. Recall (Sensitivity or True Positive Rate): The proportion of true positive predictions out of the total
actual positive instances (recall = true positives / (true positives + false negatives)).
4. F1 Score: The harmonic mean of precision and recall, providing a balance between the two metrics
(F1 score = 2 * ((precision * recall) / (precision + recall))).
These metrics help assess the classifier’s effectiveness in correctly classifying instances of different
classes.
Understanding how well a machine learning model will perform on unseen data is the main purpose
behind working with these evaluation metrics. Metrics like accuracy, precision, recall are good ways
to evaluate classification models for balanced datasets, but if the data is imbalanced then other
methods like ROC/AUC perform better in evaluating the model performance.
ROC curve isn’t just a single number but it’s a whole curve that provides nuanced details about the
behavior of the classifier. It is also hard to quickly compare many ROC curves to each other.

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 7

Which type of Python UDFs let you define Python functions that receive batches of input rows as
Pandas DataFrames and return batches of results as Pandas arrays or Series?

  • A. MPP Python UDFs
  • B. Scaler Python UDFs
  • C. Vectorized Python UDFs
  • D. Hybrid Python UDFs
Answer:

C

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
Vectorized Python UDFs let you define Python functions that receive batches of input rows as Pandas
DataFrames and return batches of results as Pandas arrays or Series. You call vectorized Py-thon UDFs
the same way you call other Python UDFs.
Advantages of using vectorized Python UDFs compared to the default row-by-row processing pat-
tern include:
The potential for better performance if your Python code operates efficiently on batches of rows.
Less transformation logic required if you are calling into libraries that operate on Pandas Data-
Frames or Pandas arrays.
When you use vectorized Python UDFs:
You do not need to change how you write queries using Python UDFs. All batching is handled by the
UDF framework rather than your own code.
As with non-vectorized UDFs, there is no guarantee of which instances of your handler code will see
which batches of input.

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 8

All Snowpark ML modeling and preprocessing classes are in the ________ namespace?

  • A. snowpark.ml.modeling
  • B. snowflake.sklearn.modeling
  • C. snowflake.scikit.modeling
  • D. snowflake.ml.modeling
Answer:

D

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
All Snowpark ML modeling and preprocessing classes are in the snowflake.ml.modeling namespace.
The Snowpark ML modules have the same name as the corresponding module from the sklearn
namespace. For example, the Snowpark ML module corresponding to sklearn.calibration is snow-
flake.ml.modeling.calibration.
The xgboost and lightgbm modules correspond to snowflake.ml.modeling.xgboost and snow-
flake.ml.modeling.lightgbm, respectively.
Not all of the classes from scikit-learn are supported in Snowpark ML.

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 9

Which ones are the correct rules while using a data science model created via External function in
Snowflake?

  • A. External functions return a value. The returned value can be a compound value, such as a VARIANT that contains JSON.
  • B. External functions can be overloaded.
  • C. An external function can appear in any clause of a SQL statement in which other types of UDF can appear.
  • D. External functions can accept Model parameters.
Answer:

A, B, C, D

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
From the perspective of a user running a SQL statement, an external function behaves like any other
UDF . External functions follow these rules:
External functions return a value.
External functions can accept parameters.
An external function can appear in any clause of a SQL statement in which other types of UDF can
appear. For example:
1. select my_external_function_2(column_1, column_2)
2. from table_1;
1. select col1
2. from table_1
3. where my_external_function_3(col2) < 0;
1. create view view1 (col1) as
2. select my_external_function_5(col1)
3. from table9;
An external function can be part of a more complex expression:
1. select upper(zipcode_to_city_external_function(zipcode))
2. from address_table;
The returned value can be a compound value, such as a VARIANT that contains JSON.
External functions can be overloaded; two different functions can have the same name but different
signatures (different numbers or data types of input parameters).

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 10

Which of the following is a useful tool for gaining insights into the relationship between features and
predictions?

  • A. numpy plots
  • B. sklearn plots
  • C. Partial dependence plots(PDP)
  • D. FULL dependence plots (FDP)
Answer:

C

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
Partial dependence plots (PDP) is a useful tool for gaining insights into the relationship between
features and predictions. It helps us understand how different values of a particular feature impact
model’s predictions.

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 11

How do you handle missing or corrupted data in a dataset?

  • A. Drop missing rows or columns
  • B. Replace missing values with mean/median/mode
  • C. Assign a unique category to missing values
  • D. All of the above
Answer:

D

User Votes:
A
50%
B
50%
C
50%
D
50%

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 12

The most widely used metrics and tools to assess a classification model are:

  • A. Confusion matrix
  • B. Cost-sensitive accuracy
  • C. Area under the ROC curve
  • D. All of the above
Answer:

D

User Votes:
A
50%
B
50%
C
50%
D
50%

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 13

Which of the following is a common evaluation metric for binary classification?

  • A. Accuracy
  • B. F1 score
  • C. Mean squared error (MSE)
  • D. Area under the ROC curve (AUC)
Answer:

D

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
The area under the ROC curve (AUC) is a common evaluation metric for binary classification, which
measures the performance of a classifier at different threshold values for the predicted probabilities.
Other common metrics include accuracy, precision, recall, and F1 score, which are based on the
confusion matrix of true positives, false positives, true negatives, and false negatives.

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 14

Which of the following cross validation versions is suitable quicker cross-validation for very large
datasets with hundreds of thousands of samples?

  • A. k-fold cross-validation
  • B. Leave-one-out cross-validation
  • C. Holdout method
  • D. All of the above
Answer:

C

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
Holdout cross-validation method is suitable for very large dataset because it is the simplest and
quicker to compute version of cross-validation.
Holdout method
In this method, the dataset is divided into two sets namely the training and the test set with the
basic property that the training set is bigger than the test set. Later, the model is trained on the
training dataset and evaluated using the test dataset.

Discussions
vote your answer:
A
B
C
D
0 / 1000

Question 15

Which of the following cross validation versions may not be suitable for very large datasets with
hundreds of thousands of samples?

  • A. k-fold cross-validation
  • B. Leave-one-out cross-validation
  • C. Holdout method
  • D. All of the above
Answer:

B

User Votes:
A
50%
B
50%
C
50%
D
50%

Explanation:
Leave-one-out cross-validation (LOO cross-validation) is not suitable for very large datasets due to
the fact that this validation technique requires one model for every sample in the training set to be
created and evaluated.
Cross validation
It is a technique to evaluate a machine learning model and it is the basis for whole class of model
evaluation methods. The goal of cross-validation is to test the model's ability to predict new data
that was not used in estimating it. It works by the idea of splitting dataset into number of subsets,
keep a subset aside, train the model, and test the model on the holdout subset.
Leave-one-out cross validation
Leave-one-out cross validation is K-fold cross validation taken to its logical extreme, with K equal to
N, the number of data points in the set. That means that N separate times, the function
approximator is trained on all the data except for one point and a prediction is made for that point.
As be-fore the average error is computed and used to evaluate the model. The evaluation given by
leave-one-out cross validation is very expensive to compute at first pass.

Discussions
vote your answer:
A
B
C
D
0 / 1000
To page 2