Questions for the PCA were updated on : Nov 21 ,2025
Which function would you use to calculate the 95th percentile latency from histogram data?
B
Explanation:
To calculate a percentile (e.g., 95th percentile) from histogram data in Prometheus, the correct
function is histogram_quantile(). It estimates quantiles based on cumulative bucket counts.
Example:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
This computes the 95th percentile request duration across all observed instances over the last 5
minutes.
Reference: Prometheus documentation – histogram_quantile() Function, Working with Histogram
Buckets.
Which Prometheus component handles service discovery?
B
Explanation:
The Prometheus Server is responsible for service discovery, which identifies the list of targets to
scrape. It integrates with multiple service discovery mechanisms such as Kubernetes, Consul, EC2,
and static configurations.
This allows Prometheus to automatically adapt to dynamic environments without manual
reconfiguration.
Reference: Prometheus documentation – Service Discovery Mechanisms, Prometheus Architecture.
What are the four golden signals of monitoring as defined by Google’s SRE principles?
A
Explanation:
The Four Golden Signals—Traffic, Errors, Latency, and Saturation—are key service-level indicators
defined by Google’s Site Reliability Engineering (SRE) discipline.
Traffic: Demand placed on the system (e.g., requests per second).
Errors: Rate of failed requests.
Latency: Time taken to serve requests.
Saturation: How “full” the system resources are (CPU, memory, etc.).
Prometheus and its metrics-based model are ideal for capturing these signals.
Reference: Google SRE Book & Prometheus Best Practices – Golden Signals and SLO Monitoring.
Which Alertmanager feature allows you to temporarily stop notifications for a specific alert?
A
Explanation:
The Silence feature in Alertmanager allows operators to mute specific alerts for a defined period.
Each silence includes a matcher (labels), a creator, a comment, and an expiration time.
Silencing is useful during maintenance windows or known outages to prevent alert noise. Unlike
inhibition, silences are manual and explicit.
Reference: Prometheus documentation – Silences in Alertmanager, Managing Alerts During
Maintenance Windows.
What does the rate() function in PromQL return?
B
Explanation:
The rate() function calculates the average per-second rate of increase of a counter over the specified
range. It smooths out short-term fluctuations and adjusts for counter resets.
Example:
rate(http_requests_total[5m])
returns the number of requests per second averaged over the last five minutes. This function is
frequently used in dashboards and alerting expressions.
Reference: Prometheus documentation – rate() Function Definition, Counter Handling in PromQL.
Where does Prometheus store its time series data by default?
B
Explanation:
By default, Prometheus stores its time series data in a local, embedded Time Series Database (TSDB)
on disk. The data is organized in block files under the data/ directory inside Prometheus’s storage
path.
Each block typically covers two hours of data, containing chunks, index, and metadata files. Older
blocks are compacted and deleted based on retention settings.
Reference: Prometheus documentation – Storage, TSDB Internals, Retention and Compaction.
Which Alertmanager feature prevents duplicate notifications from being sent?
C
Explanation:
Deduplication in Alertmanager ensures that identical alerts from multiple Prometheus servers or rule
evaluations do not trigger duplicate notifications.
Alertmanager compares alerts based on their labels and fingerprints; if an alert with identical labels
already exists, it merges or refreshes the existing one instead of creating a new notification.
This mechanism is essential in high-availability setups where multiple Prometheus instances monitor
the same targets.
Reference: Prometheus documentation – Alertmanager Architecture, Deduplication, Grouping, and
Routing Logic.
What is the role of the Pushgateway in Prometheus?
B
Explanation:
The Pushgateway is a Prometheus component used to handle short-lived batch jobs that cannot be
scraped directly. These jobs push their metrics to the Pushgateway, which then exposes them for
Prometheus to scrape.
This ensures metrics persist beyond the job’s lifetime. However, it’s not designed for continuously
running services, as metrics in the Pushgateway remain static until replaced.
Reference: Prometheus documentation – Pushgateway Overview, Best Practices for Short-Lived Jobs.
What does the increase() function do in PromQL?
B
Explanation:
The increase() function computes the total increase in a counter metric over a specified range vector.
It accounts for counter resets and only measures the net change in the counter’s value during the
time window.
Example:
increase(http_requests_total[5m])
This query returns how many HTTP requests occurred in the last five minutes. Unlike rate(), which
provides a per-second average rate, increase() gives the absolute number of increments.
Reference: Prometheus documentation – PromQL Function Reference, Counter Handling and
increase() vs. rate().
What does the evaluation_interval parameter in the Prometheus configuration control?
B
Explanation:
The evaluation_interval parameter defines how frequently Prometheus evaluates its recording and
alerting rules. It determines the schedule at which the rule engine runs, checking whether alert
conditions are met and generating new time series for recording rules.
For example, setting:
global:
evaluation_interval: 30s
means Prometheus evaluates all configured rules every 30 seconds. This setting differs from
scrape_interval, which controls how often Prometheus collects data from targets.
Having a proper evaluation interval ensures alerting latency is balanced with system performance.
Reference: Prometheus documentation – Configuration File Reference, Rule Evaluation Cycle, Global
vs Job-Specific Parameters.
Which field in alerting rules files indicates the time an alert needs to go from pending to firing state?
D
Explanation:
In Prometheus alerting rules, the for field specifies how long a condition must remain true
continuously before the alert transitions from the pending to the firing state. This feature prevents
transient spikes or brief metric fluctuations from triggering false alerts.
Example:
alert: HighRequestLatency
expr: http_request_duration_seconds_avg > 1
for: 5m
labels:
severity: warning
annotations:
description: "Request latency is above 1s for more than 5 minutes."
In this configuration, Prometheus evaluates the expression every rule evaluation cycle. The alert only
fires if the condition (http_request_duration_seconds_avg > 1) remains true for 5 consecutive
minutes. If it returns to normal before that duration, the alert resets and never fires.
This mechanism adds stability and noise reduction to alerting systems by ensuring only sustained
issues generate notifications.
Reference:
Verified from Prometheus documentation – Alerting Rules Configuration Syntax, Pending vs. Firing
States, and Best Practices for Alert Timing and Thresholds sections.
Which of the following signal belongs to symptom-based alerting?
D
Explanation:
Symptom-based alerting focuses on user-visible problems or service-impacting symptoms rather
than low-level resource metrics. In Prometheus and Site Reliability Engineering (SRE) practices, alerts
should signal conditions that affect users’ experience — such as high latency, request failures, or
service unavailability — instead of merely reflecting internal resource states.
Among the options, API latency directly represents the performance perceived by end users. If API
response times increase, it immediately impacts user satisfaction and indicates a possible service
degradation.
In contrast, metrics like disk space, CPU usage, or database memory utilization are cause-based
metrics — they may correlate with problems but do not always translate into observable user
impact.
Prometheus alerting best practices recommend alerting on symptoms (via RED metrics — Rate,
Errors, Duration) while using cause-based metrics for deeper investigation and diagnosis, not for
immediate paging alerts.
Reference:
Verified from Prometheus documentation – Alerting Best Practices, Symptom vs. Cause Alerting, and
RED/USE Monitoring Principles sections.
What popular open-source project is commonly used to visualize Prometheus data?
B
Explanation:
The most widely used open-source visualization and dashboarding platform for Prometheus data is
Grafana. Grafana provides native integration with Prometheus as a data source, allowing users to
create real-time, interactive dashboards using PromQL queries.
Grafana supports advanced visualization panels (graphs, heatmaps, gauges, tables, etc.) and enables
users to design custom dashboards to monitor infrastructure, application performance, and service-
level objectives (SLOs). It also provides alerting capabilities that can complement or extend
Prometheus’s own alerting system.
While Kibana is part of the Elastic Stack and focuses on log analytics, Thanos extends Prometheus for
long-term storage and high availability, and Loki is a log aggregation system. None of these tools
serve as the primary dashboarding solution for Prometheus metrics the way Grafana does.
Grafana’s seamless Prometheus integration and templating support make it the de facto standard
visualization tool in the Prometheus ecosystem.
Reference:
Verified from Prometheus documentation – Visualizing Data with Grafana, and Grafana
documentation – Prometheus Data Source Integration and Dashboard Creation Guide.
When can you use the Grafana Heatmap panel?
B
Explanation:
The Grafana Heatmap panel is best suited for visualizing histogram metrics collected from
Prometheus. Histograms provide bucketed data distributions (e.g., request durations, response
sizes), and the heatmap effectively displays these as a two-dimensional density chart over time.
In Prometheus, histogram metrics are exposed as multiple time series with the _bucket suffix and
the label le (less than or equal). Grafana interprets these buckets to create visual bands showing how
frequently different value ranges occurred.
Counters, gauges, and info metrics do not have bucketed distributions, so a heatmap would not
produce meaningful output for them.
Reference:
Verified from Grafana documentation – Heatmap Panel Overview, Visualizing Prometheus
Histograms, and Prometheus documentation – Understanding Histogram Buckets.
What is a rule group?
A
Explanation:
In Prometheus, a rule group is a logical collection of recording and alerting rules that are evaluated
sequentially at a specified interval. Rule groups are defined in YAML files under the groups: key, with
each group containing a name, an interval, and a list of rules.
For example:
groups:
- name: example
interval: 1m
rules:
- record: job:http_inprogress_requests:sum
expr: sum(http_inprogress_requests) by (job)
All rules in a group share the same evaluation schedule and are executed one after another. This
ensures deterministic order, especially when one rule depends on another’s result.
Reference:
Verified from Prometheus documentation – Rule Configuration, Rule Groups and Evaluation Order,
and Recording & Alerting Rules Guide.