Question 1

How would you define an SLO for a user-facing API, and what would you do when the error budget is 80% consumed halfway through the quarter?

Accepted Answer

Start by distinguishing SLI (the metric), SLO (the target), and error budget (the allowance for unreliability). Define a concrete SLI — e.g., the fraction of requests completing in under 200ms with a 2xx status. Set a 99.5% monthly SLO. Then walk through the error budget burn scenario: calculate remaining budget, identify burn rate alerts (Alertmanager multi-window), convene a reliability review, freeze non-critical feature releases, and prioritize toil reduction or bug fixes. Reference the concept of fast-burn vs. slow-burn alerts. What interviewers look for: Can you move beyond definitions to operational decision-making? Interviewers want to see you treat error budgets as an actual policy lever, not a dashboard metric. They're checking whether you understand burn rate alerting math and whether you can navigate the tension between shipping velocity and reliability without just saying 'slow down releases.'

Question 2

Walk me through how you handled a major production incident — what did you do in the first 15 minutes, and what did you change afterward?

Accepted Answer

Use a structured narrative: detection (how you found out — alert, customer report, dashboard), initial triage (what signals you checked first and why — error rate, latency, dependency health), mitigation (rollback, traffic shift, feature flag), communication (status page update, stakeholder ping), and then the post-mortem. For the aftermath, be specific: what action items were created, which were completed, how you verified the fix held. Name actual tools — PagerDuty, Datadog, Grafana, runbooks. What interviewers look for: Interviewers are evaluating incident ownership and calm under pressure. They want to see a clear mental model for triage (symptoms → causes → blast radius), evidence that you communicate proactively rather than waiting for someone to ask, and that your post-mortems produce durable fixes rather than just documentation.

Question 3

Design a deployment pipeline for a microservice that needs to be deployed to production dozens of times per day with zero downtime and the ability to roll back in under 2 minutes.

Accepted Answer

Walk through stages: code commit → lint/unit test → build container image → push to registry with immutable tag → deploy to staging with integration tests → canary deploy to 5% production traffic → automated promotion or rollback based on SLO metrics → full rollout. Discuss the deployment strategy choice (canary vs. blue-green vs. rolling) with explicit tradeoffs. Explain how rollback works: re-deploy the previous image tag. Mention feature flags for decoupling deploy from release. Address how you handle database migrations safely (expand/contract pattern). What interviewers look for: Can you design a pipeline end-to-end with real safety gates, not just list CI/CD tool names? They want to see you reason about the rollback mechanism specifically, handle the DB migration problem (a common trap), and understand why canary + automated metric gates are safer than a timed rollout.

Question 4

A Kubernetes pod is crash-looping. Walk me through how you diagnose and fix it.

Accepted Answer

Start with kubectl get pods to see status and restart count. Then kubectl describe pod to read Events — look for OOMKilled, failed liveness probe, image pull errors, or scheduling failures. Then kubectl logs --previous to get the last crash output. Branch the diagnosis: if OOMKilled, check resource limits and application memory profile; if liveness probe failure, check probe timing vs. startup time (use startupProbe); if init container failure, check init logs separately. Fix forward: adjust limits, fix probe config, or fix the application bug. Mention that you'd check node pressure (kubectl describe node) if scheduling is involved. What interviewers look for: This is a real-world debugging exercise. They want to see a systematic, layered approach rather than random guessing. The OOMKilled and liveness probe timing paths are the most common real failures — candidates who know to check --previous logs and describe node signal genuine Kubernetes operational experience.

Question 5

What are the risks of running terraform apply in a CI/CD pipeline on a shared production environment, and how do you mitigate them?

Accepted Answer

Identify the real risks: concurrent applies causing state lock contention or corruption, broad IAM permissions in CI context, accidental destroy of resources, apply running on unreviewed code, and blast radius if the pipeline itself is compromised. Mitigations: use remote state with locking (S3 + DynamoDB or Terraform Cloud), require plan review as a PR step (atlantis or terraform cloud PR automation), separate plan from apply jobs with a manual approval gate for production, scope IAM roles minimally per pipeline, and use targeted applies or workspaces to isolate environments. Mention drift detection (terraform plan in check mode on a schedule). What interviewers look for: They want to see security and operational awareness beyond 'just run it.' The state locking issue and the plan-then-apply separation are the two biggest signals. Candidates who mention the IAM scope and drift detection demonstrate infrastructure security maturity expected at mid-level.

Question 6

Your team's on-call alert volume has tripled in three months, but production incidents haven't increased. What do you do?

Accepted Answer

This is an alert hygiene / toil problem. First, classify the alert volume: which alerts fired most, what percentage were actionable (led to real mitigation) vs. noise (auto-resolved or ignored). Use this to calculate alert precision. Then attack by category: silence or delete alerts that auto-resolve within 5 minutes without action; convert alerts that require investigation but no immediate action into tickets; tighten thresholds on noisy alerts using historical data; add alert deduplication (Alertmanager grouping). Propose a regular alert review cadence. Distinguish symptom-based alerting (good) from cause-based alerting (often noisy). What interviewers look for: This is a toil recognition and process improvement question. They're checking whether you treat alerts as a system to be engineered, not just a notification stream. The symptom vs. cause alerting distinction and the precision/recall framing signal SRE conceptual depth. The process answer (regular review cadence) signals maturity.

Question 7

Explain what happens at the network level when a Kubernetes service of type LoadBalancer receives a request from the internet and it reaches a pod.

Accepted Answer

Trace the path: external client → cloud load balancer (L4, e.g., AWS NLB or L7 ALB) → NodePort on one of the cluster nodes → kube-proxy iptables/IPVS rules → virtual IP of the ClusterIP service → selected pod IP via NAT. Explain that kube-proxy maintains iptables rules that DNAT traffic from the ClusterIP:port to a pod IP:port. Discuss that with externalTrafficPolicy: Local you preserve source IP but lose cross-node load balancing. Mention CNI role (Flannel, Calico, Cilium) in pod networking. If using an ingress controller, explain where that fits (before the Service in the L7 path). What interviewers look for: This tests whether you actually understand what Kubernetes networking is doing, not just that services exist. The iptables DNAT step and the externalTrafficPolicy tradeoff are where strong candidates differentiate. Interviewers are checking that you can debug connectivity issues from first principles rather than cargo-culting kubectl port-forward.

Question 8

A Linux server's load average is 40 but CPU utilization is only 15%. What's happening and how do you investigate?

Accepted Answer

High load with low CPU means processes are blocked on I/O or waiting for something other than CPU — most likely disk I/O wait or uninterruptible sleep (D state). Start with top or htop: look at wa% (iowait) and count processes in D state. Use iostat -x 1 to see disk utilization, await (average wait time), and %util per device. Use iotop to find which processes are generating I/O. Check dmesg for storage errors. Also consider: NFS mounts hanging, memory pressure causing swap thrashing, or a flock/semaphore contention issue. Load average counts D-state processes, which explains the discrepancy. What interviewers look for: This is a classic systems diagnostic question that filters out candidates who only know CPU-level tools. The D-state explanation is the key insight. They want to see a structured tool chain: top → iostat → iotop → dmesg, not random guessing. NFS hang awareness is a bonus signal.

Question 9

How do you manage secrets in a Kubernetes-native application, and what are the risks of using Kubernetes Secrets naively?

Accepted Answer

Start with the problem: native Kubernetes Secrets are base64-encoded (not encrypted) in etcd by default, readable by anyone with kubectl get secret, and often leaked via environment variables in pod specs. Mitigations layer: enable etcd encryption at rest, use RBAC to restrict secret access, prefer volume mounts over env vars (env vars can leak in crash dumps and logs), and integrate with an external secrets manager. Discuss options: HashiCorp Vault with the agent injector or Vault Secrets Operator, AWS Secrets Manager via External Secrets Operator, Sealed Secrets for GitOps workflows. Mention that the right answer depends on the threat model and operational complexity tolerance. What interviewers look for: The base64-not-encryption point is the first filter. Then they want to see you reason about the actual threat vectors (etcd access, env var leakage, RBAC gaps) and know at least one real secrets management integration pattern. The External Secrets Operator or Vault agent patterns signal production experience beyond toy setups.

Question 10

Tell me about a time you identified and reduced toil for your team. How did you measure the impact?

Accepted Answer

Structure with situation, the specific toil (manual, repetitive, scalable-with-load, automatable), what you built to eliminate it, and how you measured before/after. Be concrete: 'We manually promoted releases by SSHing into Jenkins and clicking Build — 45 minutes of engineer time per deploy, 3 deploys per week. I built a Slack-bot-triggered pipeline that reduced this to 2 minutes and zero manual steps, saving ~2 hours/week.' The measurement framing matters: time saved per occurrence × frequency, or reduction in on-call interrupts. What interviewers look for: They're checking whether you have internalized the SRE concept of toil as something to be actively eliminated, not just endured. The measurement component is critical — mid-level engineers should be able to quantify operational improvements, not just describe them qualitatively. Vague answers ('I automated some stuff') are a flag.

Question 11

Describe a situation where a development team pushed back on a reliability requirement you were enforcing. How did you handle it?

Accepted Answer

The strong answer shows you understand the underlying tension (developer velocity vs. reliability) and that you resolved it through data and alignment, not authority. Walk through: what the requirement was, why the dev team pushed back (context matters — was the timeline tight? was the requirement unclear?), how you translated the requirement into terms that resonated with the dev team (customer impact, incident cost, SLO math), and how you found a middle ground or phased approach. Show that you can be an advocate for reliability without being an obstacle to shipping. What interviewers look for: This tests cross-functional collaboration and whether you can operate as a partner to engineering rather than a gatekeeper. They want to see empathy for dev team constraints and the ability to make the reliability case with data. Candidates who describe 'winning' the argument via escalation are a flag.

Question 12

Design a global alerting and on-call system for a company with 200 engineers across 3 time zones. What are the key components and failure modes you'd design against?

Accepted Answer

Break into components: ingestion (metrics/logs/traces → alerting rules engine, e.g., Prometheus Alertmanager or Grafana OnCall), routing (alert → right team/person based on service ownership registry), escalation policies (primary → secondary → manager after N minutes), notification channels (PagerDuty, Opsgenie, SMS, phone), a status page (Statuspage.io), and a post-mortem workflow. For failure modes: the alerting system itself going down (make it independent of the systems it monitors; use heartbeat/deadman alerts); alert storms (grouping, inhibition rules, rate limiting); timezone fairness (follow-the-sun rotation design); noisy alerts drowning real ones (alert quality review). Discuss tradeoffs of build vs. buy (PagerDuty vs. home-built). What interviewers look for: This is a system design question with an operational twist. They want to see you reason about the alerting system's own reliability (who watches the watchmen?), not just the happy path. The follow-the-sun rotation design and alert storm mitigation patterns signal real on-call experience. Build vs. buy reasoning with concrete rationale is a strong mid-level signal.

Mid-Level DevOps / SRE Engineer Interview Questions

What to expect

12 questions, with how to answer them

1. How would you define an SLO for a user-facing API, and what would you do when the error budget is 80% consumed halfway through the quarter?

2. Walk me through how you handled a major production incident — what did you do in the first 15 minutes, and what did you change afterward?

3. Design a deployment pipeline for a microservice that needs to be deployed to production dozens of times per day with zero downtime and the ability to roll back in under 2 minutes.

4. A Kubernetes pod is crash-looping. Walk me through how you diagnose and fix it.

5. What are the risks of running terraform apply in a CI/CD pipeline on a shared production environment, and how do you mitigate them?

6. Your team's on-call alert volume has tripled in three months, but production incidents haven't increased. What do you do?

7. Explain what happens at the network level when a Kubernetes service of type LoadBalancer receives a request from the internet and it reaches a pod.

8. A Linux server's load average is 40 but CPU utilization is only 15%. What's happening and how do you investigate?

9. How do you manage secrets in a Kubernetes-native application, and what are the risks of using Kubernetes Secrets naively?

10. Tell me about a time you identified and reduced toil for your team. How did you measure the impact?

11. Describe a situation where a development team pushed back on a reliability requirement you were enforcing. How did you handle it?

12. Design a global alerting and on-call system for a company with 200 engineers across 3 time zones. What are the key components and failure modes you'd design against?

Study tips