Interview questions

Mid-Level DevOps / SRE Engineer Interview Questions

Mid-level DevOps/SRE interviews probe whether you can operate production systems with real ownership — not just run tools someone else configured. Expect deep questions on reliability fundamentals, incident response, CI/CD design, and the tradeoffs between automation speed and system safety. Interviewers distinguish mid-level candidates from juniors by whether they can reason about failure modes before they happen and from seniors by whether they own cross-team reliability strategy.

What to expect

A typical loop for this level runs 4–6 rounds: one or two coding/scripting rounds (Linux, Python, or Go — not LeetCode-heavy, but real ops scripting), one or two system design rounds focused on infra or reliability architecture (design a deployment pipeline, design an alerting system), one infrastructure/tools deep-dive where they probe your hands-on knowledge of specific platforms (Kubernetes, Terraform, AWS/GCP), and at least one behavioral round assessing incident ownership and cross-functional collaboration. Some companies add a 'break this system' debugging exercise or live incident simulation. You will be expected to know what SLOs, error budgets, and toil are — not just define them but discuss how you've applied them.

These are the questions every DevOps / SRE Engineer gets.

Get questions tailored to your experience, answer them, and get honest feedback — free, no credit card.

Run a free fit check →

12 questions, with how to answer them

  1. Reliability & SRE Fundamentals

    1. How would you define an SLO for a user-facing API, and what would you do when the error budget is 80% consumed halfway through the quarter?

    How to answer: Start by distinguishing SLI (the metric), SLO (the target), and error budget (the allowance for unreliability). Define a concrete SLI — e.g., the fraction of requests completing in under 200ms with a 2xx status. Set a 99.5% monthly SLO. Then walk through the error budget burn scenario: calculate remaining budget, identify burn rate alerts (Alertmanager multi-window), convene a reliability review, freeze non-critical feature releases, and prioritize toil reduction or bug fixes. Reference the concept of fast-burn vs. slow-burn alerts.

    What they look for: Can you move beyond definitions to operational decision-making? Interviewers want to see you treat error budgets as an actual policy lever, not a dashboard metric. They're checking whether you understand burn rate alerting math and whether you can navigate the tension between shipping velocity and reliability without just saying 'slow down releases.'

  2. Incident Response

    2. Walk me through how you handled a major production incident — what did you do in the first 15 minutes, and what did you change afterward?

    How to answer: Use a structured narrative: detection (how you found out — alert, customer report, dashboard), initial triage (what signals you checked first and why — error rate, latency, dependency health), mitigation (rollback, traffic shift, feature flag), communication (status page update, stakeholder ping), and then the post-mortem. For the aftermath, be specific: what action items were created, which were completed, how you verified the fix held. Name actual tools — PagerDuty, Datadog, Grafana, runbooks.

    What they look for: Interviewers are evaluating incident ownership and calm under pressure. They want to see a clear mental model for triage (symptoms → causes → blast radius), evidence that you communicate proactively rather than waiting for someone to ask, and that your post-mortems produce durable fixes rather than just documentation.

  3. CI/CD & Deployment

    3. Design a deployment pipeline for a microservice that needs to be deployed to production dozens of times per day with zero downtime and the ability to roll back in under 2 minutes.

    How to answer: Walk through stages: code commit → lint/unit test → build container image → push to registry with immutable tag → deploy to staging with integration tests → canary deploy to 5% production traffic → automated promotion or rollback based on SLO metrics → full rollout. Discuss the deployment strategy choice (canary vs. blue-green vs. rolling) with explicit tradeoffs. Explain how rollback works: re-deploy the previous image tag. Mention feature flags for decoupling deploy from release. Address how you handle database migrations safely (expand/contract pattern).

    What they look for: Can you design a pipeline end-to-end with real safety gates, not just list CI/CD tool names? They want to see you reason about the rollback mechanism specifically, handle the DB migration problem (a common trap), and understand why canary + automated metric gates are safer than a timed rollout.

  4. Kubernetes & Container Orchestration

    4. A Kubernetes pod is crash-looping. Walk me through how you diagnose and fix it.

    How to answer: Start with kubectl get pods to see status and restart count. Then kubectl describe pod to read Events — look for OOMKilled, failed liveness probe, image pull errors, or scheduling failures. Then kubectl logs --previous to get the last crash output. Branch the diagnosis: if OOMKilled, check resource limits and application memory profile; if liveness probe failure, check probe timing vs. startup time (use startupProbe); if init container failure, check init logs separately. Fix forward: adjust limits, fix probe config, or fix the application bug. Mention that you'd check node pressure (kubectl describe node) if scheduling is involved.

    What they look for: This is a real-world debugging exercise. They want to see a systematic, layered approach rather than random guessing. The OOMKilled and liveness probe timing paths are the most common real failures — candidates who know to check --previous logs and describe node signal genuine Kubernetes operational experience.

  5. Infrastructure as Code

    5. What are the risks of running terraform apply in a CI/CD pipeline on a shared production environment, and how do you mitigate them?

    How to answer: Identify the real risks: concurrent applies causing state lock contention or corruption, broad IAM permissions in CI context, accidental destroy of resources, apply running on unreviewed code, and blast radius if the pipeline itself is compromised. Mitigations: use remote state with locking (S3 + DynamoDB or Terraform Cloud), require plan review as a PR step (atlantis or terraform cloud PR automation), separate plan from apply jobs with a manual approval gate for production, scope IAM roles minimally per pipeline, and use targeted applies or workspaces to isolate environments. Mention drift detection (terraform plan in check mode on a schedule).

    What they look for: They want to see security and operational awareness beyond 'just run it.' The state locking issue and the plan-then-apply separation are the two biggest signals. Candidates who mention the IAM scope and drift detection demonstrate infrastructure security maturity expected at mid-level.

  6. Observability

    6. Your team's on-call alert volume has tripled in three months, but production incidents haven't increased. What do you do?

    How to answer: This is an alert hygiene / toil problem. First, classify the alert volume: which alerts fired most, what percentage were actionable (led to real mitigation) vs. noise (auto-resolved or ignored). Use this to calculate alert precision. Then attack by category: silence or delete alerts that auto-resolve within 5 minutes without action; convert alerts that require investigation but no immediate action into tickets; tighten thresholds on noisy alerts using historical data; add alert deduplication (Alertmanager grouping). Propose a regular alert review cadence. Distinguish symptom-based alerting (good) from cause-based alerting (often noisy).

    What they look for: This is a toil recognition and process improvement question. They're checking whether you treat alerts as a system to be engineered, not just a notification stream. The symptom vs. cause alerting distinction and the precision/recall framing signal SRE conceptual depth. The process answer (regular review cadence) signals maturity.

  7. Networking & Systems

    7. Explain what happens at the network level when a Kubernetes service of type LoadBalancer receives a request from the internet and it reaches a pod.

    How to answer: Trace the path: external client → cloud load balancer (L4, e.g., AWS NLB or L7 ALB) → NodePort on one of the cluster nodes → kube-proxy iptables/IPVS rules → virtual IP of the ClusterIP service → selected pod IP via NAT. Explain that kube-proxy maintains iptables rules that DNAT traffic from the ClusterIP:port to a pod IP:port. Discuss that with externalTrafficPolicy: Local you preserve source IP but lose cross-node load balancing. Mention CNI role (Flannel, Calico, Cilium) in pod networking. If using an ingress controller, explain where that fits (before the Service in the L7 path).

    What they look for: This tests whether you actually understand what Kubernetes networking is doing, not just that services exist. The iptables DNAT step and the externalTrafficPolicy tradeoff are where strong candidates differentiate. Interviewers are checking that you can debug connectivity issues from first principles rather than cargo-culting kubectl port-forward.

  8. Linux & Systems Administration

    8. A Linux server's load average is 40 but CPU utilization is only 15%. What's happening and how do you investigate?

    How to answer: High load with low CPU means processes are blocked on I/O or waiting for something other than CPU — most likely disk I/O wait or uninterruptible sleep (D state). Start with top or htop: look at wa% (iowait) and count processes in D state. Use iostat -x 1 to see disk utilization, await (average wait time), and %util per device. Use iotop to find which processes are generating I/O. Check dmesg for storage errors. Also consider: NFS mounts hanging, memory pressure causing swap thrashing, or a flock/semaphore contention issue. Load average counts D-state processes, which explains the discrepancy.

    What they look for: This is a classic systems diagnostic question that filters out candidates who only know CPU-level tools. The D-state explanation is the key insight. They want to see a structured tool chain: top → iostat → iotop → dmesg, not random guessing. NFS hang awareness is a bonus signal.

  9. Security & Compliance

    9. How do you manage secrets in a Kubernetes-native application, and what are the risks of using Kubernetes Secrets naively?

    How to answer: Start with the problem: native Kubernetes Secrets are base64-encoded (not encrypted) in etcd by default, readable by anyone with kubectl get secret, and often leaked via environment variables in pod specs. Mitigations layer: enable etcd encryption at rest, use RBAC to restrict secret access, prefer volume mounts over env vars (env vars can leak in crash dumps and logs), and integrate with an external secrets manager. Discuss options: HashiCorp Vault with the agent injector or Vault Secrets Operator, AWS Secrets Manager via External Secrets Operator, Sealed Secrets for GitOps workflows. Mention that the right answer depends on the threat model and operational complexity tolerance.

    What they look for: The base64-not-encryption point is the first filter. Then they want to see you reason about the actual threat vectors (etcd access, env var leakage, RBAC gaps) and know at least one real secrets management integration pattern. The External Secrets Operator or Vault agent patterns signal production experience beyond toy setups.

  10. Behavioral / Ownership

    10. Tell me about a time you identified and reduced toil for your team. How did you measure the impact?

    How to answer: Structure with situation, the specific toil (manual, repetitive, scalable-with-load, automatable), what you built to eliminate it, and how you measured before/after. Be concrete: 'We manually promoted releases by SSHing into Jenkins and clicking Build — 45 minutes of engineer time per deploy, 3 deploys per week. I built a Slack-bot-triggered pipeline that reduced this to 2 minutes and zero manual steps, saving ~2 hours/week.' The measurement framing matters: time saved per occurrence × frequency, or reduction in on-call interrupts.

    What they look for: They're checking whether you have internalized the SRE concept of toil as something to be actively eliminated, not just endured. The measurement component is critical — mid-level engineers should be able to quantify operational improvements, not just describe them qualitatively. Vague answers ('I automated some stuff') are a flag.

  11. Behavioral / Collaboration

    11. Describe a situation where a development team pushed back on a reliability requirement you were enforcing. How did you handle it?

    How to answer: The strong answer shows you understand the underlying tension (developer velocity vs. reliability) and that you resolved it through data and alignment, not authority. Walk through: what the requirement was, why the dev team pushed back (context matters — was the timeline tight? was the requirement unclear?), how you translated the requirement into terms that resonated with the dev team (customer impact, incident cost, SLO math), and how you found a middle ground or phased approach. Show that you can be an advocate for reliability without being an obstacle to shipping.

    What they look for: This tests cross-functional collaboration and whether you can operate as a partner to engineering rather than a gatekeeper. They want to see empathy for dev team constraints and the ability to make the reliability case with data. Candidates who describe 'winning' the argument via escalation are a flag.

  12. System Design

    12. Design a global alerting and on-call system for a company with 200 engineers across 3 time zones. What are the key components and failure modes you'd design against?

    How to answer: Break into components: ingestion (metrics/logs/traces → alerting rules engine, e.g., Prometheus Alertmanager or Grafana OnCall), routing (alert → right team/person based on service ownership registry), escalation policies (primary → secondary → manager after N minutes), notification channels (PagerDuty, Opsgenie, SMS, phone), a status page (Statuspage.io), and a post-mortem workflow. For failure modes: the alerting system itself going down (make it independent of the systems it monitors; use heartbeat/deadman alerts); alert storms (grouping, inhibition rules, rate limiting); timezone fairness (follow-the-sun rotation design); noisy alerts drowning real ones (alert quality review). Discuss tradeoffs of build vs. buy (PagerDuty vs. home-built).

    What they look for: This is a system design question with an operational twist. They want to see you reason about the alerting system's own reliability (who watches the watchmen?), not just the happy path. The follow-the-sun rotation design and alert storm mitigation patterns signal real on-call experience. Build vs. buy reasoning with concrete rationale is a strong mid-level signal.

Study tips

  • Practice the 'toil audit' framing: for every tool or process you've worked with, be ready to explain what was manual, what you automated, and how you'd measure the before/after. Interviewers at this level expect you to have concrete numbers, not general claims.
  • Know the Kubernetes networking stack well enough to trace a packet from the internet to a pod. This path — LB → NodePort → kube-proxy iptables → pod — comes up in debugging scenarios constantly, and candidates who can't trace it reveal a surface-level understanding of Kubernetes.
  • Study error budget burn rate alerting math, not just the definition of SLOs. Understand why a 14.4x burn rate over 1 hour is a page-worthy signal even if you haven't violated your monthly budget yet. The Google SRE Workbook chapter on alerting on SLOs is the canonical reference.
  • For behavioral questions, prepare 3–4 incidents you personally owned end-to-end. For each, know the detection method, your first 15 minutes of actions, the mitigation, the root cause, and what changed in the system or process afterward. Vague incident stories are one of the most common mid-level failure modes.
  • When designing systems (pipelines, alerting, infra), proactively bring up failure modes before the interviewer asks. Saying 'one risk here is X, and I'd mitigate it by Y' signals the engineering judgment that separates mid-level from junior. Waiting to be asked about failure modes signals that you don't think about them naturally.

Practice these against your own résumé

Get questions tailored to your experience, answer them, and get honest feedback — free, no credit card.

Run a free fit check →