Interview questions

Senior DevOps / SRE Engineer Interview Questions

Senior SRE and DevOps interviews test your ability to own reliability at scale — not just operate tools, but design systems that fail gracefully, recover automatically, and improve measurably over time. Expect deep dives into incidents you've personally led, architecture decisions you've defended, and tradeoffs you've navigated under pressure. This guide covers the realistic question mix you'll face across system design, reliability engineering, infrastructure, and leadership.

What to expect

A senior SRE/DevOps loop typically runs 4–6 rounds: one coding screen (Linux internals, scripting, or a small automation task — not LeetCode-style algorithms), one or two system design rounds focused on distributed systems reliability and infrastructure architecture, a deep-dive into a past incident or on-call scenario, a behavioral/leadership round probing cross-team influence and production ownership, and often a domain-specific round on observability, CI/CD, or Kubernetes internals. At this level, interviewers care far less about whether you can recite kubectl commands and far more about whether you can reason about failure modes, make sound capacity and cost tradeoffs, and lead a team through ambiguity. Expect to be challenged on your own past decisions — what you'd do differently is as important as what you did.

These are the questions every DevOps / SRE Engineer gets.

Get questions tailored to your experience, answer them, and get honest feedback — free, no credit card.

Run a free fit check →

12 questions, with how to answer them

  1. Incident Management & Reliability

    1. Walk me through a major production incident you owned end-to-end. How did you detect it, coordinate the response, and drive the postmortem to prevent recurrence?

    How to answer: Structure around a timeline: detection (what alerted you and why — was the alert good?), mitigation (immediate actions and why you chose them over alternatives), root cause analysis (5 Whys or Ishikawa, not just 'the deploy broke it'), and postmortem outcomes (specific action items with owners and deadlines, not vague 'we'll monitor it'). Name the tooling: PagerDuty, incident.io, Jira, Slack war rooms, runbooks. Quantify blast radius — affected users, revenue impact, SLO burn rate.

    What they look for: Ownership and calm under pressure, structured thinking during chaos, quality of the postmortem culture you drive, and whether your corrective actions were systemic (fixing the class of problem) vs. symptomatic (patching one bug). Red flag: you describe the incident but minimize your own decisions or agency.

  2. System Design

    2. Design a multi-region active-active deployment for a stateful web application that must achieve 99.99% availability. Walk through data consistency, failover, and traffic management.

    How to answer: Start by clarifying the consistency model the application needs — can it tolerate eventual consistency or does it require strong consistency? Sketch the topology: two or more regions, each with full compute stack. Address data replication (CockroachDB, Cassandra, or DynamoDB global tables for eventual; synchronous replication with leader election for strong). Traffic layer: Anycast DNS or global load balancer (Cloudflare, AWS Global Accelerator). Failure modes: split-brain, replication lag, DNS TTL propagation delay. Discuss RPO/RTO targets explicitly. Cover deployment strategy — how do you roll out changes without region-skew causing consistency bugs?

    What they look for: Whether you immediately anchor on the CAP theorem tradeoff and force a real decision rather than hand-waving 'we'll use eventual consistency.' Interviewers want to see that you understand the coupling between application semantics and infrastructure design. Senior signal: you proactively raise the hardest problem (split-brain or partial failure) without being prompted.

  3. Observability

    3. Your service has SLOs defined on latency and error rate, but engineers keep getting paged for issues that don't actually violate the SLO. How do you fix the alerting strategy?

    How to answer: Diagnose the two classic failure modes: alerts firing too early (using static thresholds instead of error-budget burn rates) and alerts that don't correlate to user impact. Introduce multi-window burn rate alerting (the Google SRE Workbook model: fast burn on short window + slow burn on long window). Explain how to calculate burn rate multipliers based on your error budget size. Distinguish symptom-based alerting (what the user experiences) from cause-based alerting (what might be causing it — the latter belongs in dashboards, not pages). Address alert fatigue compounding: tracking alert-to-action ratio, regular alert audits.

    What they look for: Fluency with SLO math and error budgets beyond surface-level familiarity. The interviewer wants to hear you distinguish between 'alerting on causes' vs. 'alerting on symptoms' and see that you understand the burn rate model quantitatively. Generic answer: 'we'll tune the thresholds.' Strong answer: specific burn rate windows and why.

  4. Infrastructure & IaC

    4. You're migrating a large, manually-provisioned AWS infrastructure to Terraform. The existing infrastructure has drift, undocumented resources, and is partially shared between teams. How do you approach this?

    How to answer: Phase 1: inventory with AWS Config or Steampipe, identify resource relationships and dependencies, map ownership. Phase 2: import strategy — terraform import for existing resources, but highlight the risk that imported resources may fail plan/apply on first run due to missing attributes or defaults. Use Terraformer or cf2tf for bulk import as a starting scaffold, not a final answer. Phase 3: state management — separate state files per team/service boundary (avoid monolithic state), use remote state with S3+DynamoDB or Terraform Cloud. Address drift: run terraform plan in CI to detect drift before it becomes a problem. Phase 4: handle shared resources (VPCs, IAM roles) via data sources and outputs rather than duplicating or hard-coding. Governance: module versioning, code review gates on plan output.

    What they look for: Whether you've actually done this and understand the messy reality — particularly the import complexity and drift management. Red flags: treating this as a clean greenfield problem, not mentioning state file strategy, or ignoring team/ownership boundaries. Senior signal: you proactively raise blast-radius risk of shared state.

  5. Kubernetes & Container Orchestration

    5. Describe how you would design a Kubernetes cluster upgrade strategy for a production cluster running critical workloads with no maintenance window.

    How to answer: Blue/green cluster upgrade: provision a new cluster at the target version alongside the current one, migrate workloads incrementally using weighted traffic shifting or namespace migration, validate with canary traffic before cutover, decommission old cluster. Alternative: rolling node group upgrade with PodDisruptionBudgets enforced — drain nodes one at a time, ensure PDBs are correctly configured (common failure: PDB set to minAvailable: 0 or not set at all). Control plane upgrade first, then node groups — never the reverse. Test compatibility: validate admission webhooks, custom resource versions (CRD API versions deprecated between minor versions is a common silent killer), and deprecated API usage with tools like Pluto or kubent. Verify workloads have proper readinessProbes and terminationGracePeriodSeconds before draining.

    What they look for: Practical knowledge of the failure modes specific to k8s upgrades — deprecated CRD API versions and misconfigured PDBs are real landmines that separate candidates with real upgrade experience from those who've only read the docs. Interviewers want to hear you name specific tools and failure scenarios unprompted.

  6. CI/CD & Deployment

    6. How do you implement progressive delivery for a backend microservice where a bad deploy has historically caused cascading failures across dependent services?

    How to answer: Canary deployment with automated rollback: route 1–5% of traffic to the new version, instrument with SLO-aligned metrics (error rate, p99 latency, downstream error propagation via distributed tracing). Use a release operator (Argo Rollouts or Flagger) to automate the promotion/rollback decision based on metric thresholds rather than manual gates. Address the cascade risk specifically: circuit breakers in the service mesh (Envoy/Istio) prevent upstream failures from propagating — configure with appropriate thresholds and half-open state behavior. Feature flags (LaunchDarkly, Unleash) for decoupling deploy from release. Deployment frequency and MTTR as DORA metrics to benchmark improvement.

    What they look for: Whether you connect the deployment mechanism to the actual problem (cascading failures) rather than just describing a generic canary setup. The strong answer names circuit breakers, explains why they matter for this specific failure mode, and shows awareness of the tooling tradeoffs. Red flag: describing blue/green as the answer without addressing the cascade risk.

  7. Linux & Systems

    7. A production service is experiencing intermittent latency spikes every few hours with no obvious application-level cause. How do you diagnose this systematically?

    How to answer: Layer the investigation: start at the application (is it GC pauses? connection pool exhaustion? log the latency distribution, not just averages), then move to the OS (perf, strace, iostat, vmstat — look for CPU steal time, iowait, memory pressure/OOM events in /var/log/kern.log), then network (ss -s for socket state, packet loss via mtr, retransmit rate), then infrastructure (noisy neighbor on shared hypervisor — CPU steal is the tell). For periodic spikes specifically, correlate with cron jobs, log rotation, GC cycles, autoscaler activity, or scheduled backup jobs. Use eBPF/bpftrace or Linux perf events to get low-overhead profiling without restart. Check p99/p999 — 'intermittent' often means a small percentile of requests, which average metrics hide.

    What they look for: Systematic, layered debugging methodology — not random tool-throwing. The 'periodic' hint is deliberate: strong candidates immediately think about what repeats on a schedule (cron, GC, backup) and how to correlate timestamps. Red flag: jumping straight to application code without OS-level investigation, or not knowing how to investigate without access to application internals.

  8. Security & Compliance

    8. You're responsible for secrets management across 50+ microservices running on Kubernetes. How do you design and enforce a secrets management strategy at scale?

    How to answer: Central secrets store: HashiCorp Vault or AWS Secrets Manager as source of truth — never Kubernetes Secrets as the primary store (they're base64, not encrypted at rest by default unless etcd encryption is configured). Dynamic secrets for databases (Vault database secrets engine) — short-lived credentials eliminate rotation burden and limit blast radius. Kubernetes integration: Vault Agent Sidecar Injector or the Secrets Store CSI Driver with provider plugins — CSI is preferred as it avoids the sidecar proliferation problem. Service-to-vault authentication: Kubernetes auth method using service account JWT. Enforce least-privilege: each service gets a Vault policy scoped to exactly its secrets path. Audit: Vault audit log to SIEM. CI/CD: no secrets in environment variables or Docker layers — use OIDC-based short-lived token injection at deploy time. Detect drift/leakage: gitleaks in pre-commit and CI, periodic secret rotation validation.

    What they look for: Whether you immediately flag that Kubernetes Secrets are not actually secret without additional configuration — this is the most common dangerous assumption. Interviewers also look for whether you understand the rotation and blast-radius problem and how dynamic secrets address it. Generic answer: 'use Vault.' Strong answer: specific auth method, policy scoping, and how you prevent secrets sprawl in CI.

  9. Capacity Planning & Cost

    9. Your infrastructure bill increased 40% last quarter with no corresponding increase in traffic. How do you investigate and remediate this?

    How to answer: Start with the FinOps taxonomy: cost explorer breakdown by service, tag, and team to isolate where the spike originated. Common culprits: data transfer costs (cross-AZ or cross-region calls that increased due to an architectural change), EBS/S3 storage accumulation (snapshot retention, log accumulation, orphaned volumes), EC2/compute rightsizing drift (over-provisioned auto-scaling groups that weren't updated after traffic patterns changed), or a new service deployed without cost review. Tools: AWS Cost Anomaly Detection for alerting, Infracost in CI to catch cost regressions at PR time, Kubecost for Kubernetes workload attribution. Remediation: Savings Plans and Reserved Instance coverage review, Spot Instance migration for fault-tolerant workloads, S3 Intelligent-Tiering for infrequently accessed data. Governance: tagging enforcement, showback/chargeback to teams.

    What they look for: A structured investigation approach — not jumping to 'buy Reserved Instances.' The 40% jump with no traffic change is a signal of a specific change, not general over-provisioning, and strong candidates orient the investigation toward finding what changed. Senior signal: you mention Infracost or similar shift-left cost tooling to prevent recurrence, not just remediate.

  10. Behavioral & Leadership

    10. Tell me about a time you disagreed with engineering leadership about a reliability or infrastructure decision. How did you handle it and what was the outcome?

    How to answer: Use a specific example where the stakes were real — not a minor disagreement. Structure: what the decision was and why leadership favored it (show you understood their perspective), what your concern was and how you quantified it (don't just say 'I felt it was risky' — show data: SLO impact, blast radius, failure probability), how you escalated constructively (written proposal, data-backed, presented alternatives with tradeoffs rather than just objecting), and the actual outcome — including if you lost the argument and how you committed to the decision anyway. Avoid framing where you were obviously right and leadership was obviously wrong.

    What they look for: Whether you can disagree with technical rigor rather than just opinion, and whether you know when to escalate vs. commit. Interviewers at this level are screening for engineers who influence decisions through data and trust-building, not through stubbornness or passive resistance. Red flag: no real stakes, or a story where you were simply ignored.

  11. Behavioral & Leadership

    11. How have you built or improved an on-call culture in a team where alert fatigue and toil were significant problems?

    How to answer: Be specific about the starting state (quantify: how many pages per week, what percentage were actionable, what was engineer morale/attrition signal). Interventions: alert audit and triage to eliminate noise (every alert must have a documented action), runbook quality reviews, toil tracking (categorize toil vs. engineering work in sprint velocity), on-call rotation design (handoff protocols, escalation paths, explicit 'no paging outside business hours without P0 criteria'), blameless postmortem culture. Measure improvement: track actionable alert percentage, time-to-mitigate trend, on-call toil hours per rotation. Address the organizational side: on-call should be compensated, acknowledged, and not treated as background work.

    What they look for: Whether you've actually led this kind of cultural change, not just participated in it. Senior signal: you mention measurement (before/after metrics), you address the organizational/compensation dimension, and you show that you treated it as a systemic problem rather than a 'people just need to respond faster' problem.

  12. Distributed Systems

    12. Explain how you would implement and tune a circuit breaker for a service that makes synchronous calls to three downstream dependencies with very different latency and reliability profiles.

    How to answer: Per-dependency circuit breaker instances — never a single circuit breaker across all downstream calls, because a slow dependency will trip the breaker for a healthy one. Configure each independently: failure threshold (percentage-based, not count-based, to handle variable traffic), minimum request volume before evaluation, half-open probe frequency, and timeout values tuned to each dependency's p95 latency (not p50 — you need to catch tail latency degradation). Discuss the three states (closed/open/half-open) and the transition conditions. Address timeout budget: the total timeout for the calling service must be less than the sum of worst-case timeouts to avoid cascading timeouts. Bulkhead pattern as a complement: separate thread pools or connection pools per dependency so a slow dependency doesn't exhaust shared resources. Observability: circuit state transitions should emit metrics and trigger non-paging alerts. Fallback behavior: what does the application do when a circuit is open — return cached data, degrade gracefully, or fail fast?

    What they look for: Whether you know that circuit breakers must be tuned per-dependency and that bulkheads solve the resource exhaustion problem that circuit breakers alone don't address. Generic answer: 'use Hystrix/Resilience4j and configure a threshold.' Strong answer: per-dependency configuration rationale, timeout budget math, and explicit discussion of fallback behavior — because a circuit breaker with no graceful degradation plan just changes how you fail, not whether you fail.

Study tips

  • Run a structured postmortem on every significant incident in your current role before interviewing — interviewers will probe for specifics, and having 2–3 deeply analyzed incidents (with data, decision points, and systemic fixes) is more valuable than 10 vague stories.
  • Read the Google SRE Workbook chapters on SLO implementation and alerting before your loop — specifically the burn rate alerting math. Being able to derive error budget consumption and burn rate windows on a whiteboard separates candidates who've internalized SRE principles from those who've just read blog posts about them.
  • For system design rounds, practice explicitly naming the failure mode you're designing against before proposing a solution. Interviewers at this level test whether you enumerate failure modes unprompted — partial failure, split-brain, cascading failure, thundering herd — not just whether you can name AWS services.
  • If you're weak on Kubernetes internals, focus on the failure modes (PDB misconfiguration, deprecated API versions, resource limits vs. requests, scheduler behavior under pressure) rather than surface-level feature knowledge — these are what come up in real senior interviews.
  • Prepare a clear narrative on one infrastructure decision where you owned the tradeoff between reliability, cost, and engineering velocity — with real numbers. 'We chose X over Y because it reduced our error budget spend by Z% at N% higher cost' is the kind of specificity that signals senior-level judgment.

Practice these against your own résumé

Get questions tailored to your experience, answer them, and get honest feedback — free, no credit card.

Run a free fit check →