Question 1

Walk me through a major production incident you owned end-to-end. How did you detect it, coordinate the response, and drive the postmortem to prevent recurrence?

Accepted Answer

Structure around a timeline: detection (what alerted you and why — was the alert good?), mitigation (immediate actions and why you chose them over alternatives), root cause analysis (5 Whys or Ishikawa, not just 'the deploy broke it'), and postmortem outcomes (specific action items with owners and deadlines, not vague 'we'll monitor it'). Name the tooling: PagerDuty, incident.io, Jira, Slack war rooms, runbooks. Quantify blast radius — affected users, revenue impact, SLO burn rate. What interviewers look for: Ownership and calm under pressure, structured thinking during chaos, quality of the postmortem culture you drive, and whether your corrective actions were systemic (fixing the class of problem) vs. symptomatic (patching one bug). Red flag: you describe the incident but minimize your own decisions or agency.

Question 2

Design a multi-region active-active deployment for a stateful web application that must achieve 99.99% availability. Walk through data consistency, failover, and traffic management.

Accepted Answer

Start by clarifying the consistency model the application needs — can it tolerate eventual consistency or does it require strong consistency? Sketch the topology: two or more regions, each with full compute stack. Address data replication (CockroachDB, Cassandra, or DynamoDB global tables for eventual; synchronous replication with leader election for strong). Traffic layer: Anycast DNS or global load balancer (Cloudflare, AWS Global Accelerator). Failure modes: split-brain, replication lag, DNS TTL propagation delay. Discuss RPO/RTO targets explicitly. Cover deployment strategy — how do you roll out changes without region-skew causing consistency bugs? What interviewers look for: Whether you immediately anchor on the CAP theorem tradeoff and force a real decision rather than hand-waving 'we'll use eventual consistency.' Interviewers want to see that you understand the coupling between application semantics and infrastructure design. Senior signal: you proactively raise the hardest problem (split-brain or partial failure) without being prompted.

Question 3

Your service has SLOs defined on latency and error rate, but engineers keep getting paged for issues that don't actually violate the SLO. How do you fix the alerting strategy?

Accepted Answer

Diagnose the two classic failure modes: alerts firing too early (using static thresholds instead of error-budget burn rates) and alerts that don't correlate to user impact. Introduce multi-window burn rate alerting (the Google SRE Workbook model: fast burn on short window + slow burn on long window). Explain how to calculate burn rate multipliers based on your error budget size. Distinguish symptom-based alerting (what the user experiences) from cause-based alerting (what might be causing it — the latter belongs in dashboards, not pages). Address alert fatigue compounding: tracking alert-to-action ratio, regular alert audits. What interviewers look for: Fluency with SLO math and error budgets beyond surface-level familiarity. The interviewer wants to hear you distinguish between 'alerting on causes' vs. 'alerting on symptoms' and see that you understand the burn rate model quantitatively. Generic answer: 'we'll tune the thresholds.' Strong answer: specific burn rate windows and why.

Question 4

You're migrating a large, manually-provisioned AWS infrastructure to Terraform. The existing infrastructure has drift, undocumented resources, and is partially shared between teams. How do you approach this?

Accepted Answer

Phase 1: inventory with AWS Config or Steampipe, identify resource relationships and dependencies, map ownership. Phase 2: import strategy — terraform import for existing resources, but highlight the risk that imported resources may fail plan/apply on first run due to missing attributes or defaults. Use Terraformer or cf2tf for bulk import as a starting scaffold, not a final answer. Phase 3: state management — separate state files per team/service boundary (avoid monolithic state), use remote state with S3+DynamoDB or Terraform Cloud. Address drift: run terraform plan in CI to detect drift before it becomes a problem. Phase 4: handle shared resources (VPCs, IAM roles) via data sources and outputs rather than duplicating or hard-coding. Governance: module versioning, code review gates on plan output. What interviewers look for: Whether you've actually done this and understand the messy reality — particularly the import complexity and drift management. Red flags: treating this as a clean greenfield problem, not mentioning state file strategy, or ignoring team/ownership boundaries. Senior signal: you proactively raise blast-radius risk of shared state.

Question 5

Describe how you would design a Kubernetes cluster upgrade strategy for a production cluster running critical workloads with no maintenance window.

Accepted Answer

Blue/green cluster upgrade: provision a new cluster at the target version alongside the current one, migrate workloads incrementally using weighted traffic shifting or namespace migration, validate with canary traffic before cutover, decommission old cluster. Alternative: rolling node group upgrade with PodDisruptionBudgets enforced — drain nodes one at a time, ensure PDBs are correctly configured (common failure: PDB set to minAvailable: 0 or not set at all). Control plane upgrade first, then node groups — never the reverse. Test compatibility: validate admission webhooks, custom resource versions (CRD API versions deprecated between minor versions is a common silent killer), and deprecated API usage with tools like Pluto or kubent. Verify workloads have proper readinessProbes and terminationGracePeriodSeconds before draining. What interviewers look for: Practical knowledge of the failure modes specific to k8s upgrades — deprecated CRD API versions and misconfigured PDBs are real landmines that separate candidates with real upgrade experience from those who've only read the docs. Interviewers want to hear you name specific tools and failure scenarios unprompted.

Question 6

How do you implement progressive delivery for a backend microservice where a bad deploy has historically caused cascading failures across dependent services?

Accepted Answer

Canary deployment with automated rollback: route 1–5% of traffic to the new version, instrument with SLO-aligned metrics (error rate, p99 latency, downstream error propagation via distributed tracing). Use a release operator (Argo Rollouts or Flagger) to automate the promotion/rollback decision based on metric thresholds rather than manual gates. Address the cascade risk specifically: circuit breakers in the service mesh (Envoy/Istio) prevent upstream failures from propagating — configure with appropriate thresholds and half-open state behavior. Feature flags (LaunchDarkly, Unleash) for decoupling deploy from release. Deployment frequency and MTTR as DORA metrics to benchmark improvement. What interviewers look for: Whether you connect the deployment mechanism to the actual problem (cascading failures) rather than just describing a generic canary setup. The strong answer names circuit breakers, explains why they matter for this specific failure mode, and shows awareness of the tooling tradeoffs. Red flag: describing blue/green as the answer without addressing the cascade risk.

Question 7

A production service is experiencing intermittent latency spikes every few hours with no obvious application-level cause. How do you diagnose this systematically?

Accepted Answer

Layer the investigation: start at the application (is it GC pauses? connection pool exhaustion? log the latency distribution, not just averages), then move to the OS (perf, strace, iostat, vmstat — look for CPU steal time, iowait, memory pressure/OOM events in /var/log/kern.log), then network (ss -s for socket state, packet loss via mtr, retransmit rate), then infrastructure (noisy neighbor on shared hypervisor — CPU steal is the tell). For periodic spikes specifically, correlate with cron jobs, log rotation, GC cycles, autoscaler activity, or scheduled backup jobs. Use eBPF/bpftrace or Linux perf events to get low-overhead profiling without restart. Check p99/p999 — 'intermittent' often means a small percentile of requests, which average metrics hide. What interviewers look for: Systematic, layered debugging methodology — not random tool-throwing. The 'periodic' hint is deliberate: strong candidates immediately think about what repeats on a schedule (cron, GC, backup) and how to correlate timestamps. Red flag: jumping straight to application code without OS-level investigation, or not knowing how to investigate without access to application internals.

Question 8

You're responsible for secrets management across 50+ microservices running on Kubernetes. How do you design and enforce a secrets management strategy at scale?

Accepted Answer

Central secrets store: HashiCorp Vault or AWS Secrets Manager as source of truth — never Kubernetes Secrets as the primary store (they're base64, not encrypted at rest by default unless etcd encryption is configured). Dynamic secrets for databases (Vault database secrets engine) — short-lived credentials eliminate rotation burden and limit blast radius. Kubernetes integration: Vault Agent Sidecar Injector or the Secrets Store CSI Driver with provider plugins — CSI is preferred as it avoids the sidecar proliferation problem. Service-to-vault authentication: Kubernetes auth method using service account JWT. Enforce least-privilege: each service gets a Vault policy scoped to exactly its secrets path. Audit: Vault audit log to SIEM. CI/CD: no secrets in environment variables or Docker layers — use OIDC-based short-lived token injection at deploy time. Detect drift/leakage: gitleaks in pre-commit and CI, periodic secret rotation validation. What interviewers look for: Whether you immediately flag that Kubernetes Secrets are not actually secret without additional configuration — this is the most common dangerous assumption. Interviewers also look for whether you understand the rotation and blast-radius problem and how dynamic secrets address it. Generic answer: 'use Vault.' Strong answer: specific auth method, policy scoping, and how you prevent secrets sprawl in CI.

Question 9

Your infrastructure bill increased 40% last quarter with no corresponding increase in traffic. How do you investigate and remediate this?

Accepted Answer

Start with the FinOps taxonomy: cost explorer breakdown by service, tag, and team to isolate where the spike originated. Common culprits: data transfer costs (cross-AZ or cross-region calls that increased due to an architectural change), EBS/S3 storage accumulation (snapshot retention, log accumulation, orphaned volumes), EC2/compute rightsizing drift (over-provisioned auto-scaling groups that weren't updated after traffic patterns changed), or a new service deployed without cost review. Tools: AWS Cost Anomaly Detection for alerting, Infracost in CI to catch cost regressions at PR time, Kubecost for Kubernetes workload attribution. Remediation: Savings Plans and Reserved Instance coverage review, Spot Instance migration for fault-tolerant workloads, S3 Intelligent-Tiering for infrequently accessed data. Governance: tagging enforcement, showback/chargeback to teams. What interviewers look for: A structured investigation approach — not jumping to 'buy Reserved Instances.' The 40% jump with no traffic change is a signal of a specific change, not general over-provisioning, and strong candidates orient the investigation toward finding what changed. Senior signal: you mention Infracost or similar shift-left cost tooling to prevent recurrence, not just remediate.

Question 10

Tell me about a time you disagreed with engineering leadership about a reliability or infrastructure decision. How did you handle it and what was the outcome?

Accepted Answer

Use a specific example where the stakes were real — not a minor disagreement. Structure: what the decision was and why leadership favored it (show you understood their perspective), what your concern was and how you quantified it (don't just say 'I felt it was risky' — show data: SLO impact, blast radius, failure probability), how you escalated constructively (written proposal, data-backed, presented alternatives with tradeoffs rather than just objecting), and the actual outcome — including if you lost the argument and how you committed to the decision anyway. Avoid framing where you were obviously right and leadership was obviously wrong. What interviewers look for: Whether you can disagree with technical rigor rather than just opinion, and whether you know when to escalate vs. commit. Interviewers at this level are screening for engineers who influence decisions through data and trust-building, not through stubbornness or passive resistance. Red flag: no real stakes, or a story where you were simply ignored.

Question 11

How have you built or improved an on-call culture in a team where alert fatigue and toil were significant problems?

Accepted Answer

Be specific about the starting state (quantify: how many pages per week, what percentage were actionable, what was engineer morale/attrition signal). Interventions: alert audit and triage to eliminate noise (every alert must have a documented action), runbook quality reviews, toil tracking (categorize toil vs. engineering work in sprint velocity), on-call rotation design (handoff protocols, escalation paths, explicit 'no paging outside business hours without P0 criteria'), blameless postmortem culture. Measure improvement: track actionable alert percentage, time-to-mitigate trend, on-call toil hours per rotation. Address the organizational side: on-call should be compensated, acknowledged, and not treated as background work. What interviewers look for: Whether you've actually led this kind of cultural change, not just participated in it. Senior signal: you mention measurement (before/after metrics), you address the organizational/compensation dimension, and you show that you treated it as a systemic problem rather than a 'people just need to respond faster' problem.

Question 12

Explain how you would implement and tune a circuit breaker for a service that makes synchronous calls to three downstream dependencies with very different latency and reliability profiles.

Accepted Answer

Per-dependency circuit breaker instances — never a single circuit breaker across all downstream calls, because a slow dependency will trip the breaker for a healthy one. Configure each independently: failure threshold (percentage-based, not count-based, to handle variable traffic), minimum request volume before evaluation, half-open probe frequency, and timeout values tuned to each dependency's p95 latency (not p50 — you need to catch tail latency degradation). Discuss the three states (closed/open/half-open) and the transition conditions. Address timeout budget: the total timeout for the calling service must be less than the sum of worst-case timeouts to avoid cascading timeouts. Bulkhead pattern as a complement: separate thread pools or connection pools per dependency so a slow dependency doesn't exhaust shared resources. Observability: circuit state transitions should emit metrics and trigger non-paging alerts. Fallback behavior: what does the application do when a circuit is open — return cached data, degrade gracefully, or fail fast? What interviewers look for: Whether you know that circuit breakers must be tuned per-dependency and that bulkheads solve the resource exhaustion problem that circuit breakers alone don't address. Generic answer: 'use Hystrix/Resilience4j and configure a threshold.' Strong answer: per-dependency configuration rationale, timeout budget math, and explicit discussion of fallback behavior — because a circuit breaker with no graceful degradation plan just changes how you fail, not whether you fail.

Senior DevOps / SRE Engineer Interview Questions

What to expect

12 questions, with how to answer them

1. Walk me through a major production incident you owned end-to-end. How did you detect it, coordinate the response, and drive the postmortem to prevent recurrence?

2. Design a multi-region active-active deployment for a stateful web application that must achieve 99.99% availability. Walk through data consistency, failover, and traffic management.

3. Your service has SLOs defined on latency and error rate, but engineers keep getting paged for issues that don't actually violate the SLO. How do you fix the alerting strategy?

4. You're migrating a large, manually-provisioned AWS infrastructure to Terraform. The existing infrastructure has drift, undocumented resources, and is partially shared between teams. How do you approach this?

5. Describe how you would design a Kubernetes cluster upgrade strategy for a production cluster running critical workloads with no maintenance window.

6. How do you implement progressive delivery for a backend microservice where a bad deploy has historically caused cascading failures across dependent services?

7. A production service is experiencing intermittent latency spikes every few hours with no obvious application-level cause. How do you diagnose this systematically?

8. You're responsible for secrets management across 50+ microservices running on Kubernetes. How do you design and enforce a secrets management strategy at scale?

9. Your infrastructure bill increased 40% last quarter with no corresponding increase in traffic. How do you investigate and remediate this?

10. Tell me about a time you disagreed with engineering leadership about a reliability or infrastructure decision. How did you handle it and what was the outcome?

11. How have you built or improved an on-call culture in a team where alert fatigue and toil were significant problems?

12. Explain how you would implement and tune a circuit breaker for a service that makes synchronous calls to three downstream dependencies with very different latency and reliability profiles.

Study tips