Interview questions
New Grad DevOps / SRE Engineer Interview Questions
New grad DevOps/SRE interviews test foundational Linux, networking, and scripting knowledge alongside a basic understanding of reliability concepts like SLOs, on-call, and incident response. You won't be expected to have designed production systems at scale, but you must show genuine curiosity about systems, comfort with the command line, and the ability to reason through operational problems you haven't seen before. Expect interviewers to probe how you think, not just what you've memorized.
What to expect
A typical new grad SRE/DevOps loop includes a recruiter screen, one or two technical phone screens covering Linux fundamentals and scripting, a coding round (usually easier LeetCode-style problems or scripting tasks in Python/Bash), and a virtual or onsite loop with a systems/debugging round, a design-lite round (design a monitoring setup or a simple deployment pipeline), and one behavioral interview. Some companies replace a coding round with a take-home automation task. The bar is calibrated to internship and academic project experience — you're not expected to have run production Kubernetes clusters, but you should know why you'd want one.
These are the questions every DevOps / SRE Engineer gets.
Get questions tailored to your experience, answer them, and get honest feedback — free, no credit card.
Run a free fit check →12 questions, with how to answer them
Linux & OS Fundamentals
1. A process on a Linux server is consuming 100% CPU. Walk me through how you'd identify it and decide what to do next.
How to answer: Start with `top` or `htop` to find the PID and process name. Use `ps aux --sort=-%cpu` for a snapshot. Drill into the process with `strace -p <PID>` to see system calls, or `lsof -p <PID>` to see open files. Check `/proc/<PID>/status` for context. Decide: is this expected load, a runaway loop, or a stuck process? Explain the tradeoff between killing it immediately vs. capturing a core dump first for debugging.
What they look for: Can the candidate navigate core Unix diagnostic tools without needing to Google every flag? Do they think before acting — i.e., do they understand that blindly killing a process might lose important state? Interviewers want to see a structured triage mindset, not just a list of commands.
Networking
2. Explain what happens at each layer of the network stack when you type `curl https://example.com` in a terminal.
How to answer: Walk through: DNS resolution (resolver cache → recursive resolver → authoritative NS), TCP three-way handshake to port 443, TLS handshake (certificate validation, cipher negotiation, session key exchange), HTTP GET request over the encrypted channel, server response, TCP teardown. Mention tools like `dig`, `tcpdump`, `openssl s_client` that map to each layer. Name the OSI or TCP/IP layers you're referencing.
What they look for: This is a classic depth-probe. Interviewers want to see how far down the stack you can go and whether you understand where common failures occur (DNS timeouts, TLS cert errors, TCP resets). A new grad who can fluently reach TLS details signals strong fundamentals.
Scripting & Automation
3. Write a Bash script that monitors a log file and sends an alert (print to stdout is fine) whenever the word 'ERROR' appears more than 5 times in a 60-second window.
How to answer: Use `tail -F` to follow the log in real time. Pipe into a loop that counts occurrences of 'ERROR' using `grep -c` or a counter variable reset every 60 seconds with a timestamp check. A cleaner approach: buffer lines in a sliding window using an array and `date +%s` for timestamps. Mention that in production you'd use a real alerting tool (Alertmanager, Datadog), but the exercise shows scripting fluency.
What they look for: Can the candidate write working, readable Bash without scaffolding? Do they handle edge cases (log rotation, buffering)? Bonus points for mentioning `inotifywait` or for offering to rewrite it in Python for robustness. Interviewers look for practical judgment, not just syntax recall.
CI/CD & Deployment
4. Describe the stages you'd include in a CI/CD pipeline for a simple web service, and explain why each stage exists.
How to answer: Stages: source checkout → dependency install → unit tests → static analysis/linting → build artifact (Docker image) → push to registry → deploy to staging → integration/smoke tests → deploy to production (with a manual gate or canary). Explain the purpose of each gate: catching bugs early is cheaper than finding them in prod. Mention rollback strategy (image tags, blue/green or canary) and why you don't deploy directly from a developer's laptop.
What they look for: Interviewers want to see that you understand CI/CD as a reliability tool, not just automation for its own sake. Does the candidate reason about failure modes at each stage? Do they know what a canary deployment is and why it reduces blast radius? This separates candidates who've used pipelines from those who understand them.
Containers & Orchestration
5. What is the difference between a Docker image and a container, and what problem does Kubernetes solve that Docker alone doesn't?
How to answer: Image = immutable filesystem snapshot + metadata (layers, entrypoint). Container = a running process with an isolated filesystem, network namespace, and PID namespace instantiated from an image. Docker alone handles single-host container lifecycle. Kubernetes adds: scheduling across a cluster, desired-state reconciliation (if a container crashes, it restarts), service discovery, load balancing, rolling updates, and resource quotas. Mention the control plane (API server, scheduler, etcd) vs. data plane (kubelets, pods) distinction if you know it.
What they look for: New grads often conflate image and container — correctly distinguishing them is table stakes. The Kubernetes answer reveals whether the candidate has thought about *why* orchestration exists (single points of failure, manual scaling, no self-healing) rather than just knowing the brand name.
Reliability & SRE Concepts
6. What is an SLO, and how does it differ from an SLA? Give a concrete example of each for a hypothetical API service.
How to answer: SLA (Service Level Agreement): a contractual commitment to customers with consequences (refunds, penalties). E.g., 'We guarantee 99.9% monthly uptime; breaches trigger service credits.' SLO (Service Level Objective): an internal target that drives engineering decisions, usually stricter than the SLA. E.g., 'Our internal target is 99.95% success rate on /search requests over a 28-day window.' Error budget = 1 - SLO. Mention SLIs (the actual measured metrics, e.g., request success rate) as the measurement layer.
What they look for: Interviewers want to see that you understand the SLI→SLO→SLA hierarchy and that SLOs are engineering tools, not just management theater. A candidate who can articulate error budgets and why they exist (to make tradeoffs between reliability and feature velocity) signals SRE-specific maturity.
Incident Response
7. You're on call and receive an alert that p99 latency on your API has spiked from 80ms to 3 seconds. Walk me through your response.
How to answer: Immediately check the scope: is it one endpoint, one region, or all traffic? Pull dashboards: look at error rates alongside latency (latency spike + low error rate → slowness; latency spike + high error rate → likely failures). Check recent deploys, config changes, or cron jobs that fired recently. Look at upstream dependencies (database slow query logs, downstream service health). Mitigate first (roll back if recent deploy correlates), then diagnose. Communicate status to stakeholders. After resolution: write a postmortem draft.
What they look for: Interviewers want to see structured thinking under pressure. The key signals: Do they separate correlation from causation? Do they check for recent changes early? Do they think about mitigation before root cause? Do they mention communication and postmortem? A new grad who frameworks their answer (scope → data → hypotheses → mitigate → communicate) stands out.
Version Control & GitOps
8. What is a merge conflict, how does one occur, and how do you resolve it safely in a team environment?
How to answer: A merge conflict occurs when two branches modify the same lines of a file (or one deletes a file the other modifies) and Git can't auto-merge. Resolution: `git status` to identify conflicted files, open them and look for `<<<<<<<`, `=======`, `>>>>>>>` markers, choose the correct version (or combine both), mark resolved with `git add`, then `git commit`. Safe practices: communicate with the author of the conflicting branch, run tests after resolution, avoid resolving conflicts by blindly accepting 'ours' or 'theirs' without understanding both changes.
What they look for: This is a hygiene check. Interviewers want to confirm you've actually used Git collaboratively and understand that resolving a conflict is a semantic operation, not just a mechanical one. Candidates who mention running tests after resolution and talking to teammates show team awareness.
Infrastructure as Code
9. What is the purpose of Terraform state, and what problems can arise if two engineers run `terraform apply` simultaneously?
How to answer: Terraform state maps your HCL configuration to real infrastructure resources. It stores IDs and metadata so Terraform knows what exists vs. what needs to change. If two engineers apply simultaneously, they can each read stale state, produce conflicting plans, and both apply — resulting in duplicate resources, overwritten changes, or corrupted state. Solution: remote state backends (S3 + DynamoDB for state locking, or Terraform Cloud) that provide a distributed lock so only one apply runs at a time. Mention that state can contain secrets and should be stored securely.
What they look for: Interviewers check whether you understand *why* Terraform state exists (not just that it does) and whether you've thought about real-world collaboration problems. State locking is a concrete operational concern that distinguishes candidates who've run Terraform in teams from those who've only done solo tutorials.
Observability
10. What are the three pillars of observability, and when would you use each one to debug a production issue?
How to answer: Metrics: aggregated numeric measurements over time (CPU %, request rate, error rate). Use them for alerting and spotting trends. Logs: structured or unstructured text records of discrete events. Use them to understand *what* happened on a specific request or in a specific component. Traces: records of a request's path across multiple services, with timing at each hop. Use them when metrics say something is slow but you can't tell *where* in a distributed system the latency originates. Tools: Prometheus/Grafana (metrics), ELK/Loki (logs), Jaeger/Zipkin (traces).
What they look for: Can the candidate articulate not just what each pillar is but *when* to reach for it? New grads who only know metrics haven't thought about distributed systems. Knowing that traces exist specifically to solve the distributed latency mystery problem shows SRE-relevant depth.
Behavioral
11. Tell me about a time you debugged a technical problem that took you significantly longer than expected. What made it hard, and what did you learn?
How to answer: Use STAR (Situation, Task, Action, Result). Be specific: name the system, the symptom, the misleading signals that sent you down wrong paths, and the actual root cause. Don't invent a story — use a real project, internship, or course assignment. The 'learn' section is critical: show that you extracted a process improvement (e.g., 'I now always check logs before assumptions,' or 'I learned to timebox hypothesis testing').
What they look for: Interviewers are evaluating intellectual honesty, persistence, and learning velocity — all more important than raw knowledge at the new grad level. A candidate who admits they went down wrong paths and explains why shows self-awareness. Candidates who give vague answers or claim they solved it quickly raise flags.
Systems Design (Lite)
12. How would you design a basic health-check monitoring system that alerts an on-call engineer when a web service goes down?
How to answer: Components: a probe service that sends HTTP GET requests to the target URL every 30 seconds and records success/failure + response time. A time-series store (or simple DB) for results. An alerting layer that fires if N consecutive checks fail (to avoid flapping alerts on transient failures). A notification path: PagerDuty or email. Discuss tradeoffs: probe from multiple regions to distinguish regional outages from global ones; avoid single-point-of-failure in the probe itself; alert thresholds to reduce false positives. Mention that this is essentially what Blackbox Exporter + Alertmanager does in the Prometheus ecosystem.
What they look for: Interviewers aren't expecting a full distributed systems design. They want to see whether you can decompose a real operational problem into components, identify failure modes in your own design (what if the probe crashes?), and connect your design to real tools. Structured thinking and awareness of false-positive alerting signal SRE instincts.
Study tips
- Get comfortable with `top`, `netstat`/`ss`, `strace`, `lsof`, `journalctl`, and `tcpdump` before your interviews — not just knowing they exist, but being able to run them and interpret their output. Practice in a Linux VM or WSL2 so the muscle memory is real.
- Study the Prometheus data model (labels, metric types: counter vs. gauge vs. histogram) specifically. New grad SRE interviews at companies that use the Prometheus ecosystem will test this, and most candidates show up only knowing 'metrics exist.'
- For Bash and Python scripting rounds, practice writing scripts that process text from stdin/files — log parsing, field extraction with `awk`/`sed`, and JSON manipulation with `jq` or Python's `json` module. These tasks appear constantly in take-home exercises.
- Read the Google SRE book's chapters on SLOs, error budgets, and toil (chapters 4, 5, and 6) — they're free online. Being able to define and discuss these concepts in your own words, not just recite definitions, separates you from candidates who only have software engineering backgrounds.
- When practicing system design questions, explicitly call out failure modes in your own designs. Interviewers at this level care more about whether you can reason about what breaks than whether your design is optimal. Saying 'my probe service is a single point of failure; I'd fix this by...' is a green flag.
Practice these against your own résumé
Get questions tailored to your experience, answer them, and get honest feedback — free, no credit card.
Run a free fit check →