Question 1

A process on a Linux server is consuming 100% CPU. Walk me through how you'd identify it and decide what to do next.

Accepted Answer

Start with `top` or `htop` to find the PID and process name. Use `ps aux --sort=-%cpu` for a snapshot. Drill into the process with `strace -p ` to see system calls, or `lsof -p ` to see open files. Check `/proc//status` for context. Decide: is this expected load, a runaway loop, or a stuck process? Explain the tradeoff between killing it immediately vs. capturing a core dump first for debugging. What interviewers look for: Can the candidate navigate core Unix diagnostic tools without needing to Google every flag? Do they think before acting — i.e., do they understand that blindly killing a process might lose important state? Interviewers want to see a structured triage mindset, not just a list of commands.

Question 2

Explain what happens at each layer of the network stack when you type `curl https://example.com` in a terminal.

Accepted Answer

Walk through: DNS resolution (resolver cache → recursive resolver → authoritative NS), TCP three-way handshake to port 443, TLS handshake (certificate validation, cipher negotiation, session key exchange), HTTP GET request over the encrypted channel, server response, TCP teardown. Mention tools like `dig`, `tcpdump`, `openssl s_client` that map to each layer. Name the OSI or TCP/IP layers you're referencing. What interviewers look for: This is a classic depth-probe. Interviewers want to see how far down the stack you can go and whether you understand where common failures occur (DNS timeouts, TLS cert errors, TCP resets). A new grad who can fluently reach TLS details signals strong fundamentals.

Question 3

Write a Bash script that monitors a log file and sends an alert (print to stdout is fine) whenever the word 'ERROR' appears more than 5 times in a 60-second window.

Accepted Answer

Use `tail -F` to follow the log in real time. Pipe into a loop that counts occurrences of 'ERROR' using `grep -c` or a counter variable reset every 60 seconds with a timestamp check. A cleaner approach: buffer lines in a sliding window using an array and `date +%s` for timestamps. Mention that in production you'd use a real alerting tool (Alertmanager, Datadog), but the exercise shows scripting fluency. What interviewers look for: Can the candidate write working, readable Bash without scaffolding? Do they handle edge cases (log rotation, buffering)? Bonus points for mentioning `inotifywait` or for offering to rewrite it in Python for robustness. Interviewers look for practical judgment, not just syntax recall.

Question 4

Describe the stages you'd include in a CI/CD pipeline for a simple web service, and explain why each stage exists.

Accepted Answer

Stages: source checkout → dependency install → unit tests → static analysis/linting → build artifact (Docker image) → push to registry → deploy to staging → integration/smoke tests → deploy to production (with a manual gate or canary). Explain the purpose of each gate: catching bugs early is cheaper than finding them in prod. Mention rollback strategy (image tags, blue/green or canary) and why you don't deploy directly from a developer's laptop. What interviewers look for: Interviewers want to see that you understand CI/CD as a reliability tool, not just automation for its own sake. Does the candidate reason about failure modes at each stage? Do they know what a canary deployment is and why it reduces blast radius? This separates candidates who've used pipelines from those who understand them.

Question 5

What is the difference between a Docker image and a container, and what problem does Kubernetes solve that Docker alone doesn't?

Accepted Answer

Image = immutable filesystem snapshot + metadata (layers, entrypoint). Container = a running process with an isolated filesystem, network namespace, and PID namespace instantiated from an image. Docker alone handles single-host container lifecycle. Kubernetes adds: scheduling across a cluster, desired-state reconciliation (if a container crashes, it restarts), service discovery, load balancing, rolling updates, and resource quotas. Mention the control plane (API server, scheduler, etcd) vs. data plane (kubelets, pods) distinction if you know it. What interviewers look for: New grads often conflate image and container — correctly distinguishing them is table stakes. The Kubernetes answer reveals whether the candidate has thought about *why* orchestration exists (single points of failure, manual scaling, no self-healing) rather than just knowing the brand name.

Question 6

What is an SLO, and how does it differ from an SLA? Give a concrete example of each for a hypothetical API service.

Accepted Answer

SLA (Service Level Agreement): a contractual commitment to customers with consequences (refunds, penalties). E.g., 'We guarantee 99.9% monthly uptime; breaches trigger service credits.' SLO (Service Level Objective): an internal target that drives engineering decisions, usually stricter than the SLA. E.g., 'Our internal target is 99.95% success rate on /search requests over a 28-day window.' Error budget = 1 - SLO. Mention SLIs (the actual measured metrics, e.g., request success rate) as the measurement layer. What interviewers look for: Interviewers want to see that you understand the SLI→SLO→SLA hierarchy and that SLOs are engineering tools, not just management theater. A candidate who can articulate error budgets and why they exist (to make tradeoffs between reliability and feature velocity) signals SRE-specific maturity.

Question 7

You're on call and receive an alert that p99 latency on your API has spiked from 80ms to 3 seconds. Walk me through your response.

Accepted Answer

Immediately check the scope: is it one endpoint, one region, or all traffic? Pull dashboards: look at error rates alongside latency (latency spike + low error rate → slowness; latency spike + high error rate → likely failures). Check recent deploys, config changes, or cron jobs that fired recently. Look at upstream dependencies (database slow query logs, downstream service health). Mitigate first (roll back if recent deploy correlates), then diagnose. Communicate status to stakeholders. After resolution: write a postmortem draft. What interviewers look for: Interviewers want to see structured thinking under pressure. The key signals: Do they separate correlation from causation? Do they check for recent changes early? Do they think about mitigation before root cause? Do they mention communication and postmortem? A new grad who frameworks their answer (scope → data → hypotheses → mitigate → communicate) stands out.

Question 8

What is a merge conflict, how does one occur, and how do you resolve it safely in a team environment?

Accepted Answer

A merge conflict occurs when two branches modify the same lines of a file (or one deletes a file the other modifies) and Git can't auto-merge. Resolution: `git status` to identify conflicted files, open them and look for `<<<<<<<`, `=======`, `>>>>>>>` markers, choose the correct version (or combine both), mark resolved with `git add`, then `git commit`. Safe practices: communicate with the author of the conflicting branch, run tests after resolution, avoid resolving conflicts by blindly accepting 'ours' or 'theirs' without understanding both changes. What interviewers look for: This is a hygiene check. Interviewers want to confirm you've actually used Git collaboratively and understand that resolving a conflict is a semantic operation, not just a mechanical one. Candidates who mention running tests after resolution and talking to teammates show team awareness.

Question 9

What is the purpose of Terraform state, and what problems can arise if two engineers run `terraform apply` simultaneously?

Accepted Answer

Terraform state maps your HCL configuration to real infrastructure resources. It stores IDs and metadata so Terraform knows what exists vs. what needs to change. If two engineers apply simultaneously, they can each read stale state, produce conflicting plans, and both apply — resulting in duplicate resources, overwritten changes, or corrupted state. Solution: remote state backends (S3 + DynamoDB for state locking, or Terraform Cloud) that provide a distributed lock so only one apply runs at a time. Mention that state can contain secrets and should be stored securely. What interviewers look for: Interviewers check whether you understand *why* Terraform state exists (not just that it does) and whether you've thought about real-world collaboration problems. State locking is a concrete operational concern that distinguishes candidates who've run Terraform in teams from those who've only done solo tutorials.

Question 10

What are the three pillars of observability, and when would you use each one to debug a production issue?

Accepted Answer

Metrics: aggregated numeric measurements over time (CPU %, request rate, error rate). Use them for alerting and spotting trends. Logs: structured or unstructured text records of discrete events. Use them to understand *what* happened on a specific request or in a specific component. Traces: records of a request's path across multiple services, with timing at each hop. Use them when metrics say something is slow but you can't tell *where* in a distributed system the latency originates. Tools: Prometheus/Grafana (metrics), ELK/Loki (logs), Jaeger/Zipkin (traces). What interviewers look for: Can the candidate articulate not just what each pillar is but *when* to reach for it? New grads who only know metrics haven't thought about distributed systems. Knowing that traces exist specifically to solve the distributed latency mystery problem shows SRE-relevant depth.

Question 11

Tell me about a time you debugged a technical problem that took you significantly longer than expected. What made it hard, and what did you learn?

Accepted Answer

Use STAR (Situation, Task, Action, Result). Be specific: name the system, the symptom, the misleading signals that sent you down wrong paths, and the actual root cause. Don't invent a story — use a real project, internship, or course assignment. The 'learn' section is critical: show that you extracted a process improvement (e.g., 'I now always check logs before assumptions,' or 'I learned to timebox hypothesis testing'). What interviewers look for: Interviewers are evaluating intellectual honesty, persistence, and learning velocity — all more important than raw knowledge at the new grad level. A candidate who admits they went down wrong paths and explains why shows self-awareness. Candidates who give vague answers or claim they solved it quickly raise flags.

Question 12

How would you design a basic health-check monitoring system that alerts an on-call engineer when a web service goes down?

Accepted Answer

Components: a probe service that sends HTTP GET requests to the target URL every 30 seconds and records success/failure + response time. A time-series store (or simple DB) for results. An alerting layer that fires if N consecutive checks fail (to avoid flapping alerts on transient failures). A notification path: PagerDuty or email. Discuss tradeoffs: probe from multiple regions to distinguish regional outages from global ones; avoid single-point-of-failure in the probe itself; alert thresholds to reduce false positives. Mention that this is essentially what Blackbox Exporter + Alertmanager does in the Prometheus ecosystem. What interviewers look for: Interviewers aren't expecting a full distributed systems design. They want to see whether you can decompose a real operational problem into components, identify failure modes in your own design (what if the probe crashes?), and connect your design to real tools. Structured thinking and awareness of false-positive alerting signal SRE instincts.

New Grad DevOps / SRE Engineer Interview Questions

What to expect

12 questions, with how to answer them

1. A process on a Linux server is consuming 100% CPU. Walk me through how you'd identify it and decide what to do next.

2. Explain what happens at each layer of the network stack when you type `curl https://example.com` in a terminal.

3. Write a Bash script that monitors a log file and sends an alert (print to stdout is fine) whenever the word 'ERROR' appears more than 5 times in a 60-second window.

4. Describe the stages you'd include in a CI/CD pipeline for a simple web service, and explain why each stage exists.

5. What is the difference between a Docker image and a container, and what problem does Kubernetes solve that Docker alone doesn't?

6. What is an SLO, and how does it differ from an SLA? Give a concrete example of each for a hypothetical API service.

7. You're on call and receive an alert that p99 latency on your API has spiked from 80ms to 3 seconds. Walk me through your response.

8. What is a merge conflict, how does one occur, and how do you resolve it safely in a team environment?

9. What is the purpose of Terraform state, and what problems can arise if two engineers run `terraform apply` simultaneously?

10. What are the three pillars of observability, and when would you use each one to debug a production issue?

11. Tell me about a time you debugged a technical problem that took you significantly longer than expected. What made it hard, and what did you learn?

12. How would you design a basic health-check monitoring system that alerts an on-call engineer when a web service goes down?

Study tips