Question 1

Design a rate-limiting service that can be used by hundreds of microservices across a distributed system. It needs to support per-user, per-endpoint, and per-tenant limits.

Accepted Answer

Start by clarifying requirements: exact vs. approximate counting, latency SLA, burst tolerance. Walk through algorithm options — fixed window (simple, boundary burst problem), sliding window log (accurate, memory heavy), sliding window counter (good middle ground), token bucket (good for bursts), leaky bucket (smooth output). For distributed enforcement, discuss centralized Redis with atomic Lua scripts or INCR+EXPIRE, vs. local-first with periodic sync for eventual consistency. Address the thundering herd on limit resets, sticky routing as a partial solution, and how you'd surface limit headers back to callers. Mention how configuration (limit values per tenant/endpoint) gets propagated — config service, feature flags, or database. What interviewers look for: Interviewer wants to see that you understand the gap between single-node and distributed rate limiting, can name concrete algorithms with their real tradeoffs, and know when approximate is acceptable. Red flag: jumping to a solution without asking about consistency vs. latency tradeoffs or not knowing what 'sliding window counter' means.

Question 2

Design the backend for a real-time collaborative document editing system (like Google Docs). Focus on the conflict resolution and persistence layer.

Accepted Answer

Introduce Operational Transformation (OT) or CRDTs as the two dominant approaches — explain that OT requires a central server to order operations, while CRDTs allow peer-to-peer merge but are harder to implement for rich text. Sketch the WebSocket fanout layer (pub/sub per document session), the operation log (append-only, used for replay and catch-up on reconnect), and snapshotting for fast load. Discuss how you'd persist: operation log in Postgres or DynamoDB, snapshots in object storage. Cover presence (lightweight heartbeat + Redis TTL), version vectors, and how you'd handle a client that was offline for 10 minutes reconnecting with 50 pending ops. Mention linearizability needs at the server's merge point. What interviewers look for: Looking for awareness of CRDTs vs. OT (not necessarily deep implementation), real-time fanout architecture, and durability guarantees. Strong signal: unprompted discussion of the reconnect/catch-up scenario. Weak signal: only describing the happy path WebSocket flow.

Question 3

Given a stream of log lines where each line is 'timestamp user_id action', implement a function that returns, at any point in time, the top K most active users in the last 5 minutes. Optimize for frequent reads.

Accepted Answer

Use a sliding window over timestamps. Maintain a deque or min-heap of (timestamp, user_id) entries and a HashMap of user_id → count. On each query, evict entries older than now - 5min from the deque, decrement counts, and use a min-heap of size K to return top K. Discuss time complexity: O(log K) per insert for the heap. If reads are extremely frequent vs. writes, maintain a sorted structure that updates incrementally. Consider thread safety if concurrent access is needed — discuss read-write locks or lock-free structures. Edge case: same user_id appearing multiple times in the window. What interviewers look for: Tests ability to combine multiple data structures for a compound problem — HashMap + heap + deque. Interviewer checks: do you evict lazily or eagerly? Do you handle the case where a user's count drops to 0? Do you discuss time vs. space tradeoffs? Clean, working code is expected; perfect optimization is secondary.

Question 4

Implement a thread-safe in-memory cache with a TTL per key and LRU eviction when capacity is exceeded. Walk through your locking strategy.

Accepted Answer

Core structure: doubly linked list (for O(1) LRU eviction) + HashMap (for O(1) lookup). Each node stores key, value, expiry timestamp. On get: check expiry first — if expired, remove and return null; otherwise move to head. On put: if key exists update in place; if at capacity, evict tail; insert at head. For TTL, either check lazily on access or run a background reaper thread (discuss tradeoffs). Locking: a single ReentrantReadWriteLock works for most cases — read lock for get (but be careful since get mutates order), write lock for put/evict. For high concurrency, discuss striped locking or ConcurrentHashMap + ConcurrentLinkedDeque with atomic operations. Mention that Java's LinkedHashMap with accessOrder=true handles the LRU logic internally. What interviewers look for: Senior signal is in the locking discussion, not just the data structure. Can you articulate why a read lock is insufficient for get() since it mutates the list? Do you know about striped locking? Do you handle the expired-but-not-yet-evicted state correctly? Weak answer: just describing LRU without addressing concurrency.

Question 5

Your Postgres table with 500M rows is experiencing slow queries. Walk me through your complete diagnostic and remediation process.

Accepted Answer

Start with EXPLAIN ANALYZE — identify sequential scans, high row estimates vs. actuals (stale statistics → ANALYZE), and costly nested loops. Check pg_stat_user_tables for table bloat (dead tuples → VACUUM). Examine pg_stat_activity for lock contention. Remediation layers: (1) indexes — B-tree for equality/range, partial indexes for filtered queries, covering indexes to avoid heap fetches; (2) query rewrite — avoid SELECT *, push predicates down, avoid functions on indexed columns; (3) partitioning — range partition by date for time-series data, enables partition pruning; (4) connection pooling via PgBouncer if connection count is the issue; (5) read replicas for read-heavy workloads; (6) archiving cold data. Discuss when to consider a different storage engine entirely (columnar for analytics). What interviewers look for: Interviewers want a systematic approach, not a laundry list of tips. Do you start with diagnosis before solutions? Do you know what EXPLAIN ANALYZE actually outputs and what to look for? Strong signal: mentioning stale statistics or bloat — most candidates skip these. Red flag: jumping straight to 'add an index' without diagnosing.

Question 6

Explain how you would model a multi-tenant SaaS application's database. What are the tradeoffs between shared schema, shared database, and isolated database per tenant?

Accepted Answer

Three models: (1) Shared schema with tenant_id column — lowest cost, simplest ops, but tenant data isolation is enforced only in application logic; noisy neighbor problem; hard to offer tenant-specific customizations. (2) Separate schema per tenant (same DB) — schema-level isolation, pg_dump per tenant possible, but schema migrations become complex at scale (1000 schemas = 1000 ALTER TABLE runs). (3) Separate database per tenant — strongest isolation, easy for regulated industries (HIPAA, SOC2), enables per-tenant performance tuning; operationally expensive, connection pooling is harder. Selection criteria: regulatory requirements, expected tenant count, need for customization, ops maturity. Hybrid is common — small tenants share, enterprise tenants get isolation. Discuss row-level security in Postgres as a mechanism to enforce tenant_id isolation in the DB layer. What interviewers look for: Expects you to name all three models without prompting and discuss the operational dimension, not just the technical one. Strong signal: mentioning row-level security as a DB-layer enforcement mechanism and discussing migration complexity for the separate-schema model. Weak answer: only describing shared vs. isolated without mentioning the hybrid approach or regulatory drivers.

Question 7

Explain how Kafka guarantees message ordering and exactly-once delivery. Where does it break down, and how do you design around it?

Accepted Answer

Ordering: Kafka guarantees ordering within a partition, not across partitions. Partition assignment by key (e.g., user_id) ensures related events are ordered. Breakdown: if you need cross-entity ordering (e.g., order events for user A and payment events), single-partition topics (bottleneck) or application-level sequencing is needed. Exactly-once: Kafka's EOS (exactly-once semantics) requires idempotent producers (enable.idempotence=true, sequence numbers), transactional producers (beginTransaction/commitTransaction), and read-committed isolation on consumers. End-to-end EOS also requires idempotent consumers — Kafka only guarantees delivery to the broker, not that your downstream DB write is idempotent. Practical design: use idempotency keys in your message payload, make consumer processing idempotent regardless of Kafka's guarantees, store consumer offset and business state in the same transaction (outbox pattern or transactional offset commit). What interviewers look for: Many candidates know Kafka basics but fail on exactly-once nuance. Interviewer looks for: understanding that EOS is broker-side and doesn't cover the consumer's side effects, and knowing the outbox pattern or transactional consumers as the real solution. Red flag: saying 'just enable exactly-once in the config' without addressing consumer idempotency.

Question 8

How does a distributed transaction work in a microservices architecture? Compare 2PC, Saga, and outbox pattern with honest tradeoffs.

Accepted Answer

2PC: coordinator sends prepare to all participants, waits for votes, then commits or aborts. Guarantees atomicity but blocks if coordinator fails; all participants must hold locks during prepare phase — poor availability, high latency. Saga: decompose transaction into local transactions with compensating transactions for rollback. Two flavors: choreography (events trigger next step, hard to debug) and orchestration (central orchestrator, single point of failure but easier to reason about). No distributed locks, eventually consistent. Failure case: compensation logic must be idempotent and must handle partial failures. Outbox pattern: write to a local 'outbox' table in the same DB transaction as your business data, then a separate relay process publishes to the message broker. Solves the dual-write problem. Combine with Saga for end-to-end distributed transactions. Be honest: there is no free lunch — Saga trades atomicity for availability. What interviewers look for: Looking for candid tradeoff analysis, not advocacy for one pattern. Strong signal: knowing the choreography vs. orchestration distinction within Saga, and recognizing that the outbox pattern solves dual-write specifically. Weak answer: describing Saga without mentioning compensating transactions or the failure/rollback complexity.

Question 9

Tell me about a time you identified a systemic reliability problem before it caused an outage. How did you build the case to fix it?

Accepted Answer

Use STAR but go beyond basic format. Describe the signal you noticed (anomaly in metrics, a pattern in incidents, code review concern). Explain your diagnostic process — what data you gathered, what hypothesis you formed. Then focus on the 'build the case' part: how you quantified risk (error budget math, cost of downtime, replication to staging), how you communicated to non-engineers, and how you prioritized it against feature work. Address resistance: what objections did you get and how did you counter them. Result: what specifically changed — metric improvement, incident reduction, architectural change. What interviewers look for: Interviewers are evaluating proactive ownership, not just reactive firefighting. They want to see that you can translate technical risk into business language and that you navigated organizational friction. Red flag: the story where you spotted the bug, fixed it, done — with no evidence of influencing others or systemic thinking.

Question 10

Describe a significant technical decision you made that you later regretted. What would you do differently?

Accepted Answer

Pick a real decision with real consequences — not a minor bug. Good examples: technology choice (chose a message queue that couldn't handle the load), over-engineering (built a distributed system when a monolith would have sufficed), under-investing in observability early. Demonstrate that you understand *why* it was wrong — not just in hindsight, but what signals you missed or ignored at the time. Then articulate specifically what you'd do differently: what process, what criteria, what conversations you'd have had earlier. The regret should be calibrated — you shouldn't be defensive, but also shouldn't flagellate yourself; show you've internalized the lesson. What interviewers look for: This question is a character check. Interviewers want intellectual honesty and learning agility. Red flags: can't think of any regrets (lack of self-awareness), blames others for the decision, or the 'regret' is clearly humble-brag ('I worked too hard'). Strong signal: specific technical details in the failure, clear counterfactual thinking, and evidence the lesson changed your behavior.

Question 11

Explain how you would diagnose and fix a service experiencing high tail latency (p99 is 10x the median) under production load.

Accepted Answer

Tail latency is often caused by: (1) lock contention — a slow request holds a mutex, queuing others; (2) GC pauses (JVM) — check GC logs, tune heap, consider G1GC or ZGC; (3) thread pool exhaustion — requests queue behind a slow downstream; (4) noisy neighbor on shared infrastructure; (5) database query variance — a query that's fast for most users is slow for users with large datasets. Diagnostic approach: use percentile-breakdown metrics (p50/p95/p99/p999 separately), distributed tracing to identify which service/span is slow, thread dump analysis for contention, flame graphs for CPU hotspots. Fix strategies: async processing to avoid blocking thread pools, circuit breakers to fail fast to slow dependencies, hedged requests (send duplicate to second replica after a threshold delay), connection pool tuning, adding DB indexes for the outlier query pattern. What interviewers look for: Senior engineers are expected to go beyond 'profile the code.' Interviewer looks for: awareness of GC and thread pool as sources, knowledge of hedged requests as a tail-latency-specific technique, and use of distributed tracing rather than just server-side metrics. Red flag: only mentioning caching as a solution.

Question 12

Design a secure, versioned public API for a platform that will be used by thousands of external developers. What are the key decisions and how do you handle breaking changes?

Accepted Answer

Authentication: OAuth 2.0 with client credentials for server-to-server, PKCE for user-delegated flows; never API keys alone for sensitive operations. Authorization: scope-based (read:orders, write:orders), not role-based at the API level. Versioning strategy: URI versioning (/v1/, /v2/) is most explicit and cacheable; header versioning is cleaner but harder to test; never break without a deprecation period. Breaking vs. non-breaking changes: adding optional fields is non-breaking; removing fields, changing types, changing semantics is breaking. Deprecation process: announce in changelog, add Deprecation header in responses, provide migration guide, sunset after 12–18 months minimum. Rate limiting with tiered quotas. Input validation: treat all external input as adversarial — validate types, ranges, lengths; use allowlists not denylists. Pagination: cursor-based for large datasets (stable under insertions), not offset-based. Document with OpenAPI spec, generate SDKs. What interviewers look for: Looking for completeness and prioritization — not just listing features but knowing which decisions are the hard ones. Strong signal: distinguishing breaking vs. non-breaking changes specifically, and knowing cursor-based pagination is preferred over offset for external APIs. Weak answer: only discussing authentication without touching versioning or the deprecation lifecycle.

Senior Backend Engineer Interview Questions

What to expect

12 questions, with how to answer them

1. Design a rate-limiting service that can be used by hundreds of microservices across a distributed system. It needs to support per-user, per-endpoint, and per-tenant limits.

2. Design the backend for a real-time collaborative document editing system (like Google Docs). Focus on the conflict resolution and persistence layer.

3. Given a stream of log lines where each line is 'timestamp user_id action', implement a function that returns, at any point in time, the top K most active users in the last 5 minutes. Optimize for frequent reads.

4. Implement a thread-safe in-memory cache with a TTL per key and LRU eviction when capacity is exceeded. Walk through your locking strategy.

5. Your Postgres table with 500M rows is experiencing slow queries. Walk me through your complete diagnostic and remediation process.

6. Explain how you would model a multi-tenant SaaS application's database. What are the tradeoffs between shared schema, shared database, and isolated database per tenant?

7. Explain how Kafka guarantees message ordering and exactly-once delivery. Where does it break down, and how do you design around it?

8. How does a distributed transaction work in a microservices architecture? Compare 2PC, Saga, and outbox pattern with honest tradeoffs.

9. Tell me about a time you identified a systemic reliability problem before it caused an outage. How did you build the case to fix it?

10. Describe a significant technical decision you made that you later regretted. What would you do differently?

11. Explain how you would diagnose and fix a service experiencing high tail latency (p99 is 10x the median) under production load.

12. Design a secure, versioned public API for a platform that will be used by thousands of external developers. What are the key decisions and how do you handle breaking changes?

Study tips