Q: How would you design a near-duplicate detection system that scales to 10 billion documents?

Frame it as a locality-sensitive hashing (LSH) or MinHash problem. Walk through: (1) feature representation choices — shingling for text, perceptual hashing for images; (2) LSH banding to control precision/recall tradeoff; (3) approximate vs. exact matching thresholds; (4) scaling strategy — batch vs. streaming deduplication, how to handle incremental updates without re-hashing the entire corpus; (5) evaluation — constructing a ground-truth dataset, measuring recall at scale. Mention SimHash as an alternative and when you'd prefer one over the other. What interviewers look for: Depth on approximate algorithms and their mathematical guarantees. Whether you think about scale before correctness — at 10B documents, naive pairwise comparison is O(n²) and that answer is disqualifying. Practical awareness of evaluation challenges.

Q: Implement gradient descent with momentum from scratch in Python, then explain how Adam differs and when you'd prefer one over the other.

Write clean numpy implementations of SGD+momentum (velocity update: v = β*v - lr*grad; param += v) and Adam (first/second moment estimates with bias correction). Explain the intuition: momentum smooths gradient direction; Adam additionally adapts learning rate per-parameter using second moment. Discuss when Adam can generalize worse than SGD (sharp minima hypothesis, paper: Wilson et al. 2017). Mention practical tuning: Adam is more forgiving of LR choice, SGD+momentum often yields better final accuracy with proper LR scheduling. What interviewers look for: Clean, correct implementation without referencing docs. Genuine understanding of the math, not just API usage. Ability to reason about optimizer behavior in practice — this distinguishes engineers who've debugged training runs from those who only call fit().

Q: Given a large dataset that doesn't fit in memory, write a Python solution to compute exact quantiles and explain the tradeoff with approximate methods like t-digest or GK summaries.

Implement external sort-based exact quantile computation or reservoir sampling for approximate. Then discuss: exact quantiles require O(n) memory or O(n log n) disk I/O; sketch-based methods (t-digest, Greenwald-Khanna) give ε-approximate quantiles in O(1/ε) space with mergeable data structures. Explain merge-friendliness as critical for distributed settings (Spark, Flink). Mention use cases where approximation error is acceptable vs. not (financial SLAs vs. monitoring dashboards). What interviewers look for: Awareness that 'load into pandas' is not always an answer. Understanding of space-time tradeoffs in streaming/distributed computation. Ability to choose and justify the right tool for constraints given.

Q: You train a model that achieves 95% accuracy on your validation set but performs at 70% in production. Walk through your diagnosis and remediation process.

Structure diagnostically: (1) distribution shift — covariate shift (P(X) changes), label shift (P(Y) changes), concept drift (P(Y|X) changes). Use tests: MMD, KS test on feature distributions, monitoring prediction confidence over time. (2) Data leakage in training pipeline — check temporal splits, join keys. (3) Training-serving skew — feature computation differences between training and inference paths. (4) Label quality in production — are ground truth labels correct? (5) Remediation: importance weighting for covariate shift, online learning or retraining triggers for drift, shadow scoring to catch skew early. What interviewers look for: Structured debugging mindset. Explicit naming of the types of shift and their different remediation strategies. Whether you distinguish between a data problem and a modeling problem. Experience signals come from the specificity of your examples.

Q: Explain how you'd handle severe class imbalance (1:10000) in a fraud detection model, and what metric you'd optimize.

Discuss that accuracy is useless at this ratio. Metric choice: precision-recall AUC or F-beta (β > 1 to weight recall). Techniques layered by effectiveness: (1) threshold calibration post-training — often the first move; (2) cost-sensitive learning — adjust class weights in loss; (3) resampling — SMOTE for oversampling minority, random undersampling majority (beware of undersampling pitfalls at 1:10000); (4) anomaly detection framing — isolation forest, autoencoders trained only on majority class; (5) ensemble methods — BalancedBaggingClassifier. Discuss calibration separately: a model can have good discrimination (AUC) but poor calibration, and in fraud you often need calibrated scores to set business thresholds. What interviewers look for: Does the candidate know the difference between discrimination and calibration? Do they know why SMOTE fails at extreme imbalance? Do they connect the metric to the business cost structure, not just technical performance?

Question 1

Design a real-time content ranking system for a social feed serving 50M daily active users. Walk through the full architecture.

Accepted Answer

Start by clarifying objectives and constraints: latency budget (p99 < 100ms?), freshness requirements, feedback signals available. Then structure your answer in layers: (1) candidate generation — ANN retrieval over user/item embeddings using FAISS or ScaNN; (2) ranking — a two-stage funnel (lightweight scorer → full feature model); (3) feature store — online (Redis/Feast) vs. offline (Hive/BigQuery) split; (4) training pipeline — log collection, label construction, training cadence; (5) serving — model versioning, shadow deployment, A/B routing. Name specific tradeoffs: pointwise vs. pairwise vs. listwise loss, explore-exploit balance, position bias correction. What interviewers look for: Can you decompose a vague product problem into concrete ML subproblems? Do you understand the gap between offline metrics and online metrics? Do you proactively surface failure modes (training-serving skew, feedback loops, cold start)?

Question 2

How would you design a near-duplicate detection system that scales to 10 billion documents?

Accepted Answer

Frame it as a locality-sensitive hashing (LSH) or MinHash problem. Walk through: (1) feature representation choices — shingling for text, perceptual hashing for images; (2) LSH banding to control precision/recall tradeoff; (3) approximate vs. exact matching thresholds; (4) scaling strategy — batch vs. streaming deduplication, how to handle incremental updates without re-hashing the entire corpus; (5) evaluation — constructing a ground-truth dataset, measuring recall at scale. Mention SimHash as an alternative and when you'd prefer one over the other. What interviewers look for: Depth on approximate algorithms and their mathematical guarantees. Whether you think about scale before correctness — at 10B documents, naive pairwise comparison is O(n²) and that answer is disqualifying. Practical awareness of evaluation challenges.

Question 3

Implement gradient descent with momentum from scratch in Python, then explain how Adam differs and when you'd prefer one over the other.

Accepted Answer

Write clean numpy implementations of SGD+momentum (velocity update: v = β*v - lr*grad; param += v) and Adam (first/second moment estimates with bias correction). Explain the intuition: momentum smooths gradient direction; Adam additionally adapts learning rate per-parameter using second moment. Discuss when Adam can generalize worse than SGD (sharp minima hypothesis, paper: Wilson et al. 2017). Mention practical tuning: Adam is more forgiving of LR choice, SGD+momentum often yields better final accuracy with proper LR scheduling. What interviewers look for: Clean, correct implementation without referencing docs. Genuine understanding of the math, not just API usage. Ability to reason about optimizer behavior in practice — this distinguishes engineers who've debugged training runs from those who only call fit().

Question 4

Given a large dataset that doesn't fit in memory, write a Python solution to compute exact quantiles and explain the tradeoff with approximate methods like t-digest or GK summaries.

Accepted Answer

Implement external sort-based exact quantile computation or reservoir sampling for approximate. Then discuss: exact quantiles require O(n) memory or O(n log n) disk I/O; sketch-based methods (t-digest, Greenwald-Khanna) give ε-approximate quantiles in O(1/ε) space with mergeable data structures. Explain merge-friendliness as critical for distributed settings (Spark, Flink). Mention use cases where approximation error is acceptable vs. not (financial SLAs vs. monitoring dashboards). What interviewers look for: Awareness that 'load into pandas' is not always an answer. Understanding of space-time tradeoffs in streaming/distributed computation. Ability to choose and justify the right tool for constraints given.

Question 5

You train a model that achieves 95% accuracy on your validation set but performs at 70% in production. Walk through your diagnosis and remediation process.

Accepted Answer

Structure diagnostically: (1) distribution shift — covariate shift (P(X) changes), label shift (P(Y) changes), concept drift (P(Y|X) changes). Use tests: MMD, KS test on feature distributions, monitoring prediction confidence over time. (2) Data leakage in training pipeline — check temporal splits, join keys. (3) Training-serving skew — feature computation differences between training and inference paths. (4) Label quality in production — are ground truth labels correct? (5) Remediation: importance weighting for covariate shift, online learning or retraining triggers for drift, shadow scoring to catch skew early. What interviewers look for: Structured debugging mindset. Explicit naming of the types of shift and their different remediation strategies. Whether you distinguish between a data problem and a modeling problem. Experience signals come from the specificity of your examples.

Question 6

Explain how you'd handle severe class imbalance (1:10000) in a fraud detection model, and what metric you'd optimize.

Accepted Answer

Discuss that accuracy is useless at this ratio. Metric choice: precision-recall AUC or F-beta (β > 1 to weight recall). Techniques layered by effectiveness: (1) threshold calibration post-training — often the first move; (2) cost-sensitive learning — adjust class weights in loss; (3) resampling — SMOTE for oversampling minority, random undersampling majority (beware of undersampling pitfalls at 1:10000); (4) anomaly detection framing — isolation forest, autoencoders trained only on majority class; (5) ensemble methods — BalancedBaggingClassifier. Discuss calibration separately: a model can have good discrimination (AUC) but poor calibration, and in fraud you often need calibrated scores to set business thresholds. What interviewers look for: Does the candidate know the difference between discrimination and calibration? Do they know why SMOTE fails at extreme imbalance? Do they connect the metric to the business cost structure, not just technical performance?

Question 7

Tell me about a time you disagreed with a PM or stakeholder about the ML approach and how you resolved it.

Accepted Answer

Use STAR with emphasis on the Situation stakes and your Resolution method. Pick an example where the technical disagreement was substantive (e.g., stakeholder wanted rule-based system; you knew ML was required, or vice versa). Show: you built a shared understanding of the tradeoff (not just asserted your view), you used data or a prototype to make the abstract concrete, you involved the right people, and you maintained the relationship regardless of outcome. Avoid stories where you simply won or simply capitulated — the best answers show genuine alignment-seeking. What interviewers look for: Influence without authority. Technical credibility communicated to non-technical stakeholders. Whether you can disagree productively. At senior level, conflict avoidance is a red flag.

Question 8

Describe the most complex ML project you've led end to end. What would you do differently?

Accepted Answer

Structure: scope and constraints, your specific decisions (model choice, data strategy, evaluation design, rollout plan), what went wrong and why, concrete retrospective changes. The 'what would you do differently' is the real interview — it shows self-awareness and growth. Strong answers name a specific architectural or process mistake (e.g., 'I would have instrumented logging before training, not after, because we had to retrain with corrected labels twice') rather than vague regrets. What interviewers look for: Ownership of the full lifecycle. Honesty about failure. Depth of reflection — not just 'I'd communicate more' but specific technical or process decisions you'd change. Scope of 'complex' should be calibrated: a senior MLE should be talking about multi-quarter, cross-functional projects.

Question 9

How do you design an A/B test when the metric you care about (e.g., long-term retention) takes months to observe?

Accepted Answer

Explain the core tension: you can't wait months per experiment. Strategies: (1) proxy metrics — identify leading indicators (7-day engagement, early retention) validated to correlate with long-term metric via historical analysis or Granger causality; (2) surrogate models — train a model to predict long-term outcome from early signals; (3) interleaving experiments for ranking systems — faster signal with less traffic; (4) sequential testing / always-valid p-values (e-values) to avoid peeking problems; (5) power analysis to set sample size before running. Mention the risk of Goodhart's Law when optimizing proxies. What interviewers look for: Sophistication beyond 'run an A/B test for 2 weeks.' Understanding of surrogate metrics and their risks. Statistical rigor — awareness of peeking, FWER, sequential testing. Practical experience running experiments with delayed outcomes.

Question 10

Your new model shows a +2% lift on offline evaluation but a neutral result in the A/B test. How do you diagnose this?

Accepted Answer

Enumerate hypotheses systematically: (1) offline metric doesn't proxy the online metric well — check which users/behaviors the offline eval over-represents; (2) novelty effect or user adaptation — cohort analysis over time; (3) A/B test underpowered — recalculate MDE and sample size; (4) implementation bug in the experiment — verify treatment assignment, feature computation in serving vs. training; (5) interaction effects — other concurrent experiments contaminating the result (check experiment isolation); (6) the model genuinely doesn't improve the objective that users respond to. Recommend: iterate on the offline-online correlation, instrument detailed breakdowns by user segment. What interviewers look for: Systematic thinking under ambiguity. Awareness of the offline-online gap as a fundamental problem in applied ML, not an anomaly. Experience suggests itself in specificity — candidates who've seen this before name real failure modes quickly.

Question 11

How would you build a feature store that serves both training and online inference while preventing training-serving skew?

Accepted Answer

Explain the two-path problem: offline features computed by batch jobs (Spark, dbt) and online features served at low latency (Redis, DynamoDB). Key design decisions: (1) single feature definition layer (e.g., Feast feature views) that generates both paths from the same code — this is the primary skew prevention mechanism; (2) point-in-time correct joins for training to prevent leakage; (3) feature logging at serving time with replay capability; (4) monitoring — statistical tests between training distribution and live distribution per feature, alerting on drift; (5) versioning — features need semantic versioning because a feature definition change can silently break a deployed model. Reference open-source options (Feast, Tecton, Hopsworks) with their tradeoffs. What interviewers look for: Understanding that training-serving skew is an architectural problem, not just a testing problem. Point-in-time correctness for temporal features. Practical experience with the operational burden of running a feature store.

Question 12

How do you decide when to retrain a production model, and how do you automate that decision safely?

Accepted Answer

Distinguish trigger strategies: (1) scheduled retraining — simple but wasteful and can miss drift; (2) performance-based triggers — monitor proxy metrics (prediction confidence distribution, CTR, downstream business KPIs) and trigger when they cross thresholds; (3) data-drift triggers — PSI or KL divergence on input feature distributions; (4) concept drift detection — DDM, ADWIN on windowed accuracy estimates. Automation safety: new model must pass offline evaluation, shadow deployment comparison, and canary rollout before replacing production. Discuss rollback triggers and the need for a model registry with reproducible artifacts. Note the chicken-and-egg problem: you need labels to evaluate, but labels may be delayed. What interviewers look for: Understanding that retraining is a risk event, not just a maintenance event. Ability to design a retraining pipeline with appropriate guardrails. Awareness of the label delay problem and strategies (proxy labels, partial labels) to work around it.

Senior Machine Learning Engineer Interview Questions

What to expect

12 questions, with how to answer them

1. Design a real-time content ranking system for a social feed serving 50M daily active users. Walk through the full architecture.

2. How would you design a near-duplicate detection system that scales to 10 billion documents?

3. Implement gradient descent with momentum from scratch in Python, then explain how Adam differs and when you'd prefer one over the other.

4. Given a large dataset that doesn't fit in memory, write a Python solution to compute exact quantiles and explain the tradeoff with approximate methods like t-digest or GK summaries.

5. You train a model that achieves 95% accuracy on your validation set but performs at 70% in production. Walk through your diagnosis and remediation process.

6. Explain how you'd handle severe class imbalance (1:10000) in a fraud detection model, and what metric you'd optimize.

7. Tell me about a time you disagreed with a PM or stakeholder about the ML approach and how you resolved it.

8. Describe the most complex ML project you've led end to end. What would you do differently?

9. How do you design an A/B test when the metric you care about (e.g., long-term retention) takes months to observe?

10. Your new model shows a +2% lift on offline evaluation but a neutral result in the A/B test. How do you diagnose this?

11. How would you build a feature store that serves both training and online inference while preventing training-serving skew?

12. How do you decide when to retrain a production model, and how do you automate that decision safely?

Study tips