Interview questions
Mid-Level Machine Learning Engineer Interview Questions
Mid-level ML engineer interviews test whether you can independently own the full lifecycle of an ML project — from data pipeline design through model training, evaluation, and production deployment — without heavy hand-holding. You're expected to know the theory well enough to debug failure modes, not just call sklearn APIs. The bar is practical depth: have you shipped real models, dealt with real data problems, and made real tradeoffs?
What to expect
Expect a loop of 4-6 rounds covering: one or two coding rounds (ML-flavored LeetCode at medium difficulty, plus ML-specific coding like implementing a loss function, gradient descent, or evaluation metric from scratch), one or two ML conceptual/depth rounds (probing your understanding of model internals, training dynamics, and evaluation), one system design round focused on ML systems (feature stores, training pipelines, serving infrastructure), and one behavioral round weighted toward project ownership and cross-functional collaboration. You won't be expected to architect a planet-scale ML platform, but you must show you've thought carefully about latency, data drift, and model monitoring in real deployments.
These are the questions every Machine Learning Engineer gets.
Get questions tailored to your experience, answer them, and get honest feedback — free, no credit card.
Run a free fit check →12 questions, with how to answer them
ML Coding
1. Implement logistic regression training with gradient descent from scratch in NumPy, including the forward pass, binary cross-entropy loss, and gradient update.
How to answer: Start by writing the sigmoid function, then the forward pass (predictions), then the loss. Derive the gradient analytically: dL/dw = X^T (y_hat - y) / n. Implement the update loop with a learning rate and show how you'd add an L2 regularization term. Mention numerical stability considerations for log(0).
What they look for: Can you derive and implement core ML math without a framework? They want to see you treat the gradient derivation as obvious, not struggle through it. Clean vectorized NumPy code signals you understand the matrix shapes and won't write Python loops over samples in production.
ML Coding
2. Write a function to compute precision, recall, F1, and AUC-ROC from raw model scores and binary labels without using sklearn.
How to answer: Precision and recall require choosing a threshold; show you understand this explicitly. For AUC-ROC, sort by score descending, iterate thresholds, accumulate TPR/FPR pairs, then use the trapezoidal rule. Distinguish macro vs. micro averaging for multi-class extensions. Handle edge cases like all-positive or all-negative label sets.
What they look for: Interviewers want to see that you genuinely understand what these metrics measure, not just that you can call a function. The AUC implementation specifically tests whether you understand the ROC curve construction. Edge-case handling signals production-readiness.
ML Concepts
3. Your model achieves 92% accuracy on the test set but performs poorly in production. Walk me through how you diagnose this.
How to answer: Structure around distribution shift first: covariate shift (input X changes), label shift (P(Y) changes), or concept drift (P(Y|X) changes). Then ask: is the test set representative? Was there data leakage inflating test performance? Is the production data preprocessed identically? Use monitoring tools — log input feature distributions, compare them to training distributions with KL divergence or PSI (Population Stability Index). Then check prediction distributions and downstream business metrics.
What they look for: This is a practical debugging question. They want to see a structured, hypothesis-driven approach rather than guessing. Mentioning PSI/KL divergence for drift detection and distinguishing the types of distribution shift shows genuine production ML experience.
ML Concepts
4. Explain the bias-variance tradeoff and describe a concrete situation where you'd deliberately choose a higher-bias model.
How to answer: Define bias (systematic error from model assumptions) and variance (sensitivity to training data fluctuations). Formally: Expected Error = Bias² + Variance + Irreducible Noise. A higher-bias model is preferable when: training data is small and a complex model would overfit badly; the feature set is noisy and a simpler model generalizes better; interpretability is required (e.g., regulatory context); or inference latency is constrained and a linear model is fast enough.
What they look for: They want to see you treat this as an engineering decision, not a textbook definition. Grounding the tradeoff in a real scenario (small data, latency, interpretability) signals that you use this framework to make actual model selection decisions.
ML Concepts
5. How does the Adam optimizer work, and when would you prefer SGD with momentum instead?
How to answer: Adam tracks a first moment (mean of gradients, momentum) and second moment (uncentered variance, adaptive learning rate per parameter). The bias-correction terms (dividing by 1-β^t) are important to mention — they fix the cold-start problem. Adam converges fast but can generalize worse than SGD in some vision tasks (the 'Adam generalization gap'). Prefer SGD+momentum when you have the compute budget to tune LR and schedule carefully, or when benchmarks show the final accuracy matters more than convergence speed.
What they look for: Mid-level candidates should know optimizer internals, not just that 'Adam usually works.' Mentioning the generalization gap and bias correction shows depth. They're checking whether you pick optimizers thoughtfully or cargo-cult Adam on everything.
ML System Design
6. Design a real-time product recommendation system that serves personalized results in under 100ms for 10M daily active users.
How to answer: Decompose into: (1) offline — train a two-tower embedding model, generate item and user embeddings, store in a vector database (Faiss, Pinecone); (2) near-line — pre-compute and cache top-K candidates per user, refresh on a schedule; (3) online — retrieve candidates from cache, re-rank with a lightweight model (gradient boosted trees, not a transformer) using real-time context features, enforce business rules. Address: embedding freshness for new users/items (cold start), feature store for consistent training/serving features, monitoring for CTR drift.
What they look for: They want to see the offline/near-line/online decomposition, which is the standard industry pattern. Knowing that you can't run a deep model in <100ms at retrieval scale is essential. Cold start handling and feature store mention signal real experience beyond toy projects.
ML System Design
7. How would you design a training data pipeline for a text classification model that needs to be retrained weekly on 50GB of new labeled data?
How to answer: Cover: data ingestion (streaming vs. batch, schema validation at ingest with Great Expectations or similar), storage (partitioned Parquet on S3 by date), processing (Spark or Beam for preprocessing at scale, avoid pandas for 50GB), versioning (DVC or Delta Lake for reproducibility), labeling quality checks (inter-annotator agreement, label distribution monitoring), and retraining triggers (scheduled vs. performance-triggered). Mention that the pipeline must be idempotent and that you should version the dataset used for each model version.
What they look for: This tests whether you think about ML pipelines with software engineering rigor. Dataset versioning and idempotency are non-obvious details that separate engineers who've debugged flaky pipelines from those who haven't. Naming concrete tools is fine but should be backed by rationale.
Applied ML
8. You're building a fraud detection model where the positive class (fraud) is 0.1% of transactions. How do you handle this, from training through evaluation?
How to answer: Training: don't just oversample naively. Options include: class-weighted loss (most practical, set weight=1000 for fraud), SMOTE for tabular data (with caveats), undersampling majority class, or adjusting the decision threshold after training rather than resampling. Evaluation: accuracy is useless here — use precision-recall AUC (better than ROC-AUC for extreme imbalance), F-beta score (weight recall higher if missing fraud is costly), and business metrics (dollar value of fraud caught). Calibration matters: if you use the score as a risk score, ensure it's well-calibrated with Platt scaling or isotonic regression.
What they look for: Knowing that ROC-AUC is misleading under extreme imbalance (because it accounts for TN which are abundant) and pivoting to PR-AUC is a key signal. Mentioning threshold tuning and calibration separately shows practical maturity.
Applied ML
9. How would you detect and handle feature drift in a production model over time?
How to answer: Split into detection and response. Detection: monitor input feature distributions using Population Stability Index (PSI), KL divergence, or Kolmogorov-Smirnov tests on a rolling window. Monitor prediction distribution separately. Set up alerting thresholds (PSI > 0.2 is a common rule of thumb for significant drift). Response options: retrain on recent data, retrain with time-weighted samples, investigate the root cause (upstream data pipeline change, real-world change), or temporarily fall back to a simpler heuristic. Use shadow mode or canary deployments when rolling out a retrained model.
What they look for: They want to see a concrete monitoring framework, not just 'monitor for drift.' PSI as a specific metric with a threshold shows hands-on experience. Distinguishing between detecting drift and deciding how to respond shows engineering maturity.
Coding / Algorithms
10. Given a large dataset of user-item interaction logs that doesn't fit in memory, how do you compute the top-K most similar items to a query item using collaborative filtering?
How to answer: Use locality-sensitive hashing (LSH) or approximate nearest neighbor (ANN) search rather than exact cosine similarity over all items. For the offline step: compute item embeddings from interaction data using implicit feedback matrix factorization (ALS via Spark or implicit library). Store embeddings in Faiss with an IVF index. At query time: retrieve approximate top-K in sub-linear time. If you must do it from scratch with limited memory: describe a streaming approach with reservoir sampling or a min-heap of size K to avoid loading all similarities at once.
What they look for: They're testing whether you understand that brute-force similarity search doesn't scale and that ANN indexes are the standard solution. Mentioning implicit feedback (not just explicit ratings) shows domain realism. The min-heap for top-K streaming is a solid algorithmic detail.
Behavioral / Project Ownership
11. Tell me about a model you owned that failed in production. What happened, what did you do, and what would you do differently?
How to answer: Use a structured narrative: context (what the model did, stakes), what failure looked like (metric drop, business impact), how you detected it (monitoring, complaint, experiment), the root cause (data pipeline bug, distribution shift, labeling error), how you fixed it, and what safeguards you added. Be specific and honest — fabricated smooth stories are obvious. The 'differently' part should show genuine learning: better monitoring, a staged rollout, a held-out validation set better matched to production.
What they look for: They're evaluating ownership, honesty, and learning velocity — not whether you've never made mistakes. Vague answers ('the model underperformed so we retrained it') score poorly. Specific details about the failure mechanism and what you actually changed afterward score highly.
Behavioral / Collaboration
12. Describe a time you had a technical disagreement with a teammate or manager about a modeling approach. How did you handle it?
How to answer: Pick a real example with genuine stakes (not a trivial preference). Structure: what was the disagreement (e.g., they wanted a complex ensemble, you argued for a simpler model), what was your reasoning, how you made your case (data, experimentation, tradeoffs), what the outcome was. Show that you can advocate for a technically correct position with evidence while remaining open to being wrong. If their approach turned out better, say so and explain what you learned.
What they look for: At mid-level, they expect you to have and defend technical opinions, not defer to everyone. They're watching for intellectual honesty (not 'I was right and convinced everyone'), communication skills, and the ability to resolve disagreement through data and experimentation rather than politics.
Study tips
- Implement the core algorithms from scratch at least once: logistic regression, k-means, a basic neural net with backprop in NumPy. Interviews frequently ask you to modify or debug these, and you can't do that if you've only ever called fit().
- For system design, learn the offline/near-line/online serving pattern and the feature store concept cold — these appear constantly in industry ML design interviews and interviewers expect you to know the vocabulary and rationale.
- Practice articulating model selection decisions as engineering tradeoffs. 'I used XGBoost because it worked' will fail; 'I used XGBoost because the tabular features had nonlinear interactions, training data was 500K rows (not worth the complexity of a neural net), and we needed feature importances for stakeholder trust' will pass.
- Know your evaluation metrics deeply: when ROC-AUC misleads (class imbalance), when to use PR-AUC instead, what calibration means and why it matters for risk scoring. These come up in both coding and conceptual rounds.
- Prepare two or three detailed stories about production ML systems you've worked on, with specific numbers (data size, latency, metric improvements, failure modes). Vague project descriptions are the most common reason mid-level ML candidates fail behavioral rounds.
Practice these against your own résumé
Get questions tailored to your experience, answer them, and get honest feedback — free, no credit card.
Run a free fit check →