Q: How would you design a training data pipeline for a text classification model that needs to be retrained weekly on 50GB of new labeled data?

Cover: data ingestion (streaming vs. batch, schema validation at ingest with Great Expectations or similar), storage (partitioned Parquet on S3 by date), processing (Spark or Beam for preprocessing at scale, avoid pandas for 50GB), versioning (DVC or Delta Lake for reproducibility), labeling quality checks (inter-annotator agreement, label distribution monitoring), and retraining triggers (scheduled vs. performance-triggered). Mention that the pipeline must be idempotent and that you should version the dataset used for each model version. What interviewers look for: This tests whether you think about ML pipelines with software engineering rigor. Dataset versioning and idempotency are non-obvious details that separate engineers who've debugged flaky pipelines from those who haven't. Naming concrete tools is fine but should be backed by rationale.

Q: You're building a fraud detection model where the positive class (fraud) is 0.1% of transactions. How do you handle this, from training through evaluation?

Training: don't just oversample naively. Options include: class-weighted loss (most practical, set weight=1000 for fraud), SMOTE for tabular data (with caveats), undersampling majority class, or adjusting the decision threshold after training rather than resampling. Evaluation: accuracy is useless here — use precision-recall AUC (better than ROC-AUC for extreme imbalance), F-beta score (weight recall higher if missing fraud is costly), and business metrics (dollar value of fraud caught). Calibration matters: if you use the score as a risk score, ensure it's well-calibrated with Platt scaling or isotonic regression. What interviewers look for: Knowing that ROC-AUC is misleading under extreme imbalance (because it accounts for TN which are abundant) and pivoting to PR-AUC is a key signal. Mentioning threshold tuning and calibration separately shows practical maturity.

Q: How would you detect and handle feature drift in a production model over time?

Split into detection and response. Detection: monitor input feature distributions using Population Stability Index (PSI), KL divergence, or Kolmogorov-Smirnov tests on a rolling window. Monitor prediction distribution separately. Set up alerting thresholds (PSI > 0.2 is a common rule of thumb for significant drift). Response options: retrain on recent data, retrain with time-weighted samples, investigate the root cause (upstream data pipeline change, real-world change), or temporarily fall back to a simpler heuristic. Use shadow mode or canary deployments when rolling out a retrained model. What interviewers look for: They want to see a concrete monitoring framework, not just 'monitor for drift.' PSI as a specific metric with a threshold shows hands-on experience. Distinguishing between detecting drift and deciding how to respond shows engineering maturity.

Question 1

Implement logistic regression training with gradient descent from scratch in NumPy, including the forward pass, binary cross-entropy loss, and gradient update.

Accepted Answer

Start by writing the sigmoid function, then the forward pass (predictions), then the loss. Derive the gradient analytically: dL/dw = X^T (y_hat - y) / n. Implement the update loop with a learning rate and show how you'd add an L2 regularization term. Mention numerical stability considerations for log(0). What interviewers look for: Can you derive and implement core ML math without a framework? They want to see you treat the gradient derivation as obvious, not struggle through it. Clean vectorized NumPy code signals you understand the matrix shapes and won't write Python loops over samples in production.

Question 2

Write a function to compute precision, recall, F1, and AUC-ROC from raw model scores and binary labels without using sklearn.

Accepted Answer

Precision and recall require choosing a threshold; show you understand this explicitly. For AUC-ROC, sort by score descending, iterate thresholds, accumulate TPR/FPR pairs, then use the trapezoidal rule. Distinguish macro vs. micro averaging for multi-class extensions. Handle edge cases like all-positive or all-negative label sets. What interviewers look for: Interviewers want to see that you genuinely understand what these metrics measure, not just that you can call a function. The AUC implementation specifically tests whether you understand the ROC curve construction. Edge-case handling signals production-readiness.

Question 3

Your model achieves 92% accuracy on the test set but performs poorly in production. Walk me through how you diagnose this.

Accepted Answer

Structure around distribution shift first: covariate shift (input X changes), label shift (P(Y) changes), or concept drift (P(Y|X) changes). Then ask: is the test set representative? Was there data leakage inflating test performance? Is the production data preprocessed identically? Use monitoring tools — log input feature distributions, compare them to training distributions with KL divergence or PSI (Population Stability Index). Then check prediction distributions and downstream business metrics. What interviewers look for: This is a practical debugging question. They want to see a structured, hypothesis-driven approach rather than guessing. Mentioning PSI/KL divergence for drift detection and distinguishing the types of distribution shift shows genuine production ML experience.

Question 4

Explain the bias-variance tradeoff and describe a concrete situation where you'd deliberately choose a higher-bias model.

Accepted Answer

Define bias (systematic error from model assumptions) and variance (sensitivity to training data fluctuations). Formally: Expected Error = Bias² + Variance + Irreducible Noise. A higher-bias model is preferable when: training data is small and a complex model would overfit badly; the feature set is noisy and a simpler model generalizes better; interpretability is required (e.g., regulatory context); or inference latency is constrained and a linear model is fast enough. What interviewers look for: They want to see you treat this as an engineering decision, not a textbook definition. Grounding the tradeoff in a real scenario (small data, latency, interpretability) signals that you use this framework to make actual model selection decisions.

Question 5

How does the Adam optimizer work, and when would you prefer SGD with momentum instead?

Accepted Answer

Adam tracks a first moment (mean of gradients, momentum) and second moment (uncentered variance, adaptive learning rate per parameter). The bias-correction terms (dividing by 1-β^t) are important to mention — they fix the cold-start problem. Adam converges fast but can generalize worse than SGD in some vision tasks (the 'Adam generalization gap'). Prefer SGD+momentum when you have the compute budget to tune LR and schedule carefully, or when benchmarks show the final accuracy matters more than convergence speed. What interviewers look for: Mid-level candidates should know optimizer internals, not just that 'Adam usually works.' Mentioning the generalization gap and bias correction shows depth. They're checking whether you pick optimizers thoughtfully or cargo-cult Adam on everything.

Question 6

Design a real-time product recommendation system that serves personalized results in under 100ms for 10M daily active users.

Accepted Answer

Decompose into: (1) offline — train a two-tower embedding model, generate item and user embeddings, store in a vector database (Faiss, Pinecone); (2) near-line — pre-compute and cache top-K candidates per user, refresh on a schedule; (3) online — retrieve candidates from cache, re-rank with a lightweight model (gradient boosted trees, not a transformer) using real-time context features, enforce business rules. Address: embedding freshness for new users/items (cold start), feature store for consistent training/serving features, monitoring for CTR drift. What interviewers look for: They want to see the offline/near-line/online decomposition, which is the standard industry pattern. Knowing that you can't run a deep model in <100ms at retrieval scale is essential. Cold start handling and feature store mention signal real experience beyond toy projects.

Question 7

How would you design a training data pipeline for a text classification model that needs to be retrained weekly on 50GB of new labeled data?

Accepted Answer

Cover: data ingestion (streaming vs. batch, schema validation at ingest with Great Expectations or similar), storage (partitioned Parquet on S3 by date), processing (Spark or Beam for preprocessing at scale, avoid pandas for 50GB), versioning (DVC or Delta Lake for reproducibility), labeling quality checks (inter-annotator agreement, label distribution monitoring), and retraining triggers (scheduled vs. performance-triggered). Mention that the pipeline must be idempotent and that you should version the dataset used for each model version. What interviewers look for: This tests whether you think about ML pipelines with software engineering rigor. Dataset versioning and idempotency are non-obvious details that separate engineers who've debugged flaky pipelines from those who haven't. Naming concrete tools is fine but should be backed by rationale.

Question 8

You're building a fraud detection model where the positive class (fraud) is 0.1% of transactions. How do you handle this, from training through evaluation?

Accepted Answer

Training: don't just oversample naively. Options include: class-weighted loss (most practical, set weight=1000 for fraud), SMOTE for tabular data (with caveats), undersampling majority class, or adjusting the decision threshold after training rather than resampling. Evaluation: accuracy is useless here — use precision-recall AUC (better than ROC-AUC for extreme imbalance), F-beta score (weight recall higher if missing fraud is costly), and business metrics (dollar value of fraud caught). Calibration matters: if you use the score as a risk score, ensure it's well-calibrated with Platt scaling or isotonic regression. What interviewers look for: Knowing that ROC-AUC is misleading under extreme imbalance (because it accounts for TN which are abundant) and pivoting to PR-AUC is a key signal. Mentioning threshold tuning and calibration separately shows practical maturity.

Question 9

How would you detect and handle feature drift in a production model over time?

Accepted Answer

Split into detection and response. Detection: monitor input feature distributions using Population Stability Index (PSI), KL divergence, or Kolmogorov-Smirnov tests on a rolling window. Monitor prediction distribution separately. Set up alerting thresholds (PSI > 0.2 is a common rule of thumb for significant drift). Response options: retrain on recent data, retrain with time-weighted samples, investigate the root cause (upstream data pipeline change, real-world change), or temporarily fall back to a simpler heuristic. Use shadow mode or canary deployments when rolling out a retrained model. What interviewers look for: They want to see a concrete monitoring framework, not just 'monitor for drift.' PSI as a specific metric with a threshold shows hands-on experience. Distinguishing between detecting drift and deciding how to respond shows engineering maturity.

Question 10

Given a large dataset of user-item interaction logs that doesn't fit in memory, how do you compute the top-K most similar items to a query item using collaborative filtering?

Accepted Answer

Use locality-sensitive hashing (LSH) or approximate nearest neighbor (ANN) search rather than exact cosine similarity over all items. For the offline step: compute item embeddings from interaction data using implicit feedback matrix factorization (ALS via Spark or implicit library). Store embeddings in Faiss with an IVF index. At query time: retrieve approximate top-K in sub-linear time. If you must do it from scratch with limited memory: describe a streaming approach with reservoir sampling or a min-heap of size K to avoid loading all similarities at once. What interviewers look for: They're testing whether you understand that brute-force similarity search doesn't scale and that ANN indexes are the standard solution. Mentioning implicit feedback (not just explicit ratings) shows domain realism. The min-heap for top-K streaming is a solid algorithmic detail.

Question 11

Tell me about a model you owned that failed in production. What happened, what did you do, and what would you do differently?

Accepted Answer

Use a structured narrative: context (what the model did, stakes), what failure looked like (metric drop, business impact), how you detected it (monitoring, complaint, experiment), the root cause (data pipeline bug, distribution shift, labeling error), how you fixed it, and what safeguards you added. Be specific and honest — fabricated smooth stories are obvious. The 'differently' part should show genuine learning: better monitoring, a staged rollout, a held-out validation set better matched to production. What interviewers look for: They're evaluating ownership, honesty, and learning velocity — not whether you've never made mistakes. Vague answers ('the model underperformed so we retrained it') score poorly. Specific details about the failure mechanism and what you actually changed afterward score highly.

Question 12

Describe a time you had a technical disagreement with a teammate or manager about a modeling approach. How did you handle it?

Accepted Answer

Pick a real example with genuine stakes (not a trivial preference). Structure: what was the disagreement (e.g., they wanted a complex ensemble, you argued for a simpler model), what was your reasoning, how you made your case (data, experimentation, tradeoffs), what the outcome was. Show that you can advocate for a technically correct position with evidence while remaining open to being wrong. If their approach turned out better, say so and explain what you learned. What interviewers look for: At mid-level, they expect you to have and defend technical opinions, not defer to everyone. They're watching for intellectual honesty (not 'I was right and convinced everyone'), communication skills, and the ability to resolve disagreement through data and experimentation rather than politics.

Mid-Level Machine Learning Engineer Interview Questions

What to expect

12 questions, with how to answer them

1. Implement logistic regression training with gradient descent from scratch in NumPy, including the forward pass, binary cross-entropy loss, and gradient update.

2. Write a function to compute precision, recall, F1, and AUC-ROC from raw model scores and binary labels without using sklearn.

3. Your model achieves 92% accuracy on the test set but performs poorly in production. Walk me through how you diagnose this.

4. Explain the bias-variance tradeoff and describe a concrete situation where you'd deliberately choose a higher-bias model.

5. How does the Adam optimizer work, and when would you prefer SGD with momentum instead?

6. Design a real-time product recommendation system that serves personalized results in under 100ms for 10M daily active users.

7. How would you design a training data pipeline for a text classification model that needs to be retrained weekly on 50GB of new labeled data?

8. You're building a fraud detection model where the positive class (fraud) is 0.1% of transactions. How do you handle this, from training through evaluation?

9. How would you detect and handle feature drift in a production model over time?

10. Given a large dataset of user-item interaction logs that doesn't fit in memory, how do you compute the top-K most similar items to a query item using collaborative filtering?

11. Tell me about a model you owned that failed in production. What happened, what did you do, and what would you do differently?

12. Describe a time you had a technical disagreement with a teammate or manager about a modeling approach. How did you handle it?

Study tips