Question 1

Implement k-means clustering from scratch using only NumPy. Handle edge cases like empty clusters.

Accepted Answer

Write the full loop: random centroid initialization, Euclidean distance computation via broadcasting, cluster assignment with argmin, centroid recomputation, and a convergence check on centroid movement. For empty clusters, re-initialize that centroid to a random data point. Walk through the complexity: O(n·k·d·iterations). What interviewers look for: Clean NumPy usage without for-loops over data points, awareness of edge cases, and understanding that you're actually implementing a real ML primitive — not just pattern-matching to a LeetCode template.

Question 2

Given a list of training labels and predicted probabilities, implement binary cross-entropy loss and its gradient with respect to the predictions.

Accepted Answer

BCE = -1/n * sum(y*log(p) + (1-y)*log(1-p)). Add epsilon to log arguments to avoid log(0). The gradient w.r.t. p is (p - y) / (p*(1-p)*n) before simplification, or more cleanly derived from the logit formulation. Implement in NumPy and test on known values. What interviewers look for: Whether you can connect the math to code, handle numerical instability, and actually verify your implementation — not just recite the formula.

Question 3

Implement a function that performs train/validation/test split with stratification across class labels, without using sklearn.

Accepted Answer

Group indices by class label, then for each class apply the split ratios independently, and concatenate results. Shuffle within each class first. Handle edge cases: classes with only one sample, non-integer split sizes (use floor/ceil carefully). What interviewers look for: Systematic thinking about data pipeline code, awareness that naive random split can produce unbalanced validation sets, and clean implementation over a dict-of-lists structure.

Question 4

Explain the bias-variance tradeoff and describe one concrete technique that reduces each. How does this manifest when you increase the depth of a decision tree?

Accepted Answer

Bias = error from wrong assumptions in the model; variance = error from sensitivity to training data fluctuations. Regularization (L2, dropout, pruning) reduces variance. Better features or more model capacity reduces bias. A shallow tree has high bias (underfits); a fully-grown tree has high variance (memorizes training data). Pruning or ensemble methods (bagging) target the variance side. What interviewers look for: Precise definitions, not just 'underfitting vs overfitting.' Ability to name specific mechanisms and connect the tradeoff to a real model family.

Question 5

Why does gradient descent with a learning rate that's too large fail to converge, and what does the loss curve look like? What are three strategies to set the learning rate?

Accepted Answer

Too-large LR causes overshooting the minimum; loss oscillates or diverges. Loss curve will bounce chaotically or spike upward. Strategies: (1) learning rate range test / LR finder, (2) learning rate schedules (cosine annealing, step decay), (3) adaptive optimizers like Adam that maintain per-parameter learning rates. Mention that Adam is less sensitive to initial LR but still has a useful range. What interviewers look for: Mechanistic understanding of gradient descent geometry, ability to diagnose from a loss curve, and awareness of practical tooling beyond 'just try 0.001.'

Question 6

You train a model and get 95% accuracy on a binary classification problem, but your stakeholder is unhappy. What questions do you ask and what metrics would you look at instead?

Accepted Answer

First ask about class imbalance — if 95% of samples are class 0, a trivial classifier achieves 95%. Switch to precision, recall, F1, or AUC-ROC depending on the cost asymmetry. Ask what the business cost of a false positive vs false negative is. Look at the confusion matrix. Discuss calibration if the model outputs probabilities. What interviewers look for: Whether you reflexively question the metric rather than accepting accuracy at face value — this is one of the most practically important ML instincts and new grads often miss it.

Question 7

Describe what happens forward and backward through a single batch normalization layer during training. How does it behave differently at inference?

Accepted Answer

Forward: compute batch mean and variance, normalize inputs to zero mean/unit variance, apply learned scale (gamma) and shift (beta). Backward: gradients flow through normalization — the chain rule through the mean/variance computation couples all examples in the batch. At inference: use running mean/variance accumulated during training (not batch statistics), so single examples can be processed. What interviewers look for: Concrete understanding of how BN works mechanically, not just 'it normalizes activations.' The training vs inference distinction is a common gotcha that reveals whether understanding is superficial.

Question 8

You're building a text classifier. Walk through your decision of whether to use TF-IDF + logistic regression versus fine-tuning a pre-trained transformer. What factors drive the decision?

Accepted Answer

Consider: dataset size (transformers need more data or strong pre-training signal), latency/compute constraints, interpretability requirements, and iteration speed. TF-IDF + LR is fast to train, easy to debug, and often competitive on short-text classification with limited data. Fine-tuning BERT-class models wins on complex tasks with sufficient data but costs more to serve. Start with the baseline, measure, then move to transformers if there's headroom. What interviewers look for: Engineering pragmatism — the ability to choose the right tool rather than always reaching for the most complex one. New grads often jump to transformers; interviewers want to see you justify the tradeoff.

Question 9

Walk me through the most technically interesting ML project you've worked on. What was the hardest debugging problem you hit?

Accepted Answer

Structure it: problem → data → model choice → training loop → evaluation → what went wrong and how you diagnosed it. Prepare one concrete failure: e.g., loss wasn't decreasing (found a bug in gradient flow), validation accuracy was much lower than training (found label leakage from a feature computed on the full dataset). Be specific about the fix. What interviewers look for: Depth over breadth. Interviewers probe one project hard. They want to see that you understand every decision you made, not that you ran a notebook and got a number out. The debugging story is the most revealing part.

Question 10

Design a pipeline to train and serve a spam classifier for email. You have 1M labeled emails. Walk through data, training, evaluation, and deployment.

Accepted Answer

Data: deduplicate, clean HTML, split train/val/test with temporal split if possible (avoid future leakage). Features: TF-IDF or embeddings. Model: logistic regression baseline, then gradient boosting or fine-tuned model. Evaluation: precision/recall tradeoff, AUC. Deployment: containerized REST endpoint or batch scoring, with a threshold tuned to business tolerance for false positives. Monitoring: track prediction distribution drift and label spot-checks over time. What interviewers look for: End-to-end thinking — not a perfect architecture, but a structured walk through each stage. For new grads, temporal train/test split and drift monitoring are differentiators that show practical awareness.

Question 11

Tell me about a time you implemented something that didn't work as expected. How did you figure out what was wrong?

Accepted Answer

Use a concrete example with a specific hypothesis-driven debugging process: (1) isolate the failure mode, (2) form a hypothesis, (3) design a minimal experiment to test it, (4) fix and verify. Avoid vague answers like 'I Googled it.' Good examples: wrong data preprocessing, incorrect loss function, shape mismatch that wasn't caught, or mislabeled data. What interviewers look for: Scientific debugging instinct — the core job of an ML engineer is diagnosing why something doesn't work. They want to see systematic thinking, not luck.

Question 12

Describe a project where you had to learn something technical outside your coursework quickly. How did you approach it?

Accepted Answer

Be specific about what you didn't know, how you identified credible resources (papers, docs, working code), how you built up understanding incrementally, and what you built to validate your understanding. Avoid 'I watched YouTube videos.' Good answers mention reading source code, reproducing a paper result, or building a small isolated prototype to test an assumption. What interviewers look for: Self-directed learning ability and intellectual honesty about the limits of your knowledge. ML moves fast; interviewers care deeply that you can ramp on new techniques without hand-holding.

New Grad Machine Learning Engineer Interview Questions

What to expect

12 questions, with how to answer them

1. Implement k-means clustering from scratch using only NumPy. Handle edge cases like empty clusters.

2. Given a list of training labels and predicted probabilities, implement binary cross-entropy loss and its gradient with respect to the predictions.

3. Implement a function that performs train/validation/test split with stratification across class labels, without using sklearn.

4. Explain the bias-variance tradeoff and describe one concrete technique that reduces each. How does this manifest when you increase the depth of a decision tree?

5. Why does gradient descent with a learning rate that's too large fail to converge, and what does the loss curve look like? What are three strategies to set the learning rate?

6. You train a model and get 95% accuracy on a binary classification problem, but your stakeholder is unhappy. What questions do you ask and what metrics would you look at instead?

7. Describe what happens forward and backward through a single batch normalization layer during training. How does it behave differently at inference?

8. You're building a text classifier. Walk through your decision of whether to use TF-IDF + logistic regression versus fine-tuning a pre-trained transformer. What factors drive the decision?

9. Walk me through the most technically interesting ML project you've worked on. What was the hardest debugging problem you hit?

10. Design a pipeline to train and serve a spam classifier for email. You have 1M labeled emails. Walk through data, training, evaluation, and deployment.

11. Tell me about a time you implemented something that didn't work as expected. How did you figure out what was wrong?

12. Describe a project where you had to learn something technical outside your coursework quickly. How did you approach it?

Study tips