Interview questions

New Grad Machine Learning Engineer Interview Questions

New grad ML engineer roles sit at the intersection of software engineering and applied machine learning — interviewers expect solid coding fundamentals, a working understanding of core ML concepts from coursework or projects, and the ability to implement models and pipelines from scratch. You won't be expected to have production ML system experience, but you must demonstrate that you can move from theory to working code and reason clearly about model behavior. Preparation should be split roughly equally between coding, ML fundamentals, and your own project work.

What to expect

Expect a 4–6 round loop: one or two LeetCode-style coding rounds (arrays, strings, trees, graphs — medium difficulty), one ML fundamentals round covering loss functions, optimization, and model evaluation, one round where you walk through a project or Kaggle competition you've worked on in depth, and often a light system design or ML pipeline design round that doesn't expect distributed-systems expertise but does expect you to think end-to-end about data → model → serving. Behavioral rounds are shorter than for senior roles but interviewers will probe for intellectual curiosity, how you debug when things go wrong, and how you handle ambiguity in research or project settings.

These are the questions every Machine Learning Engineer gets.

Get questions tailored to your experience, answer them, and get honest feedback — free, no credit card.

Run a free fit check →

12 questions, with how to answer them

  1. Coding

    1. Implement k-means clustering from scratch using only NumPy. Handle edge cases like empty clusters.

    How to answer: Write the full loop: random centroid initialization, Euclidean distance computation via broadcasting, cluster assignment with argmin, centroid recomputation, and a convergence check on centroid movement. For empty clusters, re-initialize that centroid to a random data point. Walk through the complexity: O(n·k·d·iterations).

    What they look for: Clean NumPy usage without for-loops over data points, awareness of edge cases, and understanding that you're actually implementing a real ML primitive — not just pattern-matching to a LeetCode template.

  2. Coding

    2. Given a list of training labels and predicted probabilities, implement binary cross-entropy loss and its gradient with respect to the predictions.

    How to answer: BCE = -1/n * sum(y*log(p) + (1-y)*log(1-p)). Add epsilon to log arguments to avoid log(0). The gradient w.r.t. p is (p - y) / (p*(1-p)*n) before simplification, or more cleanly derived from the logit formulation. Implement in NumPy and test on known values.

    What they look for: Whether you can connect the math to code, handle numerical instability, and actually verify your implementation — not just recite the formula.

  3. Coding

    3. Implement a function that performs train/validation/test split with stratification across class labels, without using sklearn.

    How to answer: Group indices by class label, then for each class apply the split ratios independently, and concatenate results. Shuffle within each class first. Handle edge cases: classes with only one sample, non-integer split sizes (use floor/ceil carefully).

    What they look for: Systematic thinking about data pipeline code, awareness that naive random split can produce unbalanced validation sets, and clean implementation over a dict-of-lists structure.

  4. ML Fundamentals

    4. Explain the bias-variance tradeoff and describe one concrete technique that reduces each. How does this manifest when you increase the depth of a decision tree?

    How to answer: Bias = error from wrong assumptions in the model; variance = error from sensitivity to training data fluctuations. Regularization (L2, dropout, pruning) reduces variance. Better features or more model capacity reduces bias. A shallow tree has high bias (underfits); a fully-grown tree has high variance (memorizes training data). Pruning or ensemble methods (bagging) target the variance side.

    What they look for: Precise definitions, not just 'underfitting vs overfitting.' Ability to name specific mechanisms and connect the tradeoff to a real model family.

  5. ML Fundamentals

    5. Why does gradient descent with a learning rate that's too large fail to converge, and what does the loss curve look like? What are three strategies to set the learning rate?

    How to answer: Too-large LR causes overshooting the minimum; loss oscillates or diverges. Loss curve will bounce chaotically or spike upward. Strategies: (1) learning rate range test / LR finder, (2) learning rate schedules (cosine annealing, step decay), (3) adaptive optimizers like Adam that maintain per-parameter learning rates. Mention that Adam is less sensitive to initial LR but still has a useful range.

    What they look for: Mechanistic understanding of gradient descent geometry, ability to diagnose from a loss curve, and awareness of practical tooling beyond 'just try 0.001.'

  6. ML Fundamentals

    6. You train a model and get 95% accuracy on a binary classification problem, but your stakeholder is unhappy. What questions do you ask and what metrics would you look at instead?

    How to answer: First ask about class imbalance — if 95% of samples are class 0, a trivial classifier achieves 95%. Switch to precision, recall, F1, or AUC-ROC depending on the cost asymmetry. Ask what the business cost of a false positive vs false negative is. Look at the confusion matrix. Discuss calibration if the model outputs probabilities.

    What they look for: Whether you reflexively question the metric rather than accepting accuracy at face value — this is one of the most practically important ML instincts and new grads often miss it.

  7. ML Depth

    7. Describe what happens forward and backward through a single batch normalization layer during training. How does it behave differently at inference?

    How to answer: Forward: compute batch mean and variance, normalize inputs to zero mean/unit variance, apply learned scale (gamma) and shift (beta). Backward: gradients flow through normalization — the chain rule through the mean/variance computation couples all examples in the batch. At inference: use running mean/variance accumulated during training (not batch statistics), so single examples can be processed.

    What they look for: Concrete understanding of how BN works mechanically, not just 'it normalizes activations.' The training vs inference distinction is a common gotcha that reveals whether understanding is superficial.

  8. ML Depth

    8. You're building a text classifier. Walk through your decision of whether to use TF-IDF + logistic regression versus fine-tuning a pre-trained transformer. What factors drive the decision?

    How to answer: Consider: dataset size (transformers need more data or strong pre-training signal), latency/compute constraints, interpretability requirements, and iteration speed. TF-IDF + LR is fast to train, easy to debug, and often competitive on short-text classification with limited data. Fine-tuning BERT-class models wins on complex tasks with sufficient data but costs more to serve. Start with the baseline, measure, then move to transformers if there's headroom.

    What they look for: Engineering pragmatism — the ability to choose the right tool rather than always reaching for the most complex one. New grads often jump to transformers; interviewers want to see you justify the tradeoff.

  9. Project Deep Dive

    9. Walk me through the most technically interesting ML project you've worked on. What was the hardest debugging problem you hit?

    How to answer: Structure it: problem → data → model choice → training loop → evaluation → what went wrong and how you diagnosed it. Prepare one concrete failure: e.g., loss wasn't decreasing (found a bug in gradient flow), validation accuracy was much lower than training (found label leakage from a feature computed on the full dataset). Be specific about the fix.

    What they look for: Depth over breadth. Interviewers probe one project hard. They want to see that you understand every decision you made, not that you ran a notebook and got a number out. The debugging story is the most revealing part.

  10. ML System Design (Light)

    10. Design a pipeline to train and serve a spam classifier for email. You have 1M labeled emails. Walk through data, training, evaluation, and deployment.

    How to answer: Data: deduplicate, clean HTML, split train/val/test with temporal split if possible (avoid future leakage). Features: TF-IDF or embeddings. Model: logistic regression baseline, then gradient boosting or fine-tuned model. Evaluation: precision/recall tradeoff, AUC. Deployment: containerized REST endpoint or batch scoring, with a threshold tuned to business tolerance for false positives. Monitoring: track prediction distribution drift and label spot-checks over time.

    What they look for: End-to-end thinking — not a perfect architecture, but a structured walk through each stage. For new grads, temporal train/test split and drift monitoring are differentiators that show practical awareness.

  11. Behavioral

    11. Tell me about a time you implemented something that didn't work as expected. How did you figure out what was wrong?

    How to answer: Use a concrete example with a specific hypothesis-driven debugging process: (1) isolate the failure mode, (2) form a hypothesis, (3) design a minimal experiment to test it, (4) fix and verify. Avoid vague answers like 'I Googled it.' Good examples: wrong data preprocessing, incorrect loss function, shape mismatch that wasn't caught, or mislabeled data.

    What they look for: Scientific debugging instinct — the core job of an ML engineer is diagnosing why something doesn't work. They want to see systematic thinking, not luck.

  12. Behavioral

    12. Describe a project where you had to learn something technical outside your coursework quickly. How did you approach it?

    How to answer: Be specific about what you didn't know, how you identified credible resources (papers, docs, working code), how you built up understanding incrementally, and what you built to validate your understanding. Avoid 'I watched YouTube videos.' Good answers mention reading source code, reproducing a paper result, or building a small isolated prototype to test an assumption.

    What they look for: Self-directed learning ability and intellectual honesty about the limits of your knowledge. ML moves fast; interviewers care deeply that you can ramp on new techniques without hand-holding.

Study tips

  • Implement core ML algorithms from scratch in NumPy — logistic regression, k-means, a two-layer neural network with backprop. You will be asked to do this live and reciting sklearn APIs is not sufficient.
  • Prepare one project to defend at depth rather than five projects superficially. Expect 20+ minutes of drilling into a single project — know every hyperparameter choice, every data decision, and every failure you hit.
  • Practice diagnosing ML failures by deliberately breaking your own models: remove normalization, introduce label leakage, use wrong loss functions, then fix them. This builds the debugging intuition interviewers actually test.
  • Understand the math behind the top 3–4 algorithms you claim to know. If your resume says 'transformer,' be ready to explain attention, why softmax is used, and what happens to gradients with and without layer norm.
  • For ML system design, default to a simpler baseline model and justify when complexity is warranted — interviewers at this level are specifically checking whether you have the discipline to not over-engineer.

Practice these against your own résumé

Get questions tailored to your experience, answer them, and get honest feedback — free, no credit card.

Run a free fit check →