Interview questions
New Grad Data Engineer Interview Questions
New grad data engineering interviews test foundational SQL, basic pipeline thinking, and Python/coding fundamentals — not distributed systems mastery. Interviewers know you haven't built a production Kafka cluster, but they expect you to reason clearly about data movement, transformation logic, and tradeoffs between simple solutions. The bar is: can you write correct SQL, think about data quality, and learn quickly?
What to expect
Expect a 3–5 round loop: one or two SQL/coding rounds (heavy SQL, some Python), one system design or pipeline design round calibrated to entry-level scope (e.g., design a simple ETL pipeline, not a real-time streaming platform), one behavioral round, and sometimes a take-home involving a small dataset transformation or data quality analysis. The system design round will look more like 'walk me through how you'd move data from a REST API into a reporting table' than full distributed system design. Behavioral questions are lighter than senior roles but still matter — teams want to see intellectual curiosity and honest self-awareness about gaps.
These are the questions every Data Engineer gets.
Get questions tailored to your experience, answer them, and get honest feedback — free, no credit card.
Run a free fit check →12 questions, with how to answer them
SQL
1. Write a query to find the top 3 products by revenue in each category for the past 30 days, given a sales table with columns: sale_id, product_id, category, revenue, sale_date.
How to answer: Use a window function: ROW_NUMBER() or RANK() OVER (PARTITION BY category ORDER BY SUM(revenue) DESC). First aggregate revenue per product/category with a WHERE sale_date >= CURRENT_DATE - 30 GROUP BY, then apply the window function in a subquery or CTE, then filter WHERE rank <= 3. Walk through whether RANK vs DENSE_RANK matters if ties exist.
What they look for: Correct use of window functions and CTEs. Whether you remember to filter by date before ranking, not after. Bonus: noticing the tie-breaking edge case. This is the single most tested SQL pattern for new grad DE roles.
SQL
2. What is the difference between a LEFT JOIN and an INNER JOIN? Give an example where using the wrong one would silently corrupt a downstream metric.
How to answer: Define both precisely. Then give a concrete scenario: a daily_active_users table joined to a revenue table — using INNER JOIN drops users with zero revenue, making your DAU count artificially low in a dashboard. This is a data correctness trap, not just a performance question.
What they look for: Whether you understand join semantics at the 'data correctness' level, not just the 'what rows come back' level. Interviewers want to see you connect join choice to downstream metric integrity, which is core DE responsibility.
SQL
3. You have a table of user events (user_id, event_type, event_timestamp). Write a query to find each user's first and last event timestamp.
How to answer: Use MIN(event_timestamp) and MAX(event_timestamp) with GROUP BY user_id. A stronger answer also uses FIRST_VALUE / LAST_VALUE window functions and explains when you'd prefer one approach over the other (aggregation for simple cases, window functions when you need to retain other columns from the same row).
What they look for: Basic aggregation correctness. Awareness that window functions exist. If you jump straight to a correlated subquery, that signals you haven't internalized set-based thinking, which is a flag.
Python / Coding
4. Write a Python function that reads a CSV file, removes rows where any column is null, and writes the cleaned output to a new CSV. Handle the case where the input file doesn't exist.
How to answer: Use pandas: pd.read_csv(), df.dropna(), df.to_csv(). Wrap in a try/except FileNotFoundError. A stronger answer also discusses what 'null' means in CSV context (empty string vs NaN) and whether dropna should be any or all. Keep it simple and correct — don't over-engineer.
What they look for: Clean, correct Python. Proper error handling without being asked. Awareness of the 'empty string vs NaN' ambiguity shows real data intuition. They are not expecting Spark — pandas is the right tool here.
Python / Coding
5. Given a list of dictionaries representing log records [{user_id, action, timestamp}], write Python code to group them by user_id and count the number of each action per user.
How to answer: Use collections.defaultdict or Counter. Iterate once through the list, building a nested dict: {user_id: {action: count}}. Alternatively use pandas groupby if you'd load this into a DataFrame. Mention time complexity: O(n) for the dict approach.
What they look for: Correct use of Python data structures. O(n) awareness. Whether you can cleanly transform raw records — this mirrors what ETL code actually does at small scale.
Data Modeling
6. You're building a table to track daily user subscription status (active/cancelled/paused). A user's status can change over time. How would you model this?
How to answer: Explain the slowly changing dimension (SCD) concept even if you don't know the term. Option 1: one row per user (overwrite current status — loses history). Option 2: one row per status change with effective_date and end_date — preserves history, enables point-in-time queries. Recommend option 2 and explain how to query 'status as of a given date' with a BETWEEN or date range filter.
What they look for: Whether you intuitively reach for history-preserving design. Knowledge of SCD Type 2 is a bonus but not required at new grad level. The key signal is: do you think about 'what question will this table need to answer in 6 months?'
Data Modeling
7. What is the difference between a fact table and a dimension table in a star schema? Give a concrete example.
How to answer: Fact table: stores measurable, quantitative events (e.g., order_id, customer_id, product_id, amount, order_date). Dimension table: stores descriptive attributes (e.g., customers: customer_id, name, city, segment). Explain how you join them: fact table has foreign keys to dimension tables, enabling slicing metrics by attributes. Give one example query that demonstrates the join.
What they look for: Fundamental data warehouse literacy. New grad DEs are often expected to work inside existing warehouse schemas — you need to know what you're looking at. This is a filter question; a blank answer is a red flag.
Pipeline Design
8. Walk me through how you would build a simple daily ETL pipeline that pulls data from a REST API, transforms it, and loads it into a Postgres table.
How to answer: Structure as Extract → Transform → Load. Extract: call the API with requests, handle pagination and HTTP errors, store raw JSON. Transform: parse response, validate schema, handle nulls/type mismatches. Load: use psycopg2 or SQLAlchemy, use INSERT ... ON CONFLICT for idempotency. Schedule with cron or Airflow. Mention: what happens if the API is down? What happens if the job runs twice? (idempotency).
What they look for: Whether you naturally think about failure modes and idempotency without being prompted. Knowing the ETL pattern end-to-end at a practical level. This is the most common real task for a new grad DE and the interview expects you to have thought it through.
Pipeline Design
9. What is idempotency in the context of a data pipeline, and why does it matter?
How to answer: Define it: running the pipeline multiple times with the same input produces the same output — no duplicate records, no data loss. Explain why: schedulers retry failed jobs; network issues cause double-triggers. Give a concrete fix: use INSERT ... ON CONFLICT DO NOTHING or UPSERT logic, or use a staging table + MERGE. Contrast with a naive append-only approach that creates duplicates on retry.
What they look for: Whether you've internalized that pipelines fail and retry. This is a real-world engineering concern that distinguishes a DE who's thought about production vs one who's only written scripts. Expecting concrete mechanism, not just the abstract concept.
Data Quality
10. You notice that the row count in your daily pipeline output dropped by 40% compared to yesterday. Walk me through how you'd investigate.
How to answer: Structure your investigation: 1) Check if the source data volume actually dropped (upstream issue vs pipeline issue). 2) Check pipeline logs for errors, filter conditions, or schema changes. 3) Check if a JOIN is unexpectedly dropping rows (inner join on a new column with nulls). 4) Check if a date filter has an off-by-one bug. 5) Check if upstream table was truncated or a partition was missing. Prioritize source vs pipeline vs query logic in that order.
What they look for: Systematic debugging instinct. Whether you distinguish between 'the data source changed' and 'my pipeline broke' — these have very different remedies. Organized, step-by-step thinking under ambiguity is the core signal.
Behavioral
11. Tell me about a project (academic, internship, or personal) where you had to work with messy or incomplete data. What did you do?
How to answer: Use STAR structure: Situation (what was the dataset/project), Task (what you needed to produce), Action (how you identified and handled the mess — missing values, schema inconsistencies, duplicates), Result (what you shipped and what you learned). Be specific about the actual data problem and your solution, not vague about 'I cleaned the data.'
What they look for: Genuine hands-on experience with real data pain. They're not expecting a massive project — a class project or Kaggle dataset is fine. The signal is: do you talk concretely about the mess, or do you hand-wave? Specificity = credibility.
Behavioral
12. You're given a task but the requirements are ambiguous — the stakeholder isn't sure exactly what they want. How do you handle it?
How to answer: Describe a concrete approach: ask clarifying questions upfront (what decision will this data inform? what's the time range?), propose a small scoped deliverable to align on direction before building the full thing, check back early with a sample output. Mention that in data work, building the wrong thing is expensive to undo because downstream dashboards depend on your schema.
What they look for: Communication maturity and proactiveness. New grad DEs often jump into coding before fully understanding the ask. Interviewers want to see you've thought about the stakeholder relationship, not just the technical execution. Shows you'll be low-maintenance on a real team.
Study tips
- Practice SQL window functions until they're automatic — ROW_NUMBER, RANK, LAG, LEAD, and SUM OVER PARTITION are tested constantly and most candidates fumble the PARTITION BY / ORDER BY syntax under pressure. Do 20–30 problems on StrataScratch or LeetCode's database section at the medium level.
- Study idempotency and exactly-once semantics concretely: know what UPSERT / INSERT ON CONFLICT does in Postgres and why append-only pipelines are dangerous. This question separates new grads who've thought about production from those who haven't.
- Be able to draw and explain a star schema from memory with a realistic example. Data warehouse fundamentals (fact/dim tables, grain, SCD Type 2) are often assumed knowledge that new grads underestimate — a 2-hour read of 'The Data Warehouse Toolkit' chapter 1 pays off disproportionately.
- For behavioral questions, pick 2–3 real projects and prepare them deeply rather than having shallow answers for 10 projects. At new grad level, interviewers expect modest scope — they're evaluating how clearly and honestly you describe what you actually did, not the impressiveness of the project.
- Don't neglect Python data manipulation fundamentals: pandas read/write, groupby, merge, and handling nulls. You don't need PySpark for new grad roles, but you should be able to write a clean 20-line transformation script without Googling every method.
Practice these against your own résumé
Get questions tailored to your experience, answer them, and get honest feedback — free, no credit card.
Run a free fit check →