Q: What is the difference between a LEFT JOIN and an INNER JOIN? Give an example where using the wrong one would silently corrupt a downstream metric.

Define both precisely. Then give a concrete scenario: a daily_active_users table joined to a revenue table — using INNER JOIN drops users with zero revenue, making your DAU count artificially low in a dashboard. This is a data correctness trap, not just a performance question. What interviewers look for: Whether you understand join semantics at the 'data correctness' level, not just the 'what rows come back' level. Interviewers want to see you connect join choice to downstream metric integrity, which is core DE responsibility.

Q: You have a table of user events (user_id, event_type, event_timestamp). Write a query to find each user's first and last event timestamp.

Use MIN(event_timestamp) and MAX(event_timestamp) with GROUP BY user_id. A stronger answer also uses FIRST_VALUE / LAST_VALUE window functions and explains when you'd prefer one approach over the other (aggregation for simple cases, window functions when you need to retain other columns from the same row). What interviewers look for: Basic aggregation correctness. Awareness that window functions exist. If you jump straight to a correlated subquery, that signals you haven't internalized set-based thinking, which is a flag.

Q: Write a Python function that reads a CSV file, removes rows where any column is null, and writes the cleaned output to a new CSV. Handle the case where the input file doesn't exist.

Use pandas: pd.read_csv(), df.dropna(), df.to_csv(). Wrap in a try/except FileNotFoundError. A stronger answer also discusses what 'null' means in CSV context (empty string vs NaN) and whether dropna should be any or all. Keep it simple and correct — don't over-engineer. What interviewers look for: Clean, correct Python. Proper error handling without being asked. Awareness of the 'empty string vs NaN' ambiguity shows real data intuition. They are not expecting Spark — pandas is the right tool here.

Q: Given a list of dictionaries representing log records [{user_id, action, timestamp}], write Python code to group them by user_id and count the number of each action per user.

Use collections.defaultdict or Counter. Iterate once through the list, building a nested dict: {user_id: {action: count}}. Alternatively use pandas groupby if you'd load this into a DataFrame. Mention time complexity: O(n) for the dict approach. What interviewers look for: Correct use of Python data structures. O(n) awareness. Whether you can cleanly transform raw records — this mirrors what ETL code actually does at small scale.

Q: You're building a table to track daily user subscription status (active/cancelled/paused). A user's status can change over time. How would you model this?

Explain the slowly changing dimension (SCD) concept even if you don't know the term. Option 1: one row per user (overwrite current status — loses history). Option 2: one row per status change with effective_date and end_date — preserves history, enables point-in-time queries. Recommend option 2 and explain how to query 'status as of a given date' with a BETWEEN or date range filter. What interviewers look for: Whether you intuitively reach for history-preserving design. Knowledge of SCD Type 2 is a bonus but not required at new grad level. The key signal is: do you think about 'what question will this table need to answer in 6 months?'

Q: What is the difference between a fact table and a dimension table in a star schema? Give a concrete example.

Fact table: stores measurable, quantitative events (e.g., order_id, customer_id, product_id, amount, order_date). Dimension table: stores descriptive attributes (e.g., customers: customer_id, name, city, segment). Explain how you join them: fact table has foreign keys to dimension tables, enabling slicing metrics by attributes. Give one example query that demonstrates the join. What interviewers look for: Fundamental data warehouse literacy. New grad DEs are often expected to work inside existing warehouse schemas — you need to know what you're looking at. This is a filter question; a blank answer is a red flag.

Q: Walk me through how you would build a simple daily ETL pipeline that pulls data from a REST API, transforms it, and loads it into a Postgres table.

Structure as Extract → Transform → Load. Extract: call the API with requests, handle pagination and HTTP errors, store raw JSON. Transform: parse response, validate schema, handle nulls/type mismatches. Load: use psycopg2 or SQLAlchemy, use INSERT ... ON CONFLICT for idempotency. Schedule with cron or Airflow. Mention: what happens if the API is down? What happens if the job runs twice? (idempotency). What interviewers look for: Whether you naturally think about failure modes and idempotency without being prompted. Knowing the ETL pattern end-to-end at a practical level. This is the most common real task for a new grad DE and the interview expects you to have thought it through.

Q: What is idempotency in the context of a data pipeline, and why does it matter?

Define it: running the pipeline multiple times with the same input produces the same output — no duplicate records, no data loss. Explain why: schedulers retry failed jobs; network issues cause double-triggers. Give a concrete fix: use INSERT ... ON CONFLICT DO NOTHING or UPSERT logic, or use a staging table + MERGE. Contrast with a naive append-only approach that creates duplicates on retry. What interviewers look for: Whether you've internalized that pipelines fail and retry. This is a real-world engineering concern that distinguishes a DE who's thought about production vs one who's only written scripts. Expecting concrete mechanism, not just the abstract concept.

Q: You notice that the row count in your daily pipeline output dropped by 40% compared to yesterday. Walk me through how you'd investigate.

Structure your investigation: 1) Check if the source data volume actually dropped (upstream issue vs pipeline issue). 2) Check pipeline logs for errors, filter conditions, or schema changes. 3) Check if a JOIN is unexpectedly dropping rows (inner join on a new column with nulls). 4) Check if a date filter has an off-by-one bug. 5) Check if upstream table was truncated or a partition was missing. Prioritize source vs pipeline vs query logic in that order. What interviewers look for: Systematic debugging instinct. Whether you distinguish between 'the data source changed' and 'my pipeline broke' — these have very different remedies. Organized, step-by-step thinking under ambiguity is the core signal.

Question 1

Write a query to find the top 3 products by revenue in each category for the past 30 days, given a sales table with columns: sale_id, product_id, category, revenue, sale_date.

Accepted Answer

Use a window function: ROW_NUMBER() or RANK() OVER (PARTITION BY category ORDER BY SUM(revenue) DESC). First aggregate revenue per product/category with a WHERE sale_date >= CURRENT_DATE - 30 GROUP BY, then apply the window function in a subquery or CTE, then filter WHERE rank <= 3. Walk through whether RANK vs DENSE_RANK matters if ties exist. What interviewers look for: Correct use of window functions and CTEs. Whether you remember to filter by date before ranking, not after. Bonus: noticing the tie-breaking edge case. This is the single most tested SQL pattern for new grad DE roles.

Question 2

What is the difference between a LEFT JOIN and an INNER JOIN? Give an example where using the wrong one would silently corrupt a downstream metric.

Accepted Answer

Define both precisely. Then give a concrete scenario: a daily_active_users table joined to a revenue table — using INNER JOIN drops users with zero revenue, making your DAU count artificially low in a dashboard. This is a data correctness trap, not just a performance question. What interviewers look for: Whether you understand join semantics at the 'data correctness' level, not just the 'what rows come back' level. Interviewers want to see you connect join choice to downstream metric integrity, which is core DE responsibility.

Question 3

You have a table of user events (user_id, event_type, event_timestamp). Write a query to find each user's first and last event timestamp.

Accepted Answer

Use MIN(event_timestamp) and MAX(event_timestamp) with GROUP BY user_id. A stronger answer also uses FIRST_VALUE / LAST_VALUE window functions and explains when you'd prefer one approach over the other (aggregation for simple cases, window functions when you need to retain other columns from the same row). What interviewers look for: Basic aggregation correctness. Awareness that window functions exist. If you jump straight to a correlated subquery, that signals you haven't internalized set-based thinking, which is a flag.

Question 4

Write a Python function that reads a CSV file, removes rows where any column is null, and writes the cleaned output to a new CSV. Handle the case where the input file doesn't exist.

Accepted Answer

Use pandas: pd.read_csv(), df.dropna(), df.to_csv(). Wrap in a try/except FileNotFoundError. A stronger answer also discusses what 'null' means in CSV context (empty string vs NaN) and whether dropna should be any or all. Keep it simple and correct — don't over-engineer. What interviewers look for: Clean, correct Python. Proper error handling without being asked. Awareness of the 'empty string vs NaN' ambiguity shows real data intuition. They are not expecting Spark — pandas is the right tool here.

Question 5

Given a list of dictionaries representing log records [{user_id, action, timestamp}], write Python code to group them by user_id and count the number of each action per user.

Accepted Answer

Use collections.defaultdict or Counter. Iterate once through the list, building a nested dict: {user_id: {action: count}}. Alternatively use pandas groupby if you'd load this into a DataFrame. Mention time complexity: O(n) for the dict approach. What interviewers look for: Correct use of Python data structures. O(n) awareness. Whether you can cleanly transform raw records — this mirrors what ETL code actually does at small scale.

Question 6

You're building a table to track daily user subscription status (active/cancelled/paused). A user's status can change over time. How would you model this?

Accepted Answer

Explain the slowly changing dimension (SCD) concept even if you don't know the term. Option 1: one row per user (overwrite current status — loses history). Option 2: one row per status change with effective_date and end_date — preserves history, enables point-in-time queries. Recommend option 2 and explain how to query 'status as of a given date' with a BETWEEN or date range filter. What interviewers look for: Whether you intuitively reach for history-preserving design. Knowledge of SCD Type 2 is a bonus but not required at new grad level. The key signal is: do you think about 'what question will this table need to answer in 6 months?'

Question 7

What is the difference between a fact table and a dimension table in a star schema? Give a concrete example.

Accepted Answer

Fact table: stores measurable, quantitative events (e.g., order_id, customer_id, product_id, amount, order_date). Dimension table: stores descriptive attributes (e.g., customers: customer_id, name, city, segment). Explain how you join them: fact table has foreign keys to dimension tables, enabling slicing metrics by attributes. Give one example query that demonstrates the join. What interviewers look for: Fundamental data warehouse literacy. New grad DEs are often expected to work inside existing warehouse schemas — you need to know what you're looking at. This is a filter question; a blank answer is a red flag.

Question 8

Walk me through how you would build a simple daily ETL pipeline that pulls data from a REST API, transforms it, and loads it into a Postgres table.

Accepted Answer

Structure as Extract → Transform → Load. Extract: call the API with requests, handle pagination and HTTP errors, store raw JSON. Transform: parse response, validate schema, handle nulls/type mismatches. Load: use psycopg2 or SQLAlchemy, use INSERT ... ON CONFLICT for idempotency. Schedule with cron or Airflow. Mention: what happens if the API is down? What happens if the job runs twice? (idempotency). What interviewers look for: Whether you naturally think about failure modes and idempotency without being prompted. Knowing the ETL pattern end-to-end at a practical level. This is the most common real task for a new grad DE and the interview expects you to have thought it through.

Question 9

What is idempotency in the context of a data pipeline, and why does it matter?

Accepted Answer

Define it: running the pipeline multiple times with the same input produces the same output — no duplicate records, no data loss. Explain why: schedulers retry failed jobs; network issues cause double-triggers. Give a concrete fix: use INSERT ... ON CONFLICT DO NOTHING or UPSERT logic, or use a staging table + MERGE. Contrast with a naive append-only approach that creates duplicates on retry. What interviewers look for: Whether you've internalized that pipelines fail and retry. This is a real-world engineering concern that distinguishes a DE who's thought about production vs one who's only written scripts. Expecting concrete mechanism, not just the abstract concept.

Question 10

You notice that the row count in your daily pipeline output dropped by 40% compared to yesterday. Walk me through how you'd investigate.

Accepted Answer

Structure your investigation: 1) Check if the source data volume actually dropped (upstream issue vs pipeline issue). 2) Check pipeline logs for errors, filter conditions, or schema changes. 3) Check if a JOIN is unexpectedly dropping rows (inner join on a new column with nulls). 4) Check if a date filter has an off-by-one bug. 5) Check if upstream table was truncated or a partition was missing. Prioritize source vs pipeline vs query logic in that order. What interviewers look for: Systematic debugging instinct. Whether you distinguish between 'the data source changed' and 'my pipeline broke' — these have very different remedies. Organized, step-by-step thinking under ambiguity is the core signal.

Question 11

Tell me about a project (academic, internship, or personal) where you had to work with messy or incomplete data. What did you do?

Accepted Answer

Use STAR structure: Situation (what was the dataset/project), Task (what you needed to produce), Action (how you identified and handled the mess — missing values, schema inconsistencies, duplicates), Result (what you shipped and what you learned). Be specific about the actual data problem and your solution, not vague about 'I cleaned the data.' What interviewers look for: Genuine hands-on experience with real data pain. They're not expecting a massive project — a class project or Kaggle dataset is fine. The signal is: do you talk concretely about the mess, or do you hand-wave? Specificity = credibility.

Question 12

You're given a task but the requirements are ambiguous — the stakeholder isn't sure exactly what they want. How do you handle it?

Accepted Answer

Describe a concrete approach: ask clarifying questions upfront (what decision will this data inform? what's the time range?), propose a small scoped deliverable to align on direction before building the full thing, check back early with a sample output. Mention that in data work, building the wrong thing is expensive to undo because downstream dashboards depend on your schema. What interviewers look for: Communication maturity and proactiveness. New grad DEs often jump into coding before fully understanding the ask. Interviewers want to see you've thought about the stakeholder relationship, not just the technical execution. Shows you'll be low-maintenance on a real team.

New Grad Data Engineer Interview Questions

What to expect

12 questions, with how to answer them

1. Write a query to find the top 3 products by revenue in each category for the past 30 days, given a sales table with columns: sale_id, product_id, category, revenue, sale_date.

2. What is the difference between a LEFT JOIN and an INNER JOIN? Give an example where using the wrong one would silently corrupt a downstream metric.

3. You have a table of user events (user_id, event_type, event_timestamp). Write a query to find each user's first and last event timestamp.

4. Write a Python function that reads a CSV file, removes rows where any column is null, and writes the cleaned output to a new CSV. Handle the case where the input file doesn't exist.

5. Given a list of dictionaries representing log records [{user_id, action, timestamp}], write Python code to group them by user_id and count the number of each action per user.

6. You're building a table to track daily user subscription status (active/cancelled/paused). A user's status can change over time. How would you model this?

7. What is the difference between a fact table and a dimension table in a star schema? Give a concrete example.

8. Walk me through how you would build a simple daily ETL pipeline that pulls data from a REST API, transforms it, and loads it into a Postgres table.

9. What is idempotency in the context of a data pipeline, and why does it matter?

10. You notice that the row count in your daily pipeline output dropped by 40% compared to yesterday. Walk me through how you'd investigate.

11. Tell me about a project (academic, internship, or personal) where you had to work with messy or incomplete data. What did you do?

12. You're given a task but the requirements are ambiguous — the stakeholder isn't sure exactly what they want. How do you handle it?

Study tips