Question 1

You have a fact table with 2 billion rows and a dimension table with 500k rows. A dashboard query joining them is timing out. Walk me through how you'd diagnose and fix it.

Accepted Answer

Start by running EXPLAIN/EXPLAIN ANALYZE to inspect the query plan — look for sequential scans, hash join vs. nested loop, and row estimate errors. Check statistics freshness (ANALYZE). Evaluate whether the fact table is partitioned on the filter column (usually a date); if not, propose partition pruning. Consider pre-aggregating the fact table into a summary table or materialized view for the dashboard use case. If on a columnar store (Redshift, BigQuery, Snowflake), check sort/cluster keys and distribution style. Finally, ask whether the join key has skew — a small number of dimension values matching huge numbers of fact rows can blow up memory. What interviewers look for: The interviewer wants to see a structured debugging process, not a list of random fixes. Strong candidates go to the query plan first, understand partition pruning and statistics, and distinguish between fixes that help all queries vs. fixes that optimize for this specific access pattern. They also ask clarifying questions about the warehouse technology before prescribing solutions.

Question 2

Write a query that, for each user, finds the first purchase after their account creation date, along with the number of days between signup and that first purchase.

Accepted Answer

Use a CTE or subquery to join the users table with the orders table on user_id where order_date > signup_date. Then use ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date ASC) to rank orders and filter to rank = 1. Compute DATEDIFF or equivalent between signup_date and first_order_date. Handle edge cases: users with no purchases should appear as NULL (LEFT JOIN). Be explicit about timezone handling if timestamps are involved. What interviewers look for: Clean use of window functions (ROW_NUMBER or FIRST_VALUE) rather than a self-join or subquery with MIN() — both work but window functions show fluency. The interviewer checks whether the candidate handles the no-purchase case and thinks about NULL semantics. Bonus signal: mentioning that this query could be expensive at scale and suggesting a pre-computed user_first_order table.

Question 3

Design an idempotent daily batch pipeline that ingests clickstream events from S3, transforms them, and loads into a data warehouse. How do you handle late-arriving data?

Accepted Answer

Idempotency means re-running the pipeline for a given date produces the same result. Use date-partitioned S3 paths and warehouse partitions. At load time, DELETE+INSERT (or MERGE) on the target partition rather than appending. For late data: define a lateness window (e.g., accept events up to 3 days late) and reprocess affected partitions when late files land. Use a metadata table or job log to track partition state. In orchestration (Airflow/Prefect), use the logical execution date, not wall-clock time, so backfills work correctly. Flag records with arrived_at vs. event_at skew > threshold for monitoring. What interviewers look for: The key signal is whether the candidate understands idempotency as a design property, not just a buzzword — specifically, DELETE+INSERT/MERGE vs. naive append. Interviewers also check whether the candidate distinguishes event time from processing time and has a concrete, opinionated answer on late data rather than 'it depends.'

Question 4

Your Airflow DAG fails halfway through after writing 3 of 6 partitions. What happens when it retries, and what do you need to have in place to avoid partial or duplicate data?

Accepted Answer

On retry without safeguards, the DAG will re-run all tasks or only downstream tasks depending on configuration, potentially duplicating the 3 already-written partitions. Mitigations: (1) make each task atomic — write to a staging location first, then rename/move (atomic copy pattern); (2) use MERGE or DELETE+INSERT at the destination so re-writing a partition is safe; (3) record task-level completion in a metadata store so already-finished partitions are skipped; (4) use Airflow's task-level retries with idempotent operators. Also mention that task granularity matters — if one task writes one partition, partial failure is easier to recover than one task writing all six. What interviewers look for: Interviewers want to see that the candidate thinks about failure atomicity and recovery, not just happy-path orchestration. Strong candidates immediately identify that naive retries cause duplication and describe concrete write patterns (staging + atomic move, MERGE semantics). They also talk about observability — how do you know which partitions succeeded?

Question 5

You're modeling an e-commerce data warehouse. Product prices change over time. How do you model this so analysts can correctly calculate historical revenue?

Accepted Answer

This is a classic Slowly Changing Dimension (SCD) problem. SCD Type 2 is the right answer: keep a row per product per price period with effective_from, effective_to (NULL for current), and a is_current flag. When joining orders to products for historical revenue, join on product_id AND order_date BETWEEN effective_from AND effective_to. Discuss tradeoffs: SCD2 makes historical queries correct but more complex; SCD1 (overwrite) is simpler but loses history. Mention that in a modern lakehouse you might use Delta Lake or Iceberg time-travel as an alternative, but SCD2 remains the explicit, queryable approach. Consider a surrogate key (product_key) to make joins unambiguous. What interviewers look for: The interviewer checks whether the candidate knows SCD types and picks the right one (Type 2) with a clear rationale. Naming effective_from/effective_to correctly and showing the join pattern for historical accuracy is the core signal. Bonus: mentioning the surrogate key and the performance tradeoff of range joins at scale.

Question 6

When would you choose a wide, denormalized table over a normalized star schema? Give a concrete example.

Accepted Answer

Denormalized wide tables are preferable when: (1) the primary consumers are analysts or BI tools that can't do complex joins, (2) the table is append-only and written by a single pipeline (no update anomalies), (3) performance on columnar stores favors fewer joins over storage costs. Example: a user_events_enriched table that pre-joins user attributes at event time for a product analytics use case — analysts query it with simple filters, not multi-table joins. Star schema is better when dimensions are reused across many fact tables, dimension data changes (SCD), or you need to avoid data redundancy for governance reasons. What interviewers look for: The candidate should show they understand the practical tradeoffs, not just textbook normalization rules. The key insight is that columnar warehouses make denormalization cheap storage-wise, and analyst usability is a real engineering concern. Interviewers flag candidates who give only the academic answer without connecting to real use cases.

Question 7

How would you build a data quality framework for a critical pipeline that feeds a financial reporting dashboard?

Accepted Answer

Layer the checks: (1) schema/type validation at ingestion (Great Expectations, Soda, or custom assertions); (2) row-count and null-rate checks per partition against historical baselines (anomaly detection, not just hard thresholds); (3) business-rule checks — e.g., revenue totals within X% of prior period, referential integrity between fact and dimension; (4) freshness checks — alert if partition doesn't arrive by SLA. Fail the pipeline and block dashboard refresh on critical check failures; log warnings for soft checks. Store check results in a quality metadata table for trending. For financial data specifically, add reconciliation checks against source system totals. What interviewers look for: The signal is whether the candidate distinguishes between types of checks (schema, statistical, business-rule, freshness) and has an opinion on fail-hard vs. warn-and-continue. Financial context should trigger mention of reconciliation against source totals. Interviewers also check whether the candidate thinks about alerting and visibility, not just the checks themselves.

Question 8

Explain how Apache Spark decides how many partitions to use when reading a large Parquet file from S3, and what happens if the partition count is wrong.

Accepted Answer

Spark's default parallelism for Parquet reads is driven by the number of files and spark.sql.files.maxPartitionBytes (default 128MB). Each Parquet row group maps to roughly one task. Too few partitions: large tasks, out-of-memory errors, underutilized cluster. Too many partitions: scheduling overhead, small file problem downstream if writing, GC pressure. Fix: repartition() after read if joining/aggregating (expensive — full shuffle); coalesce() before writing if reducing partitions (no shuffle). For S3, also consider file layout — many small files cause Spark to create many tiny partitions, which is why compaction jobs exist. Tune spark.sql.shuffle.partitions (default 200) separately from read partitions. What interviewers look for: Interviewers check whether the candidate understands the file-to-partition mapping, not just 'increase parallelism.' The distinction between repartition (shuffle) and coalesce (no shuffle) is a common probe. Strong candidates connect this to the small files problem and mention why compaction matters for both read performance and downstream write quality.

Question 9

You notice a Spark job is running much slower than expected. When you look at the Spark UI, you see one task taking 10x longer than all others. What's wrong and how do you fix it?

Accepted Answer

This is data skew. One partition has disproportionately more data — common cause is a skewed join key (e.g., joining on user_id where one user has millions of events, or a NULL key collapsing all nulls into one partition). Diagnose: check task duration distribution in the Spark UI Stage tab, look at partition sizes via rdd.glom().map(len).collect(). Fixes: (1) salting — add a random prefix to the skewed key on both sides and explode the smaller side; (2) broadcast join — if the smaller table fits in memory, broadcast it to avoid shuffle entirely; (3) filter/separate skewed keys and process them independently. Also check if skew is from NULLs and handle them explicitly before the join. What interviewers look for: Naming data skew immediately is the baseline. Strong candidates describe at least two concrete fixes with their tradeoffs — salting adds complexity, broadcast joins have memory limits. The best answers mention proactive diagnosis via the Spark UI rather than just describing fixes in the abstract.

Question 10

You're writing a Python function to process records from a Kafka topic and upsert them into Postgres. The topic has ~10k messages/second. What design concerns do you address?

Accepted Answer

Key concerns: (1) batching — don't upsert one row per message, batch into groups of 500-1000 and use INSERT ON CONFLICT DO UPDATE or COPY; (2) consumer group management — commit offsets only after successful upsert, not before, to ensure at-least-once delivery; (3) schema evolution — handle missing or extra fields gracefully; (4) error handling — dead-letter queue for malformed messages, retry with backoff for transient DB errors; (5) connection pooling — use a pool (psycopg2 with pgbouncer or SQLAlchemy pool) rather than opening a connection per batch; (6) backpressure — if Postgres can't keep up, pause consumption rather than dropping messages. Mention monitoring: lag on the consumer group is the primary SLA metric. What interviewers look for: The interviewer is testing whether the candidate thinks about throughput at 10k/s, not just correctness at 10/s. Batching and offset commit ordering are the two highest-signal answers. Candidates who propose one-at-a-time upserts or who commit offsets before writing show a gap. Bonus: mentioning consumer group lag as the key operational metric.

Question 11

Tell me about a time a data pipeline you owned caused incorrect data to reach a business stakeholder. What happened and what did you change?

Accepted Answer

Use STAR structure. Be specific: what was the pipeline, what was the incorrect data, who was impacted. Strong answers don't minimize the incident — they describe the immediate response (alerting stakeholders, quantifying scope of bad data), the root cause (not just 'a bug' but the specific failure mode), and the systemic fix (not just the one-time patch). The systemic fix is where candidates differentiate themselves: adding a data quality check, improving testing coverage, adding an anomaly alert, or changing the deployment process. Avoid blaming external systems or teammates. What interviewers look for: Interviewers are evaluating ownership, incident response maturity, and whether the candidate learns systemically vs. just fixing the immediate issue. The red flag is a candidate who either can't recall an incident (unlikely at mid-level) or whose 'fix' was just 'I was more careful.' Concrete, honest answers with a real systemic change are the signal.

Question 12

An analyst comes to you and says the numbers in their dashboard don't match the numbers they're getting from a direct database query. How do you handle this?

Accepted Answer

Start by not assuming either source is correct. Gather specifics: what dashboard, what metric, what time range, what filters each is applying. Common causes: (1) timezone differences in aggregation; (2) dashboard uses a cached materialized view, direct query hits live table; (3) different join logic or NULL handling between the two; (4) the analyst's direct query has a bug; (5) the pipeline has a bug. Walk through each systematically. Reproduce both numbers in SQL with explicit logic visible. Communicate status to the analyst throughout — don't go silent for two hours. Document the root cause and outcome regardless of which source was wrong. What interviewers look for: This question tests debugging methodology, communication, and ego. The wrong answer is immediately assuming the pipeline is right or the analyst is wrong. Interviewers want structured troubleshooting (enumerate hypotheses, test them) and evidence that the candidate treats analysts as partners, not tickets. Timezone and caching are the two most common real root causes — naming them shows experience.

Mid-Level Data Engineer Interview Questions

What to expect

12 questions, with how to answer them

1. You have a fact table with 2 billion rows and a dimension table with 500k rows. A dashboard query joining them is timing out. Walk me through how you'd diagnose and fix it.

2. Write a query that, for each user, finds the first purchase after their account creation date, along with the number of days between signup and that first purchase.

3. Design an idempotent daily batch pipeline that ingests clickstream events from S3, transforms them, and loads into a data warehouse. How do you handle late-arriving data?

4. Your Airflow DAG fails halfway through after writing 3 of 6 partitions. What happens when it retries, and what do you need to have in place to avoid partial or duplicate data?

5. You're modeling an e-commerce data warehouse. Product prices change over time. How do you model this so analysts can correctly calculate historical revenue?

6. When would you choose a wide, denormalized table over a normalized star schema? Give a concrete example.

7. How would you build a data quality framework for a critical pipeline that feeds a financial reporting dashboard?

8. Explain how Apache Spark decides how many partitions to use when reading a large Parquet file from S3, and what happens if the partition count is wrong.

9. You notice a Spark job is running much slower than expected. When you look at the Spark UI, you see one task taking 10x longer than all others. What's wrong and how do you fix it?

10. You're writing a Python function to process records from a Kafka topic and upsert them into Postgres. The topic has ~10k messages/second. What design concerns do you address?

11. Tell me about a time a data pipeline you owned caused incorrect data to reach a business stakeholder. What happened and what did you change?

12. An analyst comes to you and says the numbers in their dashboard don't match the numbers they're getting from a direct database query. How do you handle this?

Study tips