Interview questions
Mid-Level Data Engineer Interview Questions
Mid-level data engineering interviews test whether you can own pipelines end-to-end — not just write SQL or glue together tools, but reason about reliability, data quality, and the tradeoffs of different ingestion and transformation patterns. Expect interviewers to probe your hands-on experience with real failure modes, not just textbook definitions. The bar is: can you ship production-grade data infrastructure with moderate guidance, and catch the problems before your stakeholders do?
What to expect
A typical mid-level data engineering loop runs 4-5 rounds: one coding screen (SQL or Python, sometimes both), one or two pipeline design or system design sessions, a data modeling round, a debugging/troubleshooting scenario, and a behavioral round. Coding is lighter than software engineering loops — you won't be asked to implement a red-black tree — but SQL complexity is higher than most candidates expect, often involving window functions, CTEs, and query optimization. System design questions are scoped to single data products or pipelines rather than full platform architecture. Behavioral questions focus on cross-functional collaboration (working with analysts and stakeholders) and handling data incidents.
These are the questions every Data Engineer gets.
Get questions tailored to your experience, answer them, and get honest feedback — free, no credit card.
Run a free fit check →12 questions, with how to answer them
SQL & Query Optimization
1. You have a fact table with 2 billion rows and a dimension table with 500k rows. A dashboard query joining them is timing out. Walk me through how you'd diagnose and fix it.
How to answer: Start by running EXPLAIN/EXPLAIN ANALYZE to inspect the query plan — look for sequential scans, hash join vs. nested loop, and row estimate errors. Check statistics freshness (ANALYZE). Evaluate whether the fact table is partitioned on the filter column (usually a date); if not, propose partition pruning. Consider pre-aggregating the fact table into a summary table or materialized view for the dashboard use case. If on a columnar store (Redshift, BigQuery, Snowflake), check sort/cluster keys and distribution style. Finally, ask whether the join key has skew — a small number of dimension values matching huge numbers of fact rows can blow up memory.
What they look for: The interviewer wants to see a structured debugging process, not a list of random fixes. Strong candidates go to the query plan first, understand partition pruning and statistics, and distinguish between fixes that help all queries vs. fixes that optimize for this specific access pattern. They also ask clarifying questions about the warehouse technology before prescribing solutions.
SQL & Query Optimization
2. Write a query that, for each user, finds the first purchase after their account creation date, along with the number of days between signup and that first purchase.
How to answer: Use a CTE or subquery to join the users table with the orders table on user_id where order_date > signup_date. Then use ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date ASC) to rank orders and filter to rank = 1. Compute DATEDIFF or equivalent between signup_date and first_order_date. Handle edge cases: users with no purchases should appear as NULL (LEFT JOIN). Be explicit about timezone handling if timestamps are involved.
What they look for: Clean use of window functions (ROW_NUMBER or FIRST_VALUE) rather than a self-join or subquery with MIN() — both work but window functions show fluency. The interviewer checks whether the candidate handles the no-purchase case and thinks about NULL semantics. Bonus signal: mentioning that this query could be expensive at scale and suggesting a pre-computed user_first_order table.
Pipeline Design
3. Design an idempotent daily batch pipeline that ingests clickstream events from S3, transforms them, and loads into a data warehouse. How do you handle late-arriving data?
How to answer: Idempotency means re-running the pipeline for a given date produces the same result. Use date-partitioned S3 paths and warehouse partitions. At load time, DELETE+INSERT (or MERGE) on the target partition rather than appending. For late data: define a lateness window (e.g., accept events up to 3 days late) and reprocess affected partitions when late files land. Use a metadata table or job log to track partition state. In orchestration (Airflow/Prefect), use the logical execution date, not wall-clock time, so backfills work correctly. Flag records with arrived_at vs. event_at skew > threshold for monitoring.
What they look for: The key signal is whether the candidate understands idempotency as a design property, not just a buzzword — specifically, DELETE+INSERT/MERGE vs. naive append. Interviewers also check whether the candidate distinguishes event time from processing time and has a concrete, opinionated answer on late data rather than 'it depends.'
Pipeline Design
4. Your Airflow DAG fails halfway through after writing 3 of 6 partitions. What happens when it retries, and what do you need to have in place to avoid partial or duplicate data?
How to answer: On retry without safeguards, the DAG will re-run all tasks or only downstream tasks depending on configuration, potentially duplicating the 3 already-written partitions. Mitigations: (1) make each task atomic — write to a staging location first, then rename/move (atomic copy pattern); (2) use MERGE or DELETE+INSERT at the destination so re-writing a partition is safe; (3) record task-level completion in a metadata store so already-finished partitions are skipped; (4) use Airflow's task-level retries with idempotent operators. Also mention that task granularity matters — if one task writes one partition, partial failure is easier to recover than one task writing all six.
What they look for: Interviewers want to see that the candidate thinks about failure atomicity and recovery, not just happy-path orchestration. Strong candidates immediately identify that naive retries cause duplication and describe concrete write patterns (staging + atomic move, MERGE semantics). They also talk about observability — how do you know which partitions succeeded?
Data Modeling
5. You're modeling an e-commerce data warehouse. Product prices change over time. How do you model this so analysts can correctly calculate historical revenue?
How to answer: This is a classic Slowly Changing Dimension (SCD) problem. SCD Type 2 is the right answer: keep a row per product per price period with effective_from, effective_to (NULL for current), and a is_current flag. When joining orders to products for historical revenue, join on product_id AND order_date BETWEEN effective_from AND effective_to. Discuss tradeoffs: SCD2 makes historical queries correct but more complex; SCD1 (overwrite) is simpler but loses history. Mention that in a modern lakehouse you might use Delta Lake or Iceberg time-travel as an alternative, but SCD2 remains the explicit, queryable approach. Consider a surrogate key (product_key) to make joins unambiguous.
What they look for: The interviewer checks whether the candidate knows SCD types and picks the right one (Type 2) with a clear rationale. Naming effective_from/effective_to correctly and showing the join pattern for historical accuracy is the core signal. Bonus: mentioning the surrogate key and the performance tradeoff of range joins at scale.
Data Modeling
6. When would you choose a wide, denormalized table over a normalized star schema? Give a concrete example.
How to answer: Denormalized wide tables are preferable when: (1) the primary consumers are analysts or BI tools that can't do complex joins, (2) the table is append-only and written by a single pipeline (no update anomalies), (3) performance on columnar stores favors fewer joins over storage costs. Example: a user_events_enriched table that pre-joins user attributes at event time for a product analytics use case — analysts query it with simple filters, not multi-table joins. Star schema is better when dimensions are reused across many fact tables, dimension data changes (SCD), or you need to avoid data redundancy for governance reasons.
What they look for: The candidate should show they understand the practical tradeoffs, not just textbook normalization rules. The key insight is that columnar warehouses make denormalization cheap storage-wise, and analyst usability is a real engineering concern. Interviewers flag candidates who give only the academic answer without connecting to real use cases.
Data Quality
7. How would you build a data quality framework for a critical pipeline that feeds a financial reporting dashboard?
How to answer: Layer the checks: (1) schema/type validation at ingestion (Great Expectations, Soda, or custom assertions); (2) row-count and null-rate checks per partition against historical baselines (anomaly detection, not just hard thresholds); (3) business-rule checks — e.g., revenue totals within X% of prior period, referential integrity between fact and dimension; (4) freshness checks — alert if partition doesn't arrive by SLA. Fail the pipeline and block dashboard refresh on critical check failures; log warnings for soft checks. Store check results in a quality metadata table for trending. For financial data specifically, add reconciliation checks against source system totals.
What they look for: The signal is whether the candidate distinguishes between types of checks (schema, statistical, business-rule, freshness) and has an opinion on fail-hard vs. warn-and-continue. Financial context should trigger mention of reconciliation against source totals. Interviewers also check whether the candidate thinks about alerting and visibility, not just the checks themselves.
Distributed Systems & Tools
8. Explain how Apache Spark decides how many partitions to use when reading a large Parquet file from S3, and what happens if the partition count is wrong.
How to answer: Spark's default parallelism for Parquet reads is driven by the number of files and spark.sql.files.maxPartitionBytes (default 128MB). Each Parquet row group maps to roughly one task. Too few partitions: large tasks, out-of-memory errors, underutilized cluster. Too many partitions: scheduling overhead, small file problem downstream if writing, GC pressure. Fix: repartition() after read if joining/aggregating (expensive — full shuffle); coalesce() before writing if reducing partitions (no shuffle). For S3, also consider file layout — many small files cause Spark to create many tiny partitions, which is why compaction jobs exist. Tune spark.sql.shuffle.partitions (default 200) separately from read partitions.
What they look for: Interviewers check whether the candidate understands the file-to-partition mapping, not just 'increase parallelism.' The distinction between repartition (shuffle) and coalesce (no shuffle) is a common probe. Strong candidates connect this to the small files problem and mention why compaction matters for both read performance and downstream write quality.
Distributed Systems & Tools
9. You notice a Spark job is running much slower than expected. When you look at the Spark UI, you see one task taking 10x longer than all others. What's wrong and how do you fix it?
How to answer: This is data skew. One partition has disproportionately more data — common cause is a skewed join key (e.g., joining on user_id where one user has millions of events, or a NULL key collapsing all nulls into one partition). Diagnose: check task duration distribution in the Spark UI Stage tab, look at partition sizes via rdd.glom().map(len).collect(). Fixes: (1) salting — add a random prefix to the skewed key on both sides and explode the smaller side; (2) broadcast join — if the smaller table fits in memory, broadcast it to avoid shuffle entirely; (3) filter/separate skewed keys and process them independently. Also check if skew is from NULLs and handle them explicitly before the join.
What they look for: Naming data skew immediately is the baseline. Strong candidates describe at least two concrete fixes with their tradeoffs — salting adds complexity, broadcast joins have memory limits. The best answers mention proactive diagnosis via the Spark UI rather than just describing fixes in the abstract.
Python & Engineering Practices
10. You're writing a Python function to process records from a Kafka topic and upsert them into Postgres. The topic has ~10k messages/second. What design concerns do you address?
How to answer: Key concerns: (1) batching — don't upsert one row per message, batch into groups of 500-1000 and use INSERT ON CONFLICT DO UPDATE or COPY; (2) consumer group management — commit offsets only after successful upsert, not before, to ensure at-least-once delivery; (3) schema evolution — handle missing or extra fields gracefully; (4) error handling — dead-letter queue for malformed messages, retry with backoff for transient DB errors; (5) connection pooling — use a pool (psycopg2 with pgbouncer or SQLAlchemy pool) rather than opening a connection per batch; (6) backpressure — if Postgres can't keep up, pause consumption rather than dropping messages. Mention monitoring: lag on the consumer group is the primary SLA metric.
What they look for: The interviewer is testing whether the candidate thinks about throughput at 10k/s, not just correctness at 10/s. Batching and offset commit ordering are the two highest-signal answers. Candidates who propose one-at-a-time upserts or who commit offsets before writing show a gap. Bonus: mentioning consumer group lag as the key operational metric.
Behavioral / Cross-functional
11. Tell me about a time a data pipeline you owned caused incorrect data to reach a business stakeholder. What happened and what did you change?
How to answer: Use STAR structure. Be specific: what was the pipeline, what was the incorrect data, who was impacted. Strong answers don't minimize the incident — they describe the immediate response (alerting stakeholders, quantifying scope of bad data), the root cause (not just 'a bug' but the specific failure mode), and the systemic fix (not just the one-time patch). The systemic fix is where candidates differentiate themselves: adding a data quality check, improving testing coverage, adding an anomaly alert, or changing the deployment process. Avoid blaming external systems or teammates.
What they look for: Interviewers are evaluating ownership, incident response maturity, and whether the candidate learns systemically vs. just fixing the immediate issue. The red flag is a candidate who either can't recall an incident (unlikely at mid-level) or whose 'fix' was just 'I was more careful.' Concrete, honest answers with a real systemic change are the signal.
Behavioral / Cross-functional
12. An analyst comes to you and says the numbers in their dashboard don't match the numbers they're getting from a direct database query. How do you handle this?
How to answer: Start by not assuming either source is correct. Gather specifics: what dashboard, what metric, what time range, what filters each is applying. Common causes: (1) timezone differences in aggregation; (2) dashboard uses a cached materialized view, direct query hits live table; (3) different join logic or NULL handling between the two; (4) the analyst's direct query has a bug; (5) the pipeline has a bug. Walk through each systematically. Reproduce both numbers in SQL with explicit logic visible. Communicate status to the analyst throughout — don't go silent for two hours. Document the root cause and outcome regardless of which source was wrong.
What they look for: This question tests debugging methodology, communication, and ego. The wrong answer is immediately assuming the pipeline is right or the analyst is wrong. Interviewers want structured troubleshooting (enumerate hypotheses, test them) and evidence that the candidate treats analysts as partners, not tickets. Timezone and caching are the two most common real root causes — naming them shows experience.
Study tips
- Practice multi-step SQL with window functions on real-ish datasets — LeetCode's hard SQL problems and Mode Analytics' SQL tutorial go beyond basic aggregations into the patterns you'll actually see. Focus on ROW_NUMBER, LAG/LEAD, and frame-based aggregations.
- For pipeline design questions, build a mental template before the interview: ingestion pattern → transformation layer → load pattern → idempotency mechanism → failure/retry behavior → monitoring. Interviewers notice when candidates cover all six without prompting.
- Review the internals of one distributed processing engine deeply (Spark or Flink) — specifically the shuffle, partition sizing, and skew handling. Superficial tool knowledge ('I've used Spark') doesn't pass mid-level bars; explaining why a job is slow from the Spark UI does.
- Prepare two or three real incidents from your work history: a data quality failure, a pipeline outage, and a stakeholder disagreement. Mid-level behavioral questions are almost always about these three scenarios, and concrete specifics (the actual numbers, the actual root cause) differentiate strong candidates from vague ones.
- Know the tradeoffs between at-least-once and exactly-once delivery in your orchestration and streaming tools. Exactly-once is often not achievable end-to-end; being able to articulate why and what compensating controls you use (idempotent writes, deduplication) shows production-level thinking.
Practice these against your own résumé
Get questions tailored to your experience, answer them, and get honest feedback — free, no credit card.
Run a free fit check →