Q: You have a PySpark job that reads a 10 TB Parquet dataset, joins it with a 500 MB lookup table, and is running out of memory on executors. How do you fix it?

Identify the join strategy: by default Spark may attempt a sort-merge join on both large datasets. The fix is a broadcast join — the 500 MB table is below the default broadcast threshold (200 MB by default) or can be forced with broadcast() hint. Explain why: broadcasting sends the small table to every executor, eliminating the shuffle. Then discuss: check spark.sql.autoBroadcastJoinThreshold, verify the 500 MB is post-filter size, watch for broadcast serialization overhead on very wide tables. Also check for data skew in the large table (salting technique), partition count (too few = large tasks, too many = scheduling overhead), and whether the Parquet files are optimally sized (~128–256 MB). Demonstrate understanding of the execution plan via explain(). What interviewers look for: Can the candidate read the execution plan and connect it to memory pressure? Do they know the broadcast threshold and how to override it? Do they mention skew as a follow-up concern? This tests operational Spark depth, not just API familiarity.

Q: How would you design a data quality framework for a pipeline that feeds a financial reporting system where incorrect data causes regulatory risk?

Layer the approach: (1) source contract validation — schema checks, null checks, referential integrity at ingest using tools like Great Expectations or dbt tests, halt pipeline on critical failures; (2) business rule validation — row counts within expected bounds (compared to yesterday via anomaly detection), sum totals reconciliation against source system totals, duplicate key detection; (3) freshness SLAs — alert if table hasn't been updated within expected window; (4) lineage tracking — know which downstream reports are affected by a failure so you can proactively notify stakeholders; (5) incident runbook — who gets paged, what gets quarantined, audit log of all data corrections. Distinguish between blocking checks (stop the pipeline) and warning checks (alert but continue). Mention the importance of idempotent pipelines so reruns don't create duplicate data. What interviewers look for: Does the candidate think beyond 'add dbt tests'? Do they understand the difference between statistical anomaly detection and deterministic rule checks? Do they address the organizational dimension — who is accountable when bad data ships? This signals senior ownership, not just engineering.

Q: Compare Airflow, Prefect, and Dagster for orchestrating a complex pipeline with 500+ tasks, dynamic task generation, and strict data-aware scheduling. Which do you choose and why?

Frame the comparison on axes that matter at this scale: dynamic DAG generation (Airflow's dynamic task mapping in 2.3+ is functional but verbose; Prefect and Dagster handle dynamic flows more natively), data-aware scheduling (Dagster's asset-based model is purpose-built for this — you schedule based on asset freshness, not just time; Airflow datasets feature is nascent), observability (Dagster's asset catalog gives lineage out of the box; Airflow requires significant plugin work), and operational overhead (self-hosted Airflow at 500+ tasks requires tuning the executor, metadata DB, and scheduler; managed options like MWAA or Astronomer reduce this). For a data-aware scheduling requirement, argue for Dagster unless there's a strong existing Airflow investment. Acknowledge Prefect's Python-native simplicity for teams that prioritize fast iteration. What interviewers look for: Does the candidate have genuine operational experience, or are they just repeating documentation? Do they ground the recommendation in the specific constraints (500+ tasks, dynamic, data-aware) rather than giving a generic answer? Interviewers want opinionated judgment backed by tradeoffs.

Q: Explain how you would handle late-arriving data in a streaming pipeline computing hourly revenue aggregations, and what the tradeoffs are.

Define the problem: events arrive after the watermark has passed their event-time window. Options: (1) Watermarking with allowed lateness — in Flink/Spark SS, set a watermark delay (e.g., 2 hours) that holds windows open longer, balancing completeness against latency; (2) Lambda architecture — stream gives fast approximate results, batch recomputes correct values nightly; (3) Incremental correction — emit updated aggregates when late data arrives, downstream systems must handle updates (upserts vs. append-only); (4) Drop late data with SLA monitoring — acceptable only if late data volume is tiny and documented. For revenue, late data is often high-value (large transactions on slow networks), so dropping is rarely acceptable. Recommend watermarking tuned to the empirical late-data distribution (p99 lateness), plus a daily reconciliation job that compares stream totals to batch totals and fires alerts. What interviewers look for: Does the candidate know what a watermark actually does at the engine level, or just use the word? Do they reason about the business impact of the tradeoff (dropping late revenue data is a compliance issue)? Do they propose a reconciliation strategy rather than assuming streaming is always correct?

Q: Tell me about a time you identified and fixed a significant data quality issue that was affecting business decisions before the business noticed. Walk me through what you did.

Use STAR but go deep on the technical detection mechanism (anomaly detection, automated reconciliation, lineage tracing) and the cross-functional response (who you notified, how you communicated uncertainty, how you prioritized the fix). Quantify the impact: how long had bad data been there, what decisions were potentially affected, what was the remediation. End with the systemic fix — not just patching the immediate issue but the monitoring or process that prevents recurrence. Avoid hero narratives; emphasize collaboration and what you learned. What interviewers look for: Interviewers are evaluating proactive ownership (did you find it or did the business find it?), communication under uncertainty (did you share preliminary findings responsibly?), and systemic thinking (did you fix the root cause?). At senior level, 'I fixed the bug' is not enough — they want to see you improved the system.

Q: Describe a situation where you had to push back on a stakeholder who wanted data faster than your team could deliver it reliably. How did you handle it?

Frame the answer around principled disagreement: you understood their business need, quantified the risk of moving faster (data quality, tech debt, team burnout), and proposed alternatives (phased delivery, a less-precise MVP with clear caveats, resource trade-offs). Show that you escalated or compromised where appropriate rather than being obstructionist. Demonstrate that the outcome was better because of the pushback — either quality was preserved, or a better scoping emerged. What interviewers look for: Can this person influence without authority? Do they understand the business context well enough to have an informed opinion? Do they know when to hold the line and when to adapt? This tests the leadership maturity expected at senior level — not just technical execution.

Question 1

Design a near-real-time analytics pipeline that ingests clickstream events from a high-traffic e-commerce site (100k events/sec peak) and makes aggregated metrics available to a dashboard within 60 seconds.

Accepted Answer

Start by clarifying SLAs: latency budget, acceptable data loss, and query patterns. Propose a Lambda or Kappa architecture: Kafka (or Kinesis) for ingest, a stream processor (Flink or Spark Structured Streaming) for windowed aggregations, writing results to a low-latency serving layer (ClickHouse, Druid, or Redis for counters). Address partitioning strategy in Kafka (by user_id or session_id for ordering guarantees), exactly-once semantics tradeoffs, watermarking for late data, and a cold-path reprocessing story for corrections. Quantify rough throughput: 100k events/sec at ~1KB average = ~100MB/s — discuss partition count, consumer group sizing, and backpressure handling. What interviewers look for: Can the candidate size the system, not just name technologies? Do they reason about exactly-once vs at-least-once and its downstream impact? Do they proactively address late data and schema evolution? Interviewers want to see a structured decomposition: ingest → process → serve, with tradeoffs at each layer.

Question 2

Your company's data warehouse has grown to 200 TB and dbt runs are taking 6+ hours, causing downstream SLA breaches. How do you diagnose and fix this?

Accepted Answer

Break diagnosis into three layers: (1) warehouse execution — use query profiling (EXPLAIN, query history) to find full-table scans, missing clustering/partitioning, or data spill; (2) dbt model graph — identify bottlenecks using dbt's node timing, look for fat nodes that could be split or incrementalized, and detect unnecessary full refreshes; (3) orchestration — check for serial execution where parallelism is possible. Fix levers: convert large models to incremental with appropriate unique_key and merge strategy, apply table clustering on filter columns, use ephemeral or view models for light transforms, partition pruning, and separate fast/slow DAG layers with different schedules. Discuss cost vs. latency tradeoffs of materializing intermediate models. What interviewers look for: Does the candidate have a systematic debugging approach rather than jumping to 'add more compute'? Do they understand incremental model semantics and the risk of silent incorrect results from a bad unique_key? Can they balance engineering effort against operational impact?

Question 3

You're building a data model to support both operational reporting (daily sales by region) and exploratory analysis (ad hoc slicing by 15+ dimensions). How do you structure the warehouse layer?

Accepted Answer

Argue for a layered architecture: raw → staging (typed, lightly cleaned) → intermediate → mart. For the mart layer, distinguish use cases: a narrow, pre-aggregated fact table for operational dashboards (Kimball-style star schema with conformed dimensions), and a wide, denormalized 'one big table' or OBT for exploration. Discuss grain explicitly — the most common modeling failure is mixing grains. Address slowly changing dimensions (SCD Type 1 vs. 2 vs. 6) for dimensions like region or product category. Mention columnar storage benefits for the exploratory case and materialization strategy in dbt. Flag the maintenance cost of keeping multiple representations in sync. What interviewers look for: Does the candidate commit to a grain before designing? Do they know Kimball well enough to adapt it rather than follow it dogmatically? Can they articulate when an OBT is appropriate vs. a star schema, and what the operational cost of each is?

Question 4

Given a table of user sessions with session_start, session_end, and user_id, write a query to identify users who had overlapping sessions (a bug in the session tracking system).

Accepted Answer

Use a self-join: join the table to itself on user_id where s1.session_id != s2.session_id and s1.session_start < s2.session_end and s1.session_end > s2.session_start. To avoid duplicates, add s1.session_id < s2.session_id. Alternatively, use window functions: LAG to get the previous session's end time and compare with the current start — this is more efficient but only catches consecutive overlaps, so clarify assumptions. Discuss indexing on user_id and timestamps, and what to do with the results (flag, deduplicate, alert). What interviewers look for: Does the candidate handle the edge cases in interval overlap logic correctly (the classic mistake is using <= instead of <)? Do they think about query performance at scale? Do they proactively clarify what 'overlapping' means and what to do with the output?

Question 5

You have a PySpark job that reads a 10 TB Parquet dataset, joins it with a 500 MB lookup table, and is running out of memory on executors. How do you fix it?

Accepted Answer

Identify the join strategy: by default Spark may attempt a sort-merge join on both large datasets. The fix is a broadcast join — the 500 MB table is below the default broadcast threshold (200 MB by default) or can be forced with broadcast() hint. Explain why: broadcasting sends the small table to every executor, eliminating the shuffle. Then discuss: check spark.sql.autoBroadcastJoinThreshold, verify the 500 MB is post-filter size, watch for broadcast serialization overhead on very wide tables. Also check for data skew in the large table (salting technique), partition count (too few = large tasks, too many = scheduling overhead), and whether the Parquet files are optimally sized (~128–256 MB). Demonstrate understanding of the execution plan via explain(). What interviewers look for: Can the candidate read the execution plan and connect it to memory pressure? Do they know the broadcast threshold and how to override it? Do they mention skew as a follow-up concern? This tests operational Spark depth, not just API familiarity.

Question 6

How would you design a data quality framework for a pipeline that feeds a financial reporting system where incorrect data causes regulatory risk?

Accepted Answer

Layer the approach: (1) source contract validation — schema checks, null checks, referential integrity at ingest using tools like Great Expectations or dbt tests, halt pipeline on critical failures; (2) business rule validation — row counts within expected bounds (compared to yesterday via anomaly detection), sum totals reconciliation against source system totals, duplicate key detection; (3) freshness SLAs — alert if table hasn't been updated within expected window; (4) lineage tracking — know which downstream reports are affected by a failure so you can proactively notify stakeholders; (5) incident runbook — who gets paged, what gets quarantined, audit log of all data corrections. Distinguish between blocking checks (stop the pipeline) and warning checks (alert but continue). Mention the importance of idempotent pipelines so reruns don't create duplicate data. What interviewers look for: Does the candidate think beyond 'add dbt tests'? Do they understand the difference between statistical anomaly detection and deterministic rule checks? Do they address the organizational dimension — who is accountable when bad data ships? This signals senior ownership, not just engineering.

Question 7

Compare Airflow, Prefect, and Dagster for orchestrating a complex pipeline with 500+ tasks, dynamic task generation, and strict data-aware scheduling. Which do you choose and why?

Accepted Answer

Frame the comparison on axes that matter at this scale: dynamic DAG generation (Airflow's dynamic task mapping in 2.3+ is functional but verbose; Prefect and Dagster handle dynamic flows more natively), data-aware scheduling (Dagster's asset-based model is purpose-built for this — you schedule based on asset freshness, not just time; Airflow datasets feature is nascent), observability (Dagster's asset catalog gives lineage out of the box; Airflow requires significant plugin work), and operational overhead (self-hosted Airflow at 500+ tasks requires tuning the executor, metadata DB, and scheduler; managed options like MWAA or Astronomer reduce this). For a data-aware scheduling requirement, argue for Dagster unless there's a strong existing Airflow investment. Acknowledge Prefect's Python-native simplicity for teams that prioritize fast iteration. What interviewers look for: Does the candidate have genuine operational experience, or are they just repeating documentation? Do they ground the recommendation in the specific constraints (500+ tasks, dynamic, data-aware) rather than giving a generic answer? Interviewers want opinionated judgment backed by tradeoffs.

Question 8

Explain how you would handle late-arriving data in a streaming pipeline computing hourly revenue aggregations, and what the tradeoffs are.

Accepted Answer

Define the problem: events arrive after the watermark has passed their event-time window. Options: (1) Watermarking with allowed lateness — in Flink/Spark SS, set a watermark delay (e.g., 2 hours) that holds windows open longer, balancing completeness against latency; (2) Lambda architecture — stream gives fast approximate results, batch recomputes correct values nightly; (3) Incremental correction — emit updated aggregates when late data arrives, downstream systems must handle updates (upserts vs. append-only); (4) Drop late data with SLA monitoring — acceptable only if late data volume is tiny and documented. For revenue, late data is often high-value (large transactions on slow networks), so dropping is rarely acceptable. Recommend watermarking tuned to the empirical late-data distribution (p99 lateness), plus a daily reconciliation job that compares stream totals to batch totals and fires alerts. What interviewers look for: Does the candidate know what a watermark actually does at the engine level, or just use the word? Do they reason about the business impact of the tradeoff (dropping late revenue data is a compliance issue)? Do they propose a reconciliation strategy rather than assuming streaming is always correct?

Question 9

Tell me about a time you identified and fixed a significant data quality issue that was affecting business decisions before the business noticed. Walk me through what you did.

Accepted Answer

Use STAR but go deep on the technical detection mechanism (anomaly detection, automated reconciliation, lineage tracing) and the cross-functional response (who you notified, how you communicated uncertainty, how you prioritized the fix). Quantify the impact: how long had bad data been there, what decisions were potentially affected, what was the remediation. End with the systemic fix — not just patching the immediate issue but the monitoring or process that prevents recurrence. Avoid hero narratives; emphasize collaboration and what you learned. What interviewers look for: Interviewers are evaluating proactive ownership (did you find it or did the business find it?), communication under uncertainty (did you share preliminary findings responsibly?), and systemic thinking (did you fix the root cause?). At senior level, 'I fixed the bug' is not enough — they want to see you improved the system.

Question 10

Describe a situation where you had to push back on a stakeholder who wanted data faster than your team could deliver it reliably. How did you handle it?

Accepted Answer

Frame the answer around principled disagreement: you understood their business need, quantified the risk of moving faster (data quality, tech debt, team burnout), and proposed alternatives (phased delivery, a less-precise MVP with clear caveats, resource trade-offs). Show that you escalated or compromised where appropriate rather than being obstructionist. Demonstrate that the outcome was better because of the pushback — either quality was preserved, or a better scoping emerged. What interviewers look for: Can this person influence without authority? Do they understand the business context well enough to have an informed opinion? Do they know when to hold the line and when to adapt? This tests the leadership maturity expected at senior level — not just technical execution.

Question 11

Explain the differences between Slowly Changing Dimension Types 1, 2, and 6, and give a concrete example of when you'd use each in a B2B SaaS analytics context.

Accepted Answer

SCD1: overwrite old value, no history — use for correcting data entry errors (e.g., fixing a company name typo). SCD2: add a new row with effective dates, keep full history — use for tracking plan tier changes over time so revenue analysis can correctly attribute ARR to the plan a customer was on at the time of payment. SCD6 (hybrid 1+2+3): current value column on all rows plus versioned rows — use when you need both point-in-time accuracy and easy current-state lookups without joins, e.g., a customer dimension where you frequently query 'current segment' but also need historical correctness in cohort analysis. Discuss the query complexity SCD2 introduces (need to filter on effective dates or use a current_flag), and the storage cost vs. analytical benefit. What interviewers look for: Does the candidate know SCD2 implementation details (surrogate keys, effective date handling, the current_flag pattern)? Do they know SCD6 exists and when it's worth the complexity? Can they ground it in a real scenario rather than giving a textbook definition?

Question 12

What are the performance implications of using an EAV (Entity-Attribute-Value) schema versus a columnar approach for storing customer properties in an analytics warehouse, and when might EAV be justified?

Accepted Answer

EAV anti-patterns in analytics: unpivoting sparse attributes into rows destroys columnar compression, makes aggregation queries require self-joins or conditional aggregations (CASE WHEN attribute='revenue' THEN value END), and breaks type safety since all values are stored as strings. Query complexity grows with every new attribute. Columnar approach: wide tables with nullable columns — modern columnar engines (BigQuery, Snowflake, Redshift) handle sparse columns efficiently via run-length encoding and null compression, so the storage cost of wide tables is lower than it appears. When EAV is justified: when the attribute schema is genuinely dynamic and unpredictable (user-defined custom fields, product with 10k SKU-level attributes where >95% are null per row) — even then, prefer semi-structured columns (VARIANT/JSON) in modern warehouses over EAV. Mention that JSON/VARIANT gives you schema flexibility with better query ergonomics than EAV. What interviewers look for: Does the candidate know why EAV is almost always wrong in a warehouse context, and can they quantify the query performance impact? Do they know the modern alternative (semi-structured columns) rather than just saying 'use a wide table'? This tests real warehouse experience.

Senior Data Engineer Interview Questions

What to expect

12 questions, with how to answer them

1. Design a near-real-time analytics pipeline that ingests clickstream events from a high-traffic e-commerce site (100k events/sec peak) and makes aggregated metrics available to a dashboard within 60 seconds.

2. Your company's data warehouse has grown to 200 TB and dbt runs are taking 6+ hours, causing downstream SLA breaches. How do you diagnose and fix this?

3. You're building a data model to support both operational reporting (daily sales by region) and exploratory analysis (ad hoc slicing by 15+ dimensions). How do you structure the warehouse layer?

4. Given a table of user sessions with session_start, session_end, and user_id, write a query to identify users who had overlapping sessions (a bug in the session tracking system).

5. You have a PySpark job that reads a 10 TB Parquet dataset, joins it with a 500 MB lookup table, and is running out of memory on executors. How do you fix it?

6. How would you design a data quality framework for a pipeline that feeds a financial reporting system where incorrect data causes regulatory risk?

7. Compare Airflow, Prefect, and Dagster for orchestrating a complex pipeline with 500+ tasks, dynamic task generation, and strict data-aware scheduling. Which do you choose and why?

8. Explain how you would handle late-arriving data in a streaming pipeline computing hourly revenue aggregations, and what the tradeoffs are.

9. Tell me about a time you identified and fixed a significant data quality issue that was affecting business decisions before the business noticed. Walk me through what you did.

10. Describe a situation where you had to push back on a stakeholder who wanted data faster than your team could deliver it reliably. How did you handle it?

11. Explain the differences between Slowly Changing Dimension Types 1, 2, and 6, and give a concrete example of when you'd use each in a B2B SaaS analytics context.

12. What are the performance implications of using an EAV (Entity-Attribute-Value) schema versus a columnar approach for storing customer properties in an analytics warehouse, and when might EAV be justified?

Study tips