Q: You have a PySpark job that reads a 10TB dataset, performs multiple joins and aggregations, and writes to Parquet. It's taking 4 hours and the cluster is underutilized at 30%. Diagnose and fix it.

Start with diagnosis: read the Spark UI — check for data skew in join stages (a few tasks taking 10× longer), shuffle read/write sizes, and GC overhead. Likely culprits: (1) Data skew — identify skewed keys with a frequency analysis, then apply salting (add a random prefix 0–N to the key, replicate the smaller side, join, then aggregate and strip prefix). (2) Shuffle explosion — check whether joins are broadcasting smaller tables instead of shuffle-joining; use broadcast() hints for tables under the broadcast threshold. (3) Suboptimal partitioning — after reading, repartition on the join key to minimize shuffle. (4) Predicate pushdown — ensure filters are applied before joins, not after. (5) File format: confirm Parquet with snappy compression and check partition strategy on write (too many small files or too few large ones). Address the 30% utilization: likely caused by serialized stages or skew causing stragglers — fix skew first. What interviewers look for: The 30% utilization clue is intentional — it rules out 'just add more nodes' and forces a correctness-first diagnosis. Interviewers want to see Spark UI fluency and the ability to connect symptoms (stragglers, skew, shuffle) to specific fixes. Staff candidates explain the tradeoffs of each fix, not just list them.

Q: You've identified that three different product teams have independently built nearly identical ETL pipelines, each with different quality standards and no shared infrastructure. How do you drive consolidation without creating a monolith or losing team autonomy?

Frame as a platform strategy problem. First, audit the three pipelines to identify what's actually shared (ingestion patterns, quality checks, scheduling, lineage) vs. genuinely different (business logic, SLAs). Propose a thin internal platform: shared libraries for common patterns (ingestion connectors, quality assertion frameworks, observability hooks) that teams opt into, not mandatory rewrites. Introduce a Data Mesh-influenced model: teams own their domain pipelines but must conform to platform contracts (schema registration, lineage emission, SLA declaration). Avoid the monolith trap by making the platform an enabler, not a gatekeeper. Drive adoption through a working group with representatives from each team, dogfood the platform on one pipeline first, and publish migration guides. Measure success by reduction in duplicated code, improvement in aggregate quality metrics, and time-to-production for new pipelines. What interviewers look for: This is a technical leadership and influence question. Interviewers are checking whether you default to centralization (monolith trap) or abdication (do nothing). The signal is whether you can navigate organizational dynamics — building trust with teams who built the existing systems — while still driving a principled technical outcome.

Q: An ML team wants to use your data platform's feature store tables directly in production inference, bypassing the batch pipeline and serving features at <50ms latency. Your tables were never designed for this. How do you respond?

Don't say no immediately — this is a real need with legitimate business value. Start by understanding the actual requirements: which features, what QPS, latency budget, consistency requirements (can they tolerate stale features?). Evaluate options: (1) If features can be precomputed, materialize them into a low-latency serving store (Redis, DynamoDB, Feast) as a new product, not a hack on existing tables. (2) If features need online freshness, this requires a streaming feature pipeline separate from the batch layer. (3) If the batch tables are technically serviceable with caching, quantify the risk (consistency, schema changes breaking inference silently). Frame the right answer as a joint design session with the ML platform team, align on who owns the online serving layer, and write an RFC or design doc. Agree on a contract: your team guarantees feature correctness and freshness SLAs; they own the serving infrastructure. What interviewers look for: Interviewers look for cross-functional technical maturity: can you engage with ML requirements without being dismissive or over-promising? Staff candidates drive toward a durable architecture (not a workaround), clarify ownership boundaries, and produce a written design artifact rather than an ad-hoc decision.

Q: Design a streaming pipeline that computes fraud risk scores for payment events in real time. The model requires features computed over the last 24 hours of user activity. Events can arrive up to 2 hours late.

Use Kafka as the event backbone. Compute stateful features (transaction count, total amount, merchant diversity over 24h) in Flink using event-time processing with a watermark of 2 hours to handle late arrivals. Store intermediate state in Flink's RocksDB state backend for fault tolerance. For features requiring historical lookups beyond the Flink window (e.g., lifetime user behavior), implement an enrichment step that queries a low-latency feature store (Redis). Discuss the late-arrival strategy: with a 2-hour watermark, events arriving after are either dropped (document the business decision) or routed to a side output for offline reprocessing and model re-scoring. Address exactly-once semantics end-to-end: Kafka transactions + Flink checkpointing + idempotent writes to the scoring output topic. Mention that feature values used for a score must be logged alongside the score for model monitoring and retraining. What interviewers look for: Late-arrival handling and state management are the discriminating signals. Interviewers want to see that you understand event-time vs. processing-time semantics and can make the watermark tradeoff explicit (latency vs. completeness). Logging features with scores for model monitoring is a staff-level insight most candidates miss.

Q: How would you implement data contracts across a large organization where upstream services are owned by backend engineers who don't prioritize data quality?

Reframe data quality as a reliability concern that backend engineers already care about — not a data team problem. Propose contracts as machine-readable schemas (JSON Schema or Protobuf) checked in to the upstream service's repo, covering: field names and types, nullability, volume expectations (rows per hour), and freshness SLAs. Automate contract validation in the upstream service's CI/CD pipeline so that a schema-breaking change fails their build before it reaches production. Establish a data consumer score (how many downstream pipelines depend on each table) visible to engineering managers to create organizational incentive. For existing systems without contracts, use statistical profiling (Great Expectations, Monte Carlo) to infer implicit contracts and surface violations as incidents with clear ownership. Governance: data contracts owned jointly by producer and consumer, reviewed at launch review, and tracked as a reliability metric in engineering OKRs. What interviewers look for: Staff-level signal is in the organizational strategy, not just the technical tooling. Can you create incentives that align backend engineers without requiring a mandate from above? Interviewers check whether you've actually navigated this problem — it's one of the hardest in data engineering — and whether your solution is durable or just a process overlay.

Question 1

Design a data platform that serves both real-time operational dashboards (sub-second latency) and large-scale batch analytics (petabyte-scale) from the same source of truth. Walk through the architecture.

Accepted Answer

Open with the Lambda vs. Kappa architectural debate and explain why you'd choose one over the other given the constraints. For Lambda: define the batch layer (e.g., Iceberg/Delta tables on S3 with Spark), speed layer (Kafka + Flink or ksqlDB), and serving layer (Druid or ClickHouse for OLAP, Redis for point lookups). For Kappa: argue for a single streaming spine (Kafka) with materialized views. Address consistency between layers: how do you handle late-arriving events, reprocessing, and schema evolution across both paths? Introduce a medallion architecture (bronze/silver/gold) and map which consumers read from which tier. What interviewers look for: Can you articulate the real tradeoffs (operational complexity of Lambda vs. reprocessing cost of Kappa) rather than just naming technologies? Do you handle the consistency and correctness challenges, not just the happy path? Staff candidates are expected to define the architecture, not just describe options.

Question 2

Your company ingests data from 300 upstream microservices via change data capture (CDC). Design a system that ensures downstream data warehouse tables are always consistent, handles schema changes gracefully, and supports point-in-time recovery.

Accepted Answer

Start with CDC tooling (Debezium on Postgres/MySQL feeding Kafka). Address schema evolution as a first-class concern: Schema Registry (Confluent or Glue) with compatibility modes (BACKWARD, FULL). Describe how you handle schema changes without breaking downstream — schema-on-read with Iceberg/Delta supporting column adds/renames, versioned schemas in the catalog. For consistency: discuss exactly-once semantics in Kafka + Flink or transactional writes to table formats. For point-in-time recovery: Iceberg's time-travel and snapshot isolation, combined with Kafka log retention as a replay source. Address the operational challenge of 300 sources: automated onboarding pipelines, metadata-driven ingestion frameworks, and contract testing with upstream teams. What interviewers look for: Schema evolution handling and point-in-time recovery are the discriminating signals — most mid-level candidates skip them. Interviewers want to see that you've operated CDC systems at scale and know where they break (connector failures, Kafka lag, DDL events in binlog).

Question 3

You're modeling a SaaS product's billing and usage data into a warehouse. Customers can change plans mid-month, have multiple subscriptions, and usage must be reconcilable with invoices. How do you model this?

Accepted Answer

Identify this as a slowly-changing dimension (SCD) problem with fact grain complexity. Model customer plans as SCD Type 2 (with effective_date, expiration_date, is_current). Discuss the grain of the usage fact table: event-level vs. daily aggregated — argue for event-level with rollup tables on top to preserve auditability. For invoicing reconciliation: introduce a bridge table for subscription periods and usage allocations. Address the 'mid-month change' explicitly: proration logic belongs in a transformation layer, not the raw model. Discuss how dbt tests (row count, not-null, referential integrity on foreign keys, accepted-range checks on amounts) enforce correctness. Mention audit columns (loaded_at, source_file) for reconciliation. What interviewers look for: Do you reach for SCD Type 2 naturally? Can you reason about grain choices and their downstream consequences? Interviewers check whether you understand that modeling decisions constrain every downstream analyst and BI tool — staff-level candidates frame this as a product decision with long-term consequences, not just a SQL exercise.

Question 4

A critical pipeline serving executive dashboards has a 99.9% SLA but currently fails silently about 2% of the time due to upstream data quality issues. How do you fix this without a full rewrite?

Accepted Answer

Frame this as a three-layer problem: detection, containment, and prevention. Detection: add data quality checks at ingestion (Great Expectations, dbt tests, or custom assertions) that fail loudly; instrument pipeline runs with metadata (row counts, null rates, value distributions) and alert on deviation from a rolling baseline. Containment: implement circuit-breaker logic — if quality checks fail, stop the pipeline and surface the last known-good data with a staleness indicator rather than silently propagating bad data. Prevention: establish data contracts with upstream owners (schema, volume, freshness SLAs documented and tested); set up dead-letter queues for records failing validation with dashboards showing rejection rates. Track MTTR and pipeline error budgets explicitly. Tie this back to organizational process: who is on-call for data quality, and how is incident severity defined? What interviewers look for: The silent failure aspect is the key signal — staff candidates recognize that silent failures are worse than loud ones and design for observability first. Interviewers look for operational maturity: do you think in terms of SLAs, error budgets, and on-call runbooks, or just 'add more tests'?

Question 5

Write a query to calculate, for each customer, their rolling 30-day revenue, their percentile rank within their industry segment for that metric, and flag customers who have declined more than 20% from their 90-day peak. Use window functions throughout.

Accepted Answer

Structure in CTEs: (1) daily_revenue aggregating raw events; (2) rolling_metrics using SUM() OVER (PARTITION BY customer_id ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) for 30-day revenue, MAX() OVER a 90-day window for peak; (3) segmented_ranks using PERCENT_RANK() or NTILE() OVER (PARTITION BY industry_segment ORDER BY rolling_30d_revenue); (4) final SELECT joining and adding the flag: CASE WHEN rolling_30d_revenue < 0.8 * peak_90d_revenue THEN true END. Discuss the edge cases: sparse dates require generating a date spine and LEFT JOINing; customers who joined less than 30 days ago need careful handling; percentile rank is undefined for segments with one customer. What interviewers look for: Fluency with window functions is table stakes; what separates staff is handling edge cases unprompted (sparse data, new customers, singleton segments) and structuring the query readably with CTEs. Interviewers also probe whether you'd push this logic into a dbt model with proper testing rather than leaving it in an ad-hoc query.

Question 6

You have a PySpark job that reads a 10TB dataset, performs multiple joins and aggregations, and writes to Parquet. It's taking 4 hours and the cluster is underutilized at 30%. Diagnose and fix it.

Accepted Answer

Start with diagnosis: read the Spark UI — check for data skew in join stages (a few tasks taking 10× longer), shuffle read/write sizes, and GC overhead. Likely culprits: (1) Data skew — identify skewed keys with a frequency analysis, then apply salting (add a random prefix 0–N to the key, replicate the smaller side, join, then aggregate and strip prefix). (2) Shuffle explosion — check whether joins are broadcasting smaller tables instead of shuffle-joining; use broadcast() hints for tables under the broadcast threshold. (3) Suboptimal partitioning — after reading, repartition on the join key to minimize shuffle. (4) Predicate pushdown — ensure filters are applied before joins, not after. (5) File format: confirm Parquet with snappy compression and check partition strategy on write (too many small files or too few large ones). Address the 30% utilization: likely caused by serialized stages or skew causing stragglers — fix skew first. What interviewers look for: The 30% utilization clue is intentional — it rules out 'just add more nodes' and forces a correctness-first diagnosis. Interviewers want to see Spark UI fluency and the ability to connect symptoms (stragglers, skew, shuffle) to specific fixes. Staff candidates explain the tradeoffs of each fix, not just list them.

Question 7

You've identified that three different product teams have independently built nearly identical ETL pipelines, each with different quality standards and no shared infrastructure. How do you drive consolidation without creating a monolith or losing team autonomy?

Accepted Answer

Frame as a platform strategy problem. First, audit the three pipelines to identify what's actually shared (ingestion patterns, quality checks, scheduling, lineage) vs. genuinely different (business logic, SLAs). Propose a thin internal platform: shared libraries for common patterns (ingestion connectors, quality assertion frameworks, observability hooks) that teams opt into, not mandatory rewrites. Introduce a Data Mesh-influenced model: teams own their domain pipelines but must conform to platform contracts (schema registration, lineage emission, SLA declaration). Avoid the monolith trap by making the platform an enabler, not a gatekeeper. Drive adoption through a working group with representatives from each team, dogfood the platform on one pipeline first, and publish migration guides. Measure success by reduction in duplicated code, improvement in aggregate quality metrics, and time-to-production for new pipelines. What interviewers look for: This is a technical leadership and influence question. Interviewers are checking whether you default to centralization (monolith trap) or abdication (do nothing). The signal is whether you can navigate organizational dynamics — building trust with teams who built the existing systems — while still driving a principled technical outcome.

Question 8

An ML team wants to use your data platform's feature store tables directly in production inference, bypassing the batch pipeline and serving features at <50ms latency. Your tables were never designed for this. How do you respond?

Accepted Answer

Don't say no immediately — this is a real need with legitimate business value. Start by understanding the actual requirements: which features, what QPS, latency budget, consistency requirements (can they tolerate stale features?). Evaluate options: (1) If features can be precomputed, materialize them into a low-latency serving store (Redis, DynamoDB, Feast) as a new product, not a hack on existing tables. (2) If features need online freshness, this requires a streaming feature pipeline separate from the batch layer. (3) If the batch tables are technically serviceable with caching, quantify the risk (consistency, schema changes breaking inference silently). Frame the right answer as a joint design session with the ML platform team, align on who owns the online serving layer, and write an RFC or design doc. Agree on a contract: your team guarantees feature correctness and freshness SLAs; they own the serving infrastructure. What interviewers look for: Interviewers look for cross-functional technical maturity: can you engage with ML requirements without being dismissive or over-promising? Staff candidates drive toward a durable architecture (not a workaround), clarify ownership boundaries, and produce a written design artifact rather than an ad-hoc decision.

Question 9

Design a streaming pipeline that computes fraud risk scores for payment events in real time. The model requires features computed over the last 24 hours of user activity. Events can arrive up to 2 hours late.

Accepted Answer

Use Kafka as the event backbone. Compute stateful features (transaction count, total amount, merchant diversity over 24h) in Flink using event-time processing with a watermark of 2 hours to handle late arrivals. Store intermediate state in Flink's RocksDB state backend for fault tolerance. For features requiring historical lookups beyond the Flink window (e.g., lifetime user behavior), implement an enrichment step that queries a low-latency feature store (Redis). Discuss the late-arrival strategy: with a 2-hour watermark, events arriving after are either dropped (document the business decision) or routed to a side output for offline reprocessing and model re-scoring. Address exactly-once semantics end-to-end: Kafka transactions + Flink checkpointing + idempotent writes to the scoring output topic. Mention that feature values used for a score must be logged alongside the score for model monitoring and retraining. What interviewers look for: Late-arrival handling and state management are the discriminating signals. Interviewers want to see that you understand event-time vs. processing-time semantics and can make the watermark tradeoff explicit (latency vs. completeness). Logging features with scores for model monitoring is a staff-level insight most candidates miss.

Question 10

How would you implement data contracts across a large organization where upstream services are owned by backend engineers who don't prioritize data quality?

Accepted Answer

Reframe data quality as a reliability concern that backend engineers already care about — not a data team problem. Propose contracts as machine-readable schemas (JSON Schema or Protobuf) checked in to the upstream service's repo, covering: field names and types, nullability, volume expectations (rows per hour), and freshness SLAs. Automate contract validation in the upstream service's CI/CD pipeline so that a schema-breaking change fails their build before it reaches production. Establish a data consumer score (how many downstream pipelines depend on each table) visible to engineering managers to create organizational incentive. For existing systems without contracts, use statistical profiling (Great Expectations, Monte Carlo) to infer implicit contracts and surface violations as incidents with clear ownership. Governance: data contracts owned jointly by producer and consumer, reviewed at launch review, and tracked as a reliability metric in engineering OKRs. What interviewers look for: Staff-level signal is in the organizational strategy, not just the technical tooling. Can you create incentives that align backend engineers without requiring a mandate from above? Interviewers check whether you've actually navigated this problem — it's one of the hardest in data engineering — and whether your solution is durable or just a process overlay.

Question 11

Describe a technical decision you made that turned out to be wrong. How did you discover it, how did you handle it, and what would you do differently?

Accepted Answer

Choose a real, significant example — not a trivial bug. Structure: the context and why the original decision seemed right (show you had good reasoning at the time), the signal that revealed the mistake (ideally something you could have detected earlier), the concrete impact (be specific about cost, delay, or downstream harm), and what you did to remediate. The 'what would I do differently' section should be technically specific — not just 'I'd consult more people' but 'I'd have written a one-page RFC and stress-tested the partition strategy before committing to the table format.' Demonstrate that you updated your decision-making process, not just your opinion on the specific technology. What interviewers look for: Interviewers are evaluating self-awareness, intellectual honesty, and learning velocity. Red flags: choosing a trivial example, being vague about impact, or framing the failure as someone else's fault. Green flags: specific impact quantification, clear description of what the right decision would have been, and evidence that your process changed as a result.

Question 12

Tell me about a time you influenced a significant technical decision you didn't have direct authority over — for example, convincing a platform or infrastructure team to change something that affected your data systems.

Accepted Answer

Use a structured narrative: the technical problem you identified, why solving it required another team's action, how you built the case (data, prototypes, documented tradeoffs — not just advocacy), how you navigated disagreement or competing priorities, and the outcome. Quantify wherever possible. The strongest answers show that you invested in understanding the other team's constraints before asking them to change, and that you offered to share the implementation burden. What interviewers look for: Staff engineers must influence without authority — this question directly evaluates that. Interviewers look for evidence that you build technical credibility across teams, not just within your own. Red flags: framing influence as 'I escalated to my manager' or describing a situation where you simply had authority. Green flags: a genuine constraint-navigation story with a technical artifact (doc, proof-of-concept, benchmark) at its center.

Staff Data Engineer Interview Questions

What to expect

12 questions, with how to answer them

1. Design a data platform that serves both real-time operational dashboards (sub-second latency) and large-scale batch analytics (petabyte-scale) from the same source of truth. Walk through the architecture.

2. Your company ingests data from 300 upstream microservices via change data capture (CDC). Design a system that ensures downstream data warehouse tables are always consistent, handles schema changes gracefully, and supports point-in-time recovery.

3. You're modeling a SaaS product's billing and usage data into a warehouse. Customers can change plans mid-month, have multiple subscriptions, and usage must be reconcilable with invoices. How do you model this?

4. A critical pipeline serving executive dashboards has a 99.9% SLA but currently fails silently about 2% of the time due to upstream data quality issues. How do you fix this without a full rewrite?

5. Write a query to calculate, for each customer, their rolling 30-day revenue, their percentile rank within their industry segment for that metric, and flag customers who have declined more than 20% from their 90-day peak. Use window functions throughout.

6. You have a PySpark job that reads a 10TB dataset, performs multiple joins and aggregations, and writes to Parquet. It's taking 4 hours and the cluster is underutilized at 30%. Diagnose and fix it.

7. You've identified that three different product teams have independently built nearly identical ETL pipelines, each with different quality standards and no shared infrastructure. How do you drive consolidation without creating a monolith or losing team autonomy?

8. An ML team wants to use your data platform's feature store tables directly in production inference, bypassing the batch pipeline and serving features at <50ms latency. Your tables were never designed for this. How do you respond?

9. Design a streaming pipeline that computes fraud risk scores for payment events in real time. The model requires features computed over the last 24 hours of user activity. Events can arrive up to 2 hours late.

10. How would you implement data contracts across a large organization where upstream services are owned by backend engineers who don't prioritize data quality?

11. Describe a technical decision you made that turned out to be wrong. How did you discover it, how did you handle it, and what would you do differently?

12. Tell me about a time you influenced a significant technical decision you didn't have direct authority over — for example, convincing a platform or infrastructure team to change something that affected your data systems.

Study tips