Data Pipeline Projects | ETL & Analytics PM
Manage data engineering work with clear workflows for ETL pipelines and data quality. GitScrum coordinates data engineers, analysts, and stakeholders.
8 min read
Data pipelines require careful coordination between data engineers, analysts, and stakeholders. GitScrum helps teams manage data infrastructure projects with clear workflows.
Pipeline Development
Pipeline Task Structure
DATA PIPELINE TASK BREAKDOWN:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β PIPELINE EPIC: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β DATA-100: Customer Analytics Pipeline ββ
β β ββ
β β Goal: Daily customer metrics for analytics ββ
β β ββ
β β Source: Production DB (customers, orders) ββ
β β Target: Analytics warehouse ββ
β β Schedule: Daily at 2 AM ββ
β β ββ
β β Tasks: ββ
β β βββ DATA-101: Extract customer data ββ
β β βββ DATA-102: Extract order data ββ
β β βββ DATA-103: Transform customer metrics ββ
β β βββ DATA-104: Load to warehouse ββ
β β βββ DATA-105: Data quality checks ββ
β β βββ DATA-106: Alerting and monitoring ββ
β β βββ DATA-107: Documentation ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β INDIVIDUAL STAGE TASK: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β DATA-103: Transform customer metrics ββ
β β ββ
β β Input: Raw customer + order data ββ
β β ββ
β β Output metrics: ββ
β β β’ total_orders per customer ββ
β β β’ total_revenue per customer ββ
β β β’ avg_order_value ββ
β β β’ first_order_date ββ
β β β’ last_order_date ββ
β β β’ days_since_last_order ββ
β β ββ
β β Business rules: ββ
β β β’ Only count completed orders ββ
β β β’ Exclude test accounts ββ
β β β’ Handle refunds correctly ββ
β β ββ
β β Testing: ββ
β β β Unit tests for each calculation ββ
β β β Edge case: no orders ββ
β β β Edge case: all refunded ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Contracts
DATA CONTRACT DEFINITION:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β WHY DATA CONTRACTS: β
β β
β β’ Define expected schema β
β β’ Catch breaking changes early β
β β’ Enable parallel development β
β β’ Document data flow β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β CONTRACT TASK: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β DATA-110: Define customer_metrics contract ββ
β β ββ
β β Schema: ββ
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β customer_metrics βββ
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ββ
β β β customer_id: STRING (required) βββ
β β β total_orders: INTEGER (>= 0) βββ
β β β total_revenue: DECIMAL(10,2) βββ
β β β avg_order_value: DECIMAL(10,2) βββ
β β β first_order_date: DATE (nullable) βββ
β β β last_order_date: DATE (nullable) βββ
β β β days_since_last_order: INTEGER (nullable) βββ
β β β computed_at: TIMESTAMP (required) βββ
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β ββ
β β Constraints: ββ
β β β’ No nulls in customer_id ββ
β β β’ total_orders >= 0 ββ
β β β’ first_order_date <= last_order_date ββ
β β ββ
β β Consumers: ββ
β β β’ BI dashboard ββ
β β β’ ML churn model ββ
β β β’ Marketing automation ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β BREAKING CHANGES: β
β β’ Removing columns β
β β’ Changing types β
β β’ Changing semantics β
β β Require consumer coordination β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Quality
Quality Checks
DATA QUALITY IMPLEMENTATION:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β DATA QUALITY TASK: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β DATA-105: Data quality checks ββ
β β ββ
β β COMPLETENESS: ββ
β β β No null customer_ids ββ
β β β All customers have computed_at ββ
β β β Row count within expected range ββ
β β ββ
β β FRESHNESS: ββ
β β β computed_at within last 24 hours ββ
β β β last_order_date not in future ββ
β β ββ
β β ACCURACY: ββ
β β β total_revenue = sum of order amounts ββ
β β β avg_order_value = revenue / orders ββ
β β β Spot check against source ββ
β β ββ
β β CONSISTENCY: ββ
β β β Customer exists in customer table ββ
β β β No duplicate customer_ids ββ
β β ββ
β β REASONABLENESS: ββ
β β β avg_order_value within expected range ββ
β β β No negative revenue ββ
β β β Metric changes within bounds vs yesterday ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β QUALITY GATE: β
β β
β Pipeline succeeds only if: β
β β’ All completeness checks pass β
β β’ All accuracy checks pass β
β β’ Freshness is acceptable β
β β
β Warnings (don't block): β
β β’ Reasonableness thresholds exceeded β
β β’ β Alert, but load data β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Quality Issues
DATA QUALITY BUG HANDLING:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β DATA QUALITY INCIDENT: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β π΄ DATA-BUG-42: Duplicate customer records ββ
β β ββ
β β Discovered: 2024-01-15 monitoring alert ββ
β β Impact: BI dashboards showing inflated counts ββ
β β ββ
β β Root cause: ββ
β β Source system has duplicate customer_ids due to ββ
β β merge issue from acquisition. ββ
β β ββ
β β Immediate fix: ββ
β β β Add deduplication to transform ββ
β β β Rerun pipeline for affected dates ββ
β β β Notify consumers of corrected data ββ
β β ββ
β β Long-term fix: ββ
β β β Work with source team on deduplication ββ
β β β Add uniqueness check to quality gates ββ
β β ββ
β β Labels: data-quality, pipeline, high-priority ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β CLASSIFICATION: β
β β
β Critical: Data is wrong, decisions affected β
β β Fix immediately, notify stakeholders β
β β
β High: Data incomplete, some queries fail β
β β Fix within 1-2 days β
β β
β Medium: Data delayed, freshness impacted β
β β Fix within sprint β
β β
β Low: Minor issues, workarounds exist β
β β Add to backlog β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Stakeholder Management
Requirements Gathering
DATA PROJECT REQUIREMENTS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β STAKEHOLDER INTAKE TASK: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β DATA-REQ-01: Customer metrics requirements ββ
β β ββ
β β Stakeholder: Marketing team ββ
β β Use case: Customer segmentation ββ
β β ββ
β β Questions answered: ββ
β β ββ
β β WHAT DATA: ββ
β β β What metrics do you need? ββ
β β β Order count, revenue, recency ββ
β β ββ
β β WHY: ββ
β β β What decisions will this inform? ββ
β β β Customer segmentation for campaigns ββ
β β ββ
β β HOW OFTEN: ββ
β β β How fresh does it need to be? ββ
β β β Daily is sufficient ββ
β β ββ
β β VOLUME: ββ
β β β How many records? ββ
β β β ~500K customers ββ
β β ββ
β β ACCESS: ββ
β β β Who needs access? ββ
β β β Marketing analysts, BI tool ββ
β β ββ
β β DEFINITIONS: ββ
β β β What counts as an "order"? ββ
β β β Completed, non-refunded orders ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β COMMON ISSUES TO CLARIFY: β
β β’ Definition of metrics β
β β’ Time zone handling β
β β’ Currency conversion β
β β’ Historical data needs β
β β’ Backfill requirements β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Maintenance
Ongoing Pipeline Work
PIPELINE MAINTENANCE:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β MAINTENANCE TASKS (Recurring): β
β β
β WEEKLY: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β DATA-MAINT: Weekly pipeline health check ββ
β β ββ
β β β Review pipeline failures ββ
β β β Check data quality trends ββ
β β β Review resource usage ββ
β β β Address any alerts ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β MONTHLY: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β DATA-MAINT: Monthly pipeline maintenance ββ
β β ββ
β β β Review and update documentation ββ
β β β Check for deprecated dependencies ββ
β β β Optimize slow queries ββ
β β β Clean up old data/logs ββ
β β β Review monitoring thresholds ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β QUARTERLY: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β DATA-MAINT: Quarterly pipeline review ββ
β β ββ
β β β Review overall architecture ββ
β β β Check schema evolution needs ββ
β β β Update dependencies ββ
β β β Review access controls ββ
β β β Stakeholder check-in (still meeting needs?) ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ALLOCATE CAPACITY: β
β ~20% of data engineering time for maintenance β
β Don't just build new - maintain existing β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ