ML Project Management | Experiments to Production
Manage ML projects from experiment to production. GitScrum tracks timeboxed experiments, productionization tasks, and model monitoring systematically.
9 min read
Machine learning projects differ from traditional software - experiments fail often, timelines are uncertain, and deployment is complex. GitScrum helps teams manage ML work effectively.
ML Project Phases
Phase Structure
ML PROJECT LIFECYCLE:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β PHASE 1: PROBLEM DEFINITION β
β βββββββββββββββββββββββββββββ β
β Duration: 1-2 weeks β
β β
β Tasks: β
β β Define business problem β
β β Identify success metrics β
β β Assess feasibility β
β β Define MVP scope β
β β
β Output: Go/no-go decision, project charter β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β PHASE 2: DATA PREPARATION β
β βββββββββββββββββββββββββββββ β
β Duration: 2-4 weeks β
β β
β Tasks: β
β β Data collection β
β β Data exploration β
β β Feature engineering β
β β Data pipeline creation β
β β Train/test split β
β β
β Output: Clean dataset, feature set, data pipeline β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β PHASE 3: EXPERIMENTATION β
β βββββββββββββββββββββββββββββ β
β Duration: 2-6 weeks (timeboxed) β
β β
β Tasks: β
β β Baseline model β
β β Experiment iterations β
β β Model selection β
β β Hyperparameter tuning β
β β
β Output: Trained model meeting criteria (or decision to stop)β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β PHASE 4: PRODUCTIONIZATION β
β βββββββββββββββββββββββββββ β
β Duration: 2-4 weeks β
β β
β Tasks: β
β β Model serving infrastructure β
β β Monitoring β
β β A/B testing setup β
β β Rollout β
β β
β Output: Production model β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β PHASE 5: MAINTENANCE β
β ββββββββββββββββββββββ β
β Ongoing β
β β
β Tasks: β
β β Model monitoring β
β β Drift detection β
β β Retraining β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Experiment Management
Experiment Tasks
ML EXPERIMENT TASK:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β EXPERIMENT STRUCTURE: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β ML-EXP-05: Test BERT for sentiment classification ββ
β β ββ
β β HYPOTHESIS: ββ
β β Fine-tuned BERT will outperform current rule-based ββ
β β sentiment by 15%+ in F1 score. ββ
β β ββ
β β BASELINE: ββ
β β Current rule-based: 0.72 F1 ββ
β β ββ
β β SUCCESS CRITERIA: ββ
β β β₯ 0.85 F1 on test set ββ
β β ββ
β β TIMEBOX: ββ
β β 1 week maximum ββ
β β ββ
β β APPROACH: ββ
β β β Fine-tune bert-base-uncased ββ
β β β Use labeled training set (10K examples) ββ
β β β 5-fold cross validation ββ
β β β Compare with baseline ββ
β β ββ
β β RESOURCES: ββ
β β β’ GPU: 1x V100 ββ
β β β’ Training time: ~4 hours ββ
β β ββ
β β STOPPING CONDITIONS: ββ
β β β’ F1 < 0.75 after 3 epochs β Stop, try different modelββ
β β β’ Training diverges β Check data, restart ββ
β β β’ Time exceeded β Document results, decide next step ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β EXPERIMENT OUTCOMES: β
β β
β β
SUCCESS: β
β Met criteria, proceed to productionization β
β β
β β οΈ PARTIAL: β
β Some improvement, decide if worth continuing β
β β
β β FAILURE: β
β Below baseline or not worth complexity β
β β Still valuable! Document learnings β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tracking Results
EXPERIMENT RESULTS DOCUMENTATION:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β UPDATE TASK WHEN COMPLETE: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β ML-EXP-05: Test BERT for sentiment ββ
β β Status: β
Complete ββ
β β ββ
β β RESULTS: ββ
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ
β β Model F1 Precision Recall Time ββ
β β Baseline 0.72 0.70 0.74 - ββ
β β BERT-base 0.89 0.87 0.91 4.2h ββ
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββ
β β ββ
β β OUTCOME: β
Success - exceeded 0.85 target ββ
β β ββ
β β LEARNINGS: ββ
β β β’ BERT significantly outperformed rule-based ββ
β β β’ 10K examples sufficient for this task ββ
β β β’ GPU training practical for daily retraining ββ
β β ββ
β β NEXT STEPS: ββ
β β β Create ML-PROD-01 for productionization ββ
β β ββ
β β ARTIFACTS: ββ
β β β’ MLflow run: [link] ββ
β β β’ Model checkpoint: s3://models/bert-sentiment-v1 ββ
β β β’ Notebook: experiments/exp-05-bert.ipynb ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β FAILED EXPERIMENTS ARE VALUABLE: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β ML-EXP-04: Test simpler logistic regression ββ
β β Status: β Did not meet criteria ββ
β β ββ
β β RESULTS: F1 = 0.68 (below baseline) ββ
β β ββ
β β LEARNINGS: ββ
β β β’ Bag of words insufficient for nuanced sentiment ββ
β β β’ Confirms need for contextual embeddings ββ
β β ββ
β β β Informs EXP-05 decision to try BERT ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Productionization
From Experiment to Production
ML PRODUCTIONIZATION TASKS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β PRODUCTION EPIC: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β ML-PROD-01: Deploy sentiment model ββ
β β ββ
β β From: ML-EXP-05 (BERT sentiment model) ββ
β β ββ
β β Infrastructure: ββ
β β β ML-PROD-01a: Model serving API ββ
β β β ML-PROD-01b: Inference optimization ββ
β β β ML-PROD-01c: Load testing ββ
β β ββ
β β Monitoring: ββ
β β β ML-PROD-01d: Prediction logging ββ
β β β ML-PROD-01e: Performance dashboards ββ
β β β ML-PROD-01f: Drift detection ββ
β β ββ
β β Rollout: ββ
β β β ML-PROD-01g: Shadow mode (compare to prod) ββ
β β β ML-PROD-01h: A/B test setup ββ
β β β ML-PROD-01i: Gradual rollout ββ
β β ββ
β β Documentation: ββ
β β β ML-PROD-01j: Model card ββ
β β β ML-PROD-01k: Runbook ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β MODEL SERVING TASK: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β ML-PROD-01a: Model serving API ββ
β β ββ
β β Endpoint: POST /api/v1/sentiment ββ
β β ββ
β β Requirements: ββ
β β β’ Latency p99 < 100ms ββ
β β β’ Throughput: 1000 req/sec ββ
β β β’ Availability: 99.9% ββ
β β ββ
β β Implementation: ββ
β β β TorchServe or TF Serving ββ
β β β Model quantization for speed ββ
β β β Batching for efficiency ββ
β β β Caching layer ββ
β β β Graceful degradation (fallback to rule-based) ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Monitoring and Maintenance
ML Monitoring
MODEL MONITORING TASKS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ONGOING MONITORING: β
β β
β PERFORMANCE METRICS: β
β β’ Prediction latency β
β β’ Throughput β
β β’ Error rate β
β β
β MODEL QUALITY: β
β β’ Accuracy on labeled samples β
β β’ Prediction distribution β
β β’ Feature drift β
β β’ Label drift β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β DRIFT ALERT TASK: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β π΄ ML-ALERT-12: Sentiment model drift detected ββ
β β ββ
β β Alert: Prediction distribution shift ββ
β β Positive predictions: 40% β 65% (past week) ββ
β β ββ
β β Possible causes: ββ
β β β’ Genuine shift in user sentiment ββ
β β β’ Data pipeline issue ββ
β β β’ Model degradation ββ
β β ββ
β β Investigation: ββ
β β β Check data pipeline ββ
β β β Sample and manually label recent predictions ββ
β β β Compare input feature distributions ββ
β β ββ
β β If model issue: ββ
β β β Retrain with recent data ββ
β β β A/B test new model ββ
β β β Roll out if better ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β RETRAINING SCHEDULE: β
β β
β Regular retraining task (monthly): β
β β Collect new labeled data β
β β Retrain model β
β β Evaluate vs production β
β β Deploy if improved β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Team Coordination
ML Team Structure
ML TEAM COORDINATION:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β TYPICAL ML PROJECT ROLES: β
β β
β Data Scientist: β
β β’ Experimentation β
β β’ Model development β
β β’ Feature engineering β
β β
β ML Engineer: β
β β’ Productionization β
β β’ Model serving β
β β’ Pipeline automation β
β β
β Data Engineer: β
β β’ Data pipelines β
β β’ Feature stores β
β β’ Data quality β
β β
β Product Manager: β
β β’ Problem definition β
β β’ Success metrics β
β β’ Stakeholder coordination β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β HANDOFFS: β
β β
β DS β ML Engineer (productionization): β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Handoff includes: ββ
β β β Model checkpoint ββ
β β β Training code ββ
β β β Preprocessing pipeline ββ
β β β Performance requirements ββ
β β β Known limitations ββ
β β β Test cases ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β SPRINT BALANCE: β
β β’ Mix experiments with productionization β
β β’ Don't let experiments starve production work β
β β’ Don't let production work block all experimentation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ