ML Project Management | Experiments to Production

Manage ML projects from experiment to production. GitScrum tracks timeboxed experiments, productionization tasks, and model monitoring systematically.

9 min read

Machine learning projects differ from traditional software - experiments fail often, timelines are uncertain, and deployment is complex. GitScrum helps teams manage ML work effectively.

ML Project Phases

Phase Structure

ML PROJECT LIFECYCLE:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ PHASE 1: PROBLEM DEFINITION                                 │
│ ─────────────────────────────                               │
│ Duration: 1-2 weeks                                        │
│                                                             │
│ Tasks:                                                      │
│ ☐ Define business problem                                 │
│ ☐ Identify success metrics                                │
│ ☐ Assess feasibility                                      │
│ ☐ Define MVP scope                                        │
│                                                             │
│ Output: Go/no-go decision, project charter                │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ PHASE 2: DATA PREPARATION                                   │
│ ─────────────────────────────                               │
│ Duration: 2-4 weeks                                        │
│                                                             │
│ Tasks:                                                      │
│ ☐ Data collection                                         │
│ ☐ Data exploration                                        │
│ ☐ Feature engineering                                     │
│ ☐ Data pipeline creation                                  │
│ ☐ Train/test split                                        │
│                                                             │
│ Output: Clean dataset, feature set, data pipeline         │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ PHASE 3: EXPERIMENTATION                                    │
│ ─────────────────────────────                               │
│ Duration: 2-6 weeks (timeboxed)                           │
│                                                             │
│ Tasks:                                                      │
│ ☐ Baseline model                                          │
│ ☐ Experiment iterations                                   │
│ ☐ Model selection                                         │
│ ☐ Hyperparameter tuning                                   │
│                                                             │
│ Output: Trained model meeting criteria (or decision to stop)│
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ PHASE 4: PRODUCTIONIZATION                                  │
│ ───────────────────────────                                │
│ Duration: 2-4 weeks                                        │
│                                                             │
│ Tasks:                                                      │
│ ☐ Model serving infrastructure                            │
│ ☐ Monitoring                                              │
│ ☐ A/B testing setup                                       │
│ ☐ Rollout                                                 │
│                                                             │
│ Output: Production model                                   │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ PHASE 5: MAINTENANCE                                        │
│ ──────────────────────                                     │
│ Ongoing                                                    │
│                                                             │
│ Tasks:                                                      │
│ ☐ Model monitoring                                        │
│ ☐ Drift detection                                         │
│ ☐ Retraining                                              │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Experiment Management

Experiment Tasks

ML EXPERIMENT TASK:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ EXPERIMENT STRUCTURE:                                       │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ ML-EXP-05: Test BERT for sentiment classification      ││
│ │                                                         ││
│ │ HYPOTHESIS:                                              ││
│ │ Fine-tuned BERT will outperform current rule-based     ││
│ │ sentiment by 15%+ in F1 score.                         ││
│ │                                                         ││
│ │ BASELINE:                                                ││
│ │ Current rule-based: 0.72 F1                            ││
│ │                                                         ││
│ │ SUCCESS CRITERIA:                                        ││
│ │ ≥ 0.85 F1 on test set                                  ││
│ │                                                         ││
│ │ TIMEBOX:                                                 ││
│ │ 1 week maximum                                         ││
│ │                                                         ││
│ │ APPROACH:                                                ││
│ │ ☐ Fine-tune bert-base-uncased                         ││
│ │ ☐ Use labeled training set (10K examples)             ││
│ │ ☐ 5-fold cross validation                             ││
│ │ ☐ Compare with baseline                               ││
│ │                                                         ││
│ │ RESOURCES:                                               ││
│ │ • GPU: 1x V100                                         ││
│ │ • Training time: ~4 hours                              ││
│ │                                                         ││
│ │ STOPPING CONDITIONS:                                     ││
│ │ • F1 < 0.75 after 3 epochs → Stop, try different model││
│ │ • Training diverges → Check data, restart             ││
│ │ • Time exceeded → Document results, decide next step  ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ EXPERIMENT OUTCOMES:                                        │
│                                                             │
│ ✅ SUCCESS:                                                │
│ Met criteria, proceed to productionization               │
│                                                             │
│ ⚠️ PARTIAL:                                                │
│ Some improvement, decide if worth continuing             │
│                                                             │
│ ❌ FAILURE:                                                │
│ Below baseline or not worth complexity                   │
│ → Still valuable! Document learnings                     │
└─────────────────────────────────────────────────────────────┘

Tracking Results

EXPERIMENT RESULTS DOCUMENTATION:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ UPDATE TASK WHEN COMPLETE:                                  │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ ML-EXP-05: Test BERT for sentiment                     ││
│ │ Status: ✅ Complete                                     ││
│ │                                                         ││
│ │ RESULTS:                                                 ││
│ │ ─────────────────────────────────────────────────────── ││
│ │ Model         F1     Precision  Recall  Time          ││
│ │ Baseline      0.72   0.70       0.74    -             ││
│ │ BERT-base     0.89   0.87       0.91    4.2h          ││
│ │ ─────────────────────────────────────────────────────── ││
│ │                                                         ││
│ │ OUTCOME: ✅ Success - exceeded 0.85 target             ││
│ │                                                         ││
│ │ LEARNINGS:                                               ││
│ │ • BERT significantly outperformed rule-based          ││
│ │ • 10K examples sufficient for this task               ││
│ │ • GPU training practical for daily retraining        ││
│ │                                                         ││
│ │ NEXT STEPS:                                              ││
│ │ → Create ML-PROD-01 for productionization             ││
│ │                                                         ││
│ │ ARTIFACTS:                                               ││
│ │ • MLflow run: [link]                                  ││
│ │ • Model checkpoint: s3://models/bert-sentiment-v1    ││
│ │ • Notebook: experiments/exp-05-bert.ipynb            ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ FAILED EXPERIMENTS ARE VALUABLE:                            │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ ML-EXP-04: Test simpler logistic regression           ││
│ │ Status: ❌ Did not meet criteria                       ││
│ │                                                         ││
│ │ RESULTS: F1 = 0.68 (below baseline)                    ││
│ │                                                         ││
│ │ LEARNINGS:                                               ││
│ │ • Bag of words insufficient for nuanced sentiment     ││
│ │ • Confirms need for contextual embeddings             ││
│ │                                                         ││
│ │ → Informs EXP-05 decision to try BERT                 ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Productionization

From Experiment to Production

ML PRODUCTIONIZATION TASKS:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ PRODUCTION EPIC:                                            │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ ML-PROD-01: Deploy sentiment model                     ││
│ │                                                         ││
│ │ From: ML-EXP-05 (BERT sentiment model)                 ││
│ │                                                         ││
│ │ Infrastructure:                                         ││
│ │ ☐ ML-PROD-01a: Model serving API                      ││
│ │ ☐ ML-PROD-01b: Inference optimization                 ││
│ │ ☐ ML-PROD-01c: Load testing                           ││
│ │                                                         ││
│ │ Monitoring:                                             ││
│ │ ☐ ML-PROD-01d: Prediction logging                     ││
│ │ ☐ ML-PROD-01e: Performance dashboards                 ││
│ │ ☐ ML-PROD-01f: Drift detection                        ││
│ │                                                         ││
│ │ Rollout:                                                ││
│ │ ☐ ML-PROD-01g: Shadow mode (compare to prod)         ││
│ │ ☐ ML-PROD-01h: A/B test setup                         ││
│ │ ☐ ML-PROD-01i: Gradual rollout                        ││
│ │                                                         ││
│ │ Documentation:                                          ││
│ │ ☐ ML-PROD-01j: Model card                             ││
│ │ ☐ ML-PROD-01k: Runbook                                ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ MODEL SERVING TASK:                                         │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ ML-PROD-01a: Model serving API                         ││
│ │                                                         ││
│ │ Endpoint: POST /api/v1/sentiment                       ││
│ │                                                         ││
│ │ Requirements:                                           ││
│ │ • Latency p99 < 100ms                                 ││
│ │ • Throughput: 1000 req/sec                            ││
│ │ • Availability: 99.9%                                  ││
│ │                                                         ││
│ │ Implementation:                                          ││
│ │ ☐ TorchServe or TF Serving                            ││
│ │ ☐ Model quantization for speed                        ││
│ │ ☐ Batching for efficiency                             ││
│ │ ☐ Caching layer                                        ││
│ │ ☐ Graceful degradation (fallback to rule-based)      ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Monitoring and Maintenance

ML Monitoring

MODEL MONITORING TASKS:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ ONGOING MONITORING:                                         │
│                                                             │
│ PERFORMANCE METRICS:                                        │
│ • Prediction latency                                      │
│ • Throughput                                               │
│ • Error rate                                               │
│                                                             │
│ MODEL QUALITY:                                              │
│ • Accuracy on labeled samples                             │
│ • Prediction distribution                                 │
│ • Feature drift                                            │
│ • Label drift                                              │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ DRIFT ALERT TASK:                                           │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ 🔴 ML-ALERT-12: Sentiment model drift detected         ││
│ │                                                         ││
│ │ Alert: Prediction distribution shift                   ││
│ │ Positive predictions: 40% → 65% (past week)            ││
│ │                                                         ││
│ │ Possible causes:                                        ││
│ │ • Genuine shift in user sentiment                     ││
│ │ • Data pipeline issue                                  ││
│ │ • Model degradation                                    ││
│ │                                                         ││
│ │ Investigation:                                          ││
│ │ ☐ Check data pipeline                                 ││
│ │ ☐ Sample and manually label recent predictions        ││
│ │ ☐ Compare input feature distributions                 ││
│ │                                                         ││
│ │ If model issue:                                         ││
│ │ ☐ Retrain with recent data                            ││
│ │ ☐ A/B test new model                                  ││
│ │ ☐ Roll out if better                                  ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ RETRAINING SCHEDULE:                                        │
│                                                             │
│ Regular retraining task (monthly):                        │
│ ☐ Collect new labeled data                               │
│ ☐ Retrain model                                          │
│ ☐ Evaluate vs production                                 │
│ ☐ Deploy if improved                                     │
└─────────────────────────────────────────────────────────────┘

Team Coordination

ML Team Structure

ML TEAM COORDINATION:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│ TYPICAL ML PROJECT ROLES:                                   │
│                                                             │
│ Data Scientist:                                            │
│ • Experimentation                                         │
│ • Model development                                        │
│ • Feature engineering                                      │
│                                                             │
│ ML Engineer:                                                │
│ • Productionization                                       │
│ • Model serving                                            │
│ • Pipeline automation                                      │
│                                                             │
│ Data Engineer:                                              │
│ • Data pipelines                                          │
│ • Feature stores                                          │
│ • Data quality                                            │
│                                                             │
│ Product Manager:                                            │
│ • Problem definition                                      │
│ • Success metrics                                         │
│ • Stakeholder coordination                                │
│                                                             │
│ ─────────────────────────────────────────────────────────── │
│                                                             │
│ HANDOFFS:                                                   │
│                                                             │
│ DS → ML Engineer (productionization):                     │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Handoff includes:                                       ││
│ │ ☐ Model checkpoint                                     ││
│ │ ☐ Training code                                        ││
│ │ ☐ Preprocessing pipeline                              ││
│ │ ☐ Performance requirements                            ││
│ │ ☐ Known limitations                                    ││
│ │ ☐ Test cases                                           ││
│ └─────────────────────────────────────────────────────────┘│
│                                                             │
│ SPRINT BALANCE:                                             │
│ • Mix experiments with productionization                  │
│ • Don't let experiments starve production work           │
│ • Don't let production work block all experimentation    │
└─────────────────────────────────────────────────────────────┘