I’ve shipped production software for 28 years, and continuous integration was one of the first practices that actually changed how it felt to work. Push a commit, the suite runs, you know within minutes whether you broke something. For ordinary application code, CI is a solved problem. For machine learning models, most teams either skip it entirely or bolt on a single “run pytest” step and call it done. That step catches almost nothing that actually breaks ML systems.

Here’s the hands-on version of how I set up CI for a model, why it has to be different, and where it’s honestly overkill.

Why ML CI is a different problem

Normal CI guards one thing: your code. Change the code, run the tests, the behavior either holds or it doesn’t. A model has three independent surfaces that can each break the system on their own:

  • Code — the training script, the feature transforms, the serving layer. Same as always.
  • Data — the thing you train and predict on. An upstream column changes type, a category disappears, a join silently drops 12% of rows. The code is untouched and green. The model is now wrong.
  • The model artifact itself — the trained weights. Even with identical code and identical-looking data, a retrain can produce a model that’s quietly worse on the slice that matters.

A passing unit-test suite tells you nothing about the second and third. That’s the whole reason ML CI exists, and it’s why a copy-pasted application pipeline doesn’t transfer. You’re not just asking “did the code run” — you’re asking “did the code run, was the data sane, and is the resulting model good enough to promote.”

The pipeline I actually build

Six stages, in order. Each one is a gate — if it fails, nothing downstream runs.

  1. Validate the data before training touches it.
  2. Train in a reproducible, containerized environment so the build is the same on my laptop and in CI.
  3. Version the data and the model together with the code commit.
  4. Evaluate against a frozen held-out benchmark.
  5. Promote only past an eval gate — a hard threshold the new model must clear.
  6. Monitor for drift once it’s live, and feed that back into the gate.

1. Validate the data

This is the cheapest, highest-leverage check, and it’s the one teams skip. Before any training runs, assert the shape of your data. You can reach for Great Expectations or pandera, but a plain pytest check ships today and catches the common failures:

import pandas as pd

EXPECTED_COLS = {"user_id", "feature_a", "feature_b", "label"}

def test_schema_and_quality():
    df = pd.read_parquet("data/train.parquet")

    # Schema: columns the model depends on must exist
    assert EXPECTED_COLS.issubset(df.columns), \
        f"missing columns: {EXPECTED_COLS - set(df.columns)}"

    # Nulls in the label are silent poison
    assert df["label"].notna().all(), "null labels present"

    # Range / sanity bounds catch unit changes upstream
    assert df["feature_a"].between(0, 1).all(), "feature_a out of [0,1]"

    # Class balance hasn't collapsed (e.g. a join dropped positives)
    positive_rate = df["label"].mean()
    assert 0.05 < positive_rate < 0.95, \
        f"label balance looks broken: {positive_rate:.3f}"

If feature_a starts arriving in the 0–100 range because someone upstream switched from a fraction to a percentage, this fails in seconds instead of producing a confidently broken model an hour later.

2. Train in a container

“Works on my machine” is fatal for models because a different NumPy or CUDA version can shift results. Pin everything and build inside a container so CI and local are byte-for-byte comparable:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.lock .
RUN pip install --no-cache-dir -r requirements.lock
COPY . .
CMD ["python", "train.py"]

Set seeds in the training code too — Python’s random, NumPy, and your framework’s seed. Reproducibility isn’t a nicety here; it’s what makes an eval gate meaningful. If two runs on the same data give different scores, the gate is measuring noise.

3. Version data and model with DVC

Git is terrible at large binaries, and you need to know exactly which data produced which model. DVC stores the big files in object storage (S3, GCS, whatever) and commits a tiny pointer to Git, so the data version travels with the code version:

dvc add data/train.parquet
git add data/train.parquet.dvc .gitignore
git commit -m "data: refresh training set 2026-Q2"
dvc push   # pushes the actual bytes to remote storage

Now any commit hash reproduces the full state: code, data, and the model artifact. When a model regresses three weeks later, you can check out the exact inputs instead of guessing.

4 & 5. Evaluate against a held-out benchmark, then gate

This is the heart of ML CI. Keep a frozen evaluation set — one that never touches training — and a hard threshold the new model must beat before it’s allowed to ship. The gate is just a script that exits non-zero on failure, which is exactly what CI understands:

import json, sys
from sklearn.metrics import f1_score, roc_auc_score

THRESHOLD_F1 = 0.82      # promotion floor
MAX_REGRESSION = 0.01    # allowed drop vs. the live model

def main():
    y_true, y_pred, y_proba = load_eval_predictions()  # from frozen holdout
    f1  = f1_score(y_true, y_pred)
    auc = roc_auc_score(y_true, y_proba)

    with open("metrics/production.json") as f:
        prod_f1 = json.load(f)["f1"]

    print(f"candidate f1={f1:.4f} auc={auc:.4f} | prod f1={prod_f1:.4f}")

    if f1 < THRESHOLD_F1:
        sys.exit(f"FAIL: f1 {f1:.4f} below floor {THRESHOLD_F1}")
    if f1 < prod_f1 - MAX_REGRESSION:
        sys.exit(f"FAIL: f1 regressed {prod_f1 - f1:.4f} vs production")

    print("PASS: candidate cleared the eval gate")

if __name__ == "__main__":
    main()

Two checks, deliberately: an absolute floor (never ship anything below 0.82 F1) and a relative guard (never ship something meaningfully worse than what’s already live). The relative check is what stops a “harmless” data refresh from quietly degrading production.

Wiring it into GitHub Actions

Now the stages become CI steps. Each one runs on every pull request, and the job fails the moment a gate trips:

name: ml-ci
on:
  pull_request:
    paths: ["src/**", "train.py", "data/**", "requirements.lock"]

jobs:
  train-and-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install
        run: pip install -r requirements.lock

      - name: Pull versioned data
        run: dvc pull
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.DVC_KEY }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.DVC_SECRET }}

      - name: Validate data
        run: pytest tests/test_data.py -v

      - name: Train
        run: python train.py

      - name: Eval gate
        run: python eval_gate.py     # non-zero exit blocks the merge

Only after this whole job goes green does a model get tagged for promotion. The merge button is now backed by data validation, a reproducible train, and a benchmark — not by hope.

6. Monitor drift in production

CI ends at promotion, but the model keeps living. The world shifts under it — that’s drift, and it’s invisible to any pre-deploy check. Log the live feature distributions and prediction rates, compare them on a schedule against the training distribution (a population stability index or a simple Kolmogorov–Smirnov test works), and alert when they diverge. When drift fires, it should trigger a retrain that flows right back through this same pipeline. That loop is the difference between a model you shipped and a model you operate.

Common mistakes

  • Testing the code but never the data. The most common ML outage I see is a green build serving a broken model because nobody asserted the input schema.
  • No frozen benchmark. If your eval set is regenerated each run, you’re comparing against a moving target and your “improvements” are noise.
  • An absolute threshold with no regression guard. A model can clear the floor while still being worse than what’s live. Check both.
  • Training in CI on data that leaked from the eval set. Split once, freeze the holdout, and keep it out of every training path — or your gate is lying to you.
  • Treating drift monitoring as optional. It’s the only stage that catches failures that happen after a perfect deploy.

When this is overkill

I’ll be honest: if you’re one person with a notebook and a model you retrain by hand twice a year, this is too much. The full stack — containerized training, DVC remotes, drift jobs — earns its keep when retraining is frequent, when more than one person touches the pipeline, or when a bad model has real cost (revenue, safety, compliance). For a small team, start with exactly two stages: the data-validation pytest and the eval gate. Those two catch the overwhelming majority of real ML failures for almost no setup cost. Add reproducible training, versioning, and drift monitoring as the stakes and the team grow into them. Building the whole thing on day one for a low-stakes model is just ceremony.

This is the kind of machine learning continuous integration I set up so that “the model got worse” becomes a failed check on a pull request instead of a surprise in production.

At Champlin Enterprises we engineer these pipelines for teams putting models in front of real users. If you want a second set of eyes on yours, get in touch.