Training Data Quality Checklist for Production ML
A practical 15-point checklist for evaluating training data quality before building an ML model - covering completeness, labelling, distribution, PII, and version control.
Bad training data produces bad models - and bad models in production are expensive to diagnose, expensive to fix, and damaging to customer trust. Data quality problems are the root cause of most silent production ML failures, yet they are consistently the most underinvested part of the ML development process.
This checklist covers the 15 most critical data quality checks before you train a production model.
Completeness
- Missing value rate - What percentage of rows have missing values in each feature? Missing values above 5% in a key feature require investigation and a documented imputation strategy.
- Sample coverage - Does your dataset cover the full distribution of inputs your model will see in production? Gaps in coverage become gaps in model performance.
- Temporal coverage - Does your dataset cover the time periods relevant to your prediction task? A fraud model trained on 2020 data may not reflect 2024 fraud patterns.
Labelling Quality
- Inter-annotator agreement - If your labels were created by multiple annotators, what is the agreement rate (Cohen’s Kappa or Fleiss’ Kappa)? Below 0.7 indicates significant labelling inconsistency.
- Label noise estimate - Use a confident learning technique (e.g., Cleanlab) to estimate the label noise rate. Label noise above 10% typically requires relabelling or noise-robust training.
- Label distribution - Is your label distribution severely imbalanced? Class imbalance requires documented handling strategy (oversampling, class weights, threshold adjustment).
- Annotation guidelines - Are annotation guidelines documented and version-controlled? Undocumented guidelines produce inconsistent labels over time.
Distribution
- Feature distributions - Have you visualised the distribution of all key features? Unexpected spikes, bimodal distributions, or heavy tails require investigation.
- Outlier identification - Have outliers been identified and handled (removed, winsorized, or kept with justification)? Undocumented outliers often indicate data quality issues upstream.
- Training-production distribution match - Does your training data distribution match what your model will see in production? Distribution mismatch is the most common cause of model degradation post-deployment.
Privacy and Compliance
- PII scan - Has the dataset been scanned for personally identifiable information (names, emails, phone numbers, SSNs, health data)? PII in training data creates GDPR/CCPA compliance exposure.
- Consent and provenance - Is the data provenance documented? Is there evidence of consent for the intended ML use, where required?
- Sensitive attribute handling - If the dataset includes sensitive attributes (race, gender, health status), is their inclusion documented and justified?
Data Management
- Version control - Is the exact dataset version (including preprocessing steps) used for each model training run recorded? Without version control, you cannot reproduce a model or diagnose a regression.
- Train/validation/test split - Are splits created before any preprocessing or feature selection? Leakage between splits produces optimistic evaluation metrics that don’t generalise to production.
How to Use This Checklist
Run this checklist before training your first production model version - and before any major retrain on new data. Document the outcome of each check and the actions taken. This documentation becomes part of your model card and is directly relevant to regulatory compliance and investor due diligence.
Book a data quality audit if you want an independent assessment of your training data against these criteria.
Ship AI You Can Trust.
Book a free 30-minute AI QA scope call with our experts. We review your model, data pipeline, or AI product - and show you exactly what to test before you ship.
Talk to an Expert