Data Validation: A Crucial Step that’s Easy to Skip
//
Recently, I encountered a situation that many analysts and developers might find familiar: hours spent debugging an analysis, only to discover that the underlying data wasn’t what we thought it was. One small data quality issue had cascaded into misleading results that could have led to poor business decisions.
We often rush to analyze our data without proper validation. The pressure to deliver quick insights can make thorough data checking feel like a luxury. But here’s a simple example of why this mindset is dangerous:
# What many of us do (don't be this person)
df = pd.read_csv('dataset.csv')
result = df['value'].mean()
print(f"Average value: {result:.2f}")
# What we should do
def validate_dataset(df):
issues = []
# Check for impossible values
if (df['value'] < 0).any():
invalid_rows = df[df['value'] < 0]
issues.append(f"Found {len(invalid_rows)} negative values")
# Check for outliers (3 standard deviations)
mean, std = df['value'].mean(), df['value'].std()
outliers = df[abs(df['value'] - mean) > 3*std]
if not outliers.empty:
issues.append(f"Found {len(outliers)} potential outliers")
# Check for missing dates
if df['date'].isna().any():
issues.append(f"Found {df['date'].isna().sum()} missing dates")
return issues
# Now we're talking
issues = validate_dataset(df)
if issues:
print("⚠️ Data quality issues found:")
for issue in issues:
print(f"- {issue}")
else:
result = df['value'].mean()
print(f"Average value: {result:.2f}")
Here are three validation approaches that have consistently saved me from data-related problems coming back to haunt.
Before diving into analysis, always look at the distribution of key values. A quick visualization can reveal issues that summary statistics miss:
def analyze_distribution(series, column_name):
# Basic stats that often reveal issues
stats = {
'missing': series.isna().sum(),
'zeros': (series == 0).sum(),
'unique_values': series.nunique(),
'min': series.min(),
'max': series.max()
}
# Add percentage calculations
total_rows = len(series)
stats['missing_pct'] = (stats['missing'] / total_rows) * 100
stats['zero_pct'] = (stats['zeros'] / total_rows) * 100
return stats
Data collection methods and formats can change unexpectedly. Here’s a simple way to catch these shifts:
def detect_pattern_changes(df, date_column, value_column):
# Group by time period and get basic stats
period_stats = df.groupby(pd.Grouper(key=date_column, freq='D'))[value_column].agg([
'count',
'mean',
'std'
])
# Calculate period-over-period changes
for col in period_stats.columns:
period_stats[f'{col}_change'] = period_stats[col].pct_change()
# Flag suspicious changes
suspicious_periods = period_stats[
abs(period_stats['count_change']) > 0.5 # 50% change threshold
]
return suspicious_periods
Make your data expectations explicit. This helps catch issues early and makes them easier to debug:
validation_rules = {
'numeric_column': {
'type': 'float',
'min': 0,
'max': 1000,
'allow_null': False
},
'date_column': {
'type': 'datetime',
'range': ('2020-01-01', 'now'),
'allow_null': False
},
'id_column': {
'type': 'string',
'pattern': r'^ID\d{6}$', # Regex pattern
'allow_null': False
}
}
The key is automating these checks so they become part of your regular workflow:
class DataValidator:
def __init__(self, rules):
self.rules = rules
def validate_dataset(self, df):
issues = []
for column, rules in self.rules.items():
if column not in df.columns:
issues.append(f"Missing expected column: {column}")
continue
# Type validation
if rules['type'] == 'numeric':
if not pd.api.types.is_numeric_dtype(df[column]):
issues.append(f"{column} should be numeric")
# Range validation for numeric columns
if 'min' in rules and df[column].min() < rules['min']:
issues.append(f"{column} contains values below minimum {rules['min']}")
# More validation logic...
return issues
# Use it in your workflow
validator = DataValidator(validation_rules)
issues = validator.validate_dataset(new_data)
These practices have helped catch numerous issues in my work:
Start small – pick your most important dataset and implement basic validation. You’ll likely find issues you didn’t know existed, and your future self will thank you for the extra effort.
Have you encountered similar data quality issues in your work? How do you handle data validation? Share your experiences in the comments below.