Co-Pilot / 辅助式
更新于 24 days ago

data-quality-skill

Mmasterkram
0.0k
masterkram/data-quality-skill
82
Agent 评分

💡 摘要

该技能诊断并修复数据集中数据质量问题,涵盖整个数据质量生命周期。

🎯 适合人群

数据分析师数据科学家商业智能专业人员数据工程师机器学习从业者

🤖 AI 吐槽:看起来很能打,但别让配置把人劝退。

安全分析中风险

风险:Medium。建议检查:API Key/Token 的获取、存储与泄露风险。以最小权限运行,并在生产环境启用前审计代码与依赖。


name: data-quality description: Diagnose and fix data quality problems in datasets. Use when working with dirty data, finding duplicates, handling missing values, detecting outliers/anomalies, validating constraints (functional dependencies, referential integrity), profiling datasets, or cleaning data for analysis or ML. Covers the full data quality lifecycle - define, detect, clean, measure.

Data Quality Skill

Systematic approach to diagnosing and fixing data quality problems.

Data Quality Process

Define & Identify → Detect & Quantify → Clean & Rectify → Measure & Verify
  1. Define: Understand data context, business rules, quality requirements
  2. Detect: Profile data, find glitches (missing, duplicates, outliers, violations)
  3. Clean: Apply appropriate repair strategies
  4. Measure: Validate repairs, quantify improvement

Quick Reference

| Problem | Script | Key Function | |---------|--------|--------------| | Data overview | data_profiling.py | profile_dataframe(df) | | Find quality issues | data_profiling.py | detect_glitches(df) | | Missing values | missing_data.py | analyze_missing(df) | | Imputation | missing_data.py | impute_mean/median/regression() | | Duplicates | duplicate_detection.py | find_duplicates(df, cols) | | Deduplication | duplicate_detection.py | deduplicate(df, cols) | | Outliers | anomaly_detection.py | detect_anomalies(df) | | Constraint check | constraint_checking.py | validate_constraints(df, rules) | | String matching | similarity_metrics.py | jaro_winkler_similarity() |

Workflow

Step 1: Profile the Data

from scripts.data_profiling import profile_dataframe, detect_glitches, generate_quality_report # Quick overview print(generate_quality_report(df)) # Detailed profile profile = profile_dataframe(df) # Find issues glitches = detect_glitches(df)

Step 2: Analyze Specific Issues

Missing Data:

from scripts.missing_data import analyze_missing, test_mcar analysis = analyze_missing(df) # Check if safe to delete rows mcar_test = test_mcar(df, 'column_with_missing', ['other_cols'])

Duplicates:

from scripts.duplicate_detection import find_duplicates, cluster_duplicates matches = find_duplicates(df, ['name', 'email'], threshold=0.85) clusters = cluster_duplicates(matches)

Outliers:

from scripts.anomaly_detection import detect_anomalies, iqr_outliers # Multi-column summary anomalies = detect_anomalies(df, method='iqr') # Single column detail result = iqr_outliers(df, 'price', multiplier=1.5)

Constraints:

from scripts.constraint_checking import validate_constraints constraints = [ {'type': 'unique', 'columns': ['id']}, {'type': 'not_null', 'columns': ['name', 'email']}, {'type': 'fd', 'determinant': ['id'], 'dependent': ['name']}, {'type': 'domain', 'column': 'age', 'min_value': 0, 'max_value': 150}, ] results = validate_constraints(df, constraints)

Step 3: Clean the Data

Handle Missing:

from scripts.missing_data import impute_median, impute_regression, listwise_deletion # Simple: median for numeric df_clean = impute_median(df, 'age') # Better: regression-based df_clean = impute_regression(df, 'income', ['age', 'education']) # If MCAR confirmed df_clean = listwise_deletion(df)

Remove Duplicates:

from scripts.duplicate_detection import deduplicate df_clean, summary = deduplicate( df, columns=['name', 'email', 'address'], threshold=0.8, merge_strategy='most_complete' ) print(f"Reduced from {summary['original_rows']} to {summary['final_rows']} rows")

Handle Outliers:

# Cap extreme values q01, q99 = df['col'].quantile([0.01, 0.99]) df['col'] = df['col'].clip(q01, q99) # Or remove df_clean = df[~detect_anomalies(df)['col']['outlier_indices']]

Step 4: Validate

Re-run profiling and constraint checks on cleaned data to verify improvements.

References

For deeper understanding:

Key Concepts

Data Quality = Fit for Use

  • Free of defects
  • Has features needed for the task
  • Right information, right place, right time

Missing Data Mechanisms:

  • MCAR: Missing Completely At Random (safe to delete)
  • MAR: Missing At Random (imputation may work)
  • MNAR: Missing Not At Random (most problematic)

Constraints:

  • Functional Dependency: X → Y means X uniquely determines Y
  • Referential Integrity: foreign keys reference valid primary keys
  • Domain Constraints: values within allowed set/range

Entity Resolution:

  • Blocking reduces O(n²) to O(n·window)
  • Similarity metrics: Jaro-Winkler (names), Levenshtein (typos), Jaccard (sets)
  • Cluster by transitive closure, merge by strategy

Similarity Metrics Comparison

| Metric | Best For | Example | |--------|----------|---------| | Jaro-Winkler | Names, short strings | "Robert" vs "Rupert" | | Levenshtein | Typos, edit distance | "recieve" vs "receive" | | Jaccard | Token/word comparison | "John Doe" vs "Doe, John" | | Q-gram | Fuzzy substring matching | Partial matches |

五维分析
清晰度8/10
创新性7/10
实用性9/10
完整性9/10
可维护性8/10
优缺点分析

优点

  • 全面覆盖数据质量生命周期。
  • 多种方法检测和修复问题。
  • 常见任务的用户友好脚本。

缺点

  • 可能需要熟悉Python。
  • 仅限于特定的数据质量问题。
  • 依赖外部库。

相关技能

spark-engineer

A
toolCo-Pilot / 辅助式
86/ 100

“这个技能就像大数据的瑞士军刀——只要别指望它能切穿所有噪音。”

whodb

A
toolCo-Pilot / 辅助式
84/ 100

“看起来很能打,但别让配置把人劝退。”

exa-search

A
toolCo-Pilot / 辅助式
84/ 100

“看起来很能打,但别让配置把人劝退。”

免责声明:本内容来源于 GitHub 开源项目,仅供展示和评分分析使用。

版权归原作者所有 masterkram.