Data Cleansing -- 概念
Data Cleansing
• A process of detecting and removing errors and inconsistencies from data in order to
improve the quality of data.
• An iterative process
Data Anomalies Classification
• Syntactical Anomalies: format and values
• Semantic Anomalies: comprehensiveness and non-redundancy
• Coverage Anomalies: missing values
Data Cleansing Blueprint
(1)Data Auditing (or Analysis): detect errors and inconsistencies in the data
(2)Definition of transformation workflow: define a sequence of operations on the data
(3)Verification: test and evaluate the correctness and effectiveness of a workflow
(4)Data transformation: Execute the transformation steps
(5)Post-processing and controlling: inspect the results to verify the correctness
Categorical variables: values or observations that can be sorted into groups or categories
Quantitative data: values or observations that can be measured, and can be placed in
ascending or descending order