Data Cleansing -- 概念

Data Cleansing

• A process of detecting and removing errors and inconsistencies from data in order to improve the quality of data. • An iterative process

Data Anomalies Classification

• Syntactical Anomalies: format and values • Semantic Anomalies: comprehensiveness and non-redundancy • Coverage Anomalies: missing values

Data Cleansing Blueprint

(1)Data Auditing (or Analysis): detect errors and inconsistencies in the data (2)Definition of transformation workflow: define a sequence of operations on the data (3)Verification: test and evaluate the correctness and effectiveness of a workflow (4)Data transformation: Execute the transformation steps (5)Post-processing and controlling: inspect the results to verify the correctness Categorical variables: values or observations that can be sorted into groups or categories Quantitative data: values or observations that can be measured, and can be placed in ascending or descending order