Data sources
Data Value Chain
• Collection: getting the data
• Wrangling: data preprocessing, cleaning
• Analysis: discovery (learning, visualisation, etc.)
• Presentation: arguing that results are significant and useful
• Engineering: storage and computational resources
• Governance: overall management of data
• Operationalisation: putting the results to work
Open data
• Free – accessible, costs nothing
• Free – unrestricted usage
• Free – simple, non-proprietary format
Data formats
Machine-readable data: e.g., XML, JSON
Markup language: e.g., Markdown, Javadoc
Digital container: e.g., MPEG
Metadata
• data about data
• structured so that a computer can process & interpret it
用处:
Descriptive: describes content for identification and retrieval
Structural: documents relationships and links
Administrative: helps to manage information
Combining data
Joining data sets: They must have something in common
类型: Outer join, Inner join, Left/Right join