Data sources

Data Value Chain

• Collection: getting the data • Wrangling: data preprocessing, cleaning • Analysis: discovery (learning, visualisation, etc.) • Presentation: arguing that results are significant and useful • Engineering: storage and computational resources • Governance: overall management of data • Operationalisation: putting the results to work

Open data

• Free – accessible, costs nothing • Free – unrestricted usage • Free – simple, non-proprietary format

Data formats

Machine-readable data: e.g., XML, JSON Markup language: e.g., Markdown, Javadoc Digital container: e.g., MPEG

Metadata

• data about data • structured so that a computer can process & interpret it 用处: Descriptive: describes content for identification and retrieval Structural: documents relationships and links Administrative: helps to manage information

Combining data

Joining data sets: They must have something in common 类型: Outer join, Inner join, Left/Right join