处理数据的基本操作

Libraries

import matplotlib.pyplot as plt %matplotlib inline import pandas as pd import numpy as np import seaborn as sns

Exploratory Data Analysis Demos.ipynb

Panda基本作图方法: • Bar Plot: plt.bar([10, 20, 30], [5, 8, 2]) • Histogram: plt.hist(y1) • Boxplot: y_outliers = y1+[-10] plt.boxplot([y1,y_outliers]) plt.show() • Line plot: plt.plot(x,y1,'-') • Scatter plot: plt.plot(x,y1,'.r') 统计unique值: df1['sex'].unique() 统计值个数: df1['fare'].value_counts() 统计空值:sum(df1['embarked'].isnull())

Week 6 tutorial Data Auditing answer.ipynb

描述数据: titanic.describe() titanic.info() 按条件提取: titanic[titanic.title == "Rev"] titanic[((titanic.who == "man") | (titanic.who == "woman")) & (titanic.age < 18)] 替换值:titanic.embark_town.replace({"Cherborg": "Cherbourg", "Cherbourge": "Cherbourg"}, inplace=True) titanic['sex'].replace({'F':'female', 'M':'male'},inplace=True) 处理重复值(只留第一个): titanic[titanic.duplicated(["firstName", "lastName", "age"], keep="first")] 分割一个列里面的不同字段