Parsing Raw Data -- 读取不同格式数据

CSV files

(1)导入Panda,读取文件 (2)设置ID号 (3)操作数据:分开经纬度成两个column

json files

(1)调用谷歌地图API (2)import json来读取数据 (3)json_normalize--运行失败,记得看录像 (4)给各个column重命名 (5)多值对象提取,提取经纬度

json files

(1)from bs4 import BeautifulSoup (2)用 tree.findall 来提取出数据 (3)用循环提取出每一个值 (4)命名,得到和前面一样的表格形式数据

PDF

(1)调用 pip install pdfminer.six==20181108提取所有内容 (2)循环读出来所有值, print (repr(line))可以让回车显示为\n (3)正则表达式剪切,得到所需内容 (4)通过空格、回车进行分割 (5)内容总结,输出成表格

EXCEL

(1)调用Panda,读取文件 (2)Drop useless columns and rows (3)设置某一列的值为index (4)整理所有的列 ------具体代码看视频

补充

Geojson: A format for encoding a variety of geographic data structures. Their structure are very similar, basically geojson is also in json format, but it has some fixed elements, such as points, line strings, polygons, and multi-part collections of these types; You can use geopandas package to read geojson file.

导出数据:pd.write_csv()

其余具体的代码看第三周的三个Jupyer Notebook文件