We cover in this part scraping data from the web. Data can be presented in HTML, XML and API etc. Web scraping is the practice of using libraries to sift through a web page and gather the data that you need in a format most useful to you while at the same time preserving the structure of the data.
There are several ways to extract information from the web. Use of APIs being probably the best way to extract data from a website. Almost all large websites like Twitter, Facebook, Google, Twitter, StackOverflow provide APIs to access their data in a more structured manner. If you can get what you need through an API, it is almost always preferred approach over web scrapping. However, not all websites provide an API. Thus, we need to scrape the HTML website to fetch the information.
Non-standard python libraries needed in this tutorial include
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
Instead of retrieving all the links existing in a Wikipedia article, we are interested in extracting links that point to other article pages. If you look at the source code of the following page
https://en.wikipedia.org/wiki/Kevin_Bacon
in your browser, you fill find that all these links have three things in common:
We can use these rules to construct our search through the HTML page.
Firstly, use the urlopen() function to open the wikipedia page for "Kevin Bacon",
html = urlopen("https://en.wikipedia.org/wiki/Kevin_Bacon")
Then, find and print all the links. In order to finish this task, you need to
see <a href="/wiki/Kevin_Bacon_(disambiguation)" class="mw-disambig" title="Kevin Bacon (disambiguation)">Kevin Bacon (disambiguation)</a>
<a href="/wiki/Philadelphia" title="Philadelphia">Philadelphia</a>
Hint: regular expression is needed.
bsobj = BeautifulSoup(html, "lxml")
for link in bsobj.find("div", {"id": "bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")):
if 'href' in link.attrs:
print(link.attrs['href'])
Assume that we will find a random object in Wikipedia that is linked to "Kevin Bacon" with, so-called "Six Degrees of Wikipedia". In other words, the task is to find two subjects linked by a chain containing no more than six subjects (including the two original subjects).
import datetime
import random
random.seed(datetime.datetime.now())
def getLinks(articleUrl):
html = urlopen("http://en.wikipedia.org"+articleUrl)
bsObj = BeautifulSoup(html, "html.parser")
return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))
links = getLinks("/wiki/Kevin_Bacon")
The details of the random walk along the links are
count = 0
while len(links) > 0 and count < 5:
newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
print(newArticle)
links = getLinks(newArticle)
count = count + 1
The general approach to an exhaustive site crawl is to start with the root, i.e., the home page of a website. Here, we will start with
https://en.wikipedia.org/
by retrieving all the links that appear in the home page. And then traverse each link recursively. However, the number of links is going to be very large and a link can appear in many Wikipedia article. Thus, we need to consider how to avoid repeatedly crawling the same article or page. In order to do so, we can keep a running list for easy lookups and slightly update the getLinks() function.
pages = set()
len(pages) < 10
Otherwise, the script will run through the entire Wikipedia website, which will take a long time to finish. So please avoid that in the tutorial class.
def getLinks(pageUrl):
global pages
html = urlopen("http://en.wikipedia.org"+pageUrl)
bsObj = BeautifulSoup(html, "html.parser")
for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
if 'href' in link.attrs:
if link.attrs['href'] not in pages and len(pages) < 10:
#We have encountered a new page
newPage = link.attrs['href']
print("----------------\n"+newPage)
pages.add(newPage)
getLinks(newPage)
getLinks("")
One purpose of traversing all the the links is to extract data. The best practice is to look at a few pages from the side and determine the patterns. By looking at a handful of Wikipedia pages both articles and non-articles pages, the following pattens can be identified:
<h1 id="firstHeading" class="firstHeading" lang="en">Kevin Bacon</h1>
<h1 id="firstHeading" class="firstHeading" lang="en">Main Page</h1>
Now, the task is to further modify the getLink() function to print the title, the first paragraph and the edit link. The content from each page should be separated by
pyhon
print("----------------\n"+newPage)
pages = set()
len(pages) < 5
Otherwise, the script will run through the entire Wikipedia website, which will take a long time to finish. So please avoid that in the tutorial class.
def getLinks(pageUrl):
global pages
html = urlopen("http://en.wikipedia.org"+pageUrl)
bsObj = BeautifulSoup(html, "html.parser")
try:
print(bsObj.h1.get_text())
print(bsObj.find(id ="mw-content-text").findAll("p")[0])
print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])
except AttributeError:
print("This page is missing something! No worries though!")
for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
if 'href' in link.attrs:
if link.attrs['href'] not in pages and len(pages) < 5:
#We have encountered a new page
newPage = link.attrs['href']
print("----------------\n"+newPage)
pages.add(newPage)
getLinks(newPage)
getLinks("")
In addition to HTML format, data is commonly found on the web through public APIs. We use the 'requests' package (http://docs.python-requests.org) to call APIs using Python. In the following example, we call a public API for collecting weather data.
You need to sign up for a free account to get your unique API key to use in the following code. register at http://api.openweathermap.org
#Now we use requests to retrieve the web page with our data
import requests
url = 'http://api.openweathermap.org/data/2.5/forecast?id=524901&cnt=16&APPID=1499bcd50a6310a21f11b8de4fb653a5'
#write your APPID here#
response= requests.get(url)
response
The response object contains GET query response. A successfull one has a value of 200. we need to parse the response with json to extract the information.
#Check the HTTP status code https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
print (response.status_code)
# response.content is text
print (type(response.content))
#response.json() converts the content to json
data = response.json()
print (type(data))
data.keys()
data
The keys explain the structure of the fetched data. Try displaying values for each element. In this example, the weather information exists in the 'list'.
data['list'][15]
The next step is to create a DataFrame with the weather information, which is demonstrated as follows. You can select a subset to display or display the entire data
from pandas import DataFrame
# data with the default column headers
weather_table_all= DataFrame(data['list'])
weather_table_all
Further parsing is still required to get the table (DataFrame) in a flat shape. Now it it's your turn, parse the weather data to generate a table.
Please note that materials used in this tutorial are partially from the book "Web Scraping with Python"