Tutorial 9A. Web Scraping

We cover in this part scraping data from the web. Data can be presented in HTML, XML and API etc. Web scraping is the practice of using libraries to sift through a web page and gather the data that you need in a format most useful to you while at the same time preserving the structure of the data.

There are several ways to extract information from the web. Use of APIs being probably the best way to extract data from a website. Almost all large websites like Twitter, Facebook, Google, Twitter, StackOverflow provide APIs to access their data in a more structured manner. If you can get what you need through an API, it is almost always preferred approach over web scrapping. However, not all websites provide an API. Thus, we need to scrape the HTML website to fetch the information.

Non-standard python libraries needed in this tutorial include

  • urllib
  • beatifulsoup
  • requests
In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

Instead of retrieving all the links existing in a Wikipedia article, we are interested in extracting links that point to other article pages. If you look at the source code of the following page

https://en.wikipedia.org/wiki/Kevin_Bacon

in your browser, you fill find that all these links have three things in common:

  • They are in the div with id set to bodyContent
  • The URLs do not contain semicolons
  • The URLs begin with /wiki/

We can use these rules to construct our search through the HTML page.

Firstly, use the urlopen() function to open the wikipedia page for "Kevin Bacon",

In [2]:
html = urlopen("https://en.wikipedia.org/wiki/Kevin_Bacon")

Then, find and print all the links. In order to finish this task, you need to

  • find the div whose id = "bodyContent"
  • find all the link tags, whose href starts with "/wiki/" and does not ends with ":". For example
    see <a href="/wiki/Kevin_Bacon_(disambiguation)" class="mw-disambig" title="Kevin Bacon (disambiguation)">Kevin Bacon (disambiguation)</a>
    <a href="/wiki/Philadelphia" title="Philadelphia">Philadelphia</a>
    

Hint: regular expression is needed.

In [3]:
bsobj = BeautifulSoup(html, "lxml")
for link in bsobj.find("div", {"id": "bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")):
    if 'href' in link.attrs:
        print(link.attrs['href'])
/wiki/Kevin_Bacon_(disambiguation)
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/The_Guardian
/wiki/Academy_Award
/wiki/Hollywood_Walk_of_Fame
/wiki/Social_networks
/wiki/Six_Degrees_of_Kevin_Bacon
/wiki/SixDegrees.org
/wiki/Philadelphia
/wiki/Edmund_Bacon_(architect)
/wiki/Pennsylvania_Governor%27s_School_for_the_Arts
/wiki/Bucknell_University
/wiki/Glory_Van_Scott
/wiki/Circle_in_the_Square
/wiki/Nancy_Mills
/wiki/Cosmopolitan_(magazine)
/wiki/Fraternities_and_sororities
/wiki/Animal_House
/wiki/Search_for_Tomorrow
/wiki/Guiding_Light
/wiki/Friday_the_13th_(1980_film)
/wiki/Phoenix_Theater
/wiki/Flux
/wiki/Second_Stage_Theatre
/wiki/Obie_Award
/wiki/Forty_Deuce
/wiki/Slab_Boys
/wiki/Sean_Penn
/wiki/Val_Kilmer
/wiki/Barry_Levinson
/wiki/Diner_(film)
/wiki/Steve_Guttenberg
/wiki/Daniel_Stern_(actor)
/wiki/Mickey_Rourke
/wiki/Tim_Daly
/wiki/Ellen_Barkin
/wiki/Footloose_(1984_film)
/wiki/James_Dean
/wiki/Rebel_Without_a_Cause
/wiki/Mickey_Rooney
/wiki/Judy_Garland
/wiki/People_(American_magazine)
/wiki/Typecasting_(acting)
/wiki/John_Hughes_(filmmaker)
/wiki/She%27s_Having_a_Baby
/wiki/The_Big_Picture_(1989_film)
/wiki/Tremors_(film)
/wiki/Joel_Schumacher
/wiki/Flatliners
/wiki/Elizabeth_Perkins
/wiki/He_Said,_She_Said_(film)
/wiki/The_New_York_Times
/wiki/Oliver_Stone
/wiki/JFK_(film)
/wiki/A_Few_Good_Men_(film)
/wiki/Michael_Greif
/wiki/Golden_Globe_Award
/wiki/The_River_Wild
/wiki/Meryl_Streep
/wiki/Murder_in_the_First_(film)
/wiki/Blockbuster_(entertainment)
/wiki/Apollo_13_(film)
/wiki/Sleepers_(film)
/wiki/Picture_Perfect_(1997_film)
/wiki/Losing_Chase
/wiki/Digging_to_China
/wiki/Payola
/wiki/Telling_Lies_in_America_(film)
/wiki/Wild_Things_(film)
/wiki/Stir_of_Echoes
/wiki/David_Koepp
/wiki/Taking_Chance
/wiki/Paul_Verhoeven
/wiki/Hollow_Man
/wiki/Colin_Firth
/wiki/Rachel_Blanchard
/wiki/M%C3%A9nage_%C3%A0_trois
/wiki/Where_the_Truth_Lies
/wiki/Atom_Egoyan
/wiki/MPAA
/wiki/MPAA_film_rating_system
/wiki/Sean_Penn
/wiki/Tim_Robbins
/wiki/Clint_Eastwood
/wiki/Mystic_River_(film)
/wiki/Pedophile
/wiki/The_Woodsman_(2004_film)
/wiki/HBO_Films
/wiki/Taking_Chance
/wiki/Michael_Strobl
/wiki/Desert_Storm
/wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Miniseries_or_Television_Movie
/wiki/Matthew_Vaughn
/wiki/Sebastian_Shaw_(comics)
/wiki/Dustin_Lance_Black
/wiki/8_(play)
/wiki/Perry_v._Brown
/wiki/Proposition_8
/wiki/Charles_J._Cooper
/wiki/Wilshire_Ebell_Theatre
/wiki/American_Foundation_for_Equal_Rights
/wiki/The_Following
/wiki/Saturn_Award_for_Best_Actor_on_Television
/wiki/Huffington_Post
/wiki/Tremors_(film)
/wiki/EE_(telecommunications_company)
/wiki/United_Kingdom
/wiki/Egg_as_food
/wiki/Kyra_Sedgwick
/wiki/PBS
/wiki/Lanford_Wilson
/wiki/Lemon_Sky
/wiki/Pyrates
/wiki/Murder_in_the_First_(film)
/wiki/The_Woodsman_(2004_film)
/wiki/Loverboy_(2005_film)
/wiki/Sosie_Bacon
/wiki/Upper_West_Side
/wiki/Manhattan
/wiki/Tracy_Pollan
/wiki/The_Times
/wiki/Will.i.am
/wiki/It%27s_a_New_Day_(Will.i.am_song)
/wiki/Barack_Obama
/wiki/Ponzi_scheme
/wiki/Bernard_Madoff
/wiki/Finding_Your_Roots
/wiki/Henry_Louis_Gates
/wiki/Six_Degrees_of_Kevin_Bacon
/wiki/Trivia
/wiki/Big_screen
/wiki/Six_degrees_of_separation
/wiki/Internet_meme
/wiki/SixDegrees.org
/wiki/Bacon_number
/wiki/Internet_Movie_Database
/wiki/Paul_Erd%C5%91s
/wiki/Erd%C5%91s_number
/wiki/Paul_Erd%C5%91s
/wiki/Bacon_number
/wiki/Erd%C5%91s_number
/wiki/Erd%C5%91s%E2%80%93Bacon_number
/wiki/The_Bacon_Brothers
/wiki/Michael_Bacon_(musician)
/wiki/Music_album
/wiki/Hollywood_Walk_of_Fame
/wiki/Hollywood_Walk_of_Fame
/wiki/Denver_Film_Festival
/wiki/Phoenix_Film_Festival
/wiki/Santa_Barbara_International_Film_Festival
/wiki/Broadcast_Film_Critics_Association
/wiki/Seattle_International_Film_Festival
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Blockbuster_Entertainment_Awards
/wiki/Blockbuster_Entertainment_Awards
/wiki/Hollow_Man
/wiki/Boston_Society_of_Film_Critics
/wiki/Boston_Society_of_Film_Critics_Award_for_Best_Cast
/wiki/Mystic_River_(film)
/wiki/Bravo_Otto
/wiki/Bravo_Otto
/wiki/Footloose_(1984_film)
/wiki/CableACE_Award
/wiki/CableACE_Award
/wiki/Losing_Chase
/wiki/The_Woodsman_(2004_film)
/wiki/Critics%27_Choice_Movie_Awards
/wiki/Critics%27_Choice_Movie_Award_for_Best_Actor
/wiki/Murder_in_the_First_(film)
/wiki/Ghent_International_Film_Festival
/wiki/Ghent_International_Film_Festival
/wiki/The_Woodsman_(2004_film)
/wiki/Giffoni_Film_Festival
/wiki/Giffoni_Film_Festival
/wiki/Digging_to_China
/wiki/Gold_Derby_Awards
/wiki/Gold_Derby_Awards
/wiki/Mystic_River_(film)
/wiki/Golden_Globe_Award
/wiki/Golden_Globe_Award_for_Best_Supporting_Actor_%E2%80%93_Motion_Picture
/wiki/The_River_Wild
/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Miniseries_or_Television_Film
/wiki/Taking_Chance
/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
/wiki/I_Love_Dick_(TV_series)
/wiki/Independent_Spirit_Awards
/wiki/Independent_Spirit_Award_for_Best_Male_Lead
/wiki/The_Woodsman_(2004_film)
/wiki/Mystic_River_(film)
/wiki/MTV_Movie_%26_TV_Awards
/wiki/MTV_Movie_Award_for_Best_Villain
/wiki/Hollow_Man
/wiki/Taking_Chance
/wiki/The_Following
/wiki/E!_People%27s_Choice_Awards
/wiki/E!_People%27s_Choice_Awards
/wiki/The_Following
/wiki/E!_People%27s_Choice_Awards
/wiki/The_Following
/wiki/Primetime_Emmy_Award
/wiki/Primetime_Emmy_Award_for_Outstanding_Lead_Actor_in_a_Limited_Series_or_Movie
/wiki/Taking_Chance
/wiki/Satellite_Awards
/wiki/Satellite_Award_for_Best_Actor_%E2%80%93_Motion_Picture
/wiki/The_Woodsman_(2004_film)
/wiki/Satellite_Award_for_Best_Actor_%E2%80%93_Miniseries_or_Television_Film
/wiki/Taking_Chance
/wiki/Saturn_Award
/wiki/Saturn_Award_for_Best_Actor_on_Television
/wiki/The_Following
/wiki/Saturn_Award_for_Best_Actor_on_Television
/wiki/The_Following
/wiki/Scream_Awards
/wiki/Scream_Awards
/wiki/Screen_Actors_Guild_Award
/wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Supporting_Role
/wiki/Murder_in_the_First_(film)
/wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Cast_in_a_Motion_Picture
/wiki/Apollo_13_(film)
/wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Cast_in_a_Motion_Picture
/wiki/Mystic_River_(film)
/wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Cast_in_a_Motion_Picture
/wiki/Frost/Nixon_(film)
/wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Miniseries_or_Television_Movie
/wiki/Taking_Chance
/wiki/Teen_Choice_Awards
/wiki/Teen_Choice_Award_for_Choice_Movie_Villain
/wiki/Beauty_Shop
/wiki/Teen_Choice_Award_for_Choice_Movie_Villain
/wiki/TV_Guide_Award
/wiki/TV_Guide_Award
/wiki/The_Following
/wiki/Kevin_Bacon_filmography
/wiki/List_of_actors_with_Hollywood_Walk_of_Fame_motion_picture_stars
/wiki/The_Austin_Chronicle
/wiki/Access_Hollywood
/wiki/IMDb
/wiki/Internet_Broadway_Database
/wiki/Lortel_Archives
/wiki/AllMovie
/wiki/Losing_Chase
/wiki/Loverboy_(2005_film)
/wiki/Kevin_Bacon_filmography
/wiki/Critics%27_Choice_Movie_Award_for_Best_Actor
/wiki/Geoffrey_Rush
/wiki/Jack_Nicholson
/wiki/Ian_McKellen
/wiki/Russell_Crowe
/wiki/Russell_Crowe
/wiki/Russell_Crowe
/wiki/Daniel_Day-Lewis
/wiki/Jack_Nicholson
/wiki/Sean_Penn
/wiki/Jamie_Foxx
/wiki/Philip_Seymour_Hoffman
/wiki/Forest_Whitaker
/wiki/Daniel_Day-Lewis
/wiki/Sean_Penn
/wiki/Jeff_Bridges
/wiki/Colin_Firth
/wiki/George_Clooney
/wiki/Daniel_Day-Lewis
/wiki/Matthew_McConaughey
/wiki/Michael_Keaton
/wiki/Leonardo_DiCaprio
/wiki/Casey_Affleck
/wiki/Gary_Oldman
/wiki/Christian_Bale
/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Miniseries_or_Television_Film
/wiki/Mickey_Rooney
/wiki/Anthony_Andrews
/wiki/Richard_Chamberlain
/wiki/Ted_Danson
/wiki/Dustin_Hoffman
/wiki/James_Woods
/wiki/Randy_Quaid
/wiki/Michael_Caine
/wiki/Stacy_Keach
/wiki/Robert_Duvall
/wiki/James_Garner
/wiki/Beau_Bridges
/wiki/Robert_Duvall
/wiki/James_Garner
/wiki/Raul_Julia
/wiki/Gary_Sinise
/wiki/Alan_Rickman
/wiki/Ving_Rhames
/wiki/Stanley_Tucci
/wiki/Jack_Lemmon
/wiki/Brian_Dennehy
/wiki/James_Franco
/wiki/Albert_Finney
/wiki/Al_Pacino
/wiki/Geoffrey_Rush
/wiki/Jonathan_Rhys_Meyers
/wiki/Bill_Nighy
/wiki/Jim_Broadbent
/wiki/Paul_Giamatti
/wiki/Al_Pacino
/wiki/Idris_Elba
/wiki/Kevin_Costner
/wiki/Michael_Douglas
/wiki/Billy_Bob_Thornton
/wiki/Oscar_Isaac
/wiki/Tom_Hiddleston
/wiki/Ewan_McGregor
/wiki/Darren_Criss
/wiki/Saturn_Award_for_Best_Actor_on_Television
/wiki/Kyle_Chandler
/wiki/Steven_Weber_(actor)
/wiki/Richard_Dean_Anderson
/wiki/David_Boreanaz
/wiki/Robert_Patrick
/wiki/Ben_Browder
/wiki/David_Boreanaz
/wiki/David_Boreanaz
/wiki/Ben_Browder
/wiki/Matthew_Fox
/wiki/Michael_C._Hall
/wiki/Matthew_Fox
/wiki/Edward_James_Olmos
/wiki/Josh_Holloway
/wiki/Stephen_Moyer
/wiki/Bryan_Cranston
/wiki/Bryan_Cranston
/wiki/Mads_Mikkelsen
/wiki/Hugh_Dancy
/wiki/Andrew_Lincoln
/wiki/Bruce_Campbell
/wiki/Andrew_Lincoln
/wiki/Kyle_MacLachlan
/wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Miniseries_or_Television_Movie
/wiki/Raul_Julia
/wiki/Gary_Sinise
/wiki/Alan_Rickman
/wiki/Gary_Sinise
/wiki/Christopher_Reeve
/wiki/Jack_Lemmon
/wiki/Brian_Dennehy
/wiki/Ben_Kingsley
/wiki/William_H._Macy
/wiki/Al_Pacino
/wiki/Geoffrey_Rush
/wiki/Paul_Newman
/wiki/Jeremy_Irons
/wiki/Kevin_Kline
/wiki/Paul_Giamatti
/wiki/Al_Pacino
/wiki/Paul_Giamatti
/wiki/Kevin_Costner
/wiki/Michael_Douglas
/wiki/Mark_Ruffalo
/wiki/Idris_Elba
/wiki/Bryan_Cranston
/wiki/Alexander_Skarsg%C3%A5rd
/wiki/Darren_Criss
/wiki/Bibsys
/wiki/Biblioteca_Nacional_de_Espa%C3%B1a
/wiki/Biblioth%C3%A8que_nationale_de_France
/wiki/Integrated_Authority_File
/wiki/International_Standard_Name_Identifier
/wiki/Library_of_Congress_Control_Number
/wiki/National_Library_of_Latvia
/wiki/MusicBrainz
/wiki/National_Library_of_the_Czech_Republic
/wiki/National_Library_of_Australia
/wiki/SNAC
/wiki/Syst%C3%A8me_universitaire_de_documentation
/wiki/Virtual_International_Authority_File
/wiki/WorldCat_Identities

Task 2 Perform a random walk through a given webpate.

Assume that we will find a random object in Wikipedia that is linked to "Kevin Bacon" with, so-called "Six Degrees of Wikipedia". In other words, the task is to find two subjects linked by a chain containing no more than six subjects (including the two original subjects).

In [4]:
import datetime
import random

random.seed(datetime.datetime.now())
def getLinks(articleUrl):
    html = urlopen("http://en.wikipedia.org"+articleUrl)
    bsObj = BeautifulSoup(html, "html.parser")
    return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))
links = getLinks("/wiki/Kevin_Bacon")

The details of the random walk along the links are

  • Randomly choosing a link from the list of retrieved links
  • Printing the article represented by the link
  • Retrieving a list of links
  • repeat the above step until the number of retrieved articles reaches 5.
In [5]:
count = 0
while len(links) > 0 and count < 5:
    newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
    print(newArticle)
    links = getLinks(newArticle)
    count = count + 1
/wiki/SixDegrees.org
/wiki/Nonprofit_organization
/wiki/Click-to-donate_site
/wiki/GiveWell
/wiki/Earning_to_give

Task 3 Crawl the Entire Wikipedia website

The general approach to an exhaustive site crawl is to start with the root, i.e., the home page of a website. Here, we will start with

https://en.wikipedia.org/

by retrieving all the links that appear in the home page. And then traverse each link recursively. However, the number of links is going to be very large and a link can appear in many Wikipedia article. Thus, we need to consider how to avoid repeatedly crawling the same article or page. In order to do so, we can keep a running list for easy lookups and slightly update the getLinks() function.

In [6]:
pages = set()

Note: add a terminating condition in your code, for example,

len(pages) < 10

Otherwise, the script will run through the entire Wikipedia website, which will take a long time to finish. So please avoid that in the tutorial class.

In [7]:
def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org"+pageUrl)
    bsObj = BeautifulSoup(html, "html.parser")
    for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages and len(pages) < 10:
                #We have encountered a new page
                newPage = link.attrs['href']
                print("----------------\n"+newPage)
                pages.add(newPage)
                getLinks(newPage)
In [8]:
getLinks("")
----------------
/wiki/Wikipedia
----------------
/wiki/Wikipedia:Protection_policy#semi
----------------
/wiki/Wikipedia:Requests_for_page_protection
----------------
/wiki/Wikipedia:Requests_for_permissions
----------------
/wiki/Wikipedia:Protection_policy#template
----------------
/wiki/Wikipedia:Lists_of_protected_pages
----------------
/wiki/Wikipedia:Protection_policy
----------------
/wiki/Wikipedia:Perennial_proposals
----------------
/wiki/Wikipedia:Reliable_sources/Perennial_sources
----------------
/wiki/Wikipedia:Reliable_sources

Task 4 Collect data across the Wikipedia site

One purpose of traversing all the the links is to extract data. The best practice is to look at a few pages from the side and determine the patterns. By looking at a handful of Wikipedia pages both articles and non-articles pages, the following pattens can be identified:

  • All titles are under h1 span tags, and these are the only h1 tags on the page. For example,
    <h1 id="firstHeading" class="firstHeading" lang="en">Kevin Bacon</h1>
    
    <h1 id="firstHeading" class="firstHeading" lang="en">Main Page</h1>
    
  • All body text lives under the div#bodyContent tag. However, if we want to get more specific and access just the first paragraph of text, we might be better off using div#mw-content-text -> p.
  • Edit links occur only on article pages. If they occur, they will be found in the li#ca-edit tag, under li#ca-edit -> span -> a

Now, the task is to further modify the getLink() function to print the title, the first paragraph and the edit link. The content from each page should be separated by

pyhon
print("----------------\n"+newPage)
In [9]:
pages = set()
Please also add a terminating condition in your code, for example,
len(pages) < 5

Otherwise, the script will run through the entire Wikipedia website, which will take a long time to finish. So please avoid that in the tutorial class.

In [10]:
def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org"+pageUrl)
    bsObj = BeautifulSoup(html, "html.parser")
    try:
        print(bsObj.h1.get_text())
        print(bsObj.find(id ="mw-content-text").findAll("p")[0])
        print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])
    except AttributeError:
        print("This page is missing something! No worries though!")
    
    for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages and len(pages) < 5:
                #We have encountered a new page
                newPage = link.attrs['href']
                print("----------------\n"+newPage)
                pages.add(newPage)
                getLinks(newPage)
In [11]:
getLinks("") 
Main Page
<p><b><a href="/wiki/Teresa_Sampsonia" title="Teresa Sampsonia">Teresa Sampsonia</a></b> (1589–1668) was a noblewoman of the <a class="mw-redirect" href="/wiki/Safavid_Empire" title="Safavid Empire">Safavid Empire</a> of Iran. She was born into a noble <a href="/wiki/Orthodoxy#Christianity" title="Orthodoxy">Orthodox Christian</a> <a href="/wiki/Circassians" title="Circassians">Circassian</a> family and grew up in <a href="/wiki/Isfahan" title="Isfahan">Isfahan</a> in the Iranian royal court. In 1608 she married the <a href="/wiki/Elizabethan_era" title="Elizabethan era">Elizabethan</a> English adventurer <a href="/wiki/Robert_Shirley" title="Robert Shirley">Robert Shirley</a>, who attended the Safavid court in an effort to forge an alliance against the neighbouring <a href="/wiki/Ottoman_Empire" title="Ottoman Empire">Ottoman Empire</a>. She accompanied him on the <a href="/wiki/Persian_embassy_to_Europe_(1609%E2%80%9315)" title="Persian embassy to Europe (1609–15)">Persian embassy to Europe (1609–15)</a>, where he represented the Safavid king <a href="/wiki/Abbas_the_Great" title="Abbas the Great">Abbas the Great</a>. She was received by many of the <a href="/wiki/Monarchies_in_Europe" title="Monarchies in Europe">royal houses of Europe</a>, including the English prince <a href="/wiki/Henry_Frederick,_Prince_of_Wales" title="Henry Frederick, Prince of Wales">Henry Frederick</a> and <a href="/wiki/Anne_of_Denmark" title="Anne of Denmark">Queen Anne</a>, who were her son's <a href="/wiki/Godparent" title="Godparent">godparents</a>. The historian <a href="/wiki/Sir_Thomas_Herbert,_1st_Baronet" title="Sir Thomas Herbert, 1st Baronet">Thomas Herbert</a> considered Robert Shirley "the greatest Traveller of his time", but admired the "undaunted Lady Teresa" even more. Following the death of her husband from <a href="/wiki/Dysentery" title="Dysentery">dysentery</a> in 1628, she left Iran and lived in a convent in <a href="/wiki/Rome" title="Rome">Rome</a> for the rest of her life. (<a href="/wiki/Teresa_Sampsonia" title="Teresa Sampsonia"><b>Full article...</b></a>)
</p>
This page is missing something! No worries though!
----------------
/wiki/Wikipedia
Wikipedia
<p class="mw-empty-elt">
</p>
This page is missing something! No worries though!
----------------
/wiki/Wikipedia:Protection_policy#semi
Wikipedia:Protection policy
<p class="mw-empty-elt">
</p>
This page is missing something! No worries though!
----------------
/wiki/Wikipedia:Requests_for_page_protection
Wikipedia:Requests for page protection
<p>This page is for requesting that a page, file or template be <b> fully protected</b>, <b>create protected</b> (<a href="/wiki/Wikipedia:Protection_policy#Creation_protection" title="Wikipedia:Protection policy">salted</a>), <b>extended confirmed protected</b>, <b>semi-protected</b>, added to <b>pending changes</b>, <b>move-protected</b>, <b>template protected</b> (template-specific), <b>upload protected</b> (file-specific), or <b>unprotected</b>. Please read up on the <a href="/wiki/Wikipedia:Protection_policy" title="Wikipedia:Protection policy">protection policy</a>. Full protection is used to stop edit warring between multiple users or to prevent vandalism to <a href="/wiki/Wikipedia:High-risk_templates" title="Wikipedia:High-risk templates">high-risk templates</a>; semi-protection and pending changes are usually used only to prevent IP and new user vandalism (see the <a href="/wiki/Wikipedia:Rough_guide_to_semi-protection" title="Wikipedia:Rough guide to semi-protection">rough guide to semi-protection</a>); and move protection is used to stop <a href="/wiki/Wikipedia:Moving_a_page" title="Wikipedia:Moving a page">pagemove</a> revert wars. Extended confirmed protection is used where semi-protection has proved insufficient (see the <a href="/wiki/Wikipedia:Rough_guide_to_extended_confirmed_protection" title="Wikipedia:Rough guide to extended confirmed protection">rough guide to extended confirmed protection</a>)
</p>
/w/index.php?title=Wikipedia:Requests_for_page_protection&action=edit
----------------
/wiki/Wikipedia:Requests_for_permissions
Wikipedia:Requests for permissions
<p><span class="sysop-show" id="coordinates"><a href="/wiki/Wikipedia:Requests_for_permissions/Administrator_instructions" title="Wikipedia:Requests for permissions/Administrator instructions">Administrator instructions</a></span>
</p>
This page is missing something! No worries though!
----------------
/wiki/Wikipedia:Protection_policy#template
Wikipedia:Protection policy
<p class="mw-empty-elt">
</p>
This page is missing something! No worries though!

Task 5 API access

In addition to HTML format, data is commonly found on the web through public APIs. We use the 'requests' package (http://docs.python-requests.org) to call APIs using Python. In the following example, we call a public API for collecting weather data.

You need to sign up for a free account to get your unique API key to use in the following code. register at http://api.openweathermap.org

In [12]:
#Now we  use requests to retrieve the web page with our data
import requests
url = 'http://api.openweathermap.org/data/2.5/forecast?id=524901&cnt=16&APPID=1499bcd50a6310a21f11b8de4fb653a5'
#write your APPID here#
response= requests.get(url)
response
Out[12]:
<Response [200]>

The response object contains GET query response. A successfull one has a value of 200. we need to parse the response with json to extract the information.

In [13]:
#Check the HTTP status code https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
print (response.status_code)
200
In [14]:
# response.content is text
print (type(response.content))
<class 'bytes'>
In [15]:
#response.json() converts the content to json 
data = response.json()
print (type(data))
<class 'dict'>
In [16]:
data.keys()
Out[16]:
dict_keys(['cod', 'message', 'cnt', 'list', 'city'])
In [17]:
data
Out[17]:
{'cod': '200',
 'message': 0.0106,
 'cnt': 16,
 'list': [{'dt': 1556355600,
   'main': {'temp': 296.31,
    'temp_min': 295,
    'temp_max': 296.31,
    'pressure': 1011.01,
    'sea_level': 1011.01,
    'grnd_level': 989.15,
    'humidity': 36,
    'temp_kf': 1.31},
   'weather': [{'id': 802,
     'main': 'Clouds',
     'description': 'scattered clouds',
     'icon': '03d'}],
   'clouds': {'all': 35},
   'wind': {'speed': 3.4, 'deg': 250.866},
   'sys': {'pod': 'd'},
   'dt_txt': '2019-04-27 09:00:00'},
  {'dt': 1556366400,
   'main': {'temp': 296.09,
    'temp_min': 295.105,
    'temp_max': 296.09,
    'pressure': 1009.38,
    'sea_level': 1009.38,
    'grnd_level': 987.62,
    'humidity': 33,
    'temp_kf': 0.98},
   'weather': [{'id': 802,
     'main': 'Clouds',
     'description': 'scattered clouds',
     'icon': '03d'}],
   'clouds': {'all': 45},
   'wind': {'speed': 4.28, 'deg': 252.899},
   'sys': {'pod': 'd'},
   'dt_txt': '2019-04-27 12:00:00'},
  {'dt': 1556377200,
   'main': {'temp': 294.06,
    'temp_min': 293.405,
    'temp_max': 294.06,
    'pressure': 1007.65,
    'sea_level': 1007.65,
    'grnd_level': 986.14,
    'humidity': 41,
    'temp_kf': 0.66},
   'weather': [{'id': 804,
     'main': 'Clouds',
     'description': 'overcast clouds',
     'icon': '04d'}],
   'clouds': {'all': 90},
   'wind': {'speed': 3.72, 'deg': 239.992},
   'sys': {'pod': 'd'},
   'dt_txt': '2019-04-27 15:00:00'},
  {'dt': 1556388000,
   'main': {'temp': 290.13,
    'temp_min': 289.805,
    'temp_max': 290.13,
    'pressure': 1006.83,
    'sea_level': 1006.83,
    'grnd_level': 984.78,
    'humidity': 55,
    'temp_kf': 0.33},
   'weather': [{'id': 804,
     'main': 'Clouds',
     'description': 'overcast clouds',
     'icon': '04n'}],
   'clouds': {'all': 91},
   'wind': {'speed': 3.55, 'deg': 233.789},
   'sys': {'pod': 'n'},
   'dt_txt': '2019-04-27 18:00:00'},
  {'dt': 1556398800,
   'main': {'temp': 286.232,
    'temp_min': 286.232,
    'temp_max': 286.232,
    'pressure': 1006.73,
    'sea_level': 1006.73,
    'grnd_level': 984.7,
    'humidity': 78,
    'temp_kf': 0},
   'weather': [{'id': 500,
     'main': 'Rain',
     'description': 'light rain',
     'icon': '10n'}],
   'clouds': {'all': 98},
   'wind': {'speed': 3.43, 'deg': 18.704},
   'rain': {'3h': 0.688},
   'sys': {'pod': 'n'},
   'dt_txt': '2019-04-27 21:00:00'},
  {'dt': 1556409600,
   'main': {'temp': 283.784,
    'temp_min': 283.784,
    'temp_max': 283.784,
    'pressure': 1008.42,
    'sea_level': 1008.42,
    'grnd_level': 986.25,
    'humidity': 87,
    'temp_kf': 0},
   'weather': [{'id': 804,
     'main': 'Clouds',
     'description': 'overcast clouds',
     'icon': '04n'}],
   'clouds': {'all': 93},
   'wind': {'speed': 4.64, 'deg': 16.44},
   'rain': {},
   'sys': {'pod': 'n'},
   'dt_txt': '2019-04-28 00:00:00'},
  {'dt': 1556420400,
   'main': {'temp': 281.722,
    'temp_min': 281.722,
    'temp_max': 281.722,
    'pressure': 1011.07,
    'sea_level': 1011.07,
    'grnd_level': 988.49,
    'humidity': 75,
    'temp_kf': 0},
   'weather': [{'id': 500,
     'main': 'Rain',
     'description': 'light rain',
     'icon': '10d'}],
   'clouds': {'all': 91},
   'wind': {'speed': 6.26, 'deg': 32.303},
   'rain': {'3h': 0.062},
   'sys': {'pod': 'd'},
   'dt_txt': '2019-04-28 03:00:00'},
  {'dt': 1556431200,
   'main': {'temp': 278.729,
    'temp_min': 278.729,
    'temp_max': 278.729,
    'pressure': 1014.57,
    'sea_level': 1014.57,
    'grnd_level': 991.86,
    'humidity': 67,
    'temp_kf': 0},
   'weather': [{'id': 804,
     'main': 'Clouds',
     'description': 'overcast clouds',
     'icon': '04d'}],
   'clouds': {'all': 94},
   'wind': {'speed': 6.19, 'deg': 30.014},
   'sys': {'pod': 'd'},
   'dt_txt': '2019-04-28 06:00:00'},
  {'dt': 1556442000,
   'main': {'temp': 281.341,
    'temp_min': 281.341,
    'temp_max': 281.341,
    'pressure': 1016.79,
    'sea_level': 1016.79,
    'grnd_level': 993.76,
    'humidity': 47,
    'temp_kf': 0},
   'weather': [{'id': 803,
     'main': 'Clouds',
     'description': 'broken clouds',
     'icon': '04d'}],
   'clouds': {'all': 65},
   'wind': {'speed': 5.72, 'deg': 37.846},
   'sys': {'pod': 'd'},
   'dt_txt': '2019-04-28 09:00:00'},
  {'dt': 1556452800,
   'main': {'temp': 283.066,
    'temp_min': 283.066,
    'temp_max': 283.066,
    'pressure': 1017.85,
    'sea_level': 1017.85,
    'grnd_level': 994.78,
    'humidity': 35,
    'temp_kf': 0},
   'weather': [{'id': 802,
     'main': 'Clouds',
     'description': 'scattered clouds',
     'icon': '03d'}],
   'clouds': {'all': 32},
   'wind': {'speed': 5.71, 'deg': 32.076},
   'sys': {'pod': 'd'},
   'dt_txt': '2019-04-28 12:00:00'},
  {'dt': 1556463600,
   'main': {'temp': 281.817,
    'temp_min': 281.817,
    'temp_max': 281.817,
    'pressure': 1018.75,
    'sea_level': 1018.75,
    'grnd_level': 995.69,
    'humidity': 38,
    'temp_kf': 0},
   'weather': [{'id': 800,
     'main': 'Clear',
     'description': 'clear sky',
     'icon': '01d'}],
   'clouds': {'all': 0},
   'wind': {'speed': 5.2, 'deg': 32.025},
   'sys': {'pod': 'd'},
   'dt_txt': '2019-04-28 15:00:00'},
  {'dt': 1556474400,
   'main': {'temp': 278.308,
    'temp_min': 278.308,
    'temp_max': 278.308,
    'pressure': 1021.34,
    'sea_level': 1021.34,
    'grnd_level': 998.27,
    'humidity': 48,
    'temp_kf': 0},
   'weather': [{'id': 800,
     'main': 'Clear',
     'description': 'clear sky',
     'icon': '01n'}],
   'clouds': {'all': 0},
   'wind': {'speed': 3.98, 'deg': 33.794},
   'sys': {'pod': 'n'},
   'dt_txt': '2019-04-28 18:00:00'},
  {'dt': 1556485200,
   'main': {'temp': 276.176,
    'temp_min': 276.176,
    'temp_max': 276.176,
    'pressure': 1022.97,
    'sea_level': 1022.97,
    'grnd_level': 999.78,
    'humidity': 52,
    'temp_kf': 0},
   'weather': [{'id': 800,
     'main': 'Clear',
     'description': 'clear sky',
     'icon': '01n'}],
   'clouds': {'all': 0},
   'wind': {'speed': 3.21, 'deg': 32.784},
   'sys': {'pod': 'n'},
   'dt_txt': '2019-04-28 21:00:00'},
  {'dt': 1556496000,
   'main': {'temp': 274.592,
    'temp_min': 274.592,
    'temp_max': 274.592,
    'pressure': 1024.2,
    'sea_level': 1024.2,
    'grnd_level': 1000.78,
    'humidity': 60,
    'temp_kf': 0},
   'weather': [{'id': 800,
     'main': 'Clear',
     'description': 'clear sky',
     'icon': '01n'}],
   'clouds': {'all': 0},
   'wind': {'speed': 2.68, 'deg': 35.503},
   'sys': {'pod': 'n'},
   'dt_txt': '2019-04-29 00:00:00'},
  {'dt': 1556506800,
   'main': {'temp': 274.092,
    'temp_min': 274.092,
    'temp_max': 274.092,
    'pressure': 1024.94,
    'sea_level': 1024.94,
    'grnd_level': 1001.43,
    'humidity': 60,
    'temp_kf': 0},
   'weather': [{'id': 800,
     'main': 'Clear',
     'description': 'clear sky',
     'icon': '01d'}],
   'clouds': {'all': 0},
   'wind': {'speed': 2.4, 'deg': 43.128},
   'sys': {'pod': 'd'},
   'dt_txt': '2019-04-29 03:00:00'},
  {'dt': 1556517600,
   'main': {'temp': 278.181,
    'temp_min': 278.181,
    'temp_max': 278.181,
    'pressure': 1025.67,
    'sea_level': 1025.67,
    'grnd_level': 1002.36,
    'humidity': 41,
    'temp_kf': 0},
   'weather': [{'id': 800,
     'main': 'Clear',
     'description': 'clear sky',
     'icon': '01d'}],
   'clouds': {'all': 0},
   'wind': {'speed': 2.85, 'deg': 44.801},
   'sys': {'pod': 'd'},
   'dt_txt': '2019-04-29 06:00:00'}],
 'city': {'id': 524901,
  'name': 'Moscow',
  'coord': {'lat': 55.7522, 'lon': 37.6156},
  'country': 'RU'}}

The keys explain the structure of the fetched data. Try displaying values for each element. In this example, the weather information exists in the 'list'.

In [18]:
data['list'][15]
Out[18]:
{'dt': 1556517600,
 'main': {'temp': 278.181,
  'temp_min': 278.181,
  'temp_max': 278.181,
  'pressure': 1025.67,
  'sea_level': 1025.67,
  'grnd_level': 1002.36,
  'humidity': 41,
  'temp_kf': 0},
 'weather': [{'id': 800,
   'main': 'Clear',
   'description': 'clear sky',
   'icon': '01d'}],
 'clouds': {'all': 0},
 'wind': {'speed': 2.85, 'deg': 44.801},
 'sys': {'pod': 'd'},
 'dt_txt': '2019-04-29 06:00:00'}

The next step is to create a DataFrame with the weather information, which is demonstrated as follows. You can select a subset to display or display the entire data

In [19]:
from pandas import DataFrame
# data with the default column headers
weather_table_all= DataFrame(data['list'])
weather_table_all
Out[19]:
clouds dt dt_txt main rain sys weather wind
0 {'all': 35} 1556355600 2019-04-27 09:00:00 {'temp': 296.31, 'temp_min': 295, 'temp_max': ... NaN {'pod': 'd'} [{'id': 802, 'main': 'Clouds', 'description': ... {'speed': 3.4, 'deg': 250.866}
1 {'all': 45} 1556366400 2019-04-27 12:00:00 {'temp': 296.09, 'temp_min': 295.105, 'temp_ma... NaN {'pod': 'd'} [{'id': 802, 'main': 'Clouds', 'description': ... {'speed': 4.28, 'deg': 252.899}
2 {'all': 90} 1556377200 2019-04-27 15:00:00 {'temp': 294.06, 'temp_min': 293.405, 'temp_ma... NaN {'pod': 'd'} [{'id': 804, 'main': 'Clouds', 'description': ... {'speed': 3.72, 'deg': 239.992}
3 {'all': 91} 1556388000 2019-04-27 18:00:00 {'temp': 290.13, 'temp_min': 289.805, 'temp_ma... NaN {'pod': 'n'} [{'id': 804, 'main': 'Clouds', 'description': ... {'speed': 3.55, 'deg': 233.789}
4 {'all': 98} 1556398800 2019-04-27 21:00:00 {'temp': 286.232, 'temp_min': 286.232, 'temp_m... {'3h': 0.688} {'pod': 'n'} [{'id': 500, 'main': 'Rain', 'description': 'l... {'speed': 3.43, 'deg': 18.704}
5 {'all': 93} 1556409600 2019-04-28 00:00:00 {'temp': 283.784, 'temp_min': 283.784, 'temp_m... {} {'pod': 'n'} [{'id': 804, 'main': 'Clouds', 'description': ... {'speed': 4.64, 'deg': 16.44}
6 {'all': 91} 1556420400 2019-04-28 03:00:00 {'temp': 281.722, 'temp_min': 281.722, 'temp_m... {'3h': 0.062} {'pod': 'd'} [{'id': 500, 'main': 'Rain', 'description': 'l... {'speed': 6.26, 'deg': 32.303}
7 {'all': 94} 1556431200 2019-04-28 06:00:00 {'temp': 278.729, 'temp_min': 278.729, 'temp_m... NaN {'pod': 'd'} [{'id': 804, 'main': 'Clouds', 'description': ... {'speed': 6.19, 'deg': 30.014}
8 {'all': 65} 1556442000 2019-04-28 09:00:00 {'temp': 281.341, 'temp_min': 281.341, 'temp_m... NaN {'pod': 'd'} [{'id': 803, 'main': 'Clouds', 'description': ... {'speed': 5.72, 'deg': 37.846}
9 {'all': 32} 1556452800 2019-04-28 12:00:00 {'temp': 283.066, 'temp_min': 283.066, 'temp_m... NaN {'pod': 'd'} [{'id': 802, 'main': 'Clouds', 'description': ... {'speed': 5.71, 'deg': 32.076}
10 {'all': 0} 1556463600 2019-04-28 15:00:00 {'temp': 281.817, 'temp_min': 281.817, 'temp_m... NaN {'pod': 'd'} [{'id': 800, 'main': 'Clear', 'description': '... {'speed': 5.2, 'deg': 32.025}
11 {'all': 0} 1556474400 2019-04-28 18:00:00 {'temp': 278.308, 'temp_min': 278.308, 'temp_m... NaN {'pod': 'n'} [{'id': 800, 'main': 'Clear', 'description': '... {'speed': 3.98, 'deg': 33.794}
12 {'all': 0} 1556485200 2019-04-28 21:00:00 {'temp': 276.176, 'temp_min': 276.176, 'temp_m... NaN {'pod': 'n'} [{'id': 800, 'main': 'Clear', 'description': '... {'speed': 3.21, 'deg': 32.784}
13 {'all': 0} 1556496000 2019-04-29 00:00:00 {'temp': 274.592, 'temp_min': 274.592, 'temp_m... NaN {'pod': 'n'} [{'id': 800, 'main': 'Clear', 'description': '... {'speed': 2.68, 'deg': 35.503}
14 {'all': 0} 1556506800 2019-04-29 03:00:00 {'temp': 274.092, 'temp_min': 274.092, 'temp_m... NaN {'pod': 'd'} [{'id': 800, 'main': 'Clear', 'description': '... {'speed': 2.4, 'deg': 43.128}
15 {'all': 0} 1556517600 2019-04-29 06:00:00 {'temp': 278.181, 'temp_min': 278.181, 'temp_m... NaN {'pod': 'd'} [{'id': 800, 'main': 'Clear', 'description': '... {'speed': 2.85, 'deg': 44.801}

Discussion:

Further parsing is still required to get the table (DataFrame) in a flat shape. Now it it's your turn, parse the weather data to generate a table.

Please note that materials used in this tutorial are partially from the book "Web Scraping with Python"

In [ ]: