Recall from Tutorial 2 that we produced some simple line plots and side-by-side box plots using the hourly Melbourne pedestrian count data after wrangling the data into a tidy long form.
While Tutorial 2 explores pedestrian count from 2 locations: Melbourne Central and Southbank, in this tutorial, we will introduce a 3rd location: Bourke Street Mall (North).
An exploratory data analysis (EDA) analysis is performed at the beginning a data science project after the data has been collected and read into R. It involves a lot of data wrangling and exploratory data visualisation (informative visualisations for you or other data scientists that you are working closely with). Doing this will help you learn about missing values, seasonal patterns, anomalies, etc., which is information that can determine what you need to do next in your analysis.
Read the hourly pedestrian counts data from 2016 to 2018 for sensors located in Melbourne Central, Southbank and Bourke Street Mall (North).
tidyverse
and rwalkr
packagemelb_walk_fast()
function from the rwalkr
package, read in the pedestrian count data from 2016 to 2018 based on the above sensors.ped_melbcent
, ped_southbank
and ped_bourkest.mallnorth
bind_rows()
function and store this data in an object named ped_melb.south.bourke
.# Load tidyverse
???
# Load rwalkr
???
# Read in pedestrian data from 2016 to 2018 for 3 locations
ped_melbcent <- ???(year = 2016:2018, sensor = "Melbourne Central")
??? <- ???(year = 2016:2018, sensor = "???")
ped_bourkest.mallnorth <- ???(year = ???, sensor = "Bourke Street Mall (North)")
# Row bind 3 data frames
ped_melb.south.bourke <- ???(ped_melbcent, ped_southbank, ped_bourkest.mallnorth)
# Answer
# Load tidyverse
library(tidyverse)
# Load rwalkr
library(rwalkr)
# Read in pedestrian data from 2016 to 2018 for 3 locations
ped_melbcent <- melb_walk_fast(year = 2016:2018, sensor = "Melbourne Central")
ped_southbank <- melb_walk_fast(year = 2016:2018, sensor = "Southbank")
ped_bourkest.mallnorth <- melb_walk_fast(year = 2016:2018, sensor = "Bourke Street Mall (North)")
# Row bind 3 data frames
ped_melb.south.bourke <- bind_rows(ped_melbcent, ped_southbank, ped_bourkest.mallnorth)
You can explore the City of Melbourne Pedestrian Counting System to identify these locations on an interactive map.
Look at ped_melb.south.bourke
using the glimpse()
function and answer the following questions:
# Answer
# Look at the data
glimpse(ped_melb.south.bourke)
# Answer
A glimpse of the data shows that each observation is an hourly log of the reading from the sensor device. The variables in the data include:
* __`Sensor`__ - location of the sensor device
* __`Date_Time`__ - date and time of the reading
* __`Date`__ - date (yyyy-mm-dd)
* __`Time`__ - hour of the day
* __`Count`__ - total sensor count of pedestrians
It is important that the students understand that this is a temporal/time-series data. You can explain the difference between a temporal and cross-sectional data by comparing this data set with the space lauches data used in tutorial 4.
Using what you have learned about dates, create the following variables, which are decomposed from the variable Date
.
year
- yearmonth
- month of the yearwday
- day of the weekday
- day of the monthTo do so, fill out the missing parts of the code chunk (???
) below.
# Load lubridate
library(???)
# Create 'time' variables
ped_melb.south.bourke <- ped_melb.south.bourke %>%
mutate(year = ???(???),
month = ???(???, label = TRUE, abbr = TRUE),
wday = ???(Date, label = TRUE, abbr = TRUE, week_start = 1),
day = ???(???))
## Warning: package 'lubridate' was built under R version 3.6.2
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
The pedestrian sensor devices count the number of pedestrians over each hour of the day. For many reasons, sensors can malfunction or produce abnormal readings. This means that it’s crucial for you to thoroughly examine missing values and outliers in the data (this is not to say that all missing values and outliers are due to faulty sensor devices).
Check for missing values/time gaps in the data using what you have learned about visualising missing values.
The code chunk below is partially filled out but will guide you through the steps. Filling out the missing parts of the code chunk (???
) will produce plots of the pedestrian count in each year for each sensor with missing values placed in the bottom margin.
# Load naniar
???
# Melbourne Central time gaps
ped_melb.south.bourke %>%
filter(Sensor == "Melbourne Central") %>%
ggplot(aes(x=Date_Time, y=Count)) +
geom_miss_point(size = 0.7) +
facet_wrap(year ~., scales = "free_x", nrow = 3) +
labs(title = "Melbourne Central", y = "Count", x = "Date-Time")
# Southbank time gaps
ped_melb.south.bourke %>%
filter(Sensor == "Southbank") %>%
ggplot(aes(x=Date_Time, y=Count)) +
???(size = 0.7) +
???(year ~., scales = "free_x", nrow = 3) +
labs(title = "Southbank", y = "Count", x = "Date-Time")
# Bourke Street Mall (North) time gaps
ped_melb.south.bourke %>%
filter(Sensor == "Bourke Street Mall (North)") %>%
ggplot(aes(x=Date_Time, y=???)) +
???(size = 0.7) +
???(year ~., scales = "free_x", nrow = 3) +
labs(title = "Bourke Street Mall (North)", y = "Count", x = "Date-Time")
## Warning: package 'naniar' was built under R version 3.6.2
# Melbourne Central time gaps
ped_melb.south.bourke %>%
filter(Sensor == "Melbourne Central") %>%
ggplot(aes(x=Date_Time, y=Count)) +
geom_miss_point(size = 0.7) +
facet_wrap(year ~., scales = "free_x", nrow = 3) +
labs(title = "Melbourne Central", y = "Count", x = "Date-Time")
# Southbank time gaps
ped_melb.south.bourke %>%
filter(Sensor == "Southbank") %>%
ggplot(aes(x=Date_Time, y=Count)) +
geom_miss_point(size = 0.7) +
facet_wrap(year ~., scales = "free_x", nrow = 3) +
labs(title = "Southbank", y = "Count", x = "Date-Time")
# Bourke Street Mall (North) time gaps
ped_melb.south.bourke %>%
filter(Sensor == "Bourke Street Mall (North)") %>%
ggplot(aes(x=Date_Time, y=Count)) +
geom_miss_point(size = 0.7) +
facet_wrap(year ~., scales = "free_x", nrow = 3) +
labs(title = "Bourke Street Mall (North)", y = "Count", x = "Date-Time")
Answer the following questions based on the plots above:
It is useful to be able to quantitative describe the central tendency of numerical variables in the data. Running the following code chunk will return the mean and median hourly pedestrian counts in Melbourne Central, Southbank and Bourke Street Mall (North).
# Table of mean and median pedestrian count
ped_melb.south.bourke %>%
group_by(Sensor) %>%
summarise(meanCount = mean(Count, na.rm=TRUE),
medianCount = median(Count, na.rm=TRUE)) %>%
ungroup()
Notice that the median is lower than the mean, which indicates that the distribution of hourly pedestrian counts is positively skewed. (If this is unclear, think about which of the two measures of central tendency is more sensitive to large values.)
Fill out the missing parts of the code chunk (???
) below to produce a histogram of the pedestrian count for each location.
# Histogram of pedestrian count
ped_melb.south.bourke %>%
ggplot(aes(x = ???)) +
geom_???() +
labs(title = "Distribution of hourly pedestrian count",
x = "Pedestrians detected",
y = "Frequency") +
facet_wrap(~ ???, scales = "free", nrow = 3)
# Answer
# Histogram of pedestrian count
ped_melb.south.bourke %>%
ggplot(aes(x = Count)) +
geom_histogram() +
labs(title = "Distribution of hourly pedestrian count",
x = "Pedestrians detected",
y = "Frequency") +
facet_wrap(~ Sensor, scales = "free", nrow = 3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 6127 rows containing non-finite values (stat_bin).
Based on the distribution of pedestrian count which statistic would provide a representative measure of central tendency of pedestrian count? Why?
Use this measure of central tendency to compute the ‘typical’ pedestrian count for each month and location. Once you have done this, convert the data into a wide form.
The following code chunk is partially filled out but can guide you through the step.
For a challenge, reproduce the following line plot. The code chunk provide is partially filled out but can guide you through the steps.
# Challenge: Line plots median hourly pedestrian count
ped_melb.south.bourke %>%
group_by(Sensor, month) %>%
summarise(medianCount = median(Count, na.rm=TRUE)) %>%
ungroup() %>%
ggplot(aes(???, ???, ???, ???)) +
geom_???() +
geom_???() +
labs(title = "Median Hourly Pedestrian Counts, 2016-2018",
subtitle = "Generally more pedestrians detected in Southbank across all months.",
x = "Month",
y = "Median Counts")
# Answer
# Line plots median hourly pedestrian count
ped_melb.south.bourke %>%
group_by(Sensor, month) %>%
summarise(medianCount = median(Count, na.rm=TRUE)) %>%
ungroup() %>%
ggplot(aes(x = month, y = medianCount, colour = Sensor, group = Sensor)) +
geom_point() +
geom_line() +
labs(title = "Median Hourly Pedestrian Counts, 2016-2018",
subtitle = "Generally more pedestrians detected in Southbank across all months.",
x = "Month",
y = "Median Counts")
## `summarise()` regrouping output by 'Sensor' (override with `.groups` argument)
You can use box plots to help visualise how the distribution of pedestrian counts change from hour to hour.
Fill out the missing parts of the code chunk (???
) below to produce a side-by-side box plots of the pedestrian count for each hour of the day facetted by year. You will need to set your code chunk figure options to fig.height=9, fig.width=12
.
# Box plot of pedestrian counts
ped_melb.south.bourke %>%
ggplot(aes(x = as.factor(Time), y = ???, colour = ???)) +
geom_???(alpha = 0.5) +
facet_???(~ ???, nrow = ???) +
theme(legend.position = "bottom") + # change the legend position
labs(title = "Distribution of pedestrian counts at each hour of the day", y = "Pedestrian Counts", x = "Hour of the day")
# Answer
# Box plot of pedestrian counts
ped_melb.south.bourke %>%
ggplot(aes(x = as.factor(Time), y = Count, colour = Sensor)) +
geom_boxplot(alpha = 0.5) +
facet_wrap(~ year, nrow = 3) +
theme(legend.position = "bottom") + # change the legend position
labs(title = "Distribution of pedestrian counts at each hour of the day", y = "Pedestrian Count", x = "Hour of the day")
## Warning: Removed 6127 rows containing non-finite values (stat_boxplot).
Answer the following questions based on the side-by-side box plots above:
In the box plot, the interquartile range (IQR) is the difference between edges of the box, i.e., the 3rd quartile minus the 1st quartile. The larger the box, the greater the IQR, and hence the greater the variability of the variable. Explore the box plots of pedestrian counts at Southbank. During which hour of the day is the IQR largest? Explain why this might be the case.
During which hours of the day and at what location did the sensor detect the highest pedestrian count?
The highest detected pedestrian count is approximately 9,000. Approximately how many times larger is the highest detected pedestrian count to the overall median pedestrian count in this location?
Provide an explanation for the high frequency of pedestrian count in Southbank during the later hours of the day.
A reasonable explanation for the large number of pedestrians detected prior to midnight is that these observations occurred on New Year’s Eve.
It would be reasonable to expect the city’s New Year’s Eve festivities, which include entertainment, activities and fireworks, to attract many locals and tourists to the city. Confirm your hypothesis by filling in the code chunk to produce the below line plots of pedestrian count during the days prior to New Year’s Eve.
# Fill out ???
ped_melb.south.bourke %>%
filter(month == ???, day %in% 24:31) %>%
ggplot(aes(x = ???, y = ???, colour = ???)) +
geom_???(alpha = 0.5) +
facet_wrap(??? ~., scales = "free_x", nrow = 3) +
theme(legend.position = "bottom") + # change the legend position
???(title = "Pedestrian count at each hour of the day leading up to NYE", y = "Pedestrian Count", x = "Hour of the day")
# Answer
# Pedestrian count days prior to NYE
ped_melb.south.bourke %>%
filter(month == "Dec", day %in% 24:31) %>%
ggplot(aes(x=Date_Time, y=Count, colour = Sensor)) +
geom_line(alpha = 0.5) +
facet_wrap(year ~., scales = "free_x", nrow = 3) +
theme(legend.position = "bottom") + # change the legend position
labs(title = "Pedestrian count at each hour of the day leading up to NYE", y = "Pedestrian Count", x = "Hour of the day")
## Warning: Removed 192 row(s) containing missing values (geom_path).
Once you’ve produced your plot, answer the following questions: