Tutorial 6

Pedestrian activity

Recall from Tutorial 2 that we produced some simple line plots and side-by-side box plots using the hourly Melbourne pedestrian count data after wrangling the data into a tidy long form.

While Tutorial 2 explores pedestrian count from 2 locations: Melbourne Central and Southbank, in this tutorial, we will introduce a 3rd location: Bourke Street Mall (North).

Exploratory data analysis

An exploratory data analysis (EDA) analysis is performed at the beginning a data science project after the data has been collected and read into R. It involves a lot of data wrangling and exploratory data visualisation (informative visualisations for you or other data scientists that you are working closely with). Doing this will help you learn about missing values, seasonal patterns, anomalies, etc., which is information that can determine what you need to do next in your analysis.

Prepare the data

Read the hourly pedestrian counts data from 2016 to 2018 for sensors located in Melbourne Central, Southbank and Bourke Street Mall (North).

Load the tidyverse and rwalkr package
Using the melb_walk_fast() function from the rwalkr package, read in the pedestrian count data from 2016 to 2018 based on the above sensors.
Store each data set into an object. The objects should be named ped_melbcent, ped_southbank and ped_bourkest.mallnorth
Row bind these three data frames using the bind_rows() function and store this data in an object named ped_melb.south.bourke.

# Load tidyverse
???

# Load rwalkr
???

# Read in pedestrian data from 2016 to 2018 for 3 locations
ped_melbcent <- ???(year = 2016:2018, sensor = "Melbourne Central")
??? <- ???(year = 2016:2018, sensor = "???")
ped_bourkest.mallnorth <- ???(year = ???, sensor = "Bourke Street Mall (North)")

# Row bind 3 data frames
ped_melb.south.bourke <- ???(ped_melbcent, ped_southbank, ped_bourkest.mallnorth)

# Answer

# Load tidyverse
library(tidyverse)

# Load rwalkr
library(rwalkr)

# Read in pedestrian data from 2016 to 2018 for 3 locations
ped_melbcent <- melb_walk_fast(year = 2016:2018, sensor = "Melbourne Central")
ped_southbank <- melb_walk_fast(year = 2016:2018, sensor = "Southbank")
ped_bourkest.mallnorth <- melb_walk_fast(year = 2016:2018, sensor = "Bourke Street Mall (North)")

# Row bind 3 data frames
ped_melb.south.bourke <- bind_rows(ped_melbcent, ped_southbank, ped_bourkest.mallnorth)

You can explore the City of Melbourne Pedestrian Counting System to identify these locations on an interactive map.

Look at the data

Look at ped_melb.south.bourke using the glimpse() function and answer the following questions:

What does each observation represent?
What does each variable represent?

# Look at the data
???

# Answer

# Look at the data
glimpse(ped_melb.south.bourke)

# Answer
A glimpse of the data shows that each observation is an hourly log of the reading from the sensor device. The variables in the data include:

 * __`Sensor`__ - location of the sensor device
 * __`Date_Time`__ - date and time of the reading
 * __`Date`__ - date (yyyy-mm-dd)
 * __`Time`__ - hour of the day 
 * __`Count`__ - total sensor count of pedestrians

It is important that the students understand that this is a temporal/time-series data. You can explain the difference between a temporal and cross-sectional data by comparing this data set with the space lauches data used in tutorial 4.

Create time variables by decomposing the date

Using what you have learned about dates, create the following variables, which are decomposed from the variable Date.

year - year
month - month of the year
wday - day of the week
day - day of the month

To do so, fill out the missing parts of the code chunk (???) below.

# Load lubridate
library(???)

# Create 'time' variables
ped_melb.south.bourke <- ped_melb.south.bourke %>%
  mutate(year = ???(???),
         month = ???(???, label = TRUE, abbr = TRUE), 
         wday = ???(Date, label = TRUE, abbr = TRUE, week_start = 1),
         day = ???(???))

# Answer

# Load lubridate
library(lubridate)

## Warning: package 'lubridate' was built under R version 3.6.2

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

# Create 'time' variables
ped_melb.south.bourke <- ped_melb.south.bourke %>%
  mutate(year = year(Date),
         month = month(Date, label = TRUE, abbr = TRUE), 
         wday = wday(Date, label = TRUE, abbr = TRUE, week_start = 1),
         day = day(Date))

Exploring time gaps

The pedestrian sensor devices count the number of pedestrians over each hour of the day. For many reasons, sensors can malfunction or produce abnormal readings. This means that it’s crucial for you to thoroughly examine missing values and outliers in the data (this is not to say that all missing values and outliers are due to faulty sensor devices).

Check for missing values/time gaps in the data using what you have learned about visualising missing values.

The code chunk below is partially filled out but will guide you through the steps. Filling out the missing parts of the code chunk (???) will produce plots of the pedestrian count in each year for each sensor with missing values placed in the bottom margin.

# Load naniar
???

#  Melbourne Central time gaps
ped_melb.south.bourke %>% 
  filter(Sensor == "Melbourne Central") %>%
  ggplot(aes(x=Date_Time, y=Count)) + 
  geom_miss_point(size = 0.7) +
  facet_wrap(year ~., scales = "free_x", nrow = 3) +
  labs(title = "Melbourne Central", y = "Count", x = "Date-Time")

#  Southbank time gaps
ped_melb.south.bourke %>% 
  filter(Sensor == "Southbank") %>%
  ggplot(aes(x=Date_Time, y=Count)) + 
  ???(size = 0.7) +
  ???(year ~., scales = "free_x", nrow = 3) +
  labs(title = "Southbank", y = "Count", x = "Date-Time")

#  Bourke Street Mall (North) time gaps
ped_melb.south.bourke %>% 
  filter(Sensor == "Bourke Street Mall (North)") %>%
  ggplot(aes(x=Date_Time, y=???)) + 
  ???(size = 0.7) +
  ???(year ~., scales = "free_x", nrow = 3) +
  labs(title = "Bourke Street Mall (North)", y = "Count", x = "Date-Time")

# Answer

# Load naniar
library(naniar)

## Warning: package 'naniar' was built under R version 3.6.2

#  Melbourne Central time gaps
ped_melb.south.bourke %>% 
  filter(Sensor == "Melbourne Central") %>%
  ggplot(aes(x=Date_Time, y=Count)) + 
  geom_miss_point(size = 0.7) +
  facet_wrap(year ~., scales = "free_x", nrow = 3) +
  labs(title = "Melbourne Central", y = "Count", x = "Date-Time")

#  Southbank time gaps
ped_melb.south.bourke %>% 
  filter(Sensor == "Southbank") %>%
  ggplot(aes(x=Date_Time, y=Count)) + 
  geom_miss_point(size = 0.7) +
  facet_wrap(year ~., scales = "free_x", nrow = 3) +
  labs(title = "Southbank", y = "Count", x = "Date-Time")

#  Bourke Street Mall (North) time gaps
ped_melb.south.bourke %>% 
  filter(Sensor == "Bourke Street Mall (North)") %>%
  ggplot(aes(x=Date_Time, y=Count)) + 
  geom_miss_point(size = 0.7) +
  facet_wrap(year ~., scales = "free_x", nrow = 3) +
  labs(title = "Bourke Street Mall (North)", y = "Count", x = "Date-Time")

Answer the following questions based on the plots above:

During which period from 2016 to 2018 do you observe time gaps for each location?
Which location contains the least number of time gaps?
Provide some reasons to explain why you have observed time gaps in this data set.
How can time gaps become problematic when analysing the data?

Distribution of count

It is useful to be able to quantitative describe the central tendency of numerical variables in the data. Running the following code chunk will return the mean and median hourly pedestrian counts in Melbourne Central, Southbank and Bourke Street Mall (North).

# Table of mean and median pedestrian count
ped_melb.south.bourke %>% 
  group_by(Sensor) %>%
  summarise(meanCount = mean(Count, na.rm=TRUE),
            medianCount = median(Count, na.rm=TRUE)) %>%
  ungroup()

Notice that the median is lower than the mean, which indicates that the distribution of hourly pedestrian counts is positively skewed. (If this is unclear, think about which of the two measures of central tendency is more sensitive to large values.)

Fill out the missing parts of the code chunk (???) below to produce a histogram of the pedestrian count for each location.

# Histogram of pedestrian count
ped_melb.south.bourke %>%
  ggplot(aes(x = ???)) +
  geom_???() +
  labs(title = "Distribution of hourly pedestrian count", 
       x = "Pedestrians detected",
       y = "Frequency") +
  facet_wrap(~ ???, scales = "free", nrow = 3)

# Answer

# Histogram of pedestrian count
ped_melb.south.bourke %>%
  ggplot(aes(x = Count)) +
  geom_histogram() +
  labs(title = "Distribution of hourly pedestrian count", 
       x = "Pedestrians detected",
       y = "Frequency") +
  facet_wrap(~ Sensor, scales = "free", nrow = 3)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 6127 rows containing non-finite values (stat_bin).

Based on the distribution of pedestrian count which statistic would provide a representative measure of central tendency of pedestrian count? Why?

Use this measure of central tendency to compute the ‘typical’ pedestrian count for each month and location. Once you have done this, convert the data into a wide form.

The following code chunk is partially filled out but can guide you through the step.

ped_melb.south.bourke %>% 
  ???(Sensor, ???) %>%
  summarise(??? = ???(???, na.rm=TRUE)) %>%
  ungroup() %>%
  ???(Sensor, ???)

# Answer

# Median hourly pedestrian count for each month and location
ped_melb.south.bourke %>% 
  group_by(Sensor, month) %>%
  summarise(medianCount = median(Count, na.rm=TRUE)) %>%
  ungroup() %>%
  spread(Sensor, medianCount)

Line plot of median hourly pedestrian count

For a challenge, reproduce the following line plot. The code chunk provide is partially filled out but can guide you through the steps.

# Challenge: Line plots median hourly pedestrian count
ped_melb.south.bourke %>% 
  group_by(Sensor, month) %>%
  summarise(medianCount = median(Count, na.rm=TRUE)) %>%
  ungroup() %>%
  ggplot(aes(???, ???, ???, ???)) +
  geom_???() +
  geom_???() +
  labs(title = "Median Hourly Pedestrian Counts, 2016-2018", 
       subtitle = "Generally more pedestrians detected in Southbank across all months.",
       x = "Month", 
       y = "Median Counts")

# Answer

# Line plots median hourly pedestrian count
ped_melb.south.bourke %>% 
  group_by(Sensor, month) %>%
  summarise(medianCount = median(Count, na.rm=TRUE)) %>%
  ungroup() %>%
  ggplot(aes(x = month, y = medianCount, colour = Sensor, group = Sensor)) +
  geom_point() +
  geom_line() +
  labs(title = "Median Hourly Pedestrian Counts, 2016-2018", 
       subtitle = "Generally more pedestrians detected in Southbank across all months.",
       x = "Month", 
       y = "Median Counts")

## `summarise()` regrouping output by 'Sensor' (override with `.groups` argument)

Box plots of pedestrian counts

You can use box plots to help visualise how the distribution of pedestrian counts change from hour to hour.

Fill out the missing parts of the code chunk (???) below to produce a side-by-side box plots of the pedestrian count for each hour of the day facetted by year. You will need to set your code chunk figure options to fig.height=9, fig.width=12.

# Box plot of pedestrian counts
ped_melb.south.bourke %>% 
  ggplot(aes(x = as.factor(Time), y = ???, colour = ???)) + 
  geom_???(alpha = 0.5) +
  facet_???(~ ???, nrow = ???) +
  theme(legend.position = "bottom") + # change the legend position
  labs(title = "Distribution of pedestrian counts at each hour of the day", y = "Pedestrian Counts", x = "Hour of the day")

# Answer

# Box plot of pedestrian counts
ped_melb.south.bourke %>% 
  ggplot(aes(x = as.factor(Time), y = Count, colour = Sensor)) + 
  geom_boxplot(alpha = 0.5) +
  facet_wrap(~ year, nrow = 3) +
  theme(legend.position = "bottom") + # change the legend position
  labs(title = "Distribution of pedestrian counts at each hour of the day", y = "Pedestrian Count", x = "Hour of the day")

## Warning: Removed 6127 rows containing non-finite values (stat_boxplot).

Answer the following questions based on the side-by-side box plots above:

In the box plot, the interquartile range (IQR) is the difference between edges of the box, i.e., the 3rd quartile minus the 1st quartile. The larger the box, the greater the IQR, and hence the greater the variability of the variable. Explore the box plots of pedestrian counts at Southbank. During which hour of the day is the IQR largest? Explain why this might be the case.
During which hours of the day and at what location did the sensor detect the highest pedestrian count?
The highest detected pedestrian count is approximately 9,000. Approximately how many times larger is the highest detected pedestrian count to the overall median pedestrian count in this location?
Provide an explanation for the high frequency of pedestrian count in Southbank during the later hours of the day.

# Answer

# 1. 8am hour - sudden jump because of people going to work + along the yarra river so joggers and bikers may also contribute to jump
# 2. Later hour of the day, Southbank
# 3. 9000/1654
# 4. Single event each year i.e. NYE fireworks

Pedestrian count prior to NYE fireworks

A reasonable explanation for the large number of pedestrians detected prior to midnight is that these observations occurred on New Year’s Eve.

It would be reasonable to expect the city’s New Year’s Eve festivities, which include entertainment, activities and fireworks, to attract many locals and tourists to the city. Confirm your hypothesis by filling in the code chunk to produce the below line plots of pedestrian count during the days prior to New Year’s Eve.

# Fill out ??? 
ped_melb.south.bourke %>%
  filter(month == ???, day %in% 24:31) %>%
  ggplot(aes(x = ???, y = ???, colour = ???)) + 
  geom_???(alpha = 0.5) +
  facet_wrap(??? ~., scales = "free_x", nrow = 3) +
  theme(legend.position = "bottom") + # change the legend position
  ???(title = "Pedestrian count at each hour of the day leading up to NYE", y = "Pedestrian Count", x = "Hour of the day")

# Answer

# Pedestrian count days prior to NYE 
ped_melb.south.bourke %>%
  filter(month == "Dec", day %in% 24:31) %>%
  ggplot(aes(x=Date_Time, y=Count, colour = Sensor)) + 
  geom_line(alpha = 0.5) +
  facet_wrap(year ~., scales = "free_x", nrow = 3) +
  theme(legend.position = "bottom") + # change the legend position
  labs(title = "Pedestrian count at each hour of the day leading up to NYE", y = "Pedestrian Count", x = "Hour of the day")

## Warning: Removed 192 row(s) containing missing values (geom_path).

Once you’ve produced your plot, answer the following questions:

Which year is the pedestrian count in Melbourne Central missing for the days leading up to NYE?
Compare the pedestrian count of each location during the hours leading up to the midnight fireworks. Of the 3 locations, which is likely to be the more popular vantage point to view the midnight fireworks?
Which areas are best for viewing the midnight fireworks in Melbourne? (Hint: Search the web to help identify the best viewing areas.)

# Answer

# 1. MC sensor info missing 2016, 
# 2. Southbank is likely to be the most popular vantage point of the three locations
# 3. Southbank likely best location to view fireworks due to the visibility