Pedestrian activity

Recall from Tutorial 2 that we produced some simple line plots and side-by-side box plots using the hourly Melbourne pedestrian count data after wrangling the data into a tidy long form.

While Tutorial 2 explores pedestrian count from 2 locations: Melbourne Central and Southbank, in this tutorial, we will introduce a 3rd location: Bourke Street Mall (North).

Exploratory data analysis

An exploratory data analysis (EDA) analysis is performed at the beginning a data science project after the data has been collected and read into R. It involves a lot of data wrangling and exploratory data visualisation (informative visualisations for you or other data scientists that you are working closely with). Doing this will help you learn about missing values, seasonal patterns, anomalies, etc., which is information that can determine what you need to do next in your analysis.

Prepare the data

Read the hourly pedestrian counts data from 2016 to 2018 for sensors located in Melbourne Central, Southbank and Bourke Street Mall (North).

  • Load the tidyverse and rwalkr package
  • Using the melb_walk_fast() function from the rwalkr package, read in the pedestrian count data from 2016 to 2018 based on the above sensors.
  • Store each data set into an object. The objects should be named ped_melbcent, ped_southbank and ped_bourkest.mallnorth
  • Row bind these three data frames using the bind_rows() function and store this data in an object named ped_melb.south.bourke.

You can explore the City of Melbourne Pedestrian Counting System to identify these locations on an interactive map.

Create time variables by decomposing the date

Using what you have learned about dates, create the following variables, which are decomposed from the variable Date.

  • year - year
  • month - month of the year
  • wday - day of the week
  • day - day of the month

To do so, fill out the missing parts of the code chunk (???) below.

## Warning: package 'lubridate' was built under R version 3.6.2
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Exploring time gaps

The pedestrian sensor devices count the number of pedestrians over each hour of the day. For many reasons, sensors can malfunction or produce abnormal readings. This means that it’s crucial for you to thoroughly examine missing values and outliers in the data (this is not to say that all missing values and outliers are due to faulty sensor devices).

Check for missing values/time gaps in the data using what you have learned about visualising missing values.

The code chunk below is partially filled out but will guide you through the steps. Filling out the missing parts of the code chunk (???) will produce plots of the pedestrian count in each year for each sensor with missing values placed in the bottom margin.

## Warning: package 'naniar' was built under R version 3.6.2

Answer the following questions based on the plots above:

  • During which period from 2016 to 2018 do you observe time gaps for each location?
  • Which location contains the least number of time gaps?
  • Provide some reasons to explain why you have observed time gaps in this data set.
  • How can time gaps become problematic when analysing the data?

Distribution of count

It is useful to be able to quantitative describe the central tendency of numerical variables in the data. Running the following code chunk will return the mean and median hourly pedestrian counts in Melbourne Central, Southbank and Bourke Street Mall (North).

Notice that the median is lower than the mean, which indicates that the distribution of hourly pedestrian counts is positively skewed. (If this is unclear, think about which of the two measures of central tendency is more sensitive to large values.)

Fill out the missing parts of the code chunk (???) below to produce a histogram of the pedestrian count for each location.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 6127 rows containing non-finite values (stat_bin).

Based on the distribution of pedestrian count which statistic would provide a representative measure of central tendency of pedestrian count? Why?

Use this measure of central tendency to compute the ‘typical’ pedestrian count for each month and location. Once you have done this, convert the data into a wide form.

The following code chunk is partially filled out but can guide you through the step.

Box plots of pedestrian counts

You can use box plots to help visualise how the distribution of pedestrian counts change from hour to hour.

Fill out the missing parts of the code chunk (???) below to produce a side-by-side box plots of the pedestrian count for each hour of the day facetted by year. You will need to set your code chunk figure options to fig.height=9, fig.width=12.

## Warning: Removed 6127 rows containing non-finite values (stat_boxplot).

Answer the following questions based on the side-by-side box plots above:

  1. In the box plot, the interquartile range (IQR) is the difference between edges of the box, i.e., the 3rd quartile minus the 1st quartile. The larger the box, the greater the IQR, and hence the greater the variability of the variable. Explore the box plots of pedestrian counts at Southbank. During which hour of the day is the IQR largest? Explain why this might be the case.

  2. During which hours of the day and at what location did the sensor detect the highest pedestrian count?

  3. The highest detected pedestrian count is approximately 9,000. Approximately how many times larger is the highest detected pedestrian count to the overall median pedestrian count in this location?

  4. Provide an explanation for the high frequency of pedestrian count in Southbank during the later hours of the day.

Pedestrian count prior to NYE fireworks

A reasonable explanation for the large number of pedestrians detected prior to midnight is that these observations occurred on New Year’s Eve.

It would be reasonable to expect the city’s New Year’s Eve festivities, which include entertainment, activities and fireworks, to attract many locals and tourists to the city. Confirm your hypothesis by filling in the code chunk to produce the below line plots of pedestrian count during the days prior to New Year’s Eve.

## Warning: Removed 192 row(s) containing missing values (geom_path).

Once you’ve produced your plot, answer the following questions:

  1. Which year is the pedestrian count in Melbourne Central missing for the days leading up to NYE?
  2. Compare the pedestrian count of each location during the hours leading up to the midnight fireworks. Of the 3 locations, which is likely to be the more popular vantage point to view the midnight fireworks?
  3. Which areas are best for viewing the midnight fireworks in Melbourne? (Hint: Search the web to help identify the best viewing areas.)