The OECD Programme for International Student Assessment (PISA) conducts an annual test for 15-year-olds across the globe to measure their performance in reading, mathematics and science knowledge. It provides test scores on maths, science and reading, along with information on the student, teacher and parent through questionnaires. A subset of this data (Australian students) will be studied in this step by applying linear models you’ve learned.
The question of interest is whether more time spent studying science is associated with higher science scores and how this varies with the enjoyment of science. Begin by downloading the PISA data pisa_au.rda
from Moodle and storing the file in your R project directory.
Your R project directory is the directory (or path) to your .Rproj
file, e.g., an R project named FIT5145 has the R project file named FIT5145.Rproj
is located in the directory, “C:/Users/Quang/Documents/R/FIT5145”. This directory is the R project directory. More precisely, we may say that this is the top-level directory of our R project. From this top-level directory, you should also have a data
folder, i.e., “C:/Users/Quang/Documents/R/FIT5145/data”.
If pisa_au.rda
is placed in the data
folder, then its file path is “C:/Users/Quang/Documents/R/FIT5145/data/pisa_au.rda”.
Whether your current working directory (running getwd()
in your R console will return your current working directory) is the project directory or a subdirectory within the project directory, you can load the data into R using the here()
function from the here
package. The here()
function is a common way to access files in a folder outside of your current working directory.
Once you’ve loaded the PISA data into your R session, you will need to apply the following data wrangling and cleaning steps:
science
, which represents the student’s science score based on the average of 10 plausible value (PV) scores in science.ST094Q01NA
* represents the student’s enjoyment of science.SMINS
represents the student’s time spent studying science per week.W_FSTUWT
represents the student’s weighting, which is an indication of how many other students they represent in Australia, relative to their socioeconomic and demographic characteristics.ST004D01T
represents the student’s gender (NA is also a category here but there are no NAs for this variable in this data set). Any ‘missingness’ in this variable is information and should not be removed or corrected.ANXTEST
represents the student’s test anxiety level.The following code chunk is partially complete. Fill out the missing parts (???
) and then run.
# Load tidyverse
library(???)
# Create data frame of science scores
pisa_au_science <- pisa_au %>%
# Create science score based on the 10 PV science scores
mutate(??? = (PV1SCIE + PV2SCIE + PV3SCIE + PV4SCIE + PV5SCIE +
PV6SCIE + PV7SCIE + PV8SCIE + PV9SCIE + PV10SCIE) / 10) %>%
# Select and rename ambiguous names
???(science, science_fun = ST094Q01NA, science_time = SMINS,
stuweight = W_FSTUWT, gender=ST004D01T, anxtest = ANXTEST) %>%
# Recode gender variable
mutate(gender = factor(gender, levels=c(1,2), labels=c("girl", "boy")), exclude = NULL) %>%
# Filter out missing values in science_time
???(!is.na(science_fun), !is.na(???)) %>%
# Convert science_fun into a factor
mutate(science_fun = ???(science_fun), science_time = as.numeric(science_time)) %>%
# Filter for science_time greater than 0
filter(??? > 0)
## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.1 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Create data frame of science scores
pisa_au_science <- pisa_au %>%
# Create science score based on the 10 PV science scores
mutate(science = (PV1SCIE + PV2SCIE + PV3SCIE + PV4SCIE + PV5SCIE +
PV6SCIE + PV7SCIE + PV8SCIE + PV9SCIE + PV10SCIE) / 10) %>%
# Select and rename ambiguous names
select(science, science_fun = ST094Q01NA, science_time = SMINS,
stuweight = W_FSTUWT, gender=ST004D01T, anxtest = ANXTEST) %>%
# Recode gender variable
mutate(gender = factor(gender,
levels=c(1,2),
labels=c("girl", "boy")), exclude = NULL) %>% # exclude set to NULL to include NAs as levels
# Filter anomaly and missing values in science_time
filter(!is.na(science_fun), !is.na(science_time)) %>%
# Convert variables into appropriate class
mutate(science_fun = as.factor(science_fun),
science_time = as.numeric(science_time)) %>%
# Filter anomalies in science_time
filter(science_time > 0)
Notice that science score was created by taking the average of the 10 PV of science scores:
How might we calculate the mean across columns in a smarter way, i.e., without typing out each individual column? The following functions can help us:
rowMeans()
select_at()
vars()
starts_with()
ends_with()
The following code chunk is partially complete. Fill out the missing parts (???
) and then run.
pisa_au %>%
mutate(science = ???(select_at(., vars(starts_with("???") & ends_with("???"))))) %>%
select(science, everything())
# Answer
pisa_au %>%
mutate(science = rowMeans(select_at(., vars(starts_with("PV") & ends_with("SCIE"))))) %>%
select(science, everything())
# Notes for tutor
# The best way that I can think of to do this involves the rowMeans() function as well as some select helper functions https://rdrr.io/cran/tidyselect/man/select_helpers.html, which go inside of the select_at() function. It seems like a common operation to perform on the data and should have an obvious elegant solution, but it is actually a little tricky to do.
# Here, what is happening is that we are taking the mean by rows, but we have to use select_at() to specify the variables that we want to be involved in the row-wise mean computation. Inside of select_at(), we're basically telling R that the variables that need to be selected for the row-wise mean computation are those with start with "PV" and ends with "SCIE".
# The final select() function just places the science variable at first column for the data frame.
The goal is to develop a linear regression model of science score to understand how the independent variables effect a student’s science score. Before developing the linear regression model, we should explore the relationship between a student’s science score and factors that we expect impact science score, e.g., the time that students spend studying science.
Below is a scatter plot of science score against time spent studying science per week:
# Answer
# We notice immediately that there are outliers, i.e., some students in the data spent a large proportion of their time studying science each week. These outliers stand out in the scatter plot, stretching the x-axis range to accommodate for these large values. This makes it difficult to visualise the relationship between a student's student score and the amount of time they spend studying science each week.
Below is a density plot (similar to a histogram in that it provides a visualisation of the distribution of a continuous numerical variable, but the y-axis represents the density, not the count) of time spent studying science each week:
Notice that the time spent studying science is highly positively skewed.
Applying a log transformation on a numerical variable (note: we cannot take the log of variable that contains zeros or negative values) will scale back any large numbers. The code chunk below shows the effect on the distribution of time spent studying science before and after the log transformation.
dens_sci_time <- pisa_au_science %>%
ggplot(aes(x = science_time)) +
geom_density() +
geom_histogram(aes(y = ..density..), alpha = 0.5) +
labs(title = "Distribution of time spent studying science **before** log transformation")
dens_log_sci_time <- pisa_au_science %>%
ggplot(aes(x = science_time)) +
geom_density() +
geom_histogram(aes(y = ..density..), alpha = 0.5) +
scale_x_log10() + # apply a log transformation on the x-axis
labs(title = "Distribution of time spent studying science **after** log transformation")
gridExtra::grid.arrange(dens_sci_time, dens_log_sci_time, nrow = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot
call to apply a log transformation to time spent studying science (the x-axis log-scaled)?# Answer
# The x-axis is log-scaled by layering the function **`scale_x_log10()`** onto the **`ggplot`** call i.e. a log transformation on the time spent studying science per week (minutes) is applied, which is done to create a more meaningful plot.
# Positively skewed to normally distributed
# The scatter plot will be more interpretable and a relationship between science score and the log of time spent studying will be clear
The following code chunk is partially complete. Fill out the missing parts (???
) to produce a scatter plot of science score against the log of time spent studying science.
pisa_au_science %>%
ggplot(aes(x = ???, y = ???)) +
geom_???(alpha = 0.1) +
scale_???() +
labs(title = "Relationship between science score and time spent studying it",
subtitle = "Students are grouped based on how much they enjoy science",
caption = "*x-axis is plotted on a log scale",
x = "Time spent studying science per week*",
y = "Science score")
# Answer
pisa_au_science %>%
ggplot(aes(x = science_time, y = science)) +
geom_point(alpha = 0.1) +
scale_x_log10() +
labs(title = "Relationship between science score and time spent studying it",
caption = "*x-axis is plotted on a log scale",
x = "Time spent studying science per week*",
y = "Science score")
By grouping students based on their enjoyment of science (using the facet_wrap()
function), you are controlling for the impact that the student’s enjoyment of science has on their science score (as students whose enjoyment of science are the same are grouped together).
Fill out the missing parts of the code chunk (???
) to plot science score against time spent studying science.
pisa_au_science %>%
ggplot(aes(x = science_time, y = science, colour = science_fun)) +
geom_point(alpha = 0.1) +
facet_???(~ ???, ncol = 2) +
scale_colour_brewer("Enjoy science", palette = "Dark2") +
# convert x to log10 scale
scale_x_log10() +
theme(legend.position = "bottom") +
labs(title = "Relationship between science score and time spent studying it",
subtitle = "Students are grouped based on how much they enjoy science",
caption = "*x-axis is plotted on a log scale",
x = "Time spent studying science per week*",
y = "Science score")
# Answer
pisa_au_science %>%
ggplot(aes(x = science_time, y = science, colour = science_fun)) +
geom_point(alpha = 0.1) +
facet_wrap(~ science_fun, ncol = 2) +
scale_colour_brewer("Enjoy science", palette = "Dark2") +
# log-scale the x-axis
scale_x_log10() +
theme(legend.position = "bottom") +
ggtitle("Relationship between science score and time spent studying it") +
labs(subtitle = "Students are grouped based on how much they enjoy science",
caption = "*x-axis is plotted on a log scale") +
xlab("Time spent studying science per week*") +
ylab("Science score")
We want to explore the relationship between science score and the time spent studying and, of course, there will be some outliers, i.e., a small number of students who spend an extremely long time studying science each week. These outliers do not help to explain science scores. A log transformation on time spend studying science will ‘scale back’ these extreme values (think about how a positively skewed variable becomes more symmetric when you’ve taken the log of it), which will produce a more meaningful plot of science score against time spend studying science.
Based on the scatter plot of science score against (log) time spent studying science per week, conditional on the enjoyment of science:
# Answer
# * If you did not apply a log transformation to study time, how would you describe its distribution?
# - heavily positively skewed
# * Based on the log scaled x-axis, how would you describe the relationship between science score and time spent studying?
# - weak
# * Are students who enjoy science more likely to receive higher scores, given they spend equal amount of time studying science?
# - yes
Remove scale_x_log10() +
from the code chunk. What do you see?
# Answer
# There should be a cluster of data points on the left side of the plot as a handful of students spend a lot of time studying.
# For tutor: What the plot looks like if we do not apply a log transformation to science_time
pisa_au_science %>%
ggplot(aes(x = science_time, y = science, colour = science_fun)) +
geom_point(alpha = 0.1) +
facet_wrap(~ science_fun, ncol = 2) +
scale_colour_brewer("Enjoy science", palette = "Dark2") +
# log-scale the x-axis
#scale_x_log10() +
theme(legend.position = "bottom") +
ggtitle("Relationship between science score and time spent studying it") +
labs(subtitle = "Students are grouped based on how much they enjoy science") +
xlab("Time spent studying science per week (mins)") +
ylab("Science score")
There are two possible models for you to consider:
\[ \begin{align} 1)\ science &= \beta_0 + \beta_1 log(science\_time) + \beta_2 science\_fun\_2 \\ &+ \beta_3 science\_fun\_3 + \beta_4 science\_fun\_4 + \varepsilon \\ \\ 2)\ science &= \beta_0 + \beta_1 log(science\_time) + \beta_2 science\_fun\_2 \\ &+ \beta_3 science\_fun\_3 + \beta_4 science\_fun\_4 \\ &+ \beta_5 log(science\_time)^{*}science\_fun\_2 \\ &+ \beta_6 log(science\_time)^{*}science\_fun\_3 \\ &+ \beta_7 log(science\_time)^{*}science\_fun\_4 + \varepsilon \end{align} \] where,
Note that one category of science_fun
is omitted from the linear regression model, i.e., strongly disagree that science is enjoyable (science_fun = 1
). This category is used as a base dummy variable.
Model 2 has interaction terms between the log of time spent studying science and science enjoyment. This means that the slope of the log of time spent studying science can vary at different levels of science enjoyment. Since the time spent studying science is heavily skewed, and we do not want the extreme values of the study time to influence science score, we use the log transformation of study time as an independent variable in the model.
Fill out the missing parts of the code chunk (???
) to train the 2 models of student’s science scores using the pisa_au_science
data.
In the PISA survey, each student was given a different weighting that depended on their school and other factors. You’ll notice that each student’s weighting is taken into account by specifying the variable stuweight
in the weights
argument of the lm()
function.
# Fit both models of student's science score
mod1 <- ???(formula = ??? ~ ??? + ???,
data = ???, weights = stuweight)
mod2 <- lm(??? ~ ??? * ???,
data = ???, weights = stuweight)
# Answer
# Fit both models of student's science score
mod1 <- lm(science ~ log10(science_time) + science_fun,
data = pisa_au_science, weights = stuweight)
mod2 <- lm(science ~ log10(science_time) * science_fun,
data = pisa_au_science, weights = stuweight)
Now load the broom
package and print out the output of both trained/fitted models using the tidy()
and glance()
function:
# Load broom, which contains tidy() and glance()
library(???)
# Output of first fitted model
tidy(???)
???(mod1)
# Output of second fitted model
tidy(???)
???(mod2)
## Warning: package 'broom' was built under R version 3.6.2
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 268. 14.6 18.3 1.26e- 73
## 2 log10(science_time) 83.8 6.18 13.6 1.31e- 41
## 3 science_fun2 31.8 3.15 10.1 9.28e- 24
## 4 science_fun3 63.1 2.80 22.5 7.17e-110
## 5 science_fun4 104. 3.25 32.1 3.00e-216
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.123 0.123 387. 385. 6.87e-311 4 -66706. 1.33e5 1.33e5
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
## # A tibble: 8 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 463. 43.5 10.6 2.37e-26
## 2 log10(science_time) -0.00400 18.6 -0.000215 1.00e+ 0
## 3 science_fun2 -152. 55.5 -2.74 6.18e- 3
## 4 science_fun3 -167. 48.1 -3.46 5.33e- 4
## 5 science_fun4 -122. 53.6 -2.29 2.23e- 2
## 6 log10(science_time):science_fun2 78.6 23.8 3.31 9.33e- 4
## 7 log10(science_time):science_fun3 98.4 20.6 4.78 1.74e- 6
## 8 log10(science_time):science_fun4 96.9 22.8 4.25 2.11e- 5
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.125 0.125 387. 224. 2.62e-312 7 -66694. 1.33e5 1.33e5
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The intercept term in a regression table tells us the average expected value for the response variable when all of the predictor variables are equal to zero. The intercept coefficient at that point is the regression coefficient (Ref.).
Using the output of both fitted models:
# Answer
# For Mod1
# term estimate
# 1 (Intercept) 268.
# 2 log10(science_time) 83.8
# 3 science_fun2 31.8
# 4 science_fun3 63.1
# 5 science_fun4 104.
#
# mod1: 268 + 83.8 * log10(science_time)
# + 31.8 * science_fun2
# + 63.1 * science_fun4
# + 104 * science_fun4
#
# For Mod 2
# term estimate
# 1 (Intercept) 463.
# 2 log10(science_time) -0.00400
# 3 science_fun2 -152.
# 4 science_fun3 -167.
# 5 science_fun4 -122.
# 6 log10(science_time):science_fun2 78.6
# 7 log10(science_time):science_fun3 98.4
# 8 log10(science_time):science_fun4 96.9
#
# mod2: 463 - 0.00400 * log10(science_time)
# - 152 * science_fun2
# - 167 * science_fun3
# - 122 * science_fun4
# + 78.6 * log10(science_time)*science_fun2
# + 98.4 * log10(science_time)*science_fun3
# + 96.9 * log10(science_time)*science_fun4
#
# R^2(mod1) = 0.123
# R^2(mod2) = 0.125
Generate the above scatter plot of student’s science score against a measure of their test anxiety level (anxtest
), coloured by gender
, by appropriately filling out ???
in the code chunk below.
# Plot science score against anxiety level
pisa_au_science %>%
ggplot(aes(x = ???, y = science, colour = as.factor(???))) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm") +
theme(legend.position = "bottom") +
ggtitle("Relationship between science score and anxiety level") +
labs(colour = "Gender") +
xlab("Anxiety level") +
ylab("Science score")
# Answer
pisa_au_science %>%
ggplot(aes(x = anxtest, y = science, colour = as.factor(gender))) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm") +
theme(legend.position = "bottom") +
ggtitle("Relationship between science score and anxiety level") +
labs(colour = "Gender") +
xlab("Anxiety level") +
ylab("Science score")
## Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 25 rows containing non-finite values (stat_smooth).
## Warning: Removed 25 rows containing missing values (geom_point).
Fit two models of science score using information about the student’s gender and anxiety level:
To fit these models and return their output, complete the code by replacing parts that contain ???
in the following code chunk.
# Fit the models
sci_lm1 <- lm(science ~ ??? + ???, data = pisa_au_science, weights = stuweight)
sci_lm2 <- lm(science ~ ??? * ???, data = pisa_au_science, weights = stuweight)
# Output of first fitted model
tidy(sci_lm1)
glance(???)
# Output of second fitted model
tidy(???)
glance(sci_lm2)
Which model do you think is better?
# Answer
pisa_au_science %>%
ggplot(aes(x = anxtest, y = science, colour = as.factor(gender))) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm") +
theme(legend.position = "bottom") +
ggtitle("Relationship between science score and anxiety level") +
labs(colour = "Gender") +
xlab("Anxiety level") +
ylab("Science score")
## Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 25 rows containing non-finite values (stat_smooth).
## Warning: Removed 25 rows containing missing values (geom_point).
# Fit the models
sci_lm1 <- lm(science ~ anxtest + gender, data = pisa_au_science, weights = stuweight)
sci_lm2 <- lm(science ~ anxtest * gender, data = pisa_au_science, weights = stuweight)
# Output of each fitted model
tidy(sci_lm1)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 523. 1.35 386. 0.
## 2 anxtest -13.0 0.978 -13.3 4.30e-40
## 3 genderboy 0.358 1.90 0.189 8.50e- 1
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0173 0.0171 410. 96.3 3.47e-42 2 -67170. 1.34e5 1.34e5
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 522. 1.41 369. 0.
## 2 anxtest -10.4 1.35 -7.69 1.58e-14
## 3 genderboy 1.29 1.92 0.671 5.02e- 1
## 4 anxtest:genderboy -5.61 1.96 -2.86 4.20e- 3
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0180 0.0178 410. 67.0 6.54e-43 3 -67166. 1.34e5 1.34e5
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
For linear regression models, ideally, we want a model that helps to explain \(y\) as much as possible with only independent variables that are statistically significant, i.e., they have a statistically significant impact on what we are modelling (here it is the science score). It is important to understand this because adding any other variables (those that do not have a true or statistically significant impact on science score) will increase the model’s \(R^2\) (a measure of how well the model fits the data, i.e., how much the independent variables in the model helps to explain the dependent variable), but lead to overfitting (so the model is great at predicting the science scores for students in the trained data set, but is unable to generalise well for new data (data of students that were not used to train the model)). So keeping this in mind, what does the model development process look like and how much trial and error should there be?
In practice, the model development stage does not occur until an in-depth exploration of the data has been performed. This involves exploring the variable to be modelled and variables that could help explain the variable to be modelled (think histograms and box plots to understand the distribution of the variables and scatter plots and correlation coefficients to explore the relationship between variables), which reduces the amount of trial and error required.
Once the exploratory data analysis (EDA) has been performed, we move into the model development phase. Here, we take what we have learned from the EDA to guide us through our model development. Suppose we learn that a student’s math score has a strong linear relationship with their science score (a scatter plot might show this). This would indicate to us that a math score could be a good variable to include in our linear model of science score. If we learn that the impact of maths score on science score differs by gender (same scatter plot, but points coloured by gender), we may want to include the interaction of maths score and gender into our model of science score.
We drew these insights from our EDA and they guided us through model development (reducing the amount of trial and error). Without the EDA, we would not understand our data well, and it would turn into an exercise where we develop our model mostly through trial and error (Is Maths score a good predictor of science score? Is English score a good predictor? Does it differ by gender? Time spent studying? Does time spent studying impact science score differently between gender? What about schools?, etc.). These questions should be asked in the preliminary analysis when we explore the data. Once we have a good idea of what the data tells us (how or whether the variables available to us impacts science score, etc.), we can move into model development with less trial and error.
All of the material is copyrighted under the Creative Commons BY-SA 4.0 copyright.