Tutorial 7

PISA

The OECD Programme for International Student Assessment (PISA) conducts an annual test for 15-year-olds across the globe to measure their performance in reading, mathematics and science knowledge. It provides test scores on maths, science and reading, along with information on the student, teacher and parent through questionnaires. A subset of this data (Australian students) will be studied in this step by applying linear models you’ve learned.

Load the data

The question of interest is whether more time spent studying science is associated with higher science scores and how this varies with the enjoyment of science. Begin by downloading the PISA data pisa_au.rda from Moodle and storing the file in your R project directory.

Your R project directory is the directory (or path) to your .Rproj file, e.g., an R project named FIT5145 has the R project file named FIT5145.Rproj is located in the directory, “C:/Users/Quang/Documents/R/FIT5145”. This directory is the R project directory. More precisely, we may say that this is the top-level directory of our R project. From this top-level directory, you should also have a data folder, i.e., “C:/Users/Quang/Documents/R/FIT5145/data”.

If pisa_au.rda is placed in the data folder, then its file path is “C:/Users/Quang/Documents/R/FIT5145/data/pisa_au.rda”.

Whether your current working directory (running getwd() in your R console will return your current working directory) is the project directory or a subdirectory within the project directory, you can load the data into R using the here() function from the here package. The here() function is a common way to access files in a folder outside of your current working directory.

# Load PISA data file (stored in folder named data)
load(here::here("data", "pisa_au.rda"))

Data wrangling and cleaning

Once you’ve loaded the PISA data into your R session, you will need to apply the following data wrangling and cleaning steps:

Create a variable called science, which represents the student’s science score based on the average of 10 plausible value (PV) scores in science.
Rename the following variables so that they better reflect what they represent:

ST094Q01NA* represents the student’s enjoyment of science.
SMINS represents the student’s time spent studying science per week.
W_FSTUWT represents the student’s weighting, which is an indication of how many other students they represent in Australia, relative to their socioeconomic and demographic characteristics.
ST004D01T represents the student’s gender (NA is also a category here but there are no NAs for this variable in this data set). Any ‘missingness’ in this variable is information and should not be removed or corrected.
ANXTEST represents the student’s test anxiety level.

Filter out anomalies and missing values, and then convert the variables into appropriate classes.

The following code chunk is partially complete. Fill out the missing parts (???) and then run.

# Load tidyverse
library(???)

# Create data frame of science scores
pisa_au_science <- pisa_au %>% 
  # Create science score based on the 10 PV science scores
  mutate(??? = (PV1SCIE + PV2SCIE + PV3SCIE + PV4SCIE + PV5SCIE + 
    PV6SCIE + PV7SCIE + PV8SCIE + PV9SCIE + PV10SCIE) / 10) %>% 
  # Select and rename ambiguous names
  ???(science, science_fun = ST094Q01NA, science_time = SMINS,
    stuweight = W_FSTUWT, gender=ST004D01T, anxtest = ANXTEST) %>%
  # Recode gender variable
  mutate(gender = factor(gender, levels=c(1,2), labels=c("girl", "boy")), exclude = NULL) %>%
  # Filter out missing values in science_time
  ???(!is.na(science_fun), !is.na(???)) %>% 
  # Convert science_fun into a factor
  mutate(science_fun = ???(science_fun), science_time = as.numeric(science_time)) %>%
  # Filter for science_time greater than 0
  filter(??? > 0)

# Answer

# Load tidyverse
library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.1     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## Warning: package 'ggplot2' was built under R version 3.6.2

## Warning: package 'tibble' was built under R version 3.6.2

## Warning: package 'tidyr' was built under R version 3.6.2

## Warning: package 'purrr' was built under R version 3.6.2

## Warning: package 'dplyr' was built under R version 3.6.2

## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# Create data frame of science scores
pisa_au_science <- pisa_au %>% 
  # Create science score based on the 10 PV science scores
  mutate(science = (PV1SCIE + PV2SCIE + PV3SCIE + PV4SCIE + PV5SCIE + 
    PV6SCIE + PV7SCIE + PV8SCIE + PV9SCIE + PV10SCIE) / 10) %>% 
  # Select and rename ambiguous names
  select(science, science_fun = ST094Q01NA, science_time = SMINS,
    stuweight = W_FSTUWT, gender=ST004D01T, anxtest = ANXTEST) %>%
  # Recode gender variable
  mutate(gender = factor(gender, 
                            levels=c(1,2),
                            labels=c("girl", "boy")), exclude = NULL) %>% # exclude set to NULL to include NAs as levels
  # Filter anomaly and missing values in science_time
  filter(!is.na(science_fun), !is.na(science_time)) %>% 
  # Convert variables into appropriate class
  mutate(science_fun = as.factor(science_fun),
         science_time = as.numeric(science_time)) %>%
  # Filter anomalies in science_time
  filter(science_time > 0)

Wrangling challenge

Notice that science score was created by taking the average of the 10 PV of science scores:

How might we calculate the mean across columns in a smarter way, i.e., without typing out each individual column? The following functions can help us:

rowMeans()
select_at()
vars()
starts_with()
ends_with()

The following code chunk is partially complete. Fill out the missing parts (???) and then run.

pisa_au %>% 
  mutate(science = ???(select_at(., vars(starts_with("???") & ends_with("???"))))) %>%
  select(science, everything())

# Answer

pisa_au %>% 
  mutate(science = rowMeans(select_at(., vars(starts_with("PV") & ends_with("SCIE"))))) %>%
  select(science, everything())

# Notes for tutor

# The best way that I can think of to do this involves the rowMeans() function as well as some select helper functions https://rdrr.io/cran/tidyselect/man/select_helpers.html, which go inside of the select_at() function. It seems like a common operation to perform on the data and should have an obvious elegant solution, but it is actually a little tricky to do.
# Here, what is happening is that we are taking the mean by rows, but we have to use select_at() to specify the variables that we want to be involved in the row-wise mean computation. Inside of select_at(), we're basically telling R that the variables that need to be selected for the row-wise mean computation are those with start with "PV" and ends with "SCIE".
# The final select() function just places the science variable at first column for the data frame.

Relationship score & time spend studying

The goal is to develop a linear regression model of science score to understand how the independent variables effect a student’s science score. Before developing the linear regression model, we should explore the relationship between a student’s science score and factors that we expect impact science score, e.g., the time that students spend studying science.

Below is a scatter plot of science score against time spent studying science per week:

pisa_au_science %>% 
  ggplot(aes(x = science_time, y = science)) + 
  geom_point()

What is the relationship between science score and time spent studying science?
Is this what you expected?
Are there any unusual points in the scatter plot?
What do these points do to the graph?

# Answer

# We notice immediately that there are outliers, i.e., some students in the data spent a large proportion of their time studying science each week. These outliers stand out in the scatter plot, stretching the x-axis range to accommodate for these large values. This makes it difficult to visualise the relationship between a student's student score and the amount of time they spend studying science each week.

Below is a density plot (similar to a histogram in that it provides a visualisation of the distribution of a continuous numerical variable, but the y-axis represents the density, not the count) of time spent studying science each week:

pisa_au_science %>% 
  ggplot(aes(x = science_time)) + 
  geom_density()

Notice that the time spent studying science is highly positively skewed.

Log transformation I

Applying a log transformation on a numerical variable (note: we cannot take the log of variable that contains zeros or negative values) will scale back any large numbers. The code chunk below shows the effect on the distribution of time spent studying science before and after the log transformation.

dens_sci_time <- pisa_au_science %>% 
  ggplot(aes(x = science_time)) + 
  geom_density() +
  geom_histogram(aes(y = ..density..), alpha = 0.5) +
  labs(title = "Distribution of time spent studying science **before** log transformation")

dens_log_sci_time <- pisa_au_science %>% 
  ggplot(aes(x = science_time)) + 
  geom_density() +
  geom_histogram(aes(y = ..density..), alpha = 0.5) +
  scale_x_log10() + # apply a log transformation on the x-axis
  labs(title = "Distribution of time spent studying science **after** log transformation")

gridExtra::grid.arrange(dens_sci_time, dens_log_sci_time, nrow = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What was layered into the ggplot call to apply a log transformation to time spent studying science (the x-axis log-scaled)?
Describe the difference in the distribution of time spent studying science before and after the log transformation.
What should we expect to see when we produce a scatter plot of science score against the log of time spent studying science?

# Answer

# The x-axis is log-scaled by layering the function **`scale_x_log10()`** onto the **`ggplot`** call i.e. a log transformation on the time spent studying science per week (minutes) is applied, which is done to create a more meaningful plot. 
# Positively skewed to normally distributed
# The scatter plot will be more interpretable and a relationship between science score and the log of time spent studying will be clear

The following code chunk is partially complete. Fill out the missing parts (???) to produce a scatter plot of science score against the log of time spent studying science.

pisa_au_science %>% 
  ggplot(aes(x = ???, y = ???)) + 
  geom_???(alpha = 0.1) + 
  scale_???() +
  labs(title = "Relationship between science score and time spent studying it",
       subtitle = "Students are grouped based on how much they enjoy science",
       caption = "*x-axis is plotted on a log scale",
       x = "Time spent studying science per week*",
       y = "Science score")

# Answer

pisa_au_science %>% 
  ggplot(aes(x = science_time, y = science)) + 
  geom_point(alpha = 0.1) + 
  scale_x_log10() +
  labs(title = "Relationship between science score and time spent studying it",
       caption = "*x-axis is plotted on a log scale",
       x = "Time spent studying science per week*",
       y = "Science score")

Controlling for science enjoyment

By grouping students based on their enjoyment of science (using the facet_wrap() function), you are controlling for the impact that the student’s enjoyment of science has on their science score (as students whose enjoyment of science are the same are grouped together).

Fill out the missing parts of the code chunk (???) to plot science score against time spent studying science.

pisa_au_science %>% 
  ggplot(aes(x = science_time, y = science, colour = science_fun)) + 
  geom_point(alpha = 0.1) + 
  facet_???(~ ???, ncol = 2) + 
  scale_colour_brewer("Enjoy science", palette = "Dark2") +
  # convert x to log10 scale
  scale_x_log10() +
  theme(legend.position = "bottom") + 
  labs(title = "Relationship between science score and time spent studying it",
       subtitle = "Students are grouped based on how much they enjoy science",
       caption = "*x-axis is plotted on a log scale",
       x = "Time spent studying science per week*",
       y = "Science score")

# Answer

pisa_au_science %>% 
  ggplot(aes(x = science_time, y = science, colour = science_fun)) + 
  geom_point(alpha = 0.1) + 
  facet_wrap(~ science_fun, ncol = 2) + 
  scale_colour_brewer("Enjoy science", palette = "Dark2") +
  # log-scale the x-axis
  scale_x_log10() +
  theme(legend.position = "bottom") + 
  ggtitle("Relationship between science score and time spent studying it") +
  labs(subtitle = "Students are grouped based on how much they enjoy science",
       caption = "*x-axis is plotted on a log scale") +
  xlab("Time spent studying science per week*") + 
  ylab("Science score")

Log transformation II

We want to explore the relationship between science score and the time spent studying and, of course, there will be some outliers, i.e., a small number of students who spend an extremely long time studying science each week. These outliers do not help to explain science scores. A log transformation on time spend studying science will ‘scale back’ these extreme values (think about how a positively skewed variable becomes more symmetric when you’ve taken the log of it), which will produce a more meaningful plot of science score against time spend studying science.

Based on the scatter plot of science score against (log) time spent studying science per week, conditional on the enjoyment of science:

If you did not apply a log transformation to study time, how would you describe its distribution?
Based on the log scaled x-axis, how would you describe the relationship between science score and time spent studying?
Are students who enjoy science more likely to receive higher scores, given they spend an equal amount of time studying science?

# Answer

#  * If you did not apply a log transformation to study time, how would you describe its distribution?
#  - heavily positively skewed
#  * Based on the log scaled x-axis, how would you describe the relationship between science score and time spent studying?
#  - weak
#  * Are students who enjoy science more likely to receive higher scores, given they spend equal amount of time studying science?
#  - yes

Remove scale_x_log10() + from the code chunk. What do you see?

# Answer

# There should be a cluster of data points on the left side of the plot as a handful of students spend a lot of time studying.

# For tutor: What the plot looks like if we do not apply a log transformation to science_time

pisa_au_science %>% 
  ggplot(aes(x = science_time, y = science, colour = science_fun)) + 
  geom_point(alpha = 0.1) + 
  facet_wrap(~ science_fun, ncol = 2) + 
  scale_colour_brewer("Enjoy science", palette = "Dark2") +
  # log-scale the x-axis
  #scale_x_log10() +
  theme(legend.position = "bottom") + 
  ggtitle("Relationship between science score and time spent studying it") +
  labs(subtitle = "Students are grouped based on how much they enjoy science") +
  xlab("Time spent studying science per week (mins)") + 
  ylab("Science score")

Modelling science score

There are two possible models for you to consider:

\[ \begin{align} 1)\ science &= \beta_0 + \beta_1 log(science\_time) + \beta_2 science\_fun\_2 \\ &+ \beta_3 science\_fun\_3 + \beta_4 science\_fun\_4 + \varepsilon \\ \\ 2)\ science &= \beta_0 + \beta_1 log(science\_time) + \beta_2 science\_fun\_2 \\ &+ \beta_3 science\_fun\_3 + \beta_4 science\_fun\_4 \\ &+ \beta_5 log(science\_time)^{*}science\_fun\_2 \\ &+ \beta_6 log(science\_time)^{*}science\_fun\_3 \\ &+ \beta_7 log(science\_time)^{*}science\_fun\_4 + \varepsilon \end{align} \] where,

\(science\) - science score
\(science\_time\) - time spent studying science
\(science\_fun\) - level of science enjoyment (1: strongly disagree, 2: disagree, 3: agree, 4: strongly agree)

Note that one category of science_fun is omitted from the linear regression model, i.e., strongly disagree that science is enjoyable (science_fun = 1). This category is used as a base dummy variable.

Model 2 has interaction terms between the log of time spent studying science and science enjoyment. This means that the slope of the log of time spent studying science can vary at different levels of science enjoyment. Since the time spent studying science is heavily skewed, and we do not want the extreme values of the study time to influence science score, we use the log transformation of study time as an independent variable in the model.

Training the model

Fill out the missing parts of the code chunk (???) to train the 2 models of student’s science scores using the pisa_au_science data.

In the PISA survey, each student was given a different weighting that depended on their school and other factors. You’ll notice that each student’s weighting is taken into account by specifying the variable stuweight in the weights argument of the lm() function.

# Fit both models of student's science score
mod1 <- ???(formula = ??? ~ ??? + ???,
  data = ???, weights = stuweight)
mod2 <- lm(??? ~ ??? * ???,
  data = ???, weights = stuweight)

# Answer

# Fit both models of student's science score
mod1 <- lm(science ~ log10(science_time) + science_fun,
  data = pisa_au_science, weights = stuweight)
mod2 <- lm(science ~ log10(science_time) * science_fun,
  data = pisa_au_science, weights = stuweight)

Now load the broom package and print out the output of both trained/fitted models using the tidy() and glance() function:

# Load broom, which contains tidy() and glance()
library(???)

# Output of first fitted model
tidy(???)
???(mod1)

# Output of second fitted model
tidy(???)
???(mod2)

# Answer

# Load broom, which contains tidy() and glance()
library(broom)

## Warning: package 'broom' was built under R version 3.6.2

# Output of first fitted model
tidy(mod1)

## # A tibble: 5 x 5
##   term                estimate std.error statistic   p.value
##   <chr>                  <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            268.      14.6       18.3 1.26e- 73
## 2 log10(science_time)     83.8      6.18      13.6 1.31e- 41
## 3 science_fun2            31.8      3.15      10.1 9.28e- 24
## 4 science_fun3            63.1      2.80      22.5 7.17e-110
## 5 science_fun4           104.       3.25      32.1 3.00e-216

glance(mod1)

## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic   p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.123         0.123  387.      385. 6.87e-311     4 -66706. 1.33e5 1.33e5
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

# Output of second fitted model
tidy(mod2)

## # A tibble: 8 x 5
##   term                               estimate std.error statistic  p.value
##   <chr>                                 <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)                       463.           43.5 10.6      2.37e-26
## 2 log10(science_time)                -0.00400      18.6 -0.000215 1.00e+ 0
## 3 science_fun2                     -152.           55.5 -2.74     6.18e- 3
## 4 science_fun3                     -167.           48.1 -3.46     5.33e- 4
## 5 science_fun4                     -122.           53.6 -2.29     2.23e- 2
## 6 log10(science_time):science_fun2   78.6          23.8  3.31     9.33e- 4
## 7 log10(science_time):science_fun3   98.4          20.6  4.78     1.74e- 6
## 8 log10(science_time):science_fun4   96.9          22.8  4.25     2.11e- 5

glance(mod2)

## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic   p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.125         0.125  387.      224. 2.62e-312     7 -66694. 1.33e5 1.33e5
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The intercept term in a regression table tells us the average expected value for the response variable when all of the predictor variables are equal to zero. The intercept coefficient at that point is the regression coefficient (Ref.).

Using the output of both fitted models:

Provide an interpretation of the intercept coefficients for both fitted models.
Write down each fitted model in equation form.
Which model is better, and why?

# Answer

# For Mod1
#   term                estimate 
# 1 (Intercept)            268.  
# 2 log10(science_time)     83.8 
# 3 science_fun2            31.8 
# 4 science_fun3            63.1 
# 5 science_fun4           104.  
#
# mod1: 268 + 83.8 * log10(science_time) 
#           + 31.8 * science_fun2 
#           + 63.1 * science_fun4 
#           + 104 * science_fun4
#
# For Mod 2
#   term                               estimate 
# 1 (Intercept)                       463.         
# 2 log10(science_time)                -0.00400    
# 3 science_fun2                     -152.         
# 4 science_fun3                     -167.         
# 5 science_fun4                     -122.         
# 6 log10(science_time):science_fun2   78.6        
# 7 log10(science_time):science_fun3   98.4        
# 8 log10(science_time):science_fun4   96.9        
#
# mod2: 463 - 0.00400 * log10(science_time) 
#           - 152 * science_fun2
#           - 167 * science_fun3
#           - 122 * science_fun4
#           + 78.6 * log10(science_time)*science_fun2
#           + 98.4 * log10(science_time)*science_fun3
#           + 96.9 * log10(science_time)*science_fun4
#
# R^2(mod1) = 0.123
# R^2(mod2) = 0.125

How does science score relate to text anxiety and gender?

Generate the above scatter plot of student’s science score against a measure of their test anxiety level (anxtest), coloured by gender, by appropriately filling out ??? in the code chunk below.

How many missing values were omitted from the plot?
Does it look like an interaction term between anxiety level and gender could improve the model of students’ science score?

# Plot science score against anxiety level
pisa_au_science %>% 
  ggplot(aes(x = ???, y = science, colour = as.factor(???))) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm") +
  theme(legend.position = "bottom") + 
  ggtitle("Relationship between science score and anxiety level") +
  labs(colour = "Gender") +
  xlab("Anxiety level") + 
  ylab("Science score")

# Answer

pisa_au_science %>% 
  ggplot(aes(x = anxtest, y = science, colour = as.factor(gender))) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm") +
  theme(legend.position = "bottom") + 
  ggtitle("Relationship between science score and anxiety level") +
  labs(colour = "Gender") +
  xlab("Anxiety level") + 
  ylab("Science score")

## Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 25 rows containing non-finite values (stat_smooth).

## Warning: Removed 25 rows containing missing values (geom_point).

# Answer

# Number of missing values = 25
#
# According to this data, girls are more anxious than boys. Therefore, the model will have to account for this differing bias between the genders if it wants to model the overall relationship between the science score vs anxiety level.

Group exercise

Fit two models of science score using information about the student’s gender and anxiety level:

A model of science score against gender and test anxiety level.
A model of science score with an interaction between gender and test anxiety level

To fit these models and return their output, complete the code by replacing parts that contain ??? in the following code chunk.

# Fit the models
sci_lm1 <- lm(science ~ ??? + ???, data = pisa_au_science, weights = stuweight)
sci_lm2 <- lm(science ~ ??? * ???, data = pisa_au_science, weights = stuweight)

# Output of first fitted model
tidy(sci_lm1)
glance(???)

# Output of second fitted model
tidy(???)
glance(sci_lm2)

Which model do you think is better?

# Answer

pisa_au_science %>% 
  ggplot(aes(x = anxtest, y = science, colour = as.factor(gender))) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm") +
  theme(legend.position = "bottom") + 
  ggtitle("Relationship between science score and anxiety level") +
  labs(colour = "Gender") +
  xlab("Anxiety level") + 
  ylab("Science score")

## Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 25 rows containing non-finite values (stat_smooth).

## Warning: Removed 25 rows containing missing values (geom_point).

# Fit the models
sci_lm1 <- lm(science ~ anxtest + gender, data = pisa_au_science, weights = stuweight)
sci_lm2 <- lm(science ~ anxtest * gender, data = pisa_au_science, weights = stuweight)

# Output of each fitted model
tidy(sci_lm1)

## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  523.        1.35    386.    0.      
## 2 anxtest      -13.0       0.978   -13.3   4.30e-40
## 3 genderboy      0.358     1.90      0.189 8.50e- 1

glance(sci_lm1)

## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1    0.0173        0.0171  410.      96.3 3.47e-42     2 -67170. 1.34e5 1.34e5
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

tidy(sci_lm2)

## # A tibble: 4 x 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)         522.        1.41   369.    0.      
## 2 anxtest             -10.4       1.35    -7.69  1.58e-14
## 3 genderboy             1.29      1.92     0.671 5.02e- 1
## 4 anxtest:genderboy    -5.61      1.96    -2.86  4.20e- 3

glance(sci_lm2)

## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1    0.0180        0.0178  410.      67.0 6.54e-43     3 -67166. 1.34e5 1.34e5
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

A note about the modelling process

For linear regression models, ideally, we want a model that helps to explain \(y\) as much as possible with only independent variables that are statistically significant, i.e., they have a statistically significant impact on what we are modelling (here it is the science score). It is important to understand this because adding any other variables (those that do not have a true or statistically significant impact on science score) will increase the model’s \(R^2\) (a measure of how well the model fits the data, i.e., how much the independent variables in the model helps to explain the dependent variable), but lead to overfitting (so the model is great at predicting the science scores for students in the trained data set, but is unable to generalise well for new data (data of students that were not used to train the model)). So keeping this in mind, what does the model development process look like and how much trial and error should there be?

In practice, the model development stage does not occur until an in-depth exploration of the data has been performed. This involves exploring the variable to be modelled and variables that could help explain the variable to be modelled (think histograms and box plots to understand the distribution of the variables and scatter plots and correlation coefficients to explore the relationship between variables), which reduces the amount of trial and error required.

Once the exploratory data analysis (EDA) has been performed, we move into the model development phase. Here, we take what we have learned from the EDA to guide us through our model development. Suppose we learn that a student’s math score has a strong linear relationship with their science score (a scatter plot might show this). This would indicate to us that a math score could be a good variable to include in our linear model of science score. If we learn that the impact of maths score on science score differs by gender (same scatter plot, but points coloured by gender), we may want to include the interaction of maths score and gender into our model of science score.

We drew these insights from our EDA and they guided us through model development (reducing the amount of trial and error). Without the EDA, we would not understand our data well, and it would turn into an exercise where we develop our model mostly through trial and error (Is Maths score a good predictor of science score? Is English score a good predictor? Does it differ by gender? Time spent studying? Does time spent studying impact science score differently between gender? What about schools?, etc.). These questions should be asked in the preliminary analysis when we explore the data. Once we have a good idea of what the data tells us (how or whether the variables available to us impacts science score, etc.), we can move into model development with less trial and error.

All of the material is copyrighted under the Creative Commons BY-SA 4.0 copyright.