Tutorial 5 - Space Launches

Space launches

The space launches data sets are publicly available from The Economist GitHub and used to develop the interactive visualising in article The space race is dominated by new contenders.

The GitHub repository, graphic-detail-data from The Economist contains the data, which is split into 2 .csv files. This data is also available from the tidytuesday GitHib repository by the R for Data Science community.

agencies.csv - space launch providers
launches.csv - individual space launches

These data sets are relational data, meaning that both are connected by one or more variables.

Many publicly available real-world data sets are available on GitHub for us to explore. One of the best ways to reforcing our learning is to apply what we have learned to real-world data, so it is useful to know how to read data from online resources like GitHub repositories.

Before analysing the data, we should read through the table of variable definitions (provided by The Economist and R for Data Science GitHub).

Reading the data

We have seen from our analysis of Melbourne pedestrian count data in Tutorial 2, how data can be read into R from GitHub. Following the same steps, we can read the agencies.csv and launches.csv data with the read_csv() function from the tidyverse:

Load the tidyverse, which contains the read_csv() function to read .csv files in R.
Reading in agencies.csv:
- Go to the agencies.csv file on GitHub and then click Raw, which will open the file (as raw text) in your web browser. The URL will be used to read the data into R.
- Copy the URL of agencies.csv and paste inside the read_csv() function.
- Store the data in an object named agencies.

The following code chunk, with missing parts (???), may guide you through the steps above:

# Load tidyverse
library(???)

# Read agencies data
agencies <- read_???("???")

# Answer

# Load tidyverse
library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.1     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## Warning: package 'ggplot2' was built under R version 3.6.2

## Warning: package 'tibble' was built under R version 3.6.2

## Warning: package 'tidyr' was built under R version 3.6.2

## Warning: package 'purrr' was built under R version 3.6.2

## Warning: package 'dplyr' was built under R version 3.6.2

## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# Read agencies data
agencies <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-15/agencies.csv")

## Parsed with column specification:
## cols(
##   agency = col_character(),
##   count = col_double(),
##   ucode = col_character(),
##   state_code = col_character(),
##   type = col_character(),
##   class = col_character(),
##   tstart = col_character(),
##   tstop = col_character(),
##   short_name = col_character(),
##   name = col_character(),
##   location = col_character(),
##   longitude = col_character(),
##   latitude = col_character(),
##   error = col_character(),
##   parent = col_character(),
##   short_english_name = col_character(),
##   english_name = col_character(),
##   unicode_name = col_character(),
##   agency_type = col_character()
## )

Now that you have read the agencies data into R, replicate the previous steps to read the space launches data, storing it an in object named launches.

# Read launches data
launches <- read_csv("???")

# Answer

# URL of raw text from launches
launches_URL <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-15/launches.csv"

# Read launches data
launches <- read_csv(launches_URL)

## Parsed with column specification:
## cols(
##   tag = col_character(),
##   JD = col_double(),
##   launch_date = col_date(format = ""),
##   launch_year = col_double(),
##   type = col_character(),
##   variant = col_character(),
##   mission = col_character(),
##   agency = col_character(),
##   state_code = col_character(),
##   category = col_character(),
##   agency_type = col_character()
## )

Look at `agencies` and `launches`

The learning objective of this tutorial is to strengthen our understanding of related data sets, i.e., understanding how they are connected and recognising that more insights are available from the analysis of data sets when they are joined, than when they are in silos. While the definition of each variable is available for us, we still need to examine the content inside of each variable.

The glimpse() function, which comes from the tidyverse, gives us a glimpse of each variable in the data.

Fill out the missing parts of the code chunk (???) to look at both data sets.

# Look at agencies
glimpse(???)

# Look at launches
glimpse(???)

# Answer
glimpse(agencies)
glimpse(launches)

How many rows/observations and columns/variables are in the agencies data?
How many in the launches data?

# Answer

# Agencies: 
#  Rows: 74
#  Columns: 19
# Launches:
#  Rows: 5,726
#  Columns: 11

Conducting a preliminary analysis

A preliminary analysis of data helps us develop a better understanding of the information contained in the data. Often, asking the right questions gives us information that we can used for further analysis.

The following some questions that we could ask when conducting a preliminary analysis of the data.

How many distinct agencies (space launch providers) are in the agencies data?
Which variable in agencies uniquely describes each agency?
Is there an agency that has more than one organisation phase code? If so, which agency?
Why might an agency have more than one organisation phase code?

Fill out the missing parts of the code chunk below (???) to help answer these questions.

Distinct agencies

While agencies contains information about various space agencies, it is important for our analysis to identify the variable in the data the uniquely describes each agency, i.e., a name, or perhaps, an ID.

Describe how you would go about identifying possible variables in agencies that uniquely describes each agency and how you would confirm this.

# Answer

# You have already looked at the data with glimpse(), but you can look at it again with head() to get a better understand.
# Read over the definition of each variables.
# Based on what you have done, the obvious candidates that uniquely describe each agency are variables agency and name.
# Now it's a matter of checking that each row in the data agencies is distinct, checking if the number of distinct rows in variable agency equals to the number of distinct rows in the data agencies, checking if the number of distinct rows in variable name equals to the number of distinct rows in the data agencies.
# distinct() is a function that can be used and is easy to explain. count() is also useful here, but we will introduce this in the next exercise.
# For now, let's use nrow() and distinct() together to count the number of distinct rows.

# Check if each row in agencies is distinct (should return same data)
agencies %>%
  distinct()

# Count the distinct agency data (number of rows)
# Both of the following should return 74
nrow(agencies %>%
  distinct()) 

nrow(agencies)

# Count the distinct agencies (numbers of rows)
# Should return 74

nrow(agencies %>%
       distinct(agency))

# agencies %>% distinct(agency) shows the values and a count

# Count the distinct names (number of rows)
# Something seems a bit wrong in the count here. There is a repeat!
# We'll explore which agency that is in the next section.

nrow(agencies %>%
  distinct(name))

Complications with real-data: duplicates

One variable that we should have inspected to see if it uniquely describes each agency is name, which contains the name of the space agency. Interestingly, there are 73 space agency names, which is 1 less than the number of agencies in the data (74). This tells us that there is a space agency, whose name appears twice in the data. To count the number of times each agency name is observed in the data:

Take agencies then %>% in the count() function
Inside of count(), specify the variable whose elements we want to count, i.e., name

Fill out the missing parts of the code chunk below (???) then run.

# Count agency name
agencies %>%
  ???(???, sort = TRUE)

# Answer

# Count agency name
agencies %>%
  count(name, sort = TRUE)

Which space agency appears twice in the data?
Filter the data for this space agency and explain what has happened.

# Answer

# Filter for Sea Launch Limited Partnership. 
# Change of headquarters in 2000: Cayman Islands to California
agencies %>%
  filter(name == "Sea Launch Limited Partnership")

Joining the data

How are agencies and launches connected? We know that agencies contains information about space launch providers, e.g., their organisation phase code, number of launches, name, etc., while launches contains information about space launches, e.g., the space launch provider, date of launch, outcome of launch (success of failure), etc.

To join the data sets, we need to deterine the variable that connects them together. The launches data contains the variable agency, which is the space launch provider (the agency responsible for the space launch) and is represented by an organisation phase code.

Fill out the missing parts of the code chunk below (???) to obtain the number of agencies responsible for a space launch.

# Number of agencies that provided a space launch
launches %>%
  summarise(num_agencies = n_distinct(???))

# Answer

# Number of agencies that provided a space launch
launches %>%
  summarise(num_agencies = n_distinct(agency))

Left join

Using the appropriate connecting variable, join agencies to launches with a left join and assign this data the name launches_agencies. Explain why information about the agency e.g. number of launches, name, location, is missing for some observations.

If you’re unsure how to make a left join, have a look at the quick guide in Moodle for a refresher on joins, how they work, and more.

# Left join agencies to launches
launches_agencies <- left_join(???, ???, by = "???")

# Print launches_agencies
launches_agencies

# Answer

# Left join agencies to launches
launches_agencies <- left_join(launches, agencies, by = "agency")

# Print launches_agencies
launches_agencies

## # A tibble: 5,726 x 29
##    tag       JD launch_date launch_year type.x variant mission agency
##    <chr>  <dbl> <date>            <dbl> <chr>  <chr>   <chr>   <chr> 
##  1 1967… 2.44e6 1967-06-29         1967 Thor … <NA>    Secor … US    
##  2 1967… 2.44e6 1967-08-23         1967 Thor … <NA>    DAPP 3… US    
##  3 1967… 2.44e6 1967-10-11         1967 Thor … <NA>    DAPP 4… US    
##  4 1968… 2.44e6 1968-05-23         1968 Thor … <NA>    DAPP 5… US    
##  5 1968… 2.44e6 1968-10-23         1968 Thor … <NA>    DAPP 6… US    
##  6 1969… 2.44e6 1969-07-23         1969 Thor … <NA>    DAPP 7… US    
##  7 1970… 2.44e6 1970-02-11         1970 Thor … <NA>    DAPP B… US    
##  8 1970… 2.44e6 1970-09-03         1970 Thor … <NA>    DAPP B… US    
##  9 1971… 2.44e6 1971-02-17         1971 Thor … <NA>    DAPP B… US    
## 10 1971… 2.44e6 1971-06-08         1971 Thor … <NA>    P70-1   US    
## # … with 5,716 more rows, and 21 more variables: state_code.x <chr>,
## #   category <chr>, agency_type.x <chr>, count <dbl>, ucode <chr>,
## #   state_code.y <chr>, type.y <chr>, class <chr>, tstart <chr>, tstop <chr>,
## #   short_name <chr>, name <chr>, location <chr>, longitude <chr>,
## #   latitude <chr>, error <chr>, parent <chr>, short_english_name <chr>,
## #   english_name <chr>, unicode_name <chr>, agency_type.y <chr>

# There are some observations in the launches data set that do not contain a matching organisation phase code from the  agencies data set

Retaining only matching values

Suppose that you wanted to join both agency and launches together so that only observations from both data sets with matching organisation phase codes are retained. How could you do it?

Using the appropriate join function, fill out the following code chunk to perform this join in RStudio.

# Join to retain only observations with matches
launches_agencies <- ???_join(launches, agencies, by = "agency")

# Print launches_agencies
launches_agencies

# Answer

# Inner join
launches_agencies <- inner_join(launches, agencies, by = "agency")

# Print launches_agencies
launches_agencies

## # A tibble: 950 x 29
##    tag       JD launch_date launch_year type.x variant mission agency
##    <chr>  <dbl> <date>            <dbl> <chr>  <chr>   <chr>   <chr> 
##  1 1984… 2.45e6 1984-11-10         1984 Arian… <NA>    Spacen… AE    
##  2 1984… 2.45e6 1984-05-23         1984 Arian… <NA>    Spacen… AE    
##  3 1984… 2.45e6 1984-08-04         1984 Arian… <NA>    ECS 2   AE    
##  4 1985… 2.45e6 1985-02-08         1985 Arian… <NA>    Arabsa… AE    
##  5 1985… 2.45e6 1985-05-08         1985 Arian… <NA>    Gstar … AE    
##  6 1985… 2.45e6 1985-07-02         1985 Arian… <NA>    Giotto  AE    
##  7 1986… 2.45e6 1986-02-22         1986 Arian… <NA>    SPOT    AE    
##  8 1986… 2.45e6 1986-03-28         1986 Arian… <NA>    Gstar 2 AE    
##  9 1987… 2.45e6 1987-09-16         1987 Arian… <NA>    Aussat… AE    
## 10 1987… 2.45e6 1987-11-21         1987 Arian… <NA>    TV-SAT  AE    
## # … with 940 more rows, and 21 more variables: state_code.x <chr>,
## #   category <chr>, agency_type.x <chr>, count <dbl>, ucode <chr>,
## #   state_code.y <chr>, type.y <chr>, class <chr>, tstart <chr>, tstop <chr>,
## #   short_name <chr>, name <chr>, location <chr>, longitude <chr>,
## #   latitude <chr>, error <chr>, parent <chr>, short_english_name <chr>,
## #   english_name <chr>, unicode_name <chr>, agency_type.y <chr>

Analysis of joined data

By joining agencies to launches, you found out about the agencies involved in each space launch, e.g., the location of their headquarters, the ownership status (public or private), the founding date, etc.

Bar plot I

In RStudio, run the following code chunks. They will produce an overlapping bar plot of launch outcomes from each space agency. Inspect the plot.

# Overlapping bar plot of successful and failed launches for each space agency
launches_agencies %>%
  ggplot(aes(x = name, fill = category)) +
  geom_bar(position = position_dodge(width = 0.2), alpha = 0.5) + # adjust bar position & reduce colour transparency
  labs(title = "Launch Outcomes", x = "Space Agency", y = "Frequency") +
  scale_fill_discrete(name = "Outcome", labels = c("Failure", "Success")) + # change legend title and labels
  theme(legend.position = "bottom") +
  coord_flip() # flips the x and y coordinates

Which variable used to construct the overlapping bar plot came originally from agencies and which from launches?

Bar plot II

Note that we can certainly improve the visualisation of total failed and successful space launches by agency with some data wrangling.

# Wrangling: Create tidy data frame of count of successful and failed launches
success_fail_count <- launches_agencies %>%
  group_by(short_name, category) %>%
  summarise(n_success_fail = n()) %>%
  ungroup() %>%
  spread(category, n_success_fail) %>%
  rename(Success = O, Fail = `F`) %>%
  mutate(Success = replace_na(Success, 0),
         Fail = replace_na(Fail, 0),
         Total_Launch = Fail + Success) %>%
  # Order space agency by total launches
  mutate(short_name = fct_reorder(short_name, Total_Launch)) %>%
  gather(Outcome, Count, -short_name)

## `summarise()` regrouping output by 'short_name' (override with `.groups` argument)

# Plotting: Overlapping bar plot of successful and failed launches for each space agency
success_fail_count %>%
  filter(Outcome != "Total_Launch") %>%
  ggplot(aes(x = short_name, y = Count, fill = Outcome)) +
  geom_bar(stat = "identity") + 
  labs(title = "Launch Outcomes", x = "Space Agency", y = "Frequency") +
  # colours based on Dark2 colour palette
  scale_fill_manual(values = c("#D95F02", "#1B9E77")) +
  facet_wrap(~ Outcome, scales = "free_x") +
  coord_flip()

# Answer

# Variable from agencies: name 
# Variable from launches: category

# Number of successful and failed launches by each agency
launches_agencies %>%
  group_by(name) %>%
  summarise(num_success = sum(category == "O"),
            num_failure = sum(category == "F")) %>%
  ungroup() %>%
  mutate(perc_success = round(100*num_success/(num_success + num_failure), 2))

Labelled plots with text

Suppose you are interested in the launch vehicles with the highest number of launches. In RStudio, fill out the following code chunk to plot the launch vehicle frequency from each space agency.

Which launch vehicles are in the top five highest number of launches and which space agency do they come from?

# Plot of launch vehicle by frequency
launches_agencies %>%
  group_by(short_name, ???, ???) %>%
  summarise(freq_launchvehicle = n()) %>%
  # Return vehicle with highest launch number
  top_n(1, freq_launchvehicle) %>%
  ungroup() %>%
  ggplot(aes(x = fct_reorder(short_name, freq_launchvehicle), y = freq_launchvehicle, colour = state_code.y)) +
  geom_text(aes(label = type.x), size = 3) +
  labs(title = "Number of times vehicle was launched", x = "Space Agency", y = "Frequency") +
  scale_colour_discrete(name = "State Responsible", labels = c("Cayman Island", "France", "Japan", "Russia", "USA")) +
  coord_flip()

# Answer

# Plot of launch vehicle by frequency
launches_agencies %>%
  group_by(short_name, state_code.y, type.x) %>%
  summarise(freq_launchvehicle = n()) %>%
  # Return vehicle with highest launch number
  top_n(1, freq_launchvehicle) %>%
  ungroup() %>% 
  ggplot(aes(x = fct_reorder(short_name, freq_launchvehicle), y = freq_launchvehicle, colour = state_code.y)) +
  geom_text(aes(label = type.x), size = 3) +
  labs(title = "Number of times vehicle was launched", x = "Space Agency", y = "Frequency") +
  scale_colour_discrete(name = "State Responsible", labels = c("Cayman Island", "France", "Japan", "Russia", "USA")) +
  coord_flip()

## `summarise()` regrouping output by 'short_name', 'state_code.y' (override with `.groups` argument)

# Proton-M, Ariane5ECA, Falcon 9, Atlas V 401 and Delta 7925.

Tutorial 5 - Space Launches

Quang Bui

31 August, 2020

Space launches

Reading the data

Look at `agencies` and `launches`

Conducting a preliminary analysis

Distinct agencies

Complications with real-data: duplicates

Joining the data

Left join

Retaining only matching values

Analysis of joined data

Bar plot I

Bar plot II

Labelled plots with text

Tutorial 5 - Space Launches

Quang Bui

31 August, 2020

Space launches

Reading the data

Look at agencies and launches

Conducting a preliminary analysis

Distinct agencies

Complications with real-data: duplicates

Joining the data

Left join

Retaining only matching values

Analysis of joined data

Bar plot I

Bar plot II

Labelled plots with text

Look at `agencies` and `launches`