Space launches

The space launches data sets are publicly available from The Economist GitHub and used to develop the interactive visualising in article The space race is dominated by new contenders.

The GitHub repository, graphic-detail-data from The Economist contains the data, which is split into 2 .csv files. This data is also available from the tidytuesday GitHib repository by the R for Data Science community.

These data sets are relational data, meaning that both are connected by one or more variables.

Many publicly available real-world data sets are available on GitHub for us to explore. One of the best ways to reforcing our learning is to apply what we have learned to real-world data, so it is useful to know how to read data from online resources like GitHub repositories.

Before analysing the data, we should read through the table of variable definitions (provided by The Economist and R for Data Science GitHub).

Reading the data

We have seen from our analysis of Melbourne pedestrian count data in Tutorial 2, how data can be read into R from GitHub. Following the same steps, we can read the agencies.csv and launches.csv data with the read_csv() function from the tidyverse:

  • Load the tidyverse, which contains the read_csv() function to read .csv files in R.
  • Reading in agencies.csv:
    • Go to the agencies.csv file on GitHub and then click Raw, which will open the file (as raw text) in your web browser. The URL will be used to read the data into R.
    • Copy the URL of agencies.csv and paste inside the read_csv() function.
    • Store the data in an object named agencies.

The following code chunk, with missing parts (???), may guide you through the steps above:

## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.1     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Parsed with column specification:
## cols(
##   agency = col_character(),
##   count = col_double(),
##   ucode = col_character(),
##   state_code = col_character(),
##   type = col_character(),
##   class = col_character(),
##   tstart = col_character(),
##   tstop = col_character(),
##   short_name = col_character(),
##   name = col_character(),
##   location = col_character(),
##   longitude = col_character(),
##   latitude = col_character(),
##   error = col_character(),
##   parent = col_character(),
##   short_english_name = col_character(),
##   english_name = col_character(),
##   unicode_name = col_character(),
##   agency_type = col_character()
## )

Now that you have read the agencies data into R, replicate the previous steps to read the space launches data, storing it an in object named launches.

## Parsed with column specification:
## cols(
##   tag = col_character(),
##   JD = col_double(),
##   launch_date = col_date(format = ""),
##   launch_year = col_double(),
##   type = col_character(),
##   variant = col_character(),
##   mission = col_character(),
##   agency = col_character(),
##   state_code = col_character(),
##   category = col_character(),
##   agency_type = col_character()
## )

Look at agencies and launches

The learning objective of this tutorial is to strengthen our understanding of related data sets, i.e., understanding how they are connected and recognising that more insights are available from the analysis of data sets when they are joined, than when they are in silos. While the definition of each variable is available for us, we still need to examine the content inside of each variable.

The glimpse() function, which comes from the tidyverse, gives us a glimpse of each variable in the data.

Fill out the missing parts of the code chunk (???) to look at both data sets.

  • How many rows/observations and columns/variables are in the agencies data?
  • How many in the launches data?

Conducting a preliminary analysis

A preliminary analysis of data helps us develop a better understanding of the information contained in the data. Often, asking the right questions gives us information that we can used for further analysis.

The following some questions that we could ask when conducting a preliminary analysis of the data.

  • How many distinct agencies (space launch providers) are in the agencies data?
  • Which variable in agencies uniquely describes each agency?
  • Is there an agency that has more than one organisation phase code? If so, which agency?
  • Why might an agency have more than one organisation phase code?

Fill out the missing parts of the code chunk below (???) to help answer these questions.

Distinct agencies

While agencies contains information about various space agencies, it is important for our analysis to identify the variable in the data the uniquely describes each agency, i.e., a name, or perhaps, an ID.

Describe how you would go about identifying possible variables in agencies that uniquely describes each agency and how you would confirm this.

Complications with real-data: duplicates

One variable that we should have inspected to see if it uniquely describes each agency is name, which contains the name of the space agency. Interestingly, there are 73 space agency names, which is 1 less than the number of agencies in the data (74). This tells us that there is a space agency, whose name appears twice in the data. To count the number of times each agency name is observed in the data:

  • Take agencies then %>% in the count() function
  • Inside of count(), specify the variable whose elements we want to count, i.e., name

Fill out the missing parts of the code chunk below (???) then run.

  • Which space agency appears twice in the data?
  • Filter the data for this space agency and explain what has happened.

Joining the data

How are agencies and launches connected? We know that agencies contains information about space launch providers, e.g., their organisation phase code, number of launches, name, etc., while launches contains information about space launches, e.g., the space launch provider, date of launch, outcome of launch (success of failure), etc.

To join the data sets, we need to deterine the variable that connects them together. The launches data contains the variable agency, which is the space launch provider (the agency responsible for the space launch) and is represented by an organisation phase code.

Fill out the missing parts of the code chunk below (???) to obtain the number of agencies responsible for a space launch.

Left join

Using the appropriate connecting variable, join agencies to launches with a left join and assign this data the name launches_agencies. Explain why information about the agency e.g. number of launches, name, location, is missing for some observations.

If you’re unsure how to make a left join, have a look at the quick guide in Moodle for a refresher on joins, how they work, and more.

## # A tibble: 5,726 x 29
##    tag       JD launch_date launch_year type.x variant mission agency
##    <chr>  <dbl> <date>            <dbl> <chr>  <chr>   <chr>   <chr> 
##  1 1967… 2.44e6 1967-06-29         1967 Thor … <NA>    Secor … US    
##  2 1967… 2.44e6 1967-08-23         1967 Thor … <NA>    DAPP 3… US    
##  3 1967… 2.44e6 1967-10-11         1967 Thor … <NA>    DAPP 4… US    
##  4 1968… 2.44e6 1968-05-23         1968 Thor … <NA>    DAPP 5… US    
##  5 1968… 2.44e6 1968-10-23         1968 Thor … <NA>    DAPP 6… US    
##  6 1969… 2.44e6 1969-07-23         1969 Thor … <NA>    DAPP 7… US    
##  7 1970… 2.44e6 1970-02-11         1970 Thor … <NA>    DAPP B… US    
##  8 1970… 2.44e6 1970-09-03         1970 Thor … <NA>    DAPP B… US    
##  9 1971… 2.44e6 1971-02-17         1971 Thor … <NA>    DAPP B… US    
## 10 1971… 2.44e6 1971-06-08         1971 Thor … <NA>    P70-1   US    
## # … with 5,716 more rows, and 21 more variables: state_code.x <chr>,
## #   category <chr>, agency_type.x <chr>, count <dbl>, ucode <chr>,
## #   state_code.y <chr>, type.y <chr>, class <chr>, tstart <chr>, tstop <chr>,
## #   short_name <chr>, name <chr>, location <chr>, longitude <chr>,
## #   latitude <chr>, error <chr>, parent <chr>, short_english_name <chr>,
## #   english_name <chr>, unicode_name <chr>, agency_type.y <chr>

Retaining only matching values

Suppose that you wanted to join both agency and launches together so that only observations from both data sets with matching organisation phase codes are retained. How could you do it?

Using the appropriate join function, fill out the following code chunk to perform this join in RStudio.

## # A tibble: 950 x 29
##    tag       JD launch_date launch_year type.x variant mission agency
##    <chr>  <dbl> <date>            <dbl> <chr>  <chr>   <chr>   <chr> 
##  1 1984… 2.45e6 1984-11-10         1984 Arian… <NA>    Spacen… AE    
##  2 1984… 2.45e6 1984-05-23         1984 Arian… <NA>    Spacen… AE    
##  3 1984… 2.45e6 1984-08-04         1984 Arian… <NA>    ECS 2   AE    
##  4 1985… 2.45e6 1985-02-08         1985 Arian… <NA>    Arabsa… AE    
##  5 1985… 2.45e6 1985-05-08         1985 Arian… <NA>    Gstar … AE    
##  6 1985… 2.45e6 1985-07-02         1985 Arian… <NA>    Giotto  AE    
##  7 1986… 2.45e6 1986-02-22         1986 Arian… <NA>    SPOT    AE    
##  8 1986… 2.45e6 1986-03-28         1986 Arian… <NA>    Gstar 2 AE    
##  9 1987… 2.45e6 1987-09-16         1987 Arian… <NA>    Aussat… AE    
## 10 1987… 2.45e6 1987-11-21         1987 Arian… <NA>    TV-SAT  AE    
## # … with 940 more rows, and 21 more variables: state_code.x <chr>,
## #   category <chr>, agency_type.x <chr>, count <dbl>, ucode <chr>,
## #   state_code.y <chr>, type.y <chr>, class <chr>, tstart <chr>, tstop <chr>,
## #   short_name <chr>, name <chr>, location <chr>, longitude <chr>,
## #   latitude <chr>, error <chr>, parent <chr>, short_english_name <chr>,
## #   english_name <chr>, unicode_name <chr>, agency_type.y <chr>

Analysis of joined data

By joining agencies to launches, you found out about the agencies involved in each space launch, e.g., the location of their headquarters, the ownership status (public or private), the founding date, etc.

Bar plot II

Note that we can certainly improve the visualisation of total failed and successful space launches by agency with some data wrangling.

## `summarise()` regrouping output by 'short_name' (override with `.groups` argument)

Labelled plots with text

Suppose you are interested in the launch vehicles with the highest number of launches. In RStudio, fill out the following code chunk to plot the launch vehicle frequency from each space agency.

Which launch vehicles are in the top five highest number of launches and which space agency do they come from?

## `summarise()` regrouping output by 'short_name', 'state_code.y' (override with `.groups` argument)