The space launches data sets are publicly available from The Economist GitHub and used to develop the interactive visualising in article The space race is dominated by new contenders.
The GitHub repository, graphic-detail-data from The Economist contains the data, which is split into 2 .csv files. This data is also available from the tidytuesday GitHib repository by the R for Data Science community.
agencies.csv
- space launch providerslaunches.csv
- individual space launchesThese data sets are relational data, meaning that both are connected by one or more variables.
Many publicly available real-world data sets are available on GitHub for us to explore. One of the best ways to reforcing our learning is to apply what we have learned to real-world data, so it is useful to know how to read data from online resources like GitHub repositories.
Before analysing the data, we should read through the table of variable definitions (provided by The Economist and R for Data Science GitHub).
We have seen from our analysis of Melbourne pedestrian count data in Tutorial 2, how data can be read into R from GitHub. Following the same steps, we can read the agencies.csv
and launches.csv
data with the read_csv()
function from the tidyverse
:
tidyverse
, which contains the read_csv()
function to read .csv files in R.agencies.csv
:
agencies.csv
file on GitHub and then click Raw, which will open the file (as raw text) in your web browser. The URL will be used to read the data into R.agencies.csv
and paste inside the read_csv()
function.agencies
.The following code chunk, with missing parts (???
), may guide you through the steps above:
## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.1 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Read agencies data
agencies <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-15/agencies.csv")
## Parsed with column specification:
## cols(
## agency = col_character(),
## count = col_double(),
## ucode = col_character(),
## state_code = col_character(),
## type = col_character(),
## class = col_character(),
## tstart = col_character(),
## tstop = col_character(),
## short_name = col_character(),
## name = col_character(),
## location = col_character(),
## longitude = col_character(),
## latitude = col_character(),
## error = col_character(),
## parent = col_character(),
## short_english_name = col_character(),
## english_name = col_character(),
## unicode_name = col_character(),
## agency_type = col_character()
## )
Now that you have read the agencies
data into R, replicate the previous steps to read the space launches data, storing it an in object named launches
.
# Answer
# URL of raw text from launches
launches_URL <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-15/launches.csv"
# Read launches data
launches <- read_csv(launches_URL)
## Parsed with column specification:
## cols(
## tag = col_character(),
## JD = col_double(),
## launch_date = col_date(format = ""),
## launch_year = col_double(),
## type = col_character(),
## variant = col_character(),
## mission = col_character(),
## agency = col_character(),
## state_code = col_character(),
## category = col_character(),
## agency_type = col_character()
## )
agencies
and launches
The learning objective of this tutorial is to strengthen our understanding of related data sets, i.e., understanding how they are connected and recognising that more insights are available from the analysis of data sets when they are joined, than when they are in silos. While the definition of each variable is available for us, we still need to examine the content inside of each variable.
The glimpse()
function, which comes from the tidyverse
, gives us a glimpse of each variable in the data.
Fill out the missing parts of the code chunk (???
) to look at both data sets.
agencies
data?launches
data?A preliminary analysis of data helps us develop a better understanding of the information contained in the data. Often, asking the right questions gives us information that we can used for further analysis.
The following some questions that we could ask when conducting a preliminary analysis of the data.
agencies
data?agencies
uniquely describes each agency?Fill out the missing parts of the code chunk below (???
) to help answer these questions.
While agencies
contains information about various space agencies, it is important for our analysis to identify the variable in the data the uniquely describes each agency, i.e., a name, or perhaps, an ID.
Describe how you would go about identifying possible variables in agencies
that uniquely describes each agency and how you would confirm this.
# Answer
# You have already looked at the data with glimpse(), but you can look at it again with head() to get a better understand.
# Read over the definition of each variables.
# Based on what you have done, the obvious candidates that uniquely describe each agency are variables agency and name.
# Now it's a matter of checking that each row in the data agencies is distinct, checking if the number of distinct rows in variable agency equals to the number of distinct rows in the data agencies, checking if the number of distinct rows in variable name equals to the number of distinct rows in the data agencies.
# distinct() is a function that can be used and is easy to explain. count() is also useful here, but we will introduce this in the next exercise.
# For now, let's use nrow() and distinct() together to count the number of distinct rows.
# Check if each row in agencies is distinct (should return same data)
agencies %>%
distinct()
# Count the distinct agency data (number of rows)
# Both of the following should return 74
nrow(agencies %>%
distinct())
nrow(agencies)
# Count the distinct agencies (numbers of rows)
# Should return 74
nrow(agencies %>%
distinct(agency))
# agencies %>% distinct(agency) shows the values and a count
# Count the distinct names (number of rows)
# Something seems a bit wrong in the count here. There is a repeat!
# We'll explore which agency that is in the next section.
nrow(agencies %>%
distinct(name))
One variable that we should have inspected to see if it uniquely describes each agency is name
, which contains the name of the space agency. Interestingly, there are 73 space agency names, which is 1 less than the number of agencies in the data (74). This tells us that there is a space agency, whose name appears twice in the data. To count the number of times each agency name is observed in the data:
agencies
then %>%
in the count()
functioncount()
, specify the variable whose elements we want to count, i.e., name
Fill out the missing parts of the code chunk below (???
) then run.
How are agencies
and launches
connected? We know that agencies
contains information about space launch providers, e.g., their organisation phase code, number of launches, name, etc., while launches
contains information about space launches, e.g., the space launch provider, date of launch, outcome of launch (success of failure), etc.
To join the data sets, we need to deterine the variable that connects them together. The launches
data contains the variable agency
, which is the space launch provider (the agency responsible for the space launch) and is represented by an organisation phase code.
Fill out the missing parts of the code chunk below (???
) to obtain the number of agencies responsible for a space launch.
# Number of agencies that provided a space launch
launches %>%
summarise(num_agencies = n_distinct(???))
# Answer
# Number of agencies that provided a space launch
launches %>%
summarise(num_agencies = n_distinct(agency))
Using the appropriate connecting variable, join agencies
to launches
with a left join and assign this data the name launches_agencies
. Explain why information about the agency e.g. number of launches, name, location, is missing for some observations.
If you’re unsure how to make a left join, have a look at the quick guide in Moodle for a refresher on joins, how they work, and more.
# Left join agencies to launches
launches_agencies <- left_join(???, ???, by = "???")
# Print launches_agencies
launches_agencies
# Answer
# Left join agencies to launches
launches_agencies <- left_join(launches, agencies, by = "agency")
# Print launches_agencies
launches_agencies
## # A tibble: 5,726 x 29
## tag JD launch_date launch_year type.x variant mission agency
## <chr> <dbl> <date> <dbl> <chr> <chr> <chr> <chr>
## 1 1967… 2.44e6 1967-06-29 1967 Thor … <NA> Secor … US
## 2 1967… 2.44e6 1967-08-23 1967 Thor … <NA> DAPP 3… US
## 3 1967… 2.44e6 1967-10-11 1967 Thor … <NA> DAPP 4… US
## 4 1968… 2.44e6 1968-05-23 1968 Thor … <NA> DAPP 5… US
## 5 1968… 2.44e6 1968-10-23 1968 Thor … <NA> DAPP 6… US
## 6 1969… 2.44e6 1969-07-23 1969 Thor … <NA> DAPP 7… US
## 7 1970… 2.44e6 1970-02-11 1970 Thor … <NA> DAPP B… US
## 8 1970… 2.44e6 1970-09-03 1970 Thor … <NA> DAPP B… US
## 9 1971… 2.44e6 1971-02-17 1971 Thor … <NA> DAPP B… US
## 10 1971… 2.44e6 1971-06-08 1971 Thor … <NA> P70-1 US
## # … with 5,716 more rows, and 21 more variables: state_code.x <chr>,
## # category <chr>, agency_type.x <chr>, count <dbl>, ucode <chr>,
## # state_code.y <chr>, type.y <chr>, class <chr>, tstart <chr>, tstop <chr>,
## # short_name <chr>, name <chr>, location <chr>, longitude <chr>,
## # latitude <chr>, error <chr>, parent <chr>, short_english_name <chr>,
## # english_name <chr>, unicode_name <chr>, agency_type.y <chr>
Suppose that you wanted to join both agency
and launches
together so that only observations from both data sets with matching organisation phase codes are retained. How could you do it?
Using the appropriate join function, fill out the following code chunk to perform this join in RStudio.
# Join to retain only observations with matches
launches_agencies <- ???_join(launches, agencies, by = "agency")
# Print launches_agencies
launches_agencies
# Answer
# Inner join
launches_agencies <- inner_join(launches, agencies, by = "agency")
# Print launches_agencies
launches_agencies
## # A tibble: 950 x 29
## tag JD launch_date launch_year type.x variant mission agency
## <chr> <dbl> <date> <dbl> <chr> <chr> <chr> <chr>
## 1 1984… 2.45e6 1984-11-10 1984 Arian… <NA> Spacen… AE
## 2 1984… 2.45e6 1984-05-23 1984 Arian… <NA> Spacen… AE
## 3 1984… 2.45e6 1984-08-04 1984 Arian… <NA> ECS 2 AE
## 4 1985… 2.45e6 1985-02-08 1985 Arian… <NA> Arabsa… AE
## 5 1985… 2.45e6 1985-05-08 1985 Arian… <NA> Gstar … AE
## 6 1985… 2.45e6 1985-07-02 1985 Arian… <NA> Giotto AE
## 7 1986… 2.45e6 1986-02-22 1986 Arian… <NA> SPOT AE
## 8 1986… 2.45e6 1986-03-28 1986 Arian… <NA> Gstar 2 AE
## 9 1987… 2.45e6 1987-09-16 1987 Arian… <NA> Aussat… AE
## 10 1987… 2.45e6 1987-11-21 1987 Arian… <NA> TV-SAT AE
## # … with 940 more rows, and 21 more variables: state_code.x <chr>,
## # category <chr>, agency_type.x <chr>, count <dbl>, ucode <chr>,
## # state_code.y <chr>, type.y <chr>, class <chr>, tstart <chr>, tstop <chr>,
## # short_name <chr>, name <chr>, location <chr>, longitude <chr>,
## # latitude <chr>, error <chr>, parent <chr>, short_english_name <chr>,
## # english_name <chr>, unicode_name <chr>, agency_type.y <chr>
By joining agencies
to launches
, you found out about the agencies involved in each space launch, e.g., the location of their headquarters, the ownership status (public or private), the founding date, etc.
In RStudio, run the following code chunks. They will produce an overlapping bar plot of launch outcomes from each space agency. Inspect the plot.
# Overlapping bar plot of successful and failed launches for each space agency
launches_agencies %>%
ggplot(aes(x = name, fill = category)) +
geom_bar(position = position_dodge(width = 0.2), alpha = 0.5) + # adjust bar position & reduce colour transparency
labs(title = "Launch Outcomes", x = "Space Agency", y = "Frequency") +
scale_fill_discrete(name = "Outcome", labels = c("Failure", "Success")) + # change legend title and labels
theme(legend.position = "bottom") +
coord_flip() # flips the x and y coordinates
Which variable used to construct the overlapping bar plot came originally from agencies
and which from launches
?
Note that we can certainly improve the visualisation of total failed and successful space launches by agency with some data wrangling.
# Wrangling: Create tidy data frame of count of successful and failed launches
success_fail_count <- launches_agencies %>%
group_by(short_name, category) %>%
summarise(n_success_fail = n()) %>%
ungroup() %>%
spread(category, n_success_fail) %>%
rename(Success = O, Fail = `F`) %>%
mutate(Success = replace_na(Success, 0),
Fail = replace_na(Fail, 0),
Total_Launch = Fail + Success) %>%
# Order space agency by total launches
mutate(short_name = fct_reorder(short_name, Total_Launch)) %>%
gather(Outcome, Count, -short_name)
## `summarise()` regrouping output by 'short_name' (override with `.groups` argument)
# Plotting: Overlapping bar plot of successful and failed launches for each space agency
success_fail_count %>%
filter(Outcome != "Total_Launch") %>%
ggplot(aes(x = short_name, y = Count, fill = Outcome)) +
geom_bar(stat = "identity") +
labs(title = "Launch Outcomes", x = "Space Agency", y = "Frequency") +
# colours based on Dark2 colour palette
scale_fill_manual(values = c("#D95F02", "#1B9E77")) +
facet_wrap(~ Outcome, scales = "free_x") +
coord_flip()
# Answer
# Variable from agencies: name
# Variable from launches: category
# Number of successful and failed launches by each agency
launches_agencies %>%
group_by(name) %>%
summarise(num_success = sum(category == "O"),
num_failure = sum(category == "F")) %>%
ungroup() %>%
mutate(perc_success = round(100*num_success/(num_success + num_failure), 2))
Suppose you are interested in the launch vehicles with the highest number of launches. In RStudio, fill out the following code chunk to plot the launch vehicle frequency from each space agency.
Which launch vehicles are in the top five highest number of launches and which space agency do they come from?
# Plot of launch vehicle by frequency
launches_agencies %>%
group_by(short_name, ???, ???) %>%
summarise(freq_launchvehicle = n()) %>%
# Return vehicle with highest launch number
top_n(1, freq_launchvehicle) %>%
ungroup() %>%
ggplot(aes(x = fct_reorder(short_name, freq_launchvehicle), y = freq_launchvehicle, colour = state_code.y)) +
geom_text(aes(label = type.x), size = 3) +
labs(title = "Number of times vehicle was launched", x = "Space Agency", y = "Frequency") +
scale_colour_discrete(name = "State Responsible", labels = c("Cayman Island", "France", "Japan", "Russia", "USA")) +
coord_flip()
# Answer
# Plot of launch vehicle by frequency
launches_agencies %>%
group_by(short_name, state_code.y, type.x) %>%
summarise(freq_launchvehicle = n()) %>%
# Return vehicle with highest launch number
top_n(1, freq_launchvehicle) %>%
ungroup() %>%
ggplot(aes(x = fct_reorder(short_name, freq_launchvehicle), y = freq_launchvehicle, colour = state_code.y)) +
geom_text(aes(label = type.x), size = 3) +
labs(title = "Number of times vehicle was launched", x = "Space Agency", y = "Frequency") +
scale_colour_discrete(name = "State Responsible", labels = c("Cayman Island", "France", "Japan", "Russia", "USA")) +
coord_flip()
## `summarise()` regrouping output by 'short_name', 'state_code.y' (override with `.groups` argument)