2. Data Types
2.1 Main Data Classes
R has five basic or atomic classes of objects:
- numeric:
- double (real numbers): values like 2.3, 3.14, -5.7634 , …
- integer: values like 0,1,2, -4, …
- character: values like “GDDS”, ‘exe’
- logical: TRUE and FALSE (always capital letters)
- complex: we have nothing to do with it in this unit.
typeof(2) # numbers by default are double
## [1] "double"
typeof(2L) # to force to be integer
## [1] "integer"
typeof(3.14)
## [1] "double"
typeof(TRUE)
## [1] "logical"
typeof("TRUE")
## [1] "character"
2.2 Vectors
The most basic type of R objects is a vector. All the objects we used so far are vectors of length 1. Vectors are variables with one or more values of the same type, e.g., all are of numeric class. For example, a numeric vector might consist of the numbers (1.2, 2.3, 0.2, 1.1).
- Vectors are created by c() function (concatenatation function)
- Also, they ca be created by vector() function: v <- vector(“numeric”, length=5)
- should contain objects of the same class
- if you put objects from different classes, an implicit coercion (the calss of value would be changed) will happen
- Creating variables using seq and rep functions.
v1 <- c(5,7,9) # a vector called v1 is created.
v1
## [1] 5 7 9
print(v1)
## [1] 5 7 9
#this says v1 is a vector, or a sequence of objects, and the first one is 5.
v2 <- 3:35 # a sequence of consecutive integers are put in v2. The sequence starts from 3 and goes to 35
print(v2)
## [1] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [24] 26 27 28 29 30 31 32 33 34 35
# the first item is 3 and the 26th item is 28.
v3 <- c("Helo", "Hi", "Bye") # a vector of characters
print(v3)
## [1] "Helo" "Hi" "Bye"
v4 <- c(TRUE, TRUE, FALSE, TRUE, TRUE) # a vector of logical values
v4
## [1] TRUE TRUE FALSE TRUE TRUE
length(v4) # gives the length of a vector
## [1] 5
v5 <- seq(2,8) #another way of making a vector of consecutive numbers. Same as 2:8
v5
## [1] 2 3 4 5 6 7 8
v6 <- seq(from=3, to=10, by=2) # equally you can write seq(3,10,2)
print(v6)
## [1] 3 5 7 9
#learn more about seq() function by typing ? seq
?seq
v7 <- vector(mode="numeric", length=5) # another way of creating a vector
print(v7)
## [1] 0 0 0 0 0
v8 <- c(5, "a", 2) #different types, so a coercion happens. Be very careful about this.
print(v8)
## [1] "5" "a" "2"
#accessing elements of a vector
v8[1]
## [1] "5"
print(v8[2])
## [1] "a"
vv <- c(1,2,3)
vv
## [1] 1 2 3
vv[2] #prints the second item
## [1] 2
vv[2] <- 257 # changes the value stored in the second element
vv
## [1] 1 257 3
#to choose more than one element from a vector
x <- c(12.2, 52.3, 10.2, 11.1)
x[1] # only the first element
## [1] 12.2
x[c(1,3)] # the first and third elemment
## [1] 12.2 10.2
# Adding an element to the end of a list
v <- c(1,2,3)
print(v)
## [1] 1 2 3
v <- c(v, 100) # 100 is added to the end of a vector
print(v)
## [1] 1 2 3 100
# Create sequential data
x1 <- 0:10 # Assigns number 0 through 10 to x1
x2 <- 10:0 # Assigns number 10 through 0 to x2
x3 <- seq(10) # Counts from 1 to 10
x4 <- seq(30, 0, by = -3) # Counts down by 3
x <- c(1,3,6,9,0)
x
## [1] 1 3 6 9 0
x[-2] # all the elements except the second element
## [1] 1 6 9 0
x[3] <- 200 #modify an element
x
## [1] 1 3 200 9 0
# to delete a vector
x <- NULL
x
## NULL
x <- c(2, 9, 7)
x
## [1] 2 9 7
y <- c(x, x, 10)
y
## [1] 2 9 7 2 9 7 10
round(seq(1,3,length=10), 2)
## [1] 1.00 1.22 1.44 1.67 1.89 2.11 2.33 2.56 2.78 3.00
seq(from = 2, by = -0.1, length.out = 4)
## [1] 2.0 1.9 1.8 1.7
x <- rep(3,4)
x
## [1] 3 3 3 3
rep(1:5,3)
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
x <- c(7,3,5,2,0,1)
y <- x[-3]
y
## [1] 7 3 2 0 1
y <- x[-length(x)] # always delets the final element
y
## [1] 7 3 5 2 0
2.3 Lists
Other basic object in R is a list. A list is very similar to a vector, but it could contain objects from different classes. You can create a list using list() function. The main functionality of lists in putting outputs of functions inside. Later we will see an important example of lm() functions.
L1 <- list(5, "a", 2)
print(L1)
## [[1]]
## [1] 5
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] 2
# L1 has 3 elements, and each element is considered as a vector
#pay attention to double brackets. It shows the elements of the list
L1 #auto printing
## [[1]]
## [1] 5
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] 2
length(L1)
## [1] 3
L2 <- list(c(1,2,3), c("One", "Two"), TRUE)
print(L2)
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] "One" "Two"
##
## [[3]]
## [1] TRUE
L2
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] "One" "Two"
##
## [[3]]
## [1] TRUE
L1[1]
## [[1]]
## [1] 5
print(L2[1])
## [[1]]
## [1] 1 2 3
print(L2[[1]])
## [1] 1 2 3
2.4 Numbers
Numbers in R are considerd as numeric, (as real numbers with double precision) . If you want an integer, you need to explicitly add L to the end of the number, otherwise it is a double.
Special numbers: * Inf, infinity, for \(\frac{1}{0}\) * NaN, not a number, for \(\frac{0}{0}\) * NA can be thought as a missing value
x <- 1
print(x)
## [1] 1
class(x)
## [1] "numeric"
typeof(x)
## [1] "double"
y <- 1L
print(y)
## [1] 1
class(y)
## [1] "integer"
typeof(y)
## [1] "integer"
c1 <- "Heloo" # character variable
c2 <- "The World!" # another character variable
paste(c1, c2)
## [1] "Heloo The World!"
print(c(c1, c2))
## [1] "Heloo" "The World!"
sqrt(-2) #NaN stands for not a number
## Warning in sqrt(-2): NaNs produced
## [1] NaN
2.5 Changing Class of a Value
You saw that a vector contains values of only one class. If different classes mixed together by having valuesw ith different classes in a vector, an implicit coercion happens. It means R will convert all the values to a class that are the same. However, sometimes we want to change the type of a value ourselves, so we implemenet an explicit coercion by as.SomeClass() functions. * as.numeric() to change the type into numeric if it is possibel * as.logical() to change into logical if it is possible * as.character() * as.complex() * as.integer()
Sometimes R cannot convert one type to another, and gives NA. Also, you will get warning from R.
x <- 1:5 #sequence of numbers
class(x)
## [1] "integer"
y <- as.numeric(x)
class(y)
## [1] "numeric"
z <- as.logical(x)
print(z)
## [1] TRUE TRUE TRUE TRUE TRUE
class(z)
## [1] "logical"
u <- as.character(z)
print(u)
## [1] "TRUE" "TRUE" "TRUE" "TRUE" "TRUE"
class(u)
## [1] "character"
t <- as.numeric(u)
## Warning: NAs introduced by coercion
t
## [1] NA NA NA NA NA
class(t)
## [1] "numeric"
#list does not have any problm with mixing data types. Very poerful!
x <- list(14, "Hello", TRUE, list(23, "Hi", TRUE, FALSE))
x
## [[1]]
## [1] 14
##
## [[2]]
## [1] "Hello"
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [[4]][[1]]
## [1] 23
##
## [[4]][[2]]
## [1] "Hi"
##
## [[4]][[3]]
## [1] TRUE
##
## [[4]][[4]]
## [1] FALSE
print(x)
## [[1]]
## [1] 14
##
## [[2]]
## [1] "Hello"
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [[4]][[1]]
## [1] 23
##
## [[4]][[2]]
## [1] "Hi"
##
## [[4]][[3]]
## [1] TRUE
##
## [[4]][[4]]
## [1] FALSE
#elements of list has double brackets around them. Other objects have single bracket
2.6 Factors
Categorical data in R are represented using factors. We will learn a lot about this type of data soon. Factors are stored as integers, but they are assigned labels. R sorts factors in alphabetical oredr. Factors can be ordered or unordered. R considers factors as nominal categorical variables, and “ordered” as ordinal categorical variables.
x <- factor(c("male", "fmale", "male", "male", "fmale", "male")) #create a factor object
print(x)
## [1] male fmale male male fmale male
## Levels: fmale male
levels(x) #alphabetical order
## [1] "fmale" "male"
nlevels(x)
## [1] 2
unclass(x)
## [1] 2 1 2 2 1 2
## attr(,"levels")
## [1] "fmale" "male"
table(x) #gives frequency count
## x
## fmale male
## 2 4
levels(x)
## [1] "fmale" "male"
summary(x)
## fmale male
## 2 4
#change the order of levels
#this is important in linear regression. The first level is used as the baseline level.
x <- factor(c("male", "fmale", "male", "male", "fmale", "male"), levels=c("male", "fmale"))
print(x)
## [1] male fmale male male fmale male
## Levels: male fmale
d <- c(1,1,2,3,1,3,3,2)
d[1]+d[2] # integers
## [1] 2
fd <- factor(d)
print(fd)
## [1] 1 1 2 3 1 3 3 2
## Levels: 1 2 3
fd[1]+fd[2] #factors, you will get warning
## Warning in Ops.factor(fd[1], fd[2]): '+' not meaningful for factors
## [1] NA
unclass(fd) # bring down to integer vector
## [1] 1 1 2 3 1 3 3 2
## attr(,"levels")
## [1] "1" "2" "3"
rd <- factor(d, labels=c("A", "B", "C")) # factor is as an integer vector where each integer has a label
print(rd)
## [1] A A B C A C C B
## Levels: A B C
levels(rd) <- c("AA", "BB", "CC")
print(rd)
## [1] AA AA BB CC AA CC CC BB
## Levels: AA BB CC
is.factor(d)
## [1] FALSE
is.factor(fd)
## [1] TRUE
#ordered factor variable
x1 <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
print(x1)
## [1] low high medium high low medium high
## Levels: high low medium
x1f <- factor(x1, levels = c("low", "medium", "high"))
print(x1f)
## [1] low high medium high low medium high
## Levels: low medium high
x1o <- ordered(x1, levels = c("low", "medium", "high"))
print(x1o)
## [1] low high medium high low medium high
## Levels: low < medium < high
min(x1o) ## works!
## [1] low
## Levels: low < medium < high
is.factor(x1o)
## [1] TRUE
attributes(x1o)
## $levels
## [1] "low" "medium" "high"
##
## $class
## [1] "ordered" "factor"
By using the gl() function, we can generate factor levels . It takes two integers as input which indicates how many levels and how many times each level. * gl(n, m, labels) * n is the number of levels * m is the number of repeatitions * labels is a vector of labels
v <- gl(3, 4, labels = c("H1", "H2","H3"))
print(v)
## [1] H1 H1 H1 H1 H2 H2 H2 H2 H3 H3 H3 H3
## Levels: H1 H2 H3
class(v)
## [1] "factor"
2.7 Missing Values
A variable might not have a value, ot its value might missing. In R missing values are displayed by the symbol NA (not avaiable). * NA, not available * Makes certain calculations impossible * is.na() * is.nan() * NA values have class
x1 <- c(4, 2.5, 3, NA, 1)
summary(x1) # Works with NA
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 2.125 2.750 2.625 3.250 4.000 1
mean(x1) # Doesn't work
## [1] NA
mean(x1, na.rm=TRUE)
## [1] 2.625
is.na(x1)
## [1] FALSE FALSE FALSE TRUE FALSE
# To find missing values
which(is.na(x1)) # Give index number
## [1] 4
# Ignore missing values with na.rm = T
mean(x1, na.rm = T)
## [1] 2.625
# Replace missing values with 0 (or other number)
# In data wrangling you will learn a lot about this.
x2 <- x1
x2[is.na(x2)] <- 0
x2
## [1] 4.0 2.5 3.0 0.0 1.0
2.8 Subsetting
- [] always returns an object of the same class
- [[]] is used to extract elements from a list fo dataframe. It always return a single element.
- \(\$\) to extract elements from a list or dataframe unsing a name
x <- c("a1", "a2", "a3", "a4", "a5", "a6")
x[1] #extracts the first item. it's a vector
## [1] "a1"
x[2:5] # extracts a sequence. it's a vector
## [1] "a2" "a3" "a4" "a5"
x <- list(prime=c(2,3,5,7), even=c(0,2,4,6), odd=c(1,3,5,7), digit=3.14)
print(x)
## $prime
## [1] 2 3 5 7
##
## $even
## [1] 0 2 4 6
##
## $odd
## [1] 1 3 5 7
##
## $digit
## [1] 3.14
print(x[1]) #extracts the first element of the list, and it is a list
## $prime
## [1] 2 3 5 7
class(x[1])
## [1] "list"
print(x[[1]]) #extracts the first element and returns a vector.
## [1] 2 3 5 7
print(x[4])
## $digit
## [1] 3.14
print(x[[4]])
## [1] 3.14
x$digit
## [1] 3.14
x[c(1,4)]
## $prime
## [1] 2 3 5 7
##
## $digit
## [1] 3.14
2.9 Vectorised Operations
Makes life much easier!! We can treat vectors as single variables in R. sometimes we want to apply a particular calculation on all the members of a vector, or between two vectors.
x <- 1:4
2*x
## [1] 2 4 6 8
y <- 2:5
print(x+y)
## [1] 3 5 7 9
x[x>2]
## [1] 3 4
print(x*y)
## [1] 2 6 12 20
print(x>y)
## [1] FALSE FALSE FALSE FALSE
# Matrices will be covered soon.
m1 <- matrix(1:4,2,2)
m2 <- matrix(2:5, 2,2)
m1+m2
## [,1] [,2]
## [1,] 3 7
## [2,] 5 9
m1*m2
## [,1] [,2]
## [1,] 2 12
## [2,] 6 20
m1%*%m2 #matrix multiplicatin
## [,1] [,2]
## [1,] 11 19
## [2,] 16 28
R can perform functions over entire vectors and can be used to select certain elements within a vector. Here is a alist of more frequent functions: * max(x)
* min(x) * sum(x) * mean(x) * var(x) * sd(x) * median(x) * range(x)
3. Data Tables
3.1 Matrices
A matrix is a rectangular array of numbers. From technical perspective, it is a vector, with two additional attributes, namely, the numbers of rows and columns. Vctors we considered so far were one-dimensional. Matrices are a special type of vetor. They have dimension attribute. in other words, matrices are a multi-dimensional vectors.
m <- matrix(nrow=2, ncol=3) #empty matrix with dimension
m
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
print(m)
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
attributes(m)
## $dim
## [1] 2 3
dim(m)
## [1] 2 3
print(paste(dim(m)[1], " + ", dim(m)[2]))
## [1] "2 + 3"
m <- matrix(c(1,3,6,2,8,4), nrow=2, ncol=3 ) #matrices are build column-wise
print(m)
## [,1] [,2] [,3]
## [1,] 1 6 8
## [2,] 3 2 4
str(m) # one of the most important functions
## num [1:2, 1:3] 1 3 6 2 8 4
m[2,2]
## [1] 2
Other commonly used approaches to create matrix are cbind() and rbind().
#two other methods to creat matrices
x <- c(1,11,111)
y <- c(2,22,222)
m1 <- cbind(x,y) #column-binding
print(m1)
## x y
## [1,] 1 2
## [2,] 11 22
## [3,] 111 222
print("****")
## [1] "****"
m2 <- rbind(x,y) #raw-binding
print(m2)
## [,1] [,2] [,3]
## x 1 11 111
## y 2 22 222
3.2 Data Frames
Data frames are very important object in R. When you have \(m\) obsrvation with \(n\) attributes, you have a dataframe of size \(m\times n\). As the attributes could be of any class, a data frame is technically a list, with each component being a vector corresponding to a column in our data matrix. Therefore, dataframes are a special type of list, where every element of this list should have the same length. Dataframes can store different classes of object in each column. Matrices, should have the same class for every element.
# to create a dataframe
x <- c(1,2,3)
y <- c("a", "b", "c")
z <- c(TRUE, TRUE, FALSE)
df <- data.frame(x,y,z)
print(df)
## x y z
## 1 1 a TRUE
## 2 2 b TRUE
## 3 3 c FALSE
attributes(df)
## $names
## [1] "x" "y" "z"
##
## $row.names
## [1] 1 2 3
##
## $class
## [1] "data.frame"
nrow(df)
## [1] 3
ncol(df)
## [1] 3
df[2,2]
## [1] b
## Levels: a b c
z <- data.frame(c(1,2), c(3,4))
z
## c.1..2. c.3..4.
## 1 1 3
## 2 2 4
class(z)
## [1] "data.frame"
z1 <- data.frame(cbind(c(1,2), c(3,4)))
z1
## X1 X2
## 1 1 3
## 2 2 4
class(z1)
## [1] "data.frame"
Names
x <- c(3,5,7)
names(x)
## NULL
names(x) <- c("low", "med", "high")
print(x)
## low med high
## 3 5 7
names(x)
## [1] "low" "med" "high"
x
## low med high
## 3 5 7
names(x) <- NULL
x
## [1] 3 5 7
y <- list(low=3, med=5, high=7)
print(y)
## $low
## [1] 3
##
## $med
## [1] 5
##
## $high
## [1] 7
# Access list elements by their name
y$low
## [1] 3
print(y$low)
## [1] 3
m <- matrix(1:6, nrow=3, ncol=2)
dimnames(m)<- list(c("a", "b", "c"), c("d", "e"))
print(m)
## d e
## a 1 4
## b 2 5
## c 3 6
colnames(m) <- c("male", "fmale")
rownames(m) <- c("ice-cream", "coffee", "cake")
print(m)
## male fmale
## ice-cream 1 4
## coffee 2 5
## cake 3 6
print(df)
## x y z
## 1 1 a TRUE
## 2 2 b TRUE
## 3 3 c FALSE
row.names(df) <- c("f1", "f2", "f3")
print(df)
## x y z
## f1 1 a TRUE
## f2 2 b TRUE
## f3 3 c FALSE
colnames(df) <- c("rank", "character", "value")
print(df)
## rank character value
## f1 1 a TRUE
## f2 2 b TRUE
## f3 3 c FALSE
names(df) <- c("r1", "r2", "r3")
print(df)
## r1 r2 r3
## f1 1 a TRUE
## f2 2 b TRUE
## f3 3 c FALSE
attributes(df)
## $names
## [1] "r1" "r2" "r3"
##
## $row.names
## [1] "f1" "f2" "f3"
##
## $class
## [1] "data.frame"
class(df)
## [1] "data.frame"
mode(df)
## [1] "list"
typeof(df)
## [1] "list"
x<- 5
print(x)
## [1] 5
names(x)
## NULL
names(x) <- c("low")
print(x)
## low
## 5
names(x)
## [1] "low"
attributes(x)
## $names
## [1] "low"
Matrices and dataframes are very similar to each other as both are generally two-dimensional. However, matrices are extensions of vectors, and dataframes are extensions of lists. Matrices have all the data of te same type. Therefore, when your data has different data types, use dataframes.
m1<- matrix(1:25,5,5)
m1
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 6 11 16 21
## [2,] 2 7 12 17 22
## [3,] 3 8 13 18 23
## [4,] 4 9 14 19 24
## [5,] 5 10 15 20 25
str(m1)
## int [1:5, 1:5] 1 2 3 4 5 6 7 8 9 10 ...
is.matrix(m1)
## [1] TRUE
is.data.frame(m1)
## [1] FALSE
df1 <- as.data.frame(m1)
df1
## V1 V2 V3 V4 V5
## 1 1 6 11 16 21
## 2 2 7 12 17 22
## 3 3 8 13 18 23
## 4 4 9 14 19 24
## 5 5 10 15 20 25
str(df1)
## 'data.frame': 5 obs. of 5 variables:
## $ V1: int 1 2 3 4 5
## $ V2: int 6 7 8 9 10
## $ V3: int 11 12 13 14 15
## $ V4: int 16 17 18 19 20
## $ V5: int 21 22 23 24 25
#The object.size commands indicate how much memory of data take up in the computer
print(paste("the size of df1 is ", object.size(df1), " bytes and the size of m1 is ", object.size(m1), " bytes" ))
## [1] "the size of df1 is 1264 bytes and the size of m1 is 328 bytes"
3.3 Reading and Writing Data in R
Generally we read data from a file. In this unit we will focus on reading .txt (tab delimitted) and .csv (comma separated values) data files. In all cases, we will read a data file into a dataframe. That’s why being able to manipulate a dataframe is very important. You need to make sure that either the data file exists in your current working director, or you give a path to find the location of the file. Other than providing the name of the file, you would enter a sequence of parameters. please see ?read.table or ?read.csv to get an idea.
- read.table() to read a .txt data file, and read.csv() for .csv files
- source() to bring .r files and make the code inside the file available
- write.table(), write.csv() to export data into a file.
mydata <- read.table(“c:/mydata.csv”, header=TRUE, sep=“,”, row.names=“id”)
After working with a dataset, we might like to save it.
write.table(mydata, “c:/mydata.txt”, sep=“”)
Important parameters * hearder=TRUE the first row is the header * sep=“” tab delimitted * sep=“,” *
getwd() #gives you the current working directory
## [1] "\\\\ad.monash.edu/home/User005/xiaoleig/Documents/Other/Monash work/FIT5197/Tutes_new/Week 1"
#pay attention to the way that a directory is represented in your OS
dir() # a list of files and folders
## [1] "airfoil_self_noise.txt" "data.a1.txt"
## [3] "data.a2.txt" "data.b.txt"
## [5] "mydata.csv" "mydata222.txt"
## [7] "plot1.jpeg" "plot2.png"
## [9] "saving_plot4.pdf" "Tute_1.html"
## [11] "Tute_1.Rmd"
data <- read.table(file='airfoil_self_noise.txt')
str(data)
## 'data.frame': 1502 obs. of 6 variables:
## $ V1: int 1000 1250 1600 2000 2500 3150 4000 5000 6300 8000 ...
## $ V2: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V3: num 0.305 0.305 0.305 0.305 0.305 ...
## $ V4: num 71.3 71.3 71.3 71.3 71.3 71.3 71.3 71.3 71.3 71.3 ...
## $ V5: num 0.00266 0.00266 0.00266 0.00266 0.00266 ...
## $ V6: num 125 126 128 127 126 ...
dim(data)
## [1] 1502 6
head(data)
## V1 V2 V3 V4 V5 V6
## 1 1000 0 0.3048 71.3 0.00266337 125.201
## 2 1250 0 0.3048 71.3 0.00266337 125.951
## 3 1600 0 0.3048 71.3 0.00266337 127.591
## 4 2000 0 0.3048 71.3 0.00266337 127.461
## 5 2500 0 0.3048 71.3 0.00266337 125.571
## 6 3150 0 0.3048 71.3 0.00266337 125.201
write.csv(data, file="mydata.csv")
dir()
## [1] "airfoil_self_noise.txt" "data.a1.txt"
## [3] "data.a2.txt" "data.b.txt"
## [5] "mydata.csv" "mydata222.txt"
## [7] "plot1.jpeg" "plot2.png"
## [9] "saving_plot4.pdf" "Tute_1.html"
## [11] "Tute_1.Rmd"
write.table(data, file="mydata222.txt")
dir()
## [1] "airfoil_self_noise.txt" "data.a1.txt"
## [3] "data.a2.txt" "data.b.txt"
## [5] "mydata.csv" "mydata222.txt"
## [7] "plot1.jpeg" "plot2.png"
## [9] "saving_plot4.pdf" "Tute_1.html"
## [11] "Tute_1.Rmd"
# Split up data
a1 <- data[1:14, 1:3] # Starting data
a2 <- data[1:14, 4:6] # New column to add (with "Year" to match)
b <- data[15:16, ] # New rows to add
write.table(a1, "data.a1.txt", sep="\t")
write.table(a2, "data.a2.txt", sep="\t")
write.table(b, "data.b.txt", sep="\t")
rm(list=ls()) # Clear out everything to start fresh
# Import data
a1t <- read.table("data.a1.txt", sep="\t")
a2t <- read.table("data.a2.txt", sep="\t")
3.4 Manageing your files
- getwd(): to get the current working directory, inessence where you are now
- setwd(): to change the working directory
- dir(): gives youa list of all files and folders
- ls(): list a exisiting variables
#options() # gives you the setting of R. Most of its parameters are not changeable in jupyterhub
3.5 Built-in Datasets
There plenty of interesting datasets already avaiable in R. Actually, there is a package, dataset, which is installed by default, and has many datasets inside. We will use these built-in datasets a lot.
#To see a list of the available datasets
data()
?airmailes
## No documentation for 'airmailes' in specified packages and libraries:
## you could try '??airmailes'
str(airmiles)
## Time-Series [1:24] from 1937 to 1960: 412 480 683 1052 1385 ...
3.6 Packages
Packages are collections of R functions that are ready to use. * library() # see all packages installed * search() # see packages currently loaded * install.packages() to install a package. you don’t need this in juoyterhub. * require() to load a pckage to use it
# See current packages
search() # Shows packages that are currently loaded
## [1] ".GlobalEnv" "package:stats" "package:graphics"
## [4] "package:grDevices" "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads" "package:base"
# TO INSTALL AND USE PACKAGES
# Can use menus: Tools > Install Packages... (or use Package window)
# Or can use scripts, which can be saved in incorporated in source
#install.packages("ggplot2") # Downloads package from CRAN and installs in R
# Make package available;
require("ggplot2")
## Loading required package: ggplot2
3.7 Frequently used functions
- length(object) # number of elements or components
- str(object) # structure of an object
- class(object) # class or type of an object
- names(object) # names
- c(object,object,…) # combine objects into a vector
- cbind(object, object, …) # combine objects as columns
- rbind(object, object, …) # combine objects as rows
- ls() # list current objects
- rm(object) # delete an object
- sort()
# sort is another useful function
x <- c(2,5,3,9,4,1)
x
## [1] 2 5 3 9 4 1
sort(x, decreasing = FALSE)
## [1] 1 2 3 4 5 9
sort(x, decreasing = TRUE)
## [1] 9 5 4 3 2 1
R Built-in Functions
To use R’s built-in functions we need to follow their arguments. A function takes arguments as input and returns an object as output.
x <- 1:10
sum(x)
## [1] 55
length(x)
## [1] 10
median(x)
## [1] 5.5
? seq
#Type the name of the function without any parentheses or arguments
seq
## function (...)
## UseMethod("seq")
## <bytecode: 0x00000000174ba060>
## <environment: namespace:base>
#if you see UseMethod, there are multiple methods (functions)
#associated with the seq function
### somefunctions might be hidden!
methods(seq)
## [1] seq.Date seq.default seq.POSIXt
## see '?methods' for accessing help and source code
#seq.Date
seq()
## [1] 1
args(seq)
## function (...)
## NULL
args(round)
## function (x, digits = 0)
## NULL
ls
## function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE,
## pattern, sorted = TRUE)
## {
## if (!missing(name)) {
## pos <- tryCatch(name, error = function(e) e)
## if (inherits(pos, "error")) {
## name <- substitute(name)
## if (!is.character(name))
## name <- deparse(name)
## warning(gettextf("%s converted to character string",
## sQuote(name)), domain = NA)
## pos <- name
## }
## }
## all.names <- .Internal(ls(envir, all.names, sorted))
## if (!missing(pattern)) {
## if ((ll <- length(grep("[", pattern, fixed = TRUE))) &&
## ll != length(grep("]", pattern, fixed = TRUE))) {
## if (pattern == "[") {
## pattern <- "\\["
## warning("replaced regular expression pattern '[' by '\\\\['")
## }
## else if (length(grep("[^\\\\]\\[<-", pattern))) {
## pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
## warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
## }
## }
## grep(pattern, all.names, value = TRUE)
## }
## else all.names
## }
## <bytecode: 0x0000000013a54ce0>
## <environment: namespace:base>
6. Simulation
6.1 Generating Random Numbers
Here are functions for probability distribution in R. They help us simulate variables from given probability distributions. * rnorm: generates random normal variables * pnorm: evaluate the cumulative distribution of Noraml distribtion * dnorm: evaluates normal probaility density * qnorm: quantiles
For each peobability density function, there are four functions related to them: * d for density * r for random number generator * p for cumulative distribution * q for quantile function
Examples: * dnorm(x,mean=0, sd=1, log=FALSE) * pnorm(q,mean=0, sd=1, lower.tail=TRUE, log.p=FALSE) * dnorm(p,mean=0, sd=1, lower.tail=TRUE, log.p=FALSE) * dnorm(n,mean=0, sd=1)
If \(F\) is the cumulative distribution function for a standard nor,al distribution, then \(\text{pnorm}(q)=F(q)\) and \(\text{qnorm}(p)= F^{-1}(p)\)
#Simulation
# rnorm, dnorm, pnorm,
x <- rnorm(10)
x
## [1] 0.7170349 -1.1499620 0.6184537 -0.6282082 1.1563672 -0.3281488
## [7] 0.1507257 0.1272213 -1.3083004 -1.0102714
x <- rnorm(10,20,2)
x
## [1] 19.67680 20.98605 21.33224 21.49759 18.40721 22.18899 21.81866
## [8] 21.21492 21.12033 22.02307
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.41 21.02 21.27 21.03 21.74 22.19
set.seed(1)
rnorm(5)
## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
rnorm(5)
## [1] -0.8204684 0.4874291 0.7383247 0.5757814 -0.3053884
set.seed(1)
rnorm(5)
## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
rnorm(5)
## [1] -0.8204684 0.4874291 0.7383247 0.5757814 -0.3053884
ppois(2,2) ##cumulative distribution
## [1] 0.6766764
##Pr(x<=2)
ppois(4,2) ##Pr(x<=4)
## [1] 0.947347
set.seed(20)
x <- rnorm(100)
e <- rnorm(100,0,2)
y <- 0.5+2*x+e
summary(y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.4084 -1.5402 0.6789 0.6893 2.9303 6.5052
plot(x,y)

6.2 Random Sampling
The sample() fnction draws randomly from a specified set of (scalar) objects allowing you to sample from arbitrary distributions.
Summary: * Drawing samples from specific probability distribution can be done with r- function * Standard distributions are Normal, Poisson, Biomial, Exponential, Gamma, etc. * the sample() function can be used tio draw random samples from abitrary vectors * Settng the random number generator via set.seed() is ritical for reproducability.
set.seed(1)
sample(1:10, 4) # without replacement
## [1] 3 4 5 7
sample(1:10,4)
## [1] 3 9 8 5
sample(letters, 5)
## [1] "q" "b" "e" "x" "p"
sample(1:10) #permutation
## [1] 4 7 10 6 9 2 8 3 1 5
sample(1:10)
## [1] 2 3 4 1 9 5 10 8 6 7
sample(1:10, replace=TRUE) #sample with replacement
## [1] 2 9 7 8 2 8 5 9 7 8
7. Plotting
7.1 Building graphics from data
Dataframes are a powerful tool to organizing and visualizing data. However, it is hard to interpret large data sets, no matter how organized they are. Sometimes it is much easier to interpret graphs than numbers.
Some of the key base plotting functions
- plot(): plots based on the object type of the imput
- lines(): add lines to the plot (just connect dots)
- points(): add points
- text(): add text labels to a plot using x,y coordinates
- title(): add titles
- mtext():add arbitrary text to the margin
- axis(): adding axis ticks/labels
some important parameters
- pch: the plotting symbol (plotting character)
- lty: the line type; solid, dashed, …
- lwd: the line width; lwd=2
- col: color; col=“red”
- xlab: x-axis label; xlab=“units”
- ylab: y-axix label; ylab=“price”
plot(c(2,3), c(3,4))

x <- seq(-2*pi,2*pi,0.1)
plot(x, sin(x),
main="my Sine function",
xlab="the values",
ylab="the sine values")

Different values for type * “p” - points (defult) * “l” - lines * “b” - both points and lines * “c” - empty points joined by lines * “o” - overplotted points and lines * “s” and “S” - stair steps * “h” - histogram-like vertical lines * “n” - does not produce any points or lines
x <- seq(-2*pi,2*pi,0.1)
plot(x, sin(x),
main="my Sine function",
xlab="the values",
ylab="the sine values",
type="s",
col="blue")

Calling plot() multiple times will replace the current graph with the previous one. However, sometimes we wish to overlay the plots in order to compare the results. This is done with the functions lines() and points() to add lines and points respectively, to the existing plot.
plot(x, sin(x),
main="Overlaying Graphs",
type="l",
col="blue")
lines(x,cos(x), col="red")
legend("topleft",
c("sin(x)","cos(x)"),
fill=c("blue","red")
)

By setting some graphical parameters we can put several graphs in a single plot. The par() is used for global graphics parameters. R programming has a lot of graphical parameters which control the way our graphs are displayed. * before doing any change record the standard default parameters oldpar <- par() * las: the rientation of axix labels on the plot * bg: the background color * mar: the margin size * oma: the outer margin size * mfrow: number of plots per row (plots are filled row-wise) * mfcol: number of plots per row (plots are filled column-wise) * at the end, par(oldpar) and neglect the warning messages.
#par() # to see all the parameters
par("mar") # to see the margins, bottom, left, top, right
## [1] 5.1 4.1 4.1 2.1
# par(mfrow=c(1,2)) # set the plotting area into a 1*2 array
oldpar <- par()
# make labels and margins smaller
par(cex=0.7, mai=c(0.1,0.1,0.2,0.1))
Temperature <- airquality$Temp
# define area for the histogram
par(fig=c(0.1,0.7,0.3,0.9))
hist(Temperature)
# define area for the boxplot
par(fig=c(0.8,1,0,1), new=TRUE)
boxplot(Temperature)
# define area for the stripchart
par(fig=c(0.1,0.67,0.1,0.25), new=TRUE)
stripchart(Temperature, method="jitter")

par(oldpar)
## Warning in par(oldpar): graphical parameter "cin" cannot be set
## Warning in par(oldpar): graphical parameter "cra" cannot be set
## Warning in par(oldpar): graphical parameter "csi" cannot be set
## Warning in par(oldpar): graphical parameter "cxy" cannot be set
## Warning in par(oldpar): graphical parameter "din" cannot be set
## Warning in par(oldpar): graphical parameter "page" cannot be set
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
plot(mtcars$wt, mtcars$mpg, main="MPG and weight", col="blue", pch=5)
abline(lm(mtcars$mpg~mtcars$wt), col="red", lwd=3)

plot(mtcars$wt, mtcars$mpg,
main="MPG and weight",
col="blue",
pch=5,
xlab="wt",
ylab="mpg")
abline(lm(mtcars$mpg~mtcars$wt), col="red", lwd=3)

oldpar <- par()
par(mfrow = c(1,2))
hist(islands, breaks = 16)
boxplot(islands)

par(oldpar)
## Warning in par(oldpar): graphical parameter "cin" cannot be set
## Warning in par(oldpar): graphical parameter "cra" cannot be set
## Warning in par(oldpar): graphical parameter "csi" cannot be set
## Warning in par(oldpar): graphical parameter "cxy" cannot be set
## Warning in par(oldpar): graphical parameter "din" cannot be set
## Warning in par(oldpar): graphical parameter "page" cannot be set
drawFun <- function(f){
x <- seq(-5, 5, len=1000)
y <- sapply(x, f)
plot(x, y, type="l", col="blue")
}
drawFun(sin)
abline(h=0, col="red", lwd=3, lty=1)
abline(v=2, col="green", lwd=3, lty=2)
abline(2,1, col="pink", lwd=3, lty=3)

#develop a function which overlays a normal approximation density function and kernel density function over a histogram
funn <- function(x){
h <- hist(x, col="red", breaks=10, freq=FALSE)
xfit<-seq(min(x)-10,max(x)+10,length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
#yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
d <- density(mtcars$mpg) # returns the density data
lines(d, col="green", lwd=2)
}
funn(mtcars$mpg)

7.2 Saving Garphs
Temperature <- airquality$Temp
#to save as a jpeg to the currnt directory
jpeg(file="plot1.jpeg")
hist(Temperature, col="darkgreen")
dev.off()
## png
## 2
#saving as a png
png(file="plot2.png",
width=600, height=350)
hist(Temperature, col="gold")
dev.off()
## png
## 2
#saving as a pdf file
pdf(file="saving_plot4.pdf")
hist(Temperature, col="violet")
dev.off()
## png
## 2
x <- seq(-4,4, 0.01)
y <- sin(x)
plot(x,y, ylim=c(-2,7), type="l", col="blue")
lines(c(1.5,2.5,3),c(3,3,5), col="red")
lines(c(0,0.5,1),c(0,2,0), col="green")

plot(c(1,2,3), c(1,2,4))

plot(c(1,2,3), c(1,2,4))

x <- c(1,2,3)
y <- c(1,3,8)
plot(x,y)
lmout <- lm(y ~ x)
abline(lmout)

plot() is a generic function meaning that it is a placeholder for a family of functions. The function that actually gets called will depend on the class of the object on which it is called. Using plot(), you can add componenets one by one. * abline() then adds a line to the current graph * lines() gets a vector of x values and a vector of y values, and joins the ponits to each other * points() function adds a set of (x,y)-points * legend() is used to add a legend to a multicurve graph * text() function places some text anywhere in the current graph * mtext() adds text in the margins * polygon() draws arbitrary polygonal objects
plot(c(0,2,3), c(1,2,4))

x <- c(0,2,3)
y <- c(1,3,8)
plot(x,y) # same as before
fit <- lm(y ~ x) # a regression line
#The call to abline() then adds a line to the current graph.
#abline(c(2,1)) adds y = x + 2
abline(fit) #adds a line to a plot.
abline(h=1, col="red")
abline(v=2, col="blue")
abline(3,4, col="green") # y=3x+4

plot(x,y, type="l", col="blue")
lines(c(1.5,2.5),c(3,3), col="red")
text(2.5,4,"R is COOL")

f <- function(x) return(sin(x))
curve(f,0,2)
polygon(c(1.2,1.4,1.4,1.2),c(0,0,f(1.3),f(1.3)),col="gray")

f <- function(x) return(1-exp(-x))
curve(f,0,2)
polygon(c(1.2,1.4,1.4,1.2),c(0,0,f(1.3),f(1.3)),col="gray")

plot(x,y)
lines(lowess(x,y))

g <- function(t) { return (t^2+1)^0.5 } # define g()
x <- seq(0,5,length=10000) # x = [0.0004, 0.0008, 0.0012,..., 5]
y <- g(x) # y = [g(0.0004), g(0.0008), g(0.0012), ..., g(5)]
plot(x,y,type="l")

curve((x^2+1)^0.5,0,5)

x <- c(0,2,3)
y <- c(1,3,8)
plot(x,y) # same as before
fit <- lm(y ~ x) # a regression line
#The call to abline() then adds a line to the current graph.
#abline(c(2,1)) adds y = x + 2
abline(fit) #adds a line to a plot.
abline(h=1, col="red")
abline(v=2, col="blue")
abline(3,4, col="green") # y=3x+4
curve((x^2+1)^0.5,0,5,add=T, col="yellow")

f <- function(x) return((x^2+1)^0.5)
plot(f,0,5) # the argument must be a function name
