Aside
Last time we mentioned the set of reserved words and mentioned its better to use TRUE
or FALSE
rather than T
or F
.
# Run this as a reminder to use "TRUE"
T <- FALSE
# You can't do this:
TRUE <- FALSE
# Check reserved words:
?Reserved
Control structures and Functions
This tutorial will cover basic control structures in R, writing functions and some the apply
family of functions.
I assume some basic familiarity with programming such as a first class in programming.
A simple function
We’ll start off with the structure of a basic function and then build on that once we’ve gone over some control structures.
Here is the basic syntax of a function:
fname <- function(arg1, arg2, arg3 = default.value, etc.){
function.body
}
The function defined in the function body will be assigned to the variable fname
. The arguments : arg1, arg2, arg3
are the values passed to the function and available to operations in the function body.
Here’s a simple example:
echo <- function(arg ){
cat(arg, "\n")
}
echo("arrrrrrrg")
You can define a function on a single line without brackets, but for now I’d recommend using brackets to define the function body.
The cat
function writes its argument to the console and it can take a combination of strings and variables. (The "\n"
is equivalent to hitting return after writing a sentence).
Here’s another simple example:
add <- function(a,b){
a + b
}
add(3,3)
# with an explicit return function
add2 <- function(a,b){
res <- a + b
return(res)
}
add2(2,3)
Control structures
Control structures determine the flow of execution in a program. The most basic example of controlling program flow is branch on some condition.
x = 5
if(x == 4) {
cat("its a four")
} else {
cat("its not a four")
}
The “==” is a relational operator that checks for equality. The relational operators are
==
!=
<=
>=
and the logical operators are
!
&
&&
|
||
The “&” is the logical and operator and checks if both sides of an inequality are true and the “|” is the “or” operator. These act element wise and will return a logical vector if multiple values are becking checked.
x <- 1:10
x[(x > 3) & (x<2)]
The double operators only check the first element. In control flow statements (if, while
) where only one element is being compared either can be used. See documentation or the answers to this stackoverflow question for more information (be sure to read more than the first response).
There is a ternary operator in R
. Here’s a comparison of using the ternary operator and one using if ... else
syntax.
a = 2
ifelse( a==2, "two", "not two")
a = 3
if(a==2) "two" else "not two"
In the last statement no brackets were needed if the following statement is completed on the same line. I would recommend using brackets unless you have a strong preference or until you have more experience programming.
for, while, switch statements
All for loops can be written as while loops and all while loops as for loops.
For loops are usually used to iterate over an index or some number of elements in an array or vector.
for(k in 1:10){
a = 5
cat( paste("iteration:", k, sep = " "), "\n")
}
k = 1
while(k <= 1024){
cat( paste("product:", k, sep = " "), "\n")
k = k*2
}
There is another way to create for loops that can be safer then the for
syntax mentioned above.
# seq_along
# this behavior is unexpected for a zero length vector
xx <- NULL
for(k in 1:(length(xx))){
cat("k: ", k, "\n")
}
for(k in seq_along(xx)){
cat("k: ", k, "\n")
}
xx <- 3:20
length(xx)
new_vec <- double(length(xx))
for(k in seq_along(xx)){
cat("k: ", k, "\n")
new_vec[k] <- k + 5
}
new_vec
The seq_along()
function can be used to iterate over structures that have some length or dimension.
Functions
We’ve already seen the basic form of a function. Here we add a few more details and create some more complex functions.
Since a single element is a vector, many base R functions can operate either on a single element or a vector.
f <- function(x){
sin(x) + cos(2*x)
}
x <- 3
f(3)
x <- seq(0,2, by = 0.1)
f(x)
# simple plot
plot(f(x), type = 'l', col = "blue", lwd = 2)
Exercise Write a function to compute the Euclidean norm of a vector of any size
Many of the base R functions have arguments with default values. You can also add default values to your functions.
# default arguments
pow1 <- function(a, b = 1){
a^b
}
pow1(1,2)
pow1(2)
# can change order if you name the argument
pow1(b=1, 2)
The arguments will be evaluated based on their position unless they are named.
If a variable isn’t defined in a function R will look the next level up based on the environment in which the function was created.
x <- 100
f <- function( ) {
y <- 2
c(x, y)
}
f()
rm(f)
For loops are known for being slow in R and you can often rewrite a for loop using a vectorized function using a apply
type function.
fill_NAs <- function(v){
ind <- is.na(v)
mu <- mean(v, na.rm = TRUE, trim = 0.1)
v[ind] <- mu
return(v)
}
vec <- c(1:5, rep(NA, 5), 3:10)
fill_NAS(vec)
apply
type functions
First we’ll introduce the replicate function.
Here’s the usage from the documentation:
replicate(n, expr, simplify = "array")
The
expr
is some expression that is evaluated n
times. Here I use it to create a data frame.
expr1 <- rnorm(10, 0, 1)
df <- data.frame(replicate(5, expr1))
df
# we can't do this
mean(df)
# we can do this
apply(df, 2, mean)
apply(df, 1, mean, trim = 0.1)
# using a matrix
M <- as.matrix(df)
apply(M, 2, sd)
Here is the function usage
apply(X, MARGIN, FUN, ...)
X
is an array structure, a data frame or a matrix, for example. The margin is the (1 for rows and 2 for columns), and FUN is the function you want to apply to the columns.
The ...
means additional arguments can be passed the function. In the above example trim = 0.1
is an argument to the mean function.
There is a family of apply style functions. The lapply
function operates on lists, tapply
can operate on a vector where operations are grouped based on second index argument.
# tapply example
# generate two vectors of equal length (tapply will recycle if not)
countries <- c( rep("US", 13), rep("UK", 17), rep("FR", 10))
len <- length(countries)
medals <- sample(c(0,1), len, replace = TRUE)
ages <- rpois(len, 26)
len; medals; population
# apply sum
tapply(medals, factor(countries), sum)
tapply(ages, factor(countries), mean)
tapply(ages, factor(countries), summary)
Here’s a R version of lapply
from Advanced R which gives the idea of what the function does under the hood.
lapply2 <- function(x, f, ...) {
out <- vector("list", length(x))
for (i in seq_along(x)) {
out[[i]] <- f(x[[i]], ...)
}
return(out)
}
We’ll walk through this function. The arguments are x
a list or data structure with length greater than 0. The variable out
is a list and is what the function will return. Its allocated space for each element in x
. The variable f
is a function that will be applied to each element in x
. The interior loop applies f
to each element in x
and its stored in out
, which is returned with the for loop exits.
Here’s an example of operating on a list:
# A list of vectors of different size
vec_list <- list(a = c(1:10, NA), b = 2:20, c = 3:40, d = 5:50)
lapply(vec_list, sum)
lapply(vec_list, mean)
# remove NA's before computing means
lapply(vec_list, mean, na.rm = TRUE)
lapply(vec_list, sd, na.rm = TRUE )
res <- lapply(vec_list, sd, na.rm = TRUE )
unlist(res)
The lapply
function returns a list of the length of the list passed to the function. The unlist
function can be used to create a vector of the results.
User defined functions can also be used. In addition, you can define anonymous functions that are defined with the lapply argument. Here is the syntax:
# anonymous functions
(function(x) x*2)(2)
lapply(vec_list, function(x) x*2)
# paste/paste0 functions
paste0("filename", 1, ".csv")
# create a list of file names
dflist <- lapply(2:10, function(x) paste0("filename", x, ".csv"))
dflist
# seq
seq(2, 20, by = 4)
nums <- seq(2, 20, by = 4)
# repeating above with a different sequence
dflist <- lapply(nums , function(x) paste0("filename", x, ".csv"))
dflist
We’ll take an example
from here of running a simple simulation to test of the difference of two observations are significant.
Here’s the setup :
Let’s suppose we’re comparing two webpages to see which one converts our customers to “sign up” at a higher rate (This is commonly referred to as an A/B Test). For page A we have seen 20 convert and 100 not convert, for page B we have 38 converting and 110 not converting. We’ll model this as two Beta distributions as we can see below:
# generate samples from the distributions
runs <- 100
rbeta(runs, 38, 110)
rbeta(runs, 20, 100)
# vary the number of runs and compute the percent of winners
sig_test <- function(runs){
a <- rbeta(runs, 38, 110)
b <- rbeta(runs, 20, 100)
p_val <- 1- sum(a > b)/runs
return(p_val)
}
# check time
system.time(pval1 <- sig_test(1000))
system.time(pval2 <- sig_test(100000))
Closures
You can also create functions that return functions based on input.
make_pow <- function(n){
function(a){
a^n
}
}
pow3 <- make_pow(3)
pow4 <- make_pow(4)
is.function(pow3)
pow3(2)
pow4(3)
# create a list of functions
pows <- lapply(5:10, make_pow)
# pass the functions to the anonymous function to evaluate
lapply(pows, function(x) x(2:4))