Data Frames

Aside from last sesson

A couple things that came up in the last session that I said I’d post answers to so here they are.

What functions are available at start-up

There are a number of default settings in R that can be adjusted. I’d recommend leaving them as or reading up on how the options function before making changes. Settings can be changed in an .Rprofile.

To view default options you can use the options() command.

# check which packages are loaded at start-up
options()$defaultPackages

# look at functions available in a package
  library(help = base)
  library(help = stats)

Here’s a list from Hadley Wickham if you are interested in expanding your function vocabulary

The first two on that list are, ? and str.

Getting packages

So far we haven’t been using any package. If you’re searching for a topic Google is a good choice. There is also the CRAN taskview, or Microsoft’s version, the MRAN taskview.

Rstudio also has a built in panel for searching for and installing libraries.

We’ll start off with three packages that make data manipulation easier. These come from the so-called tidyverse:

  # install packages
  install.packages("dplyr", "readr", "stringr")

  # look at functions in dplyr
  library(help = dplyr)

Review

A quick review on creating vectors and lists


  # review : vectors and types
  a <- 1:5
  letters[1:5]
  b <- factor(letters[1:5])
  c <- c("hey", "hi", "yo", "sup", "g'day")

  # create a list
  alist <- list(a,b,c)
  # returns a list
  alist[3]
  # returns the item
  alist[[3]][5]

Data frames

A data frame is basically a list with equal vectors of items of a single type (the columns).

Here’s a very simple data frame which we make from the vectors we created above


  # create the vectors if you haven't already
  a <- 1:5
  letters[1:5]
  b <- factor(letters[1:5])
  c <- c("hey", "hi", "yo", "sup", "g'day")


  # create a data frame
  df <- data.frame(a,b,c)
  df

The names of the columns default to the variable names of the vectors the data frame was created

Simple indexing and subsetting

Subsetting follows similar rules to what we’ve seen with matrices. Note that if you subset a single column of a data frame you still have a data frame.

  # indexing and subsetting
  dim(df)
  df[1,]
  df[,2]
  df[1,2]
  # byl column name
  df$a
  # other methods
  df[, -2]
  df[-1, ]
  #logical subsetting
  df[c(TRUE, FALSE, TRUE), ]

  # subset with vectors
  df[c(1,3), 2:3]

We’ll make a slightly larger data frame to demonstrate som

  # from a set of vectors
  name <- c("Bob", "Saghi", "Alice", "Elise", "Ciera", "Elijah")
  age <- c(65, 23, 54, 10, 25, 32)
  bloodtype <- factor(c("O", "AB", "A", NA, "A", "B"))
  last_appt <- c(rep(Sys.Date(), length(age))- floor(rnorm(length(age),365, 100)))

  patients <- data.frame(name, age, bloodtype, last_appt)

  # summary
  head(patients)
  summary(patients)
  names(patients)

Looking at NA values

There are many ways of dealing with NA values. The simplest is just to remove columns. Here we look at a few basic ways of checking for missing values.

  sum(is.na(patients))
  na.omit(patients)
  # identify the observation that's a problem 
  complete.cases(patients)
  patients[!complete.cases(patients), ]

The complete.cases function is useful for looking at observations (rows) that have no NA values.

The lapply function can also be used on lists of data frames. This can be a pretty powerful method for applying functions to multiple similar data frames. You need to make sure the operations are valid on the columns they’re applied to, of course.

  # review lapply(list, FUN )
  nas <- lapply(patients, is.na)
  lapply(list(df, patients), summary)

It’s easy to add columns to data frames, or to get and change column names.

  # add a column (check err)
  random_nums <- floor(rnorm(6, 20, 5))
  patients$nums <- random_nums


  old_names <- names(patients)
  names(patients) <- c("Names", "Age", "Bloodtype", "Last_appt", "ATP_level")
  patients

  # replace old names 
  # names(patients) <- old_names

Reading and writing files

Here we go over a simple case of reading and writing files. For most school projects you’re likely to be given nice clean data to load. Once you’re using data from a wider range of sources, you’ll need to be able to check more carefully the formatting of the file before loading.

First we figure out where R has been loaded. In this example I’m just saying to the current directory R is in. In other cases you mean need to load or save to a directory using relative or absolute file paths.


  # write file
  getwd()
  dir()
  setwd()
  curr_dir <- getwd()
  filename <- "patients.txt"
  # check slash direction
  file_dir <- paste0(curr_dir, "/",  filename)
  write.table(patients, file_dir)

  df1 <- read.table(file_dir)
  df2 <- read.table(file_dir, stringsAsFactors = FALSE)

Dplyr Vignette

It’s good to be familiar with all the ways of manipulating data frames in base R and if you are minimizing dependencies of you code you might want to stick to base R functions for common tasks.

dplyr is a package written by Hadley Wickham that has a number of convenient tools for manipulating data frames.

Out of laziness and to demonstrate where to find good examples, this example is taken from the very good vignette provided for the dplyr package. Find the whole thing here

library(readr, dplyr)
# manipulation in dplyr
select(df2, -ATP_level)

library(nycflights13)

head(flights)
# filter() 
# arrange()
# select() 
# distinct()
# mutate() 
# summarise()

# January flights
filter(flights, month == 1)
filter(flights, month == 1 | month == 2)

distinct(flights, carrier)
arrange(flights, year, month, day)

arrange(flights, dep_delay)
mutate(flights, speed = distance/air_time * 60)

group_by(flights, carrier)


summarise(group_by(flights, carrier), 
          arr_delay = mean(arr_delay))


summarise(group_by(flights, carrier), 
          arr_delay = mean(arr_delay, na.rm = TRUE))

In the next example, these operations are combined. We group data by the plane tail_num and then create a summary data frame using summarise. The n() is a dplyr utility function that counts the number of observations for a given factor.

# plotting data

  by_tailnum <- group_by(flights, tailnum)
  delay <- summarise(by_tailnum,
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    arr_delay = mean(arr_delay, na.rm = TRUE))
  delay <- filter(delay, count > 20, dist < 2000)

  head(delay)

  with(delay, plot(dist, arr_delay, pch = 16, 
      cex = 0.5 + 5*count/max(count),  
      col = rgb(.1,.1,.1,0.4) ))
  lines(with(delay, loess.smooth(dist, arr_delay)), 
    col = "blue", lwd = 3, lty = 3)

My previous plot was a partial attempt to mimic the ggplot that is demonstrated in the ddplyr vignette.

  ## This example has duplicate points, so avoid cv = TRUE
  library(ggplot2)

  # Interestingly, the average delay is only slightly related to the
  # average distance flown by a plane.
  ggplot(delay, aes(dist, delay)) +
    geom_point(aes(size = count), alpha = 1/2) +
    geom_smooth() +
    scale_size_area()