Chapter 5 R Coding Basics

Now that you know how to navigate RStudio and have a working project, we’ll take a look at the basics of R. As we’re chemists first, and not computer programmers, we’ll try and avoid as much of the nitty-gritty underneath the hood aspects of R. However, a risk of this approach is being unable to understand errors and warnings preventing your code from running. As such, we’ll introduce the most important and pertinent aspects of the R language to meet your environmental chemistry needs.

5.1 Variables

We’ve already talked about how R can be used like a calculator:

(1000 * pi) / 2
## [1] 1570.796
(2 * 3) + (5 * 4)
## [1] 26

But managing these inputs and outputs is simplified with variables. Variables in R, like those you’ve encountered in math class, can only have one value, and you can reference or pass that value along by referring the variable name. And, unlike the variables in math classes, you can change that value whenever you want. Another way to think about it is that a variable is a box in which you store your value. When you want to move (reference) your value, you move the box (and whatever is inside of it). Then you can simply open the box somewhere else without having to worry about the hassle of what’s inside.

You can assign a value to a variable using <-, as shown below.

x <- 12
x
## [1] 12

Codes using <- should be read right to left: x <- 12 would be read as “take the value 12 and store it into the variable x”. The second line of code, x, simply evaluates the value that x is assigned to. Note that when a variable is typed on it own, R will print out its contents. You can now use this variable in snippets of code:

x <- x * 6.022e23
x
## [1] 7.2264e+24

Remember, R evaluates <- from right to left, so the code above is taking the number 6.022e23 and multiplying it by the value of x, which is 12 and storing that value back into x. That’s how we’re able to modifying the contents of a variable using its current value. You can also overwrite the contents of a variable at anytime (i.e. x <- 25).

Variable names are case sensitive, so if your variable is named x and you type X into the console, R will not be able to print the contents of x. Variable names can consist of letters, numbers, dots (.) and/or underlines (_). Here are some rules and guidelines for naming variables in R:

  • Variable Name Requirements as dictated by R
    • names must begin with a letter or with the dot character. var and .var are acceptable.
    • Variable names cannot start with a number or the . character cannot be followed by a number. var1 is acceptable, 1var and .1var are not.
    • Variable names cannot contain a space. var 1 is interpreted as two separate values, var and 1.
    • Certain words are reserved for R, and cannot be used as variable names. These include, but are not limited to, if, else, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NA, and NaN

Good names for variables are short, sweet, and easy to type while also being somewhat descriptive. For example, let’s say you have an air pollution data set. A good name to assign the data set to would be airPol or air_pol, as these names tell us what is contained in the data set and are easy to type. A bad name for the data set would be airPollution_NOx_O3_June20_1968. While this name is much more descriptive than the previous names, it will take you a long time to type, and will become a bit of a nuisance when you have to type it 10+ times to refer to the data set in a single script. Please refer to the Style Guide found in Advanced R by H. Wickham for more information.

Lastly, R evaluates multiple lines of code one at a time, from top-to-bottom. So if you reference a variable it must have already been created at an earlier point in your script. For example:

y + 1
## [1] 6
y <- 12

The code above returns the object 'y' not found error because we’re adding + 1 to y which hasn’t been created yet, it’s created on the next line. These errors also pop up when you edit your code without clearing your workplace. All variables created in a session are stored in the working environment so you can call them, even if you change your code. This means you can accidentally reference a variable that isn’t reproduced in the latest iteration of your code. Consequently, a good practice is to frequently clear your work-space using the ‘broom’ button in the Environment pane. This will help you to ensure the code you’re writing is organized in the correct order; see Saving R Markdown for why this is important.

5.2 Data Types

Data types refer to how data is stored and handled by and in R. This can get complicated quickly, but we’ll focus on the most common types here so you can get started on your work. Firstly, here are the data types you’ll likely be working with:

  • character: "a", "howdy", "1", is used to represent text values in R. Character values may be wrapped in either single-quotes or double-quotes; for consistency in this textbook, we’ll always use double-quotes. For example, "1", despite being read as number by us, is stored as a character and treated as such by R. Also known as string values.
  • numeric: any real or decimal number such as 2, 3.14, 6.022e23.
  • integer such as 2L, note the ‘L’ tells R this is an integer.
  • logical: either TRUE or FALSE; also known as a boolean values.

Sometimes R will misinterpret a value as the wrong data type. This can hamper your work as you can’t do arithmetic on a string!

x <- "6"
x / 2
## Error in x/2: non-numeric argument to binary operator

“non-numeric argument to binary operator” is a commonly encountered error, and it’s simply telling you that you’re trying to do math on something you can’t do math on. You might think if x is 6, why can’t I divide it by 2?

Let’s look at some helpful functions to test the data type of a value in R, and how to fix errors like this one. First, let’s see what type of data x is:

is.numeric(x)   # test if numeric 
## [1] FALSE
is.logical(x)   # test if logical
## [1] FALSE
is.integer(x)   # test if integer 
## [1] FALSE
is.character(x) # test if character
## [1] TRUE

So the value of x is a character, in other words R treats it as a word, and we can’t do math on that (note the quotation marks “” around the 6 in the code above that defined it as text). So let’s convert the data type of x to numeric to proceed.

x 
## [1] "6"
x <- as.numeric(x)
is.numeric(x)
## [1] TRUE
x
## [1] 6
x / 2
## [1] 3

So we’ve converted our character string "6" to the numerical value 6. Keep in mind there are other conversion functions which are described elsewhere, but you can’t always convert types. In the above example we could convert a character to numeric because it was ultimately a number, but we couldn’t do the same if the value of x was "six".

x <-"six"
x <- as.numeric(x)
## Warning: NAs introduced by coercion
x
## [1] NA

“NAs introduced by coercion” means that as.numeric didn’t know how to convert “six” to a numeric value, so it instead turned it into an NA, representing a missing value.

5.3 Data Structures

Data structures refers to how R stores data. It’s easy to get lost in the weeds here, so we’ll start with the focus on the most common and useful data structure for your work: data frames.

5.3.1 Data Frames

Data frames consist of data stored in rows and columns. If you’ve ever worked with a spreadsheet, it’s essentially that with the caveat that all data stored in a column must be of the same type. Different columns can have different data types, but within a column all the data needs to be the same type.

When importing data into a data frame, R will automatically convert data to ensure consistency in each column. This can lead to surprising behaviour: a common error is a single character in a column of numerical values leading to the entire column to be interpreted as character values. We’ll need to be careful about this when importing our own data, which we’ll start doing shortly!

5.3.1.1 Creating a Data Frame from Scratch

Let’s see how we can create a data frame by explicitly listing out the values.

# First, create data for each column. We use the "c" function to create
# one *vector* for each column. We'll discuss vectors in more detail below.
names <- c("Alice", "Bob", "Charlie", "David", "Eve")
ages <- c(20, 21, 22, 23, 19)
food <- c("Bubble Tea", "Pineapple Pizza", "Diet Pepsi", "Korean BBQ", "Sushi AYCE")

# Creating the data frame
students <- data.frame(Name = names, Age = ages, Food = food)

# Displaying the data frame
print(students)
##      Name Age            Food
## 1   Alice  20      Bubble Tea
## 2     Bob  21 Pineapple Pizza
## 3 Charlie  22      Diet Pepsi
## 4   David  23      Korean BBQ
## 5     Eve  19      Sushi AYCE

5.3.1.2 Reading Data from a File

Obviously when we have many more data, it would be unrealistic to manually list them out in our code. So instead, we can create a data frame by reading a file.

From the R4EnvChem-ProjectTemplate, downloaded in Importing a project, let’s import some real data that was included in the downloaded project by typing the following code into the console:

airPol <- read.csv("data/2018-01-01_60430_Toronto_ON.csv")

read.csv() is a useful R built-in function which, as you might guess from its name, can read a .csv file and convert it into a data frame. The data we just imported contains air quality data measured in downtown Toronto around January 2018. The “Column specification” summary printed to the console is a useful feature of read.csv(). It tells you what data type was determined for each column when it was imported. Note that double is simply another term for the numeric data type. Some of the variables are:

  • naps, city, p, latitude, longitude to tell you where the data was measured.
  • data.time for when the measurements were taking. Note this is a datetime, which is a subset of numeric data. The values contained herein correspond to time elements such as year, month, data, and time.
  • pollutant for the chemical measured
  • concentration for the measured concentration in parts-per-million (ppm).

We’ve assigned it to the variable airPol. This is so we can reference it and make use of it later on (see below). If we didn’t do this our data would simply be printed to the console which isn’t helpful. Let’s take a look at the first few rows of the data using the head function:

head(airPol)
##    naps    city  p latitude longitude           date.time pollutant
## 1 60430 Toronto ON 43.70944  -79.5435 2018-01-01 00:00:00        O3
## 2 60430 Toronto ON 43.70944  -79.5435 2018-01-01 00:00:00       NO2
## 3 60430 Toronto ON 43.70944  -79.5435 2018-01-01 00:00:00       SO2
## 4 60430 Toronto ON 43.70944  -79.5435 2018-01-01 01:00:00        O3
## 5 60430 Toronto ON 43.70944  -79.5435 2018-01-01 01:00:00       NO2
## 6 60430 Toronto ON 43.70944  -79.5435 2018-01-01 01:00:00       SO2
##   concentration
## 1             3
## 2            39
## 3             1
## 4             1
## 5            47
## 6             3

In this data frame, each column is a variable and each row is an observation. So reading the first row, we know that the Toronto 60430 station on 2018-07-01 at midnight measured ambient O3 concentrations of 46 ppm (Note the concentration column isn’t printed due to width). Using head will only output a small chunk of our data for us to see. If you’d like to see it in full, go to the Environment pane and double-click on the airPol variable.

5.3.2 Accessing Data in Subfolders

Note that read.csv() requires us to specify the file name, but in the above example we prefixed our file name with "data/2018...". This is because the .csv file we want to open is stored in the data subfolder. By specifying this in the prefix, we tell read.csv() to first go to the data sub folder in the working directory and then search for and open the specified data file.

What we’ve done above is called relative referencing and it’s a huge benefit of projects. The actual data file is stored somewhere on your computer in a folder like "C:/User/Your_name/Documents/School/Undergrad/Second_Year/R4EnvChemTemplate/data/2018-01-01_60430_Toronto_ON.csv". If we weren’t in a project, this is what you’d need to type to open your file, but since we’re working in the project, R assumes the long part, and begins searching for files inside the project folder. Hence, why we only need "data/2018...". Not only is this much simpler to type, and but it makes sharing your work with colleagues, TAs, and instructors (and yourself!) much easier. In other words, if you wanted to share your code, you would send the entire project folder (code & data) and the receiver could open it and run it as is.

5.3.3 Other Data Structures

R has several other data structures. They aren’t as frequently used, but it’s worth being aware of their existence. Other structures include:

  • Vectors, which contain multiple elements of the same type; either numeric, character, logical, or integer. Vectors are created using c(), which is short for combine. A data frame is just multiple vectors arranged into columns. Some examples of vectors are shown below.

    num <- c(1, 2, 3, 4, 5)
    num
    ## [1] 1 2 3 4 5
    char <- c("blue", "green", "red")
    char
    ## [1] "blue"  "green" "red"
    log <- c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE)
    log
    ## [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE
  • Lists are similar to vectors in that they are one dimensional data structures which contain multiple elements. However, there are two important differences: (1) lists can contain multiple elements of different types, while vectors only contain a single type of data; and (2) each piece of data in the list is given a name with which we can refer to that data. You can create lists using list(), as illustrated below.

    # Creating a list with three components: "Greetings", "someNumbers", and "someBooleans"
    hi <- list("Greetings" = "Hello", "someNumbers" = c(5,10,15,20), "someBooleans" = c(TRUE, TRUE, FALSE))
    hi
    ## $Greetings
    ## [1] "Hello"
    ## 
    ## $someNumbers
    ## [1]  5 10 15 20
    ## 
    ## $someBooleans
    ## [1]  TRUE  TRUE FALSE

    We can access individual components of the list by using the $ operator:

    # Access the "Greetings" component of list "hi"
    hi$Greetings
    ## [1] "Hello"
    # Access the "someNumbers" component of list "hi"
    hi$someNumbers
    ## [1]  5 10 15 20

There are many freely available resources online which dive more in depth into different data structures in R. If you are interested in learning more about different structures, you can check out the Data structure chapter of Advanced R by Hadley Wickham.

5.4 Conditional Statements

In programming, it’s often necessary to make decisions and execute certain portions of code based on specific conditions. That’s where conditional statements come into play.

In R, the primary mechanism to make decisions is the if-else construct. With it, you can evaluate a condition and, based on whether it’s true or false, choose which code block to execute.

5.4.1 Understanding R Syntax

Before diving into conditional statements, let’s take a moment to understand the syntax used in R.

R, like many programming languages, uses a combination of parentheses (), curly braces {}, and other symbols to organize and structure the code.

  1. Parentheses (): These are primarily used to enclose arguments of functions and conditions in control statements, like ‘if’. For example, in if (x > 5), the condition x > 5 is enclosed in parentheses.

  2. Curly Brackets {}: These are used to group multiple lines of code into a block. This is particularly useful in control statements where more than one line of code should be executed based on a condition.

The reason the curly bracket might span multiple lines is for readability. It makes it clear where a block of code begins and ends. While it’s possible to write if-else statements without curly brackets if only one statement is being conditioned, it’s good practice to always use them for clarity.

Now, with this understanding, let’s move on to how R uses these in conditional statements.

5.4.2 The Basic if Statement

x <- 10

# In this example, R checks if x is greater than 5. 
if (x > 5) {
  print("x is greater than 5!")
}
## [1] "x is greater than 5!"

5.4.3 Expanding with else and else if

For situations where you want to specify actions for both true and false conditions, you can add an else section.

x <- 3

if (x > 5) {
  print("x is greater than 5!")
} else { # if x <= 5
  print("x is 5 or less!")
}
## [1] "x is 5 or less!"

Here, because x is 3 (which is not greater than 5), R prints “x is 5 or less!”.

For situations where multiple conditions need to be evaluated in sequence, you can use the else if construct. This allows you to add more conditions after the initial if.

x <- 6

if (x > 10) {
  print("x is greater than 10!")
} else if (x > 5) {
  print("x is greater than 5 but less than or equal to 10!")
} else {
  print("x is 5 or less!")
}
## [1] "x is greater than 5 but less than or equal to 10!"

In terms of syntax, it’s important to remember:

  • Always enclose the condition you’re testing within parentheses ().
  • Use curly brackets {} to group the lines of code that should be executed for a particular condition.
  • Make sure each else if or else follows an if or another else if. They cannot stand alone.

5.5 R built-in Functions

Built-in functions are the essential tools that allow you to perform a wide range of tasks without having to write the underlying code from scratch. These functions are part of the R language itself and are readily available for your use.

In this chapter, you’ve already come across a few built-in functions that are incredibly useful. For instance, you’ve used the read.csv() function to import data from CSV files into your R environment. Additionally, the as.numeric() function has been employed to convert data to numeric format, and the list() function has aided in creating lists to organize and store data elements.

5.5.1 Exploring More Built-In Functions

Let’s delve into a few more built-in functions that are integral to your R experience:

print(): The print() function displays output on the console. When you want to see the result of an expression or the contents of a variable, print() makes it effortless.

print("Hello, R!")
## [1] "Hello, R!"

mean(): The mean() function calculates the average of a numeric vector.

mean(c(5, 10, 15, 20))
## [1] 12.5

max()/min(): With the max() function, you can effortlessly determine the maximum value within a numeric vector. Similarly, min() function returns the minimum value within a vector.

max(c(5, 10, 15, 20))
## [1] 20
min(c(5, 10, 15, 20))
## [1] 5

5.5.2 Function Documentation

An often unappreciated aspect of packages is that they not only contain functions we can use, but documentation. Documentation provides a description of the function (what it does), what arguments it takes, details, and working examples. Often the easiest way to learn how to use a function is to take a working example and change it bit by bit to see how it works etc. To see documentation check the “help” tab in the “outputs” window or type a question mark in front of a functions name:

# Takes you to the help document for the read.csv function
?read.csv

You can also write your own functions. Please see Writing custom functions in R for additional details.

5.6 Summary

In this chapter we’ve covered:

  • The basics of coding in R including variables, data types, and data structures (notably data.frames).
  • Importing data from your project folder into R
  • Using if-else structure to build a conditional logic
  • Using R built-in functions and opening function documentations

Now that you’re familiar with navigating RStudio and some basic R coding, you may have realized that working the console can get real messy, real quick. Read on to Workflows for R Coding where we’ll discuss R workflows to make everyone’s lives easier.

5.7 Exercise

There is a set of exercises available for this chapter!

Not sure how to access and work on the exercise Rmd files?

  • Refer to Running Tests for Your Exercises for step-by-step instructions on accessing the exercises and working within the UofT JupyterHub’s RStudio environment.

  • Alternatively, if you’d like to simply access the individual files, you can download them directly from this repository.

Always remember to save your progress regularly and consult the textbook’s guidelines for submitting your completed exercises.