Chapter 12 Transform: Data Manipulation

Transformation encompasses any steps you take to manipulate, reshape, refine, or transform your data. We’ve already touched upon some useful transformation functions in previous example code snippets, such as the mutate function for adding columns. This section will explore some of the most useful functionalities of the dplyr package, explicitly introduce the pipe operator %>%, and showcase how you can leverage these tools to quickly manipulate your data.

The essential dplyr functions are :

  • mutate() to create new columns/variables from existing data
  • arrange() to reorder rows
  • filter() to refine observations by their values (in other words by row)
  • select() to pick variables by name (in other words by column)
  • summarize() to collapse many values down to a single summary.

We’ll go through each of these functions, for more details you can read Chapter 3: Data Transformation from R for Data Science which provides a more comprehensive breakdown of these functions. Note that the information here is based on a tidyverse approach, but this is only one way of doing things. See the Further reading section for links to other suitable approaches to data transformation.

Let’s explore the functionality of dplyr using some flame absorption/emission spectroscopy (FAES) data from a CHM317 lab. This data represents the emission signal of five sodium (Na) standards measured in triplicate:

FAES <- read_csv(file = "data/FAES.csv")
head(FAES)
## # A tibble: 6 × 4
##    ...1 std_Na_conc      replicate signal
##   <dbl> <chr>                <dbl>  <dbl>
## 1     1 blank 0 ppm              1   502.
## 2     2 blank 0 ppm              2   592.
## 3     3 blank 0 ppm              3   581.
## 4     4 standard 0.1 ppm         1  5656.
## 5     5 standard 0.1 ppm         2  5654.
## 6     6 standard 0.1 ppm         3  5667.

In this dataset you can see that two important aspects of the data, sample type (sample, blank or standard) and concentration are grouped in one column. We can use the separate() function we learned about in Separating columns to separate these values into two columns to facilitate further analysis.

FAES <- separate(
  FAES,
  col = std_Na_conc,
  into = c("type", "conc_Na", "units"),
  sep = " ",
  convert = TRUE
)

DT::datatable(FAES)

12.1 Selecting by row or value

filter() allows up to subset our data based on observation (row) values.

filter(FAES, conc_Na == 0)
## # A tibble: 3 × 6
##    ...1 type  conc_Na units replicate signal
##   <dbl> <chr>   <dbl> <chr>     <dbl>  <dbl>
## 1     1 blank       0 ppm           1   502.
## 2     2 blank       0 ppm           2   592.
## 3     3 blank       0 ppm           3   581.

Note how we need to pass logical operations to filter() to specify which rows we want to select. In the above code, we used filter() to get all rows where the concentration of sodium is equal to 0 (== 0). Note the presence of two equal signs (==). In R one equal sign (=) is used to pass an argument, two equal signs (==) is the logical operation “is equal” and is used to test equality (i.e. that both sides have the same value). A frequent mistake is to use = instead of == when testing for equality.

12.1.1 Logical operators

filter() can use other relational and logical operators or combinations thereof. Relational operators compare values and logical operators carry out Boolean operations (TRUE or FALSE). Logical operators are used to combine multiple relational operators… let’s just list what they are and how we can use them:

Operator Type Description
> relational Less than
< relational Greater than
<= relational Less than or equal to
>= relational Greater than or equal to
== relational Equal to
!= relational Not equal to
& logical AND
! logical NOT
| logical OR
is.na() function Checks for missing values, TRUE if NA
  • Selecting all signals below a threshold value:

    filter(FAES, signal < 4450)
    ## # A tibble: 3 × 6
    ##    ...1 type  conc_Na units replicate signal
    ##   <dbl> <chr>   <dbl> <chr>     <dbl>  <dbl>
    ## 1     1 blank       0 ppm           1   502.
    ## 2     2 blank       0 ppm           2   592.
    ## 3     3 blank       0 ppm           3   581.
  • Selecting signals between values:

    filter(FAES, signal >= 4450 & signal < 8150)
    ## # A tibble: 3 × 6
    ##    ...1 type     conc_Na units replicate signal
    ##   <dbl> <chr>      <dbl> <chr>     <dbl>  <dbl>
    ## 1     4 standard     0.1 ppm           1  5656.
    ## 2     5 standard     0.1 ppm           2  5654.
    ## 3     6 standard     0.1 ppm           3  5667.
  • Selecting all other replicates other than replicate 2:

    filter(FAES, replicate != 2)
    ## # A tibble: 10 × 6
    ##     ...1 type     conc_Na units replicate signal
    ##    <dbl> <chr>      <dbl> <chr>     <dbl>  <dbl>
    ##  1     1 blank        0   ppm           1   502.
    ##  2     3 blank        0   ppm           3   581.
    ##  3     4 standard     0.1 ppm           1  5656.
    ##  4     6 standard     0.1 ppm           3  5667.
    ##  5     7 standard     0.2 ppm           1  9393.
    ##  6     9 standard     0.2 ppm           3  9332.
    ##  7    10 standard     0.5 ppm           1 20187.
    ##  8    12 standard     0.5 ppm           3 20153.
    ##  9    13 standard     1   ppm           1 30798.
    ## 10    15 standard     1   ppm           3 30790.
  • Selecting the first standard replicate OR any of the blanks:

    filter(FAES, (type == "standard" & replicate == 1) | (type == "blank"))
    ## # A tibble: 7 × 6
    ##    ...1 type     conc_Na units replicate signal
    ##   <dbl> <chr>      <dbl> <chr>     <dbl>  <dbl>
    ## 1     1 blank        0   ppm           1   502.
    ## 2     2 blank        0   ppm           2   592.
    ## 3     3 blank        0   ppm           3   581.
    ## 4     4 standard     0.1 ppm           1  5656.
    ## 5     7 standard     0.2 ppm           1  9393.
    ## 6    10 standard     0.5 ppm           1 20187.
    ## 7    13 standard     1   ppm           1 30798.
  • Removing any rows with missing signal values (NA) using is.na(). Note there are no missing values in our data set so nothing will be removed, if we removed the NOT operator (!) we would have selected all rows with missing values.

    filter(FAES, !is.na(signal))
    ## # A tibble: 15 × 6
    ##     ...1 type     conc_Na units replicate signal
    ##    <dbl> <chr>      <dbl> <chr>     <dbl>  <dbl>
    ##  1     1 blank        0   ppm           1   502.
    ##  2     2 blank        0   ppm           2   592.
    ##  3     3 blank        0   ppm           3   581.
    ##  4     4 standard     0.1 ppm           1  5656.
    ##  5     5 standard     0.1 ppm           2  5654.
    ##  6     6 standard     0.1 ppm           3  5667.
    ##  7     7 standard     0.2 ppm           1  9393.
    ##  8     8 standard     0.2 ppm           2  9363.
    ##  9     9 standard     0.2 ppm           3  9332.
    ## 10    10 standard     0.5 ppm           1 20187.
    ## 11    11 standard     0.5 ppm           2 20141.
    ## 12    12 standard     0.5 ppm           3 20153.
    ## 13    13 standard     1   ppm           1 30798.
    ## 14    14 standard     1   ppm           2 30837.
    ## 15    15 standard     1   ppm           3 30790.

These are just some examples, but you can combine the logical operators in any way that works for you. Likewise, there are multiple combinations that will yield the same result, it’s up to you do figure out which works best for you.

12.2 Arranging rows

arrange() reorders the rows based on the value you passed to it. By default it arranges the specified values into ascending order. Let’s arrange our data our data by increasing order of signal value:

arrange(FAES, signal)
## # A tibble: 15 × 6
##     ...1 type     conc_Na units replicate signal
##    <dbl> <chr>      <dbl> <chr>     <dbl>  <dbl>
##  1     1 blank        0   ppm           1   502.
##  2     3 blank        0   ppm           3   581.
##  3     2 blank        0   ppm           2   592.
##  4     5 standard     0.1 ppm           2  5654.
##  5     4 standard     0.1 ppm           1  5656.
##  6     6 standard     0.1 ppm           3  5667.
##  7     9 standard     0.2 ppm           3  9332.
##  8     8 standard     0.2 ppm           2  9363.
##  9     7 standard     0.2 ppm           1  9393.
## 10    11 standard     0.5 ppm           2 20141.
## 11    12 standard     0.5 ppm           3 20153.
## 12    10 standard     0.5 ppm           1 20187.
## 13    15 standard     1   ppm           3 30790.
## 14    13 standard     1   ppm           1 30798.
## 15    14 standard     1   ppm           2 30837.

Since our original FAES data is already arranged by increasing conc_Na and replicate, let’s inverse that order by arranging conc_Na into descending order using the desc() function WHILE arranging the signal values in ascending order:

# Note the order of precedence (left-to-right)
arrange(FAES, desc(conc_Na), signal)
## # A tibble: 15 × 6
##     ...1 type     conc_Na units replicate signal
##    <dbl> <chr>      <dbl> <chr>     <dbl>  <dbl>
##  1    15 standard     1   ppm           3 30790.
##  2    13 standard     1   ppm           1 30798.
##  3    14 standard     1   ppm           2 30837.
##  4    11 standard     0.5 ppm           2 20141.
##  5    12 standard     0.5 ppm           3 20153.
##  6    10 standard     0.5 ppm           1 20187.
##  7     9 standard     0.2 ppm           3  9332.
##  8     8 standard     0.2 ppm           2  9363.
##  9     7 standard     0.2 ppm           1  9393.
## 10     5 standard     0.1 ppm           2  5654.
## 11     4 standard     0.1 ppm           1  5656.
## 12     6 standard     0.1 ppm           3  5667.
## 13     1 blank        0   ppm           1   502.
## 14     3 blank        0   ppm           3   581.
## 15     2 blank        0   ppm           2   592.

Just note with arrange() that NA values will always be placed at the bottom, whether you use desc() or not.

12.3 Selecting column name

select() allows you to readily select columns by name. Note however that it will always return a tibble, even if you only select one variable/column.

select(FAES, signal)
## # A tibble: 15 × 1
##    signal
##     <dbl>
##  1   502.
##  2   592.
##  3   581.
##  4  5656.
##  5  5654.
##  6  5667.
##  7  9393.
##  8  9363.
##  9  9332.
## 10 20187.
## 11 20141.
## 12 20153.
## 13 30798.
## 14 30837.
## 15 30790.

You can also select multiple columns using the same operators and helper functions described in Tidying Your Data:.

select(FAES, conc_Na:replicate)
## # A tibble: 15 × 3
##    conc_Na units replicate
##      <dbl> <chr>     <dbl>
##  1     0   ppm           1
##  2     0   ppm           2
##  3     0   ppm           3
##  4     0.1 ppm           1
##  5     0.1 ppm           2
##  6     0.1 ppm           3
##  7     0.2 ppm           1
##  8     0.2 ppm           2
##  9     0.2 ppm           3
## 10     0.5 ppm           1
## 11     0.5 ppm           2
## 12     0.5 ppm           3
## 13     1   ppm           1
## 14     1   ppm           2
## 15     1   ppm           3
# Getting columns containing the character "p"
select(FAES, contains("p"))
## # A tibble: 15 × 2
##    type     replicate
##    <chr>        <dbl>
##  1 blank            1
##  2 blank            2
##  3 blank            3
##  4 standard         1
##  5 standard         2
##  6 standard         3
##  7 standard         1
##  8 standard         2
##  9 standard         3
## 10 standard         1
## 11 standard         2
## 12 standard         3
## 13 standard         1
## 14 standard         2
## 15 standard         3

12.4 Deleting Columns or Rows

While the process of selecting and filtering data is pivotal in data analysis, there are instances when you may need to remove specific columns or rows entirely. This is useful especially when you’re dealing with redundant or irrelevant data that might clutter your analysis.

12.4.1 Deleting columns

To delete a column, you can use the select() function with the - sign before the column name you want to remove:

# This will remove the 'signal' column from the FAES dataset
head(select(FAES, -signal))
## # A tibble: 6 × 5
##    ...1 type     conc_Na units replicate
##   <dbl> <chr>      <dbl> <chr>     <dbl>
## 1     1 blank        0   ppm           1
## 2     2 blank        0   ppm           2
## 3     3 blank        0   ppm           3
## 4     4 standard     0.1 ppm           1
## 5     5 standard     0.1 ppm           2
## 6     6 standard     0.1 ppm           3

Multiple columns can be deleted by providing more column names after the - sign:

# Deleting both 'signal' and 'replicate' columns from the FAES dataset
head(select(FAES, -c(signal, replicate)))
## # A tibble: 6 × 4
##    ...1 type     conc_Na units
##   <dbl> <chr>      <dbl> <chr>
## 1     1 blank        0   ppm  
## 2     2 blank        0   ppm  
## 3     3 blank        0   ppm  
## 4     4 standard     0.1 ppm  
## 5     5 standard     0.1 ppm  
## 6     6 standard     0.1 ppm

12.4.2 Deleting rows

To delete rows, the filter() function can be used in conjunction with relational or logical conditions that define the rows you wish to exclude:

# This will remove rows where 'signal' values are less than 20000
filter(FAES, !(signal < 20000))
## # A tibble: 6 × 6
##    ...1 type     conc_Na units replicate signal
##   <dbl> <chr>      <dbl> <chr>     <dbl>  <dbl>
## 1    10 standard     0.5 ppm           1 20187.
## 2    11 standard     0.5 ppm           2 20141.
## 3    12 standard     0.5 ppm           3 20153.
## 4    13 standard     1   ppm           1 30798.
## 5    14 standard     1   ppm           2 30837.
## 6    15 standard     1   ppm           3 30790.

The key here is the use of the ! (NOT) operator which excludes rows that meet the specified condition.

12.5 Adding new variables

mutate() allows you to add new variables (read columns) to your existing data set. It’ll probably be the workhorse function you’ll use during your data transformation as you can readily pass other functions and mathematical operators to it to transform your data. let’s suppose that our standards were diluted by a factor of 10, we can add a new column dil_fct for this:

mutate(FAES, dil_fct = 10)
## # A tibble: 15 × 7
##     ...1 type     conc_Na units replicate signal dil_fct
##    <dbl> <chr>      <dbl> <chr>     <dbl>  <dbl>   <dbl>
##  1     1 blank        0   ppm           1   502.      10
##  2     2 blank        0   ppm           2   592.      10
##  3     3 blank        0   ppm           3   581.      10
##  4     4 standard     0.1 ppm           1  5656.      10
##  5     5 standard     0.1 ppm           2  5654.      10
##  6     6 standard     0.1 ppm           3  5667.      10
##  7     7 standard     0.2 ppm           1  9393.      10
##  8     8 standard     0.2 ppm           2  9363.      10
##  9     9 standard     0.2 ppm           3  9332.      10
## 10    10 standard     0.5 ppm           1 20187.      10
## 11    11 standard     0.5 ppm           2 20141.      10
## 12    12 standard     0.5 ppm           3 20153.      10
## 13    13 standard     1   ppm           1 30798.      10
## 14    14 standard     1   ppm           2 30837.      10
## 15    15 standard     1   ppm           3 30790.      10

We can also create multiple columns in the same mutate() call:

mutate(FAES, 
       dil_fct = 10, 
       adj_signal = signal * dil_fct)
## # A tibble: 15 × 8
##     ...1 type     conc_Na units replicate signal dil_fct adj_signal
##    <dbl> <chr>      <dbl> <chr>     <dbl>  <dbl>   <dbl>      <dbl>
##  1     1 blank        0   ppm           1   502.      10      5023.
##  2     2 blank        0   ppm           2   592.      10      5918.
##  3     3 blank        0   ppm           3   581.      10      5815.
##  4     4 standard     0.1 ppm           1  5656.      10     56563.
##  5     5 standard     0.1 ppm           2  5654.      10     56536.
##  6     6 standard     0.1 ppm           3  5667.      10     56674.
##  7     7 standard     0.2 ppm           1  9393.      10     93934.
##  8     8 standard     0.2 ppm           2  9363.      10     93627.
##  9     9 standard     0.2 ppm           3  9332.      10     93320.
## 10    10 standard     0.5 ppm           1 20187.      10    201869.
## 11    11 standard     0.5 ppm           2 20141.      10    201405.
## 12    12 standard     0.5 ppm           3 20153.      10    201530.
## 13    13 standard     1   ppm           1 30798.      10    307977.
## 14    14 standard     1   ppm           2 30837.      10    308365.
## 15    15 standard     1   ppm           3 30790.      10    307898.

A couple of things to note:

  1. Quotation marks are generally optional when creating a new variable in mutate(), but they become necessary if the variable name contains spaces, special characters, or starts with a number. For example, "dil_fct", dil_fct, and dil_fct1 are all valid, but if you had a variable name like "dil fct", "dil-fct", or "2nd_fct", the quotes would be required.
  2. The variables we’re referencing do not need to be in quotation marks; hence signal because this variable already exists.
  3. Note the order of precedence: dil_fct is created first so we can reference in the second column being added, we would get an error if we swapped the order.

12.5.1 Mutate with a condition

In data analysis, there are often scenarios where we want to categorize or re-label values based on certain conditions. The case_when() function offers a versatile and readable solution for handling these multiple conditions.

The syntax for case_when() is straightforward: for each condition, you specify the logical test followed by the tilde (~) operator, and then the value or expression to return if the condition is TRUE. A .default value can be provided for cases when none of the conditions are TRUE.

With our FAES data, say you want to label each conc_Na as “Low”, “Medium”, or “High” based on its value. You can use case_when() within mutate() as follows:

mutate(FAES, 
       conc_Na_level = case_when(
         conc_Na < 0.2 ~ "Low",
         conc_Na < 0.4 ~ "Medium",
         .default = "High"))
## # A tibble: 15 × 7
##     ...1 type     conc_Na units replicate signal conc_Na_level
##    <dbl> <chr>      <dbl> <chr>     <dbl>  <dbl> <chr>        
##  1     1 blank        0   ppm           1   502. Low          
##  2     2 blank        0   ppm           2   592. Low          
##  3     3 blank        0   ppm           3   581. Low          
##  4     4 standard     0.1 ppm           1  5656. Low          
##  5     5 standard     0.1 ppm           2  5654. Low          
##  6     6 standard     0.1 ppm           3  5667. Low          
##  7     7 standard     0.2 ppm           1  9393. Medium       
##  8     8 standard     0.2 ppm           2  9363. Medium       
##  9     9 standard     0.2 ppm           3  9332. Medium       
## 10    10 standard     0.5 ppm           1 20187. High         
## 11    11 standard     0.5 ppm           2 20141. High         
## 12    12 standard     0.5 ppm           3 20153. High         
## 13    13 standard     1   ppm           1 30798. High         
## 14    14 standard     1   ppm           2 30837. High         
## 15    15 standard     1   ppm           3 30790. High

For those interested in exploring further, there’s a similar function called ifelse() which provides conditional transformations in R. You can learn more about it in R documentation found here.

12.5.2 Useful mutate function

There are a myriad of functions you can make use of with mutate. Here are some of the mathematical operators available in R:

Operator.or.Function Definition
+ addition
- subtraction
* multiplication
/ division
^ exponent; to the power of…
log() returns the specified base-log; see also log10() and log2()

12.6 Group and summarize data

summarize effectively summarizes your data based on functions you’ve passed to it. Looking at our FAES data we’d might want the mean and standard deviation of the triplicate signals. Let’s see what happens when we apply the summarize function straight up:

summarise(FAES, mean = mean(signal), stdDev = sd(signal))
## # A tibble: 1 × 2
##     mean stdDev
##    <dbl>  <dbl>
## 1 13310. 11242.

This doesn’t look like what we wanted. What we got was the mean and standard deviation of all of the signals, regardless of the concentration of the standard. Also note how we’ve lost the other columns/variables and are only left with the mean and stdDev. This is all because we need to group our observations by a variable. We can do this by using the group_by() function.

groupedFAES <- group_by(FAES, type, conc_Na)
summarise(groupedFAES, mean = mean(signal), stdDev = sd(signal))
## `summarise()` has grouped output by 'type'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 4
## # Groups:   type [2]
##   type     conc_Na   mean stdDev
##   <chr>      <dbl>  <dbl>  <dbl>
## 1 blank        0     559.  48.9 
## 2 standard     0.1  5659.   7.34
## 3 standard     0.2  9363.  30.7 
## 4 standard     0.5 20160.  24.0 
## 5 standard     1   30808.  25.0

Here we’ve created a new data set, groupedFAES, that we grouped by the variables type and conc_Na so we could get the mean and standard deviation of each group. Note the multiple levels of grouping. Depending on your dataset and the analysis you’re performing, you’ll need to decide how to group your data: the more variables you use, the smaller each group will be.

12.6.1 Useful summarize functions

We’ve used the mean() and sd() functions above, but there are a host of other useful functions you can use in conjunction with summarize. See Useful Functions in the summarise() documentation (enter ?summarise) in the console. This is also discussed in more depth in the Summarizing Data chapter.

12.7 The Pipe: Chaining Functions Together

Piping is a concept that allows you to chain functions together in a way that simplifies and clarifies your code. At its core, piping is similar to function composition in mathematics, where the output of one function becomes the input to the next. This helps you build complex operations in a readable and logical sequence.

12.7.1 Function Composition Example

Consider a mathematical example of function composition:

\[ f(g(x)) \]

Here, the function \(g(x)\) is applied first, and its output is then passed as the input to the function \(f(x)\). In programming, this concept can be translated to chaining functions together.

12.7.2 Abstract Example of Piping

The pipe operator %>%, an incredibly useful tool for writing more legible and understandable code. The pipe basically changes how you read code to emphasize the functions you’re working with by passing the intermediate steps to hidden processes in the background.

Now, consider the following abstract example of piping in R:

result <- data %>%
  step1() %>%
  step2() %>%
  step3()

Here’s what’s happening:

  • step1(): Takes data as its input.
  • step2(): Takes the result of step1() as its input.
  • step3(): Takes the result of step2() as its input.

This sequence of operations could be written without pipes by nesting function calls, but the use of pipes makes the flow of data more explicit and easier to read.

12.7.3 Simple Examples of Piping

Let’s start with a simple, single use of piping:

FAES %>% nrow()

In this case, the %>% operator pipes the FAES dataset directly into the nrow() function, which returns the number of rows in the dataset. This is functionally equivalent to:

nrow(FAES)

Now, let’s consider an example where the function takes an additional argument:

meanBlank <- FAES %>%
  filter(type == "blank")

This is functionally equivalent to:

meanBlank <- filter(FAES, type == "blank")

In both cases, piping might seem redundant because it only removes the need to specify the first argument explicitly.

12.7.4 Piping in Practice

With the tools presented in this chapter without using pipe operator we could do a decent job analyzing our FAES data. Let’s say we wanted to subtract the mean of the blank from each standard signal and then summarize those results. It would look something like this:

blank <- filter(FAES, type == "blank")
meanBlank <- summarize(blank, mean(signal))
meanBlank <- as.numeric(meanBlank)

paste("The mean signal from the blank triplicate is:", meanBlank)
## [1] "The mean signal from the blank triplicate is: 558.5249"
stds_1 <- filter(FAES, type == "standard")
stds_2 <- mutate(stds_1, cor_sig = signal - meanBlank)
stds_3 <- group_by(stds_2, conc_Na)
stds_4 <- summarize(stds_3, mean = mean(cor_sig), stdDev = sd(cor_sig))
stds_4
## # A tibble: 4 × 3
##   conc_Na   mean stdDev
##     <dbl>  <dbl>  <dbl>
## 1     0.1  5101.   7.34
## 2     0.2  8804.  30.7 
## 3     0.5 19602.  24.0 
## 4     1   30249.  25.0

If we use pipes, we can make this code much more legible and easier to understand. The code with pipes would look like this:

meanBlank <- FAES %>%
  filter(type == "blank") %>%
  summarise(mean(signal)) %>%
  as.numeric()

paste("The mean signal from the blank triplicate is:", meanBlank)
## [1] "The mean signal from the blank triplicate is: 558.5249"
stds <- FAES %>%
  filter(type == "standard") %>%
  mutate(cor_sig = signal - meanBlank) %>% 
  group_by(conc_Na) %>%
  summarize(mean = mean(cor_sig), stdDev = sd(cor_sig))

stds
## # A tibble: 4 × 3
##   conc_Na   mean stdDev
##     <dbl>  <dbl>  <dbl>
## 1     0.1  5101.   7.34
## 2     0.2  8804.  30.7 
## 3     0.5 19602.  24.0 
## 4     1   30249.  25.0

While the initial code did its job, it’s certainly wasn’t easy to type and certainly not easy to read. At every step of the way we’ve saved our updated data outputs to a new variable (stds_1, stds_2, etc.). However, most of these intermediates aren’t important, and moreover the repetitive names clutter our code. As the code above is written, we’ve had to pay special attention to the variable suffix to make sure we’re calling the correct data set as our code has progresses. An alternative would be to reassign the outputs back to the original variable name (i.e. stds_1 <- mutate(stds_1, ...)), but that doesn’t solve the issue of readability as there’s still redundant assigning.

Things may look a bit different, but our underlying code hasn’t changed much. What’s happening is the pipe operator passes the output to the first argument of the next function. So the output of filter... is passed to the first argument of sumamrise..., and the argument we specified in summarise is actually the second argument it receives. You’re probably wondering how hiding stuff makes your code more legible, but think of %>% as being equivalent to “then”. We can read our code as:

“Take the FAES dataset, then filter for type == "blank" then collapse the dataset to the mean signal value and then convert to numeric value then pass this final output to the new variable meanBlank.”

Not only is the pipe less typing, but the emphasis is on the functions so you can better understand what you’re doing vs. where all the intermediate values are going.

12.7.5 Notes on piping

The pipe is great, but it does have some limitations:

  • You can’t easily extract intermediate steps. So you’ll need to break up your pipping chain to output any intermediate steps you can.
  • The benefit of piping is legibility; this goes away as you increase the number of steps as you lose track of what’s going on. Keep the piping short and thematically similar.
  • Pipes are linear, if you have multiple inputs or outputs you should consider an alternative approach.

12.8 Further reading

  • Chapter 5: Data Transformation of R for Data Science for a deeper breakdown of dplyr and its functionality.
  • Chapter 18: Pipes of R for Data Science for more information on pipes.
  • Syntax equivalents: base R vs Tidyverse by Hugo Taveres for a comparison of base-R solutions to tidyverse. This entire book is largely biased towards tidyverse solutions, but there’s no denying that certain base-R can be more elegant. Check out this write up to get a better idea.

12.9 Exercise

There is a set of exercises available for this chapter!

Not sure how to access and work on the exercise Rmd files?

  • Refer to Running Tests for Your Exercises for step-by-step instructions on accessing the exercises and working within the UofT JupyterHub’s RStudio environment.

  • Alternatively, if you’d like to simply access the individual files, you can download them directly from this repository.

Always remember to save your progress regularly and consult the textbook’s guidelines for submitting your completed exercises.