Chapter 10 Importing Your Data Into R

Unlike Excel, you can’t copy and paste your data into R (or RStudio). Instead you need to import your data into R so you can work with it. This chapter will discuss how your data is stored, and how to import it into R (with some accompanying nuances).

10.1 csv files

While there are a myriad of ways data is stored, instruments often record results in a proprietary vendor format, the data you’re likely to encounter in an undergraduate lab will be in the form of a csv or comma-separated values file. As the name implies, values are separated by commas (go ahead and open any csv file in any text editor to observe this). Essentially you can think of each line as a row and commas as separating values into columns, which is exactly how R and Excel handle csv files.

10.2 read_csv

Importing a csv file into R simply requires the read_csv tidyverse function. The first input to this function is the most important as it’s the file path. Recall that R, unless specified, uses relative referencing. So in the example below we’re importing the ATR_plastics.csv from the data subfolder in our project by specifying "data/ATR_plastics.csv" and assigning it to the variable atr_plastics.

library(tidyverse)

atr_plastics <- read_csv("data/ATR_plastics.csv")
## Rows: 28628 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): sample
## dbl (2): wavenumber, absorbance
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A benefit of using read_csv is that it prints out the column specifications with each column’s name (how you’ll reference it in code) and the column value type. Columns can have different data types, but a data type must be consistent within any given column. Having the columns specifications is a good way to ensure R is correctly reading your data. The most common data types are:

  • int for integer values (-1,1, 2, 10, etc.)
  • dbl for doubles (decimals) or real numbers (-1.20, 0.0, 1.200, 1e7, etc.)
  • chr for character vectors or strings (“A”, “chemical”, “Howdy ma’am”, etc.)
  • lgl for logical values, either TRUE or FALSE

We can inspect this dataset either through the Environment pane or with the head() function.

head(atr_plastics)
## # A tibble: 6 × 3
##   wavenumber sample               absorbance
##        <dbl> <chr>                     <dbl>
## 1       550. EPDM                   0.212   
## 2       550. Polystyrene            0.0746  
## 3       550. Polyethylene           0.000873
## 4       550. Sample: Shopping bag   0.0236  
## 5       551. EPDM                   0.212   
## 6       551. Polystyrene            0.0746

As you can see, the head() function, by default, shows the first six rows of the data frame. If you want to inspect more or fewer rows, you can provide an optional n argument like head(data, n=10). Note the column specifications under the column name.

Also note how the first line of the ATR_plastics.csv has been interpreted as columns names (or headers) by R. This is common practice, and gives you a handle by which you can manipulate your data. If you did not intend for R to interpret the first row as headers you can suppress this with the additional argument col_names = FALSE.

head(read_csv("data/ATR_plastics.csv", col_names = FALSE))
## Rows: 28629 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): X1, X2, X3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 3
##   X1         X2                   X3         
##   <chr>      <chr>                <chr>      
## 1 wavenumber sample               absorbance 
## 2 550.0952   EPDM                 0.2119556  
## 3 550.0952   Polystyrene          0.07463058 
## 4 550.0952   Polyethylene         0.000873196
## 5 550.0952   Sample: Shopping bag 0.02364882 
## 6 550.5773   EPDM                 0.2124079

Note in the example above that since the headers are now considered data, and are composed of a string of characters, the entire column is then interpreted as character values. This will happen if a single non-numeric character is introduced in the column, so beware of typos when recording data! If we wanted to skip rows (i.e. to avoid blank rows at the top of our csv file), we can use the skip = <n> to skip <n> rows:

head(read_csv("data/ATR_plastics.csv", col_names = FALSE, skip = 1))
## Rows: 28628 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): X2
## dbl (2): X1, X3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 3
##      X1 X2                         X3
##   <dbl> <chr>                   <dbl>
## 1  550. EPDM                 0.212   
## 2  550. Polystyrene          0.0746  
## 3  550. Polyethylene         0.000873
## 4  550. Sample: Shopping bag 0.0236  
## 5  551. EPDM                 0.212   
## 6  551. Polystyrene          0.0746

Note in the example above that we skipped our headers, so read_csv() created placeholder headers (X1, X2, etc.).

Another useful function to inspect data is tail(), which displays the last six rows of a data frame. Similarly, it accepts an optional n argument to specify the number of rows you want to view.

tail(atr_plastics)
## # A tibble: 6 × 3
##   wavenumber sample               absorbance
##        <dbl> <chr>                     <dbl>
## 1      4000. Polyethylene           0.000125
## 2      4000. Sample: Shopping bag   0.0664  
## 3      4000. EPDM                   0.113   
## 4      4000. Polystyrene            0.0706  
## 5      4000. Polyethylene           0.000195
## 6      4000. Sample: Shopping bag   0.0663

10.2.1 Tibbles vs. data frames

Quick eyes will notice the first line outputted above is # A tibble: 6 x 5. Tibbles are a variation of data frames introduced in Data Frames, but built specifically for the tidyverse. While data frames and tibbles are often interchangeable, it’s important to be aware of the difference in case you do run into a rare conflict. In these situations you can readily transform a tibble into a data frame by coercion with the as.data.frame() function, and vice-versa with the as_tibble() function.

class(as.data.frame(atr_plastics))
## [1] "data.frame"

10.3 Importing other data types

There are other functions to import different types of tabular data which all function like read_csv, such as read_tsv for tab-separate value files (.tsv) and read_excel and read_xlsx for Excel files.

Warning: most Excel files have probably been formatted for legibility (i.e. merged columns), which can lead to errors when importing into R. If you plan on importing Excel files, it’s probably best to open them in Excel to remove any formatting, and then save as .csv for smoother importing into R.

10.4 Saving data

As you progress with your analysis you may want to save intermediate or final datasets. This is readily accomplished using the write_csv() tidyverse function. Similar rules apply to how we used read_csv, but now the second argument specifies the save location and file name, while the first argument is which tibble/data.frame we’re saving. Note that R will not create a folder this way, so if you’re saving to a subfolder you’ll have to make sure it exists or create it yourself.

write_csv(atr_plastics, "data/ATRSaveExample.csv")

10.5 Further Reading

See Chapters 10 and 11 of R for Data Science for some more details on tibbles and read_csv.

10.6 Exercise

There is a set of exercises available for this chapter!

Not sure how to access and work on the exercise Rmd files?

  • Refer to Running Tests for Your Exercises for step-by-step instructions on accessing the exercises and working within the UofT JupyterHub’s RStudio environment.

  • Alternatively, if you’d like to simply access the individual files, you can download them directly from this repository.

Always remember to save your progress regularly and consult the textbook’s guidelines for submitting your completed exercises.