Chapter 10 Importing Your Data Into R
Unlike Excel, you can’t copy and paste your data into R (or RStudio). Instead you need to import your data into R so you can work with it. This chapter will discuss how your data is stored, and how to import it into R (with some accompanying nuances).
10.1 csv files
While there are a myriad of ways data is stored, instruments often record results in a proprietary vendor format, the data you’re likely to encounter in an undergraduate lab will be in the form of a csv or comma-separated values file. As the name implies, values are separated by commas (go ahead and open any csv file in any text editor to observe this). Essentially you can think of each line as a row and commas as separating values into columns, which is exactly how R and Excel handle csv files.
10.2 read_csv
Importing a csv file into R simply requires the read_csv
tidyverse function. The first input to this function is the most important as it’s the file path. Recall that R, unless specified, uses relative referencing. So in the example below we’re importing the ATR_plastics.csv
from the data
subfolder in our project by specifying "data/ATR_plastics.csv"
and assigning it to the variable atr_plastics
.
## Rows: 28628 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): sample
## dbl (2): wavenumber, absorbance
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A benefit of using read_csv
is that it prints out the column specifications with each column’s name (how you’ll reference it in code) and the column value type. Columns can have different data types, but a data type must be consistent within any given column. Having the columns specifications is a good way to ensure R is correctly reading your data. The most common data types are:
- int for integer values (-1,1, 2, 10, etc.)
- dbl for doubles (decimals) or real numbers (-1.20, 0.0, 1.200, 1e7, etc.)
- chr for character vectors or strings (“A”, “chemical”, “Howdy ma’am”, etc.)
- lgl for logical values, either
TRUE
orFALSE
We can inspect this dataset either through the Environment pane or with the head()
function.
## # A tibble: 6 × 3
## wavenumber sample absorbance
## <dbl> <chr> <dbl>
## 1 550. EPDM 0.212
## 2 550. Polystyrene 0.0746
## 3 550. Polyethylene 0.000873
## 4 550. Sample: Shopping bag 0.0236
## 5 551. EPDM 0.212
## 6 551. Polystyrene 0.0746
As you can see, the head()
function, by default, shows the first six rows of the data frame. If you want to inspect more or fewer rows, you can provide an optional n
argument like head(data, n=10)
. Note the column specifications under the column name.
Also note how the first line of the ATR_plastics.csv
has been interpreted as columns names (or headers) by R. This is common practice, and gives you a handle by which you can manipulate your data. If you did not intend for R to interpret the first row as headers you can suppress this with the additional argument col_names = FALSE
.
## Rows: 28629 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): X1, X2, X3
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 3
## X1 X2 X3
## <chr> <chr> <chr>
## 1 wavenumber sample absorbance
## 2 550.0952 EPDM 0.2119556
## 3 550.0952 Polystyrene 0.07463058
## 4 550.0952 Polyethylene 0.000873196
## 5 550.0952 Sample: Shopping bag 0.02364882
## 6 550.5773 EPDM 0.2124079
Note in the example above that since the headers are now considered data, and are composed of a string of characters, the entire column is then interpreted as character values. This will happen if a single non-numeric character is introduced in the column, so beware of typos when recording data! If we wanted to skip rows (i.e. to avoid blank rows at the top of our csv file), we can use the skip = <n>
to skip <n>
rows:
## Rows: 28628 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): X2
## dbl (2): X1, X3
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 3
## X1 X2 X3
## <dbl> <chr> <dbl>
## 1 550. EPDM 0.212
## 2 550. Polystyrene 0.0746
## 3 550. Polyethylene 0.000873
## 4 550. Sample: Shopping bag 0.0236
## 5 551. EPDM 0.212
## 6 551. Polystyrene 0.0746
Note in the example above that we skipped our headers, so read_csv()
created placeholder headers (X1
, X2
, etc.).
Another useful function to inspect data is tail()
, which displays the last six rows of a data frame. Similarly, it accepts an optional n
argument to specify the number of rows you want to view.
## # A tibble: 6 × 3
## wavenumber sample absorbance
## <dbl> <chr> <dbl>
## 1 4000. Polyethylene 0.000125
## 2 4000. Sample: Shopping bag 0.0664
## 3 4000. EPDM 0.113
## 4 4000. Polystyrene 0.0706
## 5 4000. Polyethylene 0.000195
## 6 4000. Sample: Shopping bag 0.0663
10.2.1 Tibbles vs. data frames
Quick eyes will notice the first line outputted above is # A tibble: 6 x 5
. Tibbles are a variation of data frames introduced in Data Frames, but built specifically for the tidyverse. While data frames and tibbles are often interchangeable, it’s important to be aware of the difference in case you do run into a rare conflict. In these situations you can readily transform a tibble into a data frame by coercion with the as.data.frame()
function, and vice-versa with the as_tibble()
function.
## [1] "data.frame"
10.3 Importing other data types
There are other functions to import different types of tabular data which all function like read_csv
, such as read_tsv
for tab-separate value files (.tsv
) and read_excel
and read_xlsx
for Excel files.
Warning: most Excel files have probably been formatted for legibility (i.e. merged columns), which can lead to errors when importing into R. If you plan on importing Excel files, it’s probably best to open them in Excel to remove any formatting, and then save as .csv
for smoother importing into R.
10.4 Saving data
As you progress with your analysis you may want to save intermediate or final datasets. This is readily accomplished using the write_csv()
tidyverse function. Similar rules apply to how we used read_csv
, but now the second argument specifies the save location and file name, while the first argument is which tibble
/data.frame
we’re saving. Note that R will not create a folder this way, so if you’re saving to a subfolder you’ll have to make sure it exists or create it yourself.
10.5 Further Reading
See Chapters 10 and 11 of R for Data Science for some more details on tibbles
and read_csv
.
10.6 Exercise
There is a set of exercises available for this chapter!
Not sure how to access and work on the exercise Rmd files?
Refer to Running Tests for Your Exercises for step-by-step instructions on accessing the exercises and working within the UofT JupyterHub’s RStudio environment.
Alternatively, if you’d like to simply access the individual files, you can download them directly from this repository.
Always remember to save your progress regularly and consult the textbook’s guidelines for submitting your completed exercises.