Chapter 16 Visualizations for Env Chem

Visualizations have always been an important part of data science and chemistry. Good graphics illuminate trends and patterns you may have otherwise missed and allow us to quickly inspect thousands of values. R via the ggplot2 package is one of, if not the premier, data visualization language available. This chapter will formally introduce the ggplot2 package, explain a bit of the logic undergirding its operation, and give you some quick examples of how it works. Afterwards we’ll delve deeper into specific visualizations you’ll use and encounter in your studies culminating in preparing your plots for publication.

We’ve already encountered and produced several types of plots to visualize our data. We’ve also gone over the theory and basic operations of ggplot() in the ggplot basic visualizations section. Now, we’ll expand on these and explicitly walk through the most common data visualization methods you’ll encounter in the field of environmental chemistry. Additionally, we’ll learn how to get your plots ready for publication.

The plots we’ll be covering include:

These are only a smattering of the possible data visualizations you can perform in R. We’re focusing on them because of their ubiquity in our field, but they often won’t be the ideal visualizations you need to communicate your story. We highly recommend you check out the following resources. Not only are they a great source of inspiration, they provide example code to get you up and running. We consult them regularly.

  • Data to viz which features a decision tree to help you decide on what plot would serve you best.
  • ggplot2 extensions gallery which is the best repository to the plethora of ggplot2() extensions. If you need a specialized plot, check here. Odds are someone has a solution to your problem. Some great extensions include ggrepel for easy labelling of points; ggpmisc for statistical annotations; and ggpubr for publication ready plots, group wise comparisons, and annotation of statistical significance.
  • The R Graph Gallery contains hundreds of charts made with R. While it’s not as easy to navigate as Data to viz, it does contain many more examples; it is definitely worth exploring.

16.1 Discrete vs. Continuous variables

The type of plots available to you, and how they display, are dependent on the type of data. Namely, whether your data is discrete (i.e. can only take particular values) or continuous (is not restricted to defined separate values, but can occupy any value over a continuous range). So a variable consisting of cities would be discrete, whereas a variable like concentration of a chemical would be continuous. You can treat numeric data as categorical if you so choose. Understanding the difference between discrete and continuous data will shape how you plot your data.

16.2 Prerequisites

Additionally, for this section we’ll mostly be using the atlNO2 and sumAtl datasets we created in the Summarizing data chapter. Now that we’ve seen how to turn data into a longer format, we’ll show you how atlNO2 was tidied from the raw (wide format) data. This atlNO2 is identical to the previous atlNO2 we saw.

atlNO2 <- read_csv("data/2018hourlyNO2_Atl_wide.csv", skip = 7, na =c("-999")) %>%
  rename_with(~tolower(gsub("/.*", "", .x))) %>%
  pivot_longer(cols = starts_with("h"), 
               names_prefix = "h", 
               names_to = "hour", 
               names_transform = list(hour = as.numeric),
               values_to = "conc", 
               values_transform = list(conc = as.numeric),
               values_drop_na = TRUE) 
## Rows: 8395 Columns: 31
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): Pollutant//Polluant, City//Ville, P/T//P/T
## dbl (28): NAPS ID//Identifiant SNPA, Latitude//Latitude, Longitude//Longitud...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sumAtl <- atlNO2 %>%
  group_by(p, city) %>%
  summarize(mean = mean(conc),
            sd = sd(conc),
            median = median(conc),
            min = min(conc),
            max = max(conc))
## `summarise()` has grouped output by 'p'. You can override using the `.groups`
## argument.