Section 5 Data Visualization
Visualizing data is one of the most powerful ways to explore relationships and trends between variables. Visualizations are also a powerful way to communicate these relationships. In this section we will discuss the creation of the most common type of plot in chemistry: the scatterplot. How to make changes to a plot, how to add and format axes labels and how to add trendlines.
The majority of this section uses the air quality datasets from the NAPS network as an example. If you received a dataset through your CHM135 lab section feel free to use it to follow along. If not, you can visit the Introduction page to download the example dataset used throughout this resource (note that you cannot use the example dataset for your analysis in CHM135).
5.1 Creating a Scatterplot
Many physical chemistry labs result in the collection of a series of data with an independent (time, volume…) and a dependent (absorption, pH, temperature…) variable. Comparing these variables often involves plotting the independent variable on the x-axis and the dependent variable on the y-axis as a scatterplot, which plots each data point independently on the grid space of the graph. This model of data analysis is where we will begin our analysis of the relationship between NO2, O3 and OX in the NAPS dataset by creating a timeseries where the concentrations of the chemicals (dependent variable) are plotted over time (independent variable).
To create a scatterplot, select the columns of data you want to visualize and then go to the INSERT ribbon and select CHARTS, then SCATTER, then SCATTER again (this selection may look like a button with a picture of a little scatterplot). Alternatively, you can select the INSERT tab, then CHARTS, then X Y (Scatter).
If you would like to make a plot with more than 1 dataset included, as in the GIF below, then you can select all of the columns in question and create the scatterplot as described. You can select an entire column by clicking on the column heading, and if you need to select columns that are not directly adjacent to one another hold the CTRL button in Windows or COMMAND on a Mac and click on the column headings you wish to select.
The first thing to do once you make a plot is to simply look at it. Are the correct values plotted on the correct axes? Are the values properly formatted? Here is the plot that resulted from the GIF above. What do you think?
Although this figure is a bit difficult to read, from the flow of the points it looks as though the date and time is plotted on the x-axis and concentration on the y-axis. If you haven’t removed all of the -999 values in your dataset you might see some outlier values. If this is the case the Data Cleaning section walks you through how to remove them.
5.2 Changing Chart Type or Design
As mentioned above, scatterplots are the most common type of plot in the chemistry lab as the data collected is discrete. However, as is often the case, rules were meant to be broken. Adding a continuous smooth line will make relationships between the variables much easier to visualize, and will also allow us to practice changing a chart type after a plot has been generated. To do this click on the plot, then select the CHART DESIGN ribbon, then CHANGE CHART TYPE, then X Y (Scatter), then SCATTER WITH SMOOTH LINES (note that the last menu often appears as pictures of plots and so choose the one with a smooth line). This is also the menu you would choose if you wanted to change the chart type to visualize your data in another way (e.g. bar chart).
5.3 Adding Axis Labels and a Chart Title
Once you have confirmed your graph is appropriate you will need to add axes titles. To do this, click on the plot, then click on the CHART DESIGN ribbon, then ADD CHART ELEMENT, then AXIS TITLES and add titles to both the x-axis (PRIMARY HORIZONTAL) and y-axis (PRIMARY VERTICAL). Once the axes titles are present, you can edit them to an appropriate label and remember to include units if relevant.
A chart title is typically added automatically, but if not, the ADD CHART ELEMENT menu is also where you can find the CHART TITLE menu. Once the title is present you can edit it. Note that the addition of subscript or superscript text to an Excel chart title or axis label will require that you select the text in question, right click on it, select FONT, and then select the change you want.
5.3.1 Notes on Chart Titles
For CHM135 you are asked for a chart title, but for lab reports and other scientific writing, you should not use chart titles, and instead include figure captions, which are placed immediately below the figure. Figure captions are standalone, and concise, blocks of text which include all of the information readers may need to properly understand the figure. They will specify any statistics shown on the plot, explain different markers, enumerate sample sizes, and define any abbreviations
5.4 Formatting Axis Labels
Looking at the graph it is obvious that the x-axis is not properly formatted. To format an axis, double click the axis values to reveal the FORMAT AXIS option panel that includes two tabs AXIS OPTIONS and TEXT OPTIONS. Here you have many options to explore. The timeseries dates are complicated as a simple adjustment of font size is not enough to make it easy to read. There are two ways to approach adjusting the date and time displayed on the x-axis:
- Adjust the spacing of the major and minor units in the AXIS OPTIONS tab
- Adjust how the text is displayed in the TEXT OPTIONS tab
The major and minor units of the x-axis can be adjusted in the AXIS OPTIONS tab of the FORMAT AXIS panel. In the plot we are working on, the major unit was set to 1 (meaning every day is displayed) and the minor unit to 0.2. In this case the major unit corresponds to a gridline and a value displayed on the x-axis, while the minor unit is not displayed. Try changing these to something larger so that the date values are more spaced out. In the GIF below the major unit was changed to 2 and then to 3 and the minor unit to 1. One of the consequences of changing the major and minor units of the x-axis is that it may change the bounds of the plot. To adjust this, the MINIMUM and MAXIMUM values of the axis can be adjusted in the BOUNDS menu of the AXIS OPTIONS tab to reflect the 7 days in your dataset. Remember that Excel stores date and time values using numerical values (see the Data Wrangling section for more details). To adjust the bounds, add or subtract integer values (i.e. 1, 2, 3) to the minimum and maximum until your data is properly framed. Notice how changes made in the MINIMUM and MAXIMUM values of the axis update as soon as you click elsewhere, this is useful when playing around to find the proper framing for the data.
To adjust the x-axis by changing the direction of the text, select the TEXT OPTIONS tab of the FORMAT AXIS panel, find the TEXT BOX options box and change the angle of the text either by selecting an option such as ROTATED 90o from the TEXT DIRECTION menu or changing the value in the CUSTOM ANGLE box. One final note about axis formatting is that you may want to change how the values themselves are formatted (for example you may only want to see the date values and not the time in your timeseries). Remember that the values displayed in the graph are tied to specific values in your Excel sheet, and so any adjustments made to the values in the sheet will be reflected in the graph.
5.5 Adding a Linear Regression Line
One way to explore the relationship between O3 and NO2 is to create a correlation plot, which is essentially a plot of two variables against one another to visualize how they relate to one another. To accomplish this, first create a scatterplot as described above, but this time the x-axis will correspond to the concentration of NO2 and the y-axis to the concentration O3. Then to assess if there is any correlation between the two pollutants, we’ll perform a linear least squares regression analysis. Linear refers to the fact that the relationship can be expressed as a straight line defined by the equation y = mx + b. Least squares means that the trendline is the best approximation of the given dataset, while still being a straight line. Lastly, the term regression means we’re going to estimate the relationship between two variables.
Excel calculates the linear least squares regression trendline by minimizing the vertical distance (squares) between every actual value and its predicted value on the trendline. For example, in the figure below, the two lines (black and green) are both linear, but the black line is a better trendline as the sum of the squared distances between the points (the length of the arrows) is less than the green line. It’s important to remember that the ideal trendline may be farther from individual points than others (it might not touch any points at all), but the summed distance is the shortest.
The first step to performing a linear regression in Excel is to add a linear trendline to your scatterplot. To do this right click on any of the data points in your plot and select ADD TRENDLINE from the dropdown menu. In the FORMAT TRENDLINE panel select LINEAR trendline, and then select DISPLAY EQUATION ON CHART and DISPLAY R-SQUARED VALUE ON CHART. Your plot should now resemble the plot in the GIF below (if interested, download the example dataset here, which was taken from Tyler Vigen’s collection of spurious Correlations).
You are likely very familiar with format of an equation for a linear line, but may be less familiar with interpreting an R2 value. The R2 value of a linear regression is a measure of how well the line predicts the variation observed in the data. It can be summarized by this very general relationship:
\[ \begin{align} R^2 = \frac{\text{variance explained by the model}}{\text{total variance}} \label{eq10}\tag{10} \end{align} \]
The takehome message of this equation is that the closer R2 is to 1, the better the regression line describes the relationship between the two variables (aka the closer the line is to every plotted point). The R2 of a trendline is a useful value, but it is only describing the variance in the data, and so a higher R2 does not prove the correlation between two variables just as a low R2 does not disprove it.
The data shown in the GIF above has a positive relationship. Between the years 2000 and 2009, the number of computer science PhDs graduating in the United States increased and so did the revenues from arcades. This is a positive correlation as the two variables increase or decrease together. This is evident visually and in the positive value of the slope of the trendline. Variables are negatively correlated when one increases the other decreases, in which case the slope of the trendline would be negative.
Does the observed correlation between arcade revenues and computer sciences PhDs mean one variable caused the other (causation)? Maybe, maybe not. It’s not possible to answer this question using this plot and the R2 value alone, but the correlation observed between the two variables suggests the relationship may warrant further investigation.
5.6 Adding a Dataseries to an Existing Plot
The timeseries generated in the Creating a Scatterplot subsection above had all of the data ready to plot, but that’s not always the case. If we pretend the timeseries was made before the OX values were calculated, we can then add the OX dataseries to the plot to demonstrate this workflow. In the GIF below the OX data series is deleted and then re-added.
To add a new dataseries to a plot right click on the plot and click SELECT DATA from the dropdown menu (or go to the CHART DESIGN ribbon and click SELECT DATA). In the SELECT DATA SOURCE dialog box you should see the two data series already on your plot for NO2 and O3. Directly above this list, click ADD in Windows or “+” directly below the list on a Mac and the EDIT SERIES dialog box will open. This dialog box includes three boxes entitled SERIES NAME:, SERIES X VALUES: and SERIES Y VALUES: (on Mac no new dialog box will open, but a data series will be added to the list and the boxes NAME:, X VALUES: and Y VALUES: will become available). To input information into these three boxes you can type directly into the box or click the arrow (on Mac this looks more like a little Excel sheet) to select data from your Excel sheet. The SERIES NAME: box is what your dataset will be referred to in the chart legend. You can type “Ox” directly in the box, or you can select the column header. For the SERIES X VALUES: and SERIES Y VALUES: select the relevant data in your sheet without including the column headers.
When you have multiple dataseries on the same plot you can format or add trendlines to each of these plots in the same way you would on a simpler plot with only one dataseries. For formatting options simply double click on the dataseries in question to open the FORMAT DATA SERIES dialog box. To add a trendline select the relevant dataseries right click and select ADD TRENDLINE from the dropdown menu. All of the options from the [Adding a Trendline] section will be available and apply only to the selected dataseries. As an example a trendline is added to the OX dataseries in the GIF below, this trendline is simply to demonstrate how to do it and is not a useful addition to this analysis.
5.7 Annotating a Plot
One final aspect of plotting in Excel that will be useful in the first year chemistry labs is the ability to annotate a graph. These annotations can be used to indicate interesting aspects of a plot and as a tool for data analysis. To annotate a plot on the desktop version of Excel, select the plot followed by the FORMAT ribbon that appears and then the INSERT SHAPE function and the shape of interest from the dropdown menu. The annotation tool is not available on the online version of Excel and so in this case the plot should be copied into PowerPoint and annotated there.