Section 4 Math, Stats, and Programming
Physical chemistry is a quantitative science, meaning we use mathematical operations to explain and explore chemical phenomena. This math can be done by hand, but once you have more than one or two data points this gets very tiring and a spreadsheet can be a lifesaver! In this section we will talk about how to program mathematical equations into Excel, how to use common mathematical and statistical functions, and how to reference other values within the spreadsheet. We will also dip our toes into some simple computer programming that can be very useful.
The first two subsection Mathematical Operations and Cell Referencing use the air quality datasets from the NAPS network as examples. If you received a dataset through your CHM135 lab section feel free to use it to follow along. If not, you can visit the Introduction page to download the example dataset used throughout this resource (note that you cannot use the example dataset for your analysis in CHM135).
4.1 Mathematical Operations
As described in the Introduction, atmospheric chemists have defined the term “odd oxygen” or OX as the sum of the concentrations of NO2 and O3. To add OX to your analysis you will need to calculate the concentration OX at each timepoint. To do this, simply add the cells containing O3 and NO2 for the first timepoint (both concentrations are in ppb so they can be added directly) then copy this formula down the column. When you add a formula into Excel remember to start with an equals sign (“=”), this will let Excel know you are writing a formula. If you are referencing a cell you can either write it in (e.g. “C2”) or you can click on the cell. Once your formula is complete, hit ENTER and Excel will perform the calculation.
Excel can perform many mathematical operations and the table below provides a list of some of the most common with their corresponding formula.
Input | Operator | Example… | … solves to | |
---|---|---|---|---|
= | Equal sign indicates a mathematical operation | |||
+ | Addition | =2+2 | 4 | |
- | Subtraction | =3–2 | 1 | |
* | Multiplication | =3*2 | 6 | |
/ | Division | =4/2 | 2 | |
^ | Exponent | =4^2 | 16 | |
( ) | Brackets to indicate order of operations | =(4+2)*2 | 12 | |
LOG(number,base) | Logarithm (10 is the default if no base is indicated) | =LOG(100,10) | 2 | |
LN(number) | Natural Logarithm | =LN(2) | 0.693… |
To copy the formula from one cell down the list of data points, you can use the copy and paste function in the edit menu or the keyboard shortcuts CONTROL+C to copy and CONTROL+V to paste in Windows or COMMAND+C and COMMAND+V on a Mac. You can also use a shortcut specific to Excel. Click on the cell with your formula and notice the little square in the lower righthand corner. Click on this square and drag the formula down the column, or, even better, double click on the square and Excel will automatically paste the formula down the column until the end of the series (note that it will stop at the first blank cell so if you are missing values you may need to copy and paste the formula the rest of the way). When you add a new value, make sure you remember to label your column.
If you are learning about mathematical operations in Excel for the first time we strongly suggest reading the next section on cell referencing.
4.2 Cell Referencing
One of Excel’s strengths is its ability to easily perform mathematical operations in larger datasets. To take advantage of this feature it is important to appreciate how Excel references values in the spreadsheet. Simply referencing a cell in Excel will add a relative reference, which means that the reference will move as the equation is copied and pasted to other cells. This is the type of reference created in the OX calculation in the section above. If you are unsure of the cells being referenced in an equation Excel will visualize the reference when you click on the cell. To exit the function hit escape (ESC) on your keyboard.
A relative reference is very useful when calculating a series of numbers, however there are times when you might wish to make an absolute reference, such that it does not move as it is copied and pasted. To do this you add a $ in front of the letter and/or the number in the cell reference. It is possible to add a $ to only the letter (column reference) or the number (row reference) and only lock one aspect of the referencing or to lock both.
The gif below, which is a spreadsheet setup to calculate the energy of a photon in J/molecule from its wavelength in nm (if interested, download the example dataset here), is an example that includes both absolute and relative references. The equation to calculate energy includes absolute references to the relevant constants and a relative reference to the wavelength of the photon such that it moves as the equation is copied down the column. As shown in the gif, if the absolute reference is not properly applied the calculated values don’t make any sense. A more complicated example of cell referencing is discussed in the If Statements section below.
4.3 Spreadsheet Organization
Before moving on to more complicated functions in Excel it is useful to discuss ways of organizing a spreadsheet, particularly when planning to use cell references. In the example above, the constants were added to the top of the sheet so they could be easily incorporated into the calculation. A limitation of this organization is that it’s not possible to select entire columns using the letter column headers. If you decide to move your constants around, using the CUT and PASTE functions will preserve the cell reference. Cells can be CUT in Excel by selecting CUT from the dropdown menu that appears when you right click or using the keyboard shortcuts, CONTROL+X in Windows and COMMAND+X on a Mac. You can also add rows or columns to your sheet by selecting INSERT from the dropdown menu when you right click. The number of rows or columns added correspond to the number highlighted and the rows will be added above your selection and the columns to the left of your selection.
There is no one right way to organize a spreadsheet, but when using values that need to be referenced or values that are calculated it is useful to plan their position in the sheet and to be able to move them if your original position isn’t optimal.
4.4 Simple Statistical Functions
When we have several measurements it is often useful to know how they compare to one another. To get a sense of that we can employ some simple statistical functions. One such function is a mean or average value, given the symbol \(\overline{x}\). Another typical function is a standard deviation, given the symbol \(\sigma\). We typically understand what a mean value represents, but what is a standard deviation? The mathematical formula for the standard deviation of a set of values is shown in equation (\(\ref{eq8}\)):
\[ \begin{align} \sigma = \sqrt{ \frac{\sum (x_i-\overline{x})^{2}}{N-1}} \label{eq8}\tag{8} \end{align} \]
where \(x_i\) is the individual data points within the dataset; \(\overline{x}\) is the mean value of the dataset; and N is the number of values in the dataset. Looking at this formula it is clear that standard deviation is related to the difference between the individual values in the dataset and the mean value of the dataset. Standard deviation is a useful metric for appreciating the spread in the data. The greater the spread, the greater the standard deviation. More specifically, as shown in the figure below, one standard deviation away from the mean on either side will include 68% of the data points and two standard deviations will include 95% of the data.
Standard deviation is a useful metric. The magnitude of the spread in the data is often related to the overall magnitude of the values. For example, the standard deviation of 0.5 is large for a dataset with a mean value of 1, but small for a dataset with a mean value of 50. To provide this context the relative standard deviation (RSD) can be calculated by dividing the standard deviation by the mean value and then multiplying this ratio by 100 (equation (\(\ref{eq9}\)).
\[ \begin{align} RSD = \frac{\sigma}{\overline{x}} \times 100 \label{eq9}\tag{9} \end{align} \]
The RSD of a dataset expresses the magnitude of the spread in the data using a percentage to normalize for differences in magnitude (e.g. a dataset with a mean value of 1 and a standard deviation of 0.5 has an RSD of 50%, and that with a mean value of 50 and a standard deviation of 0.5 has an RSD of 1%).
The table below presents formula and descriptions to calculate the mean, median, standard deviation, as well as the minimum and maximum values for a set of numbers.
Calculation | Formula | Description | ||
---|---|---|---|---|
Mean | =AVERAGE(number1, [number2], …) | Returns the arithmetic mean of the inputs. | ||
Median | =MEDIAN(number1, [number2], …) | Returns the middle number in a group of supplied number. | ||
Standard Deviation | =STDEV(number1, [number2], …) | Returns the standard deviation of the samples provided. | ||
Minimum | =MIN(number1, [number2], …) | Returns the minimum value in a set of numbers. | ||
Maximum | =MAX(number1, [number2], …) | Returns the maximum value in a set of numbers. |
The figure below provides some context on how to use these formulas with an example calculating the mean, median, standard deviation, relative standard deviation, minimum and maximum values of a list of outdoor summer temperatures in Toronto (if interested, download the example dataset here). The formula for each calculation is provided as text next to the value in question (note the quotation marks, which indicate to Excel that the contents of the cell should be treated as text). Keep in mind that Excel will help you with these calculations, once you put an equal sign into the equation bar and start typing it will give you suggestions of possible functions, don’t be afraid to use them. Once you select the function and the brackets are open you can also select one cell, or a series of cells, using the cursor as opposed to writing them in.
4.5 Sort Function
One very useful aspect of Excel is its ability to sort tables and list according to defined characteristics, typically alphabetical order or in ascending or descending numerical order. To sort data in a table first highlight all of the relevant rows and columns, then select “Sort” from the “Data” menu or select Data > Sort from the top menu. A dialog box will pop up, the first thing to do is to tell Excel whether the highlighted cells include your column headers, in the example below (which uses data from CHM135 Lab 2) the table does include headers and so the “My List has Headers” box on the top right was clicked. If your dataset doesn’t include headers then you will choose the data in a column as defined by its letter in the Excel sheet. Whether you have headers or not, the next step is to select the column you will use to sort your data, in this case “Unknown Sample Letter” and then select “A to Z” in the “Order” input box. The sort function then organizes the entire table such that the column selected is in alphabetical order.
4.6 If Statements
Now that you have used some simple statistical functions in Excel, let’s explore some other functions and simple computer programming. The first function we will explore is the use of “if” statements. “if” is a conditional statement that signals to Excel that a decision will need to be made. As with the functions described in the Simple Statistical Functions section, once you start the formula by writing “=i” into the formula bar Excel will provide you with many possible functions. Choose “IF”.
Once you choose the IF function, Excel will again help you by explaining what the function requires to operate, in this case a logical test and a result for when the test is true and for when it is false. As is typical in more sophisticated Excel programming, each of these requirements is separated by commas.
To see how the IF function can be applied, let’s imagine a scenario where a small cafe has four employees and needs to sort out the hours they have worked (if interested, download the example dataset here). In this case the timekeeping software exports a ledger that indicates the employee, day, time, and number of hours they worked.
The cafe employer would like to sort this information so that they have a clear picture of the number of hours worked by each employee over the course of the week. To do this they create a column header that includes the employee’s names (Ayaan, Charlotte, Melena and Zambala) and then write IF statements for each that only add the number of hours into the column if the employee’s name is found in that row. As described above the IF always includes:
Logical Test: in this case, the logical test is asking whether the name in a given row matches the name at the top of the column (note the use of a relative reference for the name in the row and an absolute reference for the column header, details on referencing can be found in the Cell Referencing section).
Value If True: here the formula selects the number of hours for the row being investigated as this is what we would like to display if that employee was the one to work these hours.
Value If False: here the formula indicates the conditional statement should report a empty cell (indicated by the empty brackets ““) if the employee mentioned in this row does not match the name at the top of the column.
To copy this first IF statement to the remaining employees we could try to simply copy and paste the cell itself to the next column, but (as shown in the gif below, or if you try it yourself) this would move all the relative references and cause a mess. To address this we need to rethink the relative and absolute references used in the gif above. Before you go through the changes made below, challenge yourself to write out the changes you think would work by hand or work through it using the dataset provided.
Here are the changes that need to be implemented so that the function can be copied across both columns and rows:
Logical Test: The reference to the employee name in column A needs an absolute reference on the column, so it doesn’t more as you copy the formula to the other columns, but a relative reference to the row so it moves as the formula is copy down the rows. Conversely the reference to the names in the column headers needs a relative reference on the column but an absolute reference on the row.
Value If True: Similar to employee name in column A, hours worked in column E need an absolute reference on the column but a relative reference on the row.
Value If False: No cell references here.
4.7 SUM and COUNT Functions
Excel includes many useful functions, far too many to outline here. A couple of very useful functions are the the SUM and COUNT functions. As the name suggests, the SUM function returns the sum of a range of numbers. The COUNT function returns the number of values in the outlined range. Both of these functions can be combined with the IF function using the SUMIF and COUNTIF functions where values are summed or counted only if certain criteria are met. The AVERAGE function introduced in the Simple Statistical Functions section can also be included with an IF statement using the AVERAGEIF function.
Calculation | Formula | Description |
---|---|---|
Sum | =SUM(number1, [number2], …) | Returns the sum of the inputs |
Sum If | =SUMIF(range, criteria, [sum_range]) | Returns the sum of the inputs based on the outlined criteria |
Count | =COUNT(value1, [value2], …) | Returns the number of values in the given range |
Count If | =COUNTIF(range, criteria) | Returns the number of values in the given range based on the outlined criteria |
Mean If | =AVERAGEIF(range, criteria, [average_range]) | Returns the arithmetic mean of the inputs based on the outlined criteria |
To explore these functions let’s take another look at the book-keeping exercise from the If Statements section. Imagine the hours for each employee need to be summed over the week. This can be accomplished using the SUM function. The number of shifts worked can also be determined using the COUNT function, and the average length of the shifts using the AVERAGE function.
The work flow outlined above is a two-step process that first uses an IF statement and then the SUM, COUNT and AVERAGE functions. This works perfectly well. The two functions can also be combined using the SUMIF, COUNTIF and AVERAGEIF functions. To accomplish this, select the range of values that will be used to judge the criteria, the criteria itself, and then the range to be summed, counted or averaged.
4.8 SLOPE, INTERCEPT and R2 Functions
The relationship between many variables in chemistry can be explained using a linear regression line with equation y = mx + b. Excel can add linear trendlines with equations and R2 values to scatterplots. Details on how to do, as well as a discussion of what a linear regression entails, are provided in the Adding a Linear Regression Line subsection of the Data visualization section of this resource. If the equation of the linear trendline is required for subsequent calculations it is useful to calculate the slope (m) and intercept (b) in the body of the spreadsheet as opposed to simply reading them off of the graph. The functions used to do perform these calculations are provided in the table below.
Calculation | Formula | Description |
---|---|---|
Slope | =SLOPE(known_ys, known_xs) | Returns the slope of the linear trendline |
Intercept | =INTERCEPT(known_ys, known_xs) | Returns the intercept of the linear trendline |
R2 | =RSQ(known_ys, known_xs) | Returns the R2 of the linear regression, which is a measure of how well the line predicts the variation observed in the data |
The GIF below provides an example executing these functions using the dataset introduced in the Adding a Linear Regression Line subsection that compares the number of computer science PhDs graduating in the US with US arcade revenues each year. If we assume the number of US computer science PhDs graduating in a given year is a good predictor of US arcade revenues (which we certainly do not know to be the case, but just humor me) then we could use the equation of the trendline to predict US arcade revenues in a year when the US graduated 1300 computer science PhDs.
Although there is no clear utility to the data used in this example, it is clear that having the slope and intercept of a linear equation available in the body of the Excel sheet can facilitate subsequent calculations.