15 Summary Statistics
learn by watching
The links below will launch the video lessons in YouTube
Treat the first two videos here as definitions and vocabulary. Although I explain some of the mathematics here, that is just so you understand the purpose and meaning of the terms. ALL calculations on your homework will be done in Desmos.
For the “Summary Statistics Vocabulary Practice” homework assignment, you will be analyzing information and answering questions where the data is presented in graphs that are already created. In the “Summary Statistics Using Technology Practice” homework assignment, you will be given data, and then asked to use Desmos to find the summary statistics and draw graphs.
- Measures of Central Value (9 minutes 43 seconds)
- Measures of Range (13 minutes 13 seconds)
- Reading Histograms (11 minutes 56 seconds)
Try “Summary Statistics Vocabulary Practice” Part 1: Investigating Histograms
- Reading Boxplots (13 minutes 6 seconds)
Try “Summary Statistics Vocabulary Practice” Part 2: Investigating Boxplots
- Categorical versus Quantitative Data ( 5 minutes 8 seconds)
- Choosing an Appropriate Graph Type for Your Data ( 9 minutes 37 seconds)
Try “Summary Statistics Vocabulary Practice” Part 3: Working with Graphs (Mixed Types) Problems #1-13
- Using Desmos to Calculate Summary Statistics (9 minutes 20 seconds)
- Using Desmos to Draw Boxplots (6 minutes 53 seconds)
- Using Desmos to Draw Histograms (6 minutes 26 seconds)
Try “Summary Statistics Using Technology Practice” using Data Sets #1-4
learn by reading
The world is full of data. As we conduct research, we often need ways that we can share out the data that we’ve collected. In this lesson, we will be looking at ways to summarize and present data sets both numerically and graphically.
types of data
There are two different classifications of data: quantitative and qualitative. The type of data you are collecting determines the types of calculations and graphs that you can do.
Types of Data
QUALITATIVE data is descriptive in nature. Usually, the collected data is words. We sometimes use the term “categorical” data instead of qualitative.
QUANTITATIVE data is numerical in nature. The collected data/survey results are always numbers. Usually quantitative data is a measurement of some type.
With quantitative data, we have a lot ways that we can summarize and report out the data. We can organize and display our data on a number line. We can use the data to calculate meaningful things such as averages.
With qualitative (categorical data), we can only group similar answers together. We can share out the “count” of how many data results were in each category. At best, we can calculate and share percentages of how frequently certain results occurred.
Example 1: Identify if the following data types are Qualitative or Quantitative:
- Reporting the height of every student in a kindergarten class.
- Recording what candidate a person is planning to vote for.
- Measuring the weights of potato chip bags coming off of a factory line to make sure the machines are working correctly.
- Asking a survey participant to report the zip code that they live in.
- Quantitative – we are gathering a measurement. Averaging these results will give you a meaningful “typical” value.
- Qualitative. Our question answers will all be words (people’s names). We can group/categorize answers to find who is most likely to win. However, we cannot do things like calculate an average.
- Quantitative – again, we are gathering a measurement. Our results are numbers. Averaging these results will give you a meaningful “typical” value.
- Qualitative. This one is tricky. Although zip codes are numbers, they really are a way to report a category (where people live) and not a measurement. Although we can calculate an average, that result will not be meaningful.
One way I like to remember the difference between these terms is this:
- The word QUANTITATIVE has the word “quantity” in it. This reminds me of measurements and numbers.
- The word QUALITATIVE has the word “quality” in it. This reminds me of describing in words how good something is.
statistical summaries
Collecting data can be interesting. However, if you want to share your work with someone else, you generally don’t want to hand them over a long list of all the data you’ve collected. You’d like a way that you can summarize and describe the results in a concise and meaningful way.
“Typical” Values
There are three ways that we can describe the typical value of a data set.
Typical Values
- The MEAN. This is what people usually think of when they hear the word “average”. To find the mean, we are basically pooling all the data values together, and then redistributing them evenly among all the participants.
- The MEDIAN. This is the “central” number. If you were to line all your data results in order from smallest to biggest, this would be the one in the center. 50% of all the data points lie at or below the median, and 50% of the all the data points lie at or above them median.
- The MODE. This is the value that repeated the most often in the data set. Unlike the mean and median, it is possible to have more than one mode, or even none at all (if all the data points are unique).
If we have qualitative (categorical) data, the only “typical” value we can find is the mode. You can’t do calculations on word/descriptive results.
If we have quantitative data, we generally stick to the mean and the median. We can order and do math processes on numerical data, so these calculations are a great way to share a lot of results using a single number.
This is how I remember which typical value is which:
- Mean — it’s not very nice to calculate when you have lots of data because you have to do the math using every single data point.
- Median — similar to the “middle of the road”, this typical value describes what is in the exact center.
- Mode — sounds like the word “more”…which is what we are finding.
Skew
The “mean” is generally considered the best way to summarize data because every single data point is used as part of the calculation.
The “median” is generally considered a better way to describe a data set when we have outliers. An outlier is a data point that is unusually far away from the rest of the values. In these cases, the mean will be strongly affected by that very different result. The median, however, is not affected at all by having one or two unusually large (or unusually small) data points, because we are only looking at what falls in the middle. We are not concerned by what is going on at the extremes of our data set when describing the median as a typical value.
For example, if we want to find a typical cost of meat at a deli counter, the values all are fairly close together. The mean will give a result that describes the data set well. Adding up all the individual prices for each product and redistributing it so the prices of each product are equal is a meaningful “typical” value.
However, consider the case of a neighborhood that has houses that sell for about $350,000, but on the corner, there is a $12 million dollar mansion. In this case, if we find the mean (average) of the data set, our result will likely be more than all but that most expensive house. The one large outlier is pulling the mean to be larger than the rest of the data. Reporting this number as a summary of the data, while accurate, may give an inaccurate idea of what the neighborhood looks like. In this case, seeing the median value of $350,000 would give you a better idea of what to expect in the neighborhood.
When data calculations are being pulled one direction or another, we call the data skewed. A data set is skewed to the right when we have large outliers. A data set is skewed to the left when we have small outliers. In general, we say a data set is:
- Skewed right when the mean is bigger than the median.
- Skewed left when the mean is smaller than the median.
Describing Data Spread
The problem with just reporting a single number to summarize your data is that while you get a “typical” data result, you don’t know anything else about the data in general. To see how this might be a problem, consider the following two data sets:
A=[50,50,50,50, …, 50]
B= [1,2,3,4…,97, 98, 99]
These two data sets both have a mean of 50 and a median of 50. However, they are obviously very different in nature! In describing a quantitative data set, it is a good idea to have a way to talk about how spread out the data is.
Some useful numbers we can use in talking about data spread are called the 5-number summary.
The 5-Number Summary
- Minimum: the smallest number in the data set
- 1st Quartile (Q1): the number that lies 1/4 of the way through the data list. You may want to think of this as the median of the lower half of the data. [25% of data points lie at or below Q1, and 75% of data points lie at or above Q1]
- Median: the number that lies 1/2 of the way through the data list. [50% of data points lie at or below the median, and 50% of data points lie at or above the median]
- 3rd Quartile (Q3): the number that lies 3/4 of the way through the data list. You may want to think of this as the median of the upper half of the data. [75% of the data points like at or below Q3, and 25% of data points lie at or above Q3]
- Maximum: the biggest number in the data set
The 5-number summary marks a way that we can show how spread out a data set — essentially dividing all the data points into 4 equally-sized chunks.
- 25% of the data points lie between the minimum and Q1.
- 25% of the data points lie between Q1 and the median.
- 25% of the data points lie between the median and Q3.
- 25% of the data points lie between Q3 and the maximum.
Data Spread
- The RANGE is calculated by subtracting the maximum and the minimum.
- The INTERQUARTILE RANGE is calculated by subtracting Q3 – Q1. (This gives you how far apart the “middle 50%” of the data lie. This result is not affected by outliers)
- The STANDARD DEVIATION is calculated using every single data point. It is a pretty involved calculation that includes finding how far each individual data point is from the mean and then performing a modified average based on those values. A small standard deviation indicates that the data values are clustered closely around the mean. A large standard deviation indicates that the data is more spread out.
Data Visualizations
The old saying “A picture is worth a thousand words” is certainly true when sharing data. A graph is a quick way to visually illustrate the results of data collection.
The types of graphs you can use depend on the type of data you collect.
- Qualitative data can be demonstrated using bar graphs (where each category is displayed by the height of a bar on the graph) or pie charts (where each category is displayed by the size of a sector of circle).
- Quantitative data can be demonstrated using histograms or boxplots. These graphs both use a number line as one of the axes in a graphical display. However, these graph types summarize and display data in different ways. A histogram creates equally-sized groups, and displays the frequencies of each group as bars growing up out of the number line (a specialized type of bar graph). A box plot is a way to visually describe the 5-number summary.
Exercises: Try these!
Part 1: Summary Statistics Vocabulary – Investigating Histograms
Part 2: Summary Statistics Vocabulary – Investigating Boxplots
Part 3: Summary Statistics Vocabulary – Working with Graphs (Mixed Types)
calculating statistical summaries
Technology makes it easy to find statistical summaries. I like to use the Desmos graphing calculator application (available as a phone app or at http://www.desmos.com/calculator). NOTE: this is a different version of Desmos than the scientific calculator I have been using up to this point. Be sure to use this new website and/or download the new app.
First you need to report your data list. Give your list a name (I like to use “A” or “B”). We can define our data list in this format:
A=[_, _, _]
When you are done, Desmos will tell you you have a __-element list. This is a way to verify that you have included all of the data points.
Some cautions:
- Capitalization matters when naming your data list. You need to be consistent.
- The list needs to be enclosed specifically in square brackets [ ]. Otherwise, Desmos will not recognize it as a list.
- The list does NOT need to be in order from smallest to biggest. Desmos will work fine doing calculations on out-of-order lists.
- Your data points need to be separated by commas. Be aware that this means you need to enter large numbers without commas or Desmos will count them as separate numbers. For example, write one million as 1000000. If you write it as 1,000,000 then Desmos will count that as three different numbers (a 1 and two 0s).
Once you have your list, you can use the following functions:
Desmos Numerical Statistical Summaries
- mean(A) to find the mean of your list, A.
- median(A) to find the median of your list, A.
- stats(A) to find the 5-number summary of your list, A. This includes the minimum, Q1, median, Q3, and maximum of your data set.
- stdev(A) to find the standard deviation of your list, A.
(note: Desmos will not directly calculate the range or the IQR. But those are simple subtractions once you have your 5-number summary identified)
I recommend watching the following video to walk you through these processes so you can see what things look like: Using Desmos to Calculate Summary Statistics (9 minutes 20 seconds)
creating data visualizations
Once you have entered your data list, Desmos will even draw graphs for you!
You can create boxplots by using the function boxplot(A).
You can create histograms by using the function histogram (A, __). After the comma you want to declare what you want your number line to count by. This will be the width/grouping of your bars. You may want to count by 5s, or 10s, or even 1,000s depending on how large your data values are. I recommend changing the Bin Alignment to “left”. That will match the graphs in the answer key.
Hint: next to the command line box, you will see a little magnifying glass that says: “Zoom Fit”. This will change the window to center all of your data points so your graph is visible.
I recommend checking out the following videos. These will walk you through the process of creating your boxplots and histograms. It is full of little tricks like labeling your axes, putting 2 graphs on the same number line, identifying statistical outliers, and exporting your graphs so you can turn them in as part of your homework.
- Using Desmos to Draw Boxplots (6 minutes 53 seconds)
- Using Desmos to Draw Histograms (6 minutes 26 seconds)
Exercises: Try These!
For this assignment, use the Desmos Graphing Calculator (www.desmos.com/calculator) to find the requested statistics and to sketch the requested graphs.
SINCE YOU ARE USING DESMOS TO FIND THE STATISTICS AND SKETCH THE GRAPHS, THERE IS NO WORK TO SHOW FOR THIS ASSIGNMENT — YOU SHOULD NOT BE CALCULATING THE STATISTICS BY HAND!!!
Calculating Summary Statistics and Creating Data Visualizations
For each of the 4 data sets that follow:
- Find the mean. Include units.
- Find the standard deviation. Include units.
- Using your answers to #1 and #2, between which 2 values do the majority of data fall?
- Identify the 5-number summary.
- Find the Inter-Quartile Range.
- Create a boxplot of the data. Be sure your axes are labeled.
- Create a histogram of the data. Choose an appropriate bin width for your data (you generally want 4-7 bars in your histogram to keep it meaningful without being too busy).
- Are any of the data points statistical outliers? If so, which ones? How did this affect your summary statistics?
** You can put the boxplot and histogram on the same set of axes for each different data set. You may use your desktop’s snipping tool or use the “export your graph” option in Desmos to save your graphs in a word document to submit your answers. If using the phone app, you can take a screenshot of your graphs to include in your HW submission. If you get frustrated by the technology, just sketch what you see by hand on paper and submit pictures like you usually do on your homework. **
DATA SET 1: Seahawks Player Salaries
2023 Seahawks Starting Player | Average Annual Salary (contract in millions of dollars) |
Geno Smith | 25 |
Kenneth Walker III | 2.11 |
DK Metcalf | 24 |
Tyler Lockett | 17.25 |
Jaxon Smith-Njigba | 3.6 |
Noah Fant | 3.15 |
Nick Bellore | 3.3 |
Charles Cross | 5.35 |
Damien Lewis | 1.23 |
Evan Brown | 2.25 |
Phil Haynes | 4 |
Abraham Lucas | 1.35 |
DATA SET 2: Tuition and Fees for Washington State Public 4-year Colleges
4-year Public University/College | 2021-2022 Resident Tuition |
University of Washington | $11,380 |
Washington State University | $10,997 |
Central Washington University | $7,368 |
Eastern Washington University | $6,896 |
The Evergreen State College | $7,389 |
Western Washington University | $7,572 |
DATA SET 3: Tuition and Fees for Washington State Private Non-Profit 4-year Colleges
4-year Private Non-Profit University/College | 2021-2022 Resident Tuition |
Antioch University | $22,035 |
Bastyr University | $26,895 |
City University of Seattle | $19,035 |
Cornish College of the Arts | $34,200 |
Gonzaga University | $46,920 |
Heritage University | $18,392 |
Northwest University | $33,980 |
Pacific Lutheran University | $46,200 |
Saint Martin’s University | $39,940 |
Seattle Pacific University | $47,244 |
Seattle University | $48,090 |
University of Puget Sound | $53,800 |
Walla Walla University | $29,931 |
Western Governor’s University | $7,040 |
Whitman University | $55,968 |
Whitworth University | $46,250 |
Compare to the WA Community and Technical Colleges average tuition for 2021-2022 of $4,332 🙂
DATA SET 4: “Large” (over 100 acres of timber or 300 acres in grass) Wildfires in WA State during 2022
LARGE DNR FIRE NAME | ACRES BURNED |
Black Hole | 457 |
Bolt Creek | 14,715 |
Boulder Mountain | 2233 |
Cow Canyon | 5812 |
Goat Rocks | 8133 |
Kalama | 495 |
Loch Katrine | 1918 |
Nakia Creek | 1869 |
North Fork (Little Chill) | 1817 |
Stayman Flats | 1101 |
Suixon | 2112 |
Vantage Highway | 30,658 |
White River | 11,122 |
Williams Lake | 1869 |