⬅ Previous Lesson Workshop Index Next Workshop ➡

Data Displays, Normal Distributions and Lines of Best Fit

Objective

In this lesson, you will study how to organize data sets using methods such as frequency tables, histograms, standard line graphs, bar graphs, stem-and-leaf displays, and scatter plots. In addition, you will discuss normal distributions, as well as how to find a line of best fit using least squares regression.

Previously Covered:

A data set, or data distribution, is a collection of values representing a population. It is usually represented as a range of figures or terms enclosed in braces, e.g., .
The mean of a data set is the average value. It is defined by the formula .
The mode of a data set is the value that occurs most often. Remember that the mode can be represented by more than one number, because more than one element in a set might recur an equal number of times.
The range is the difference between the smallest number in a data set (the minimum) and the largest number in that data set (the maximum).
The median of a data set is the number that falls in the middle of the data set.
The variance measures the way that a data set’s elements are dispersed. It is defined by the formula
The standard deviation is the square root of the variance. It is defined by the formula .
The maximum of a data set is the largest element in that data set.
The minimum of a data set is the smallest element in that data set.
The lower quartile of a data set is the median of the subset of elements between the data set’s median and its minimum. These elements are greater than the minimum and less than the median.
The upper quartile of a data set is the median of the subset of elements between the data set’s median and its maximum. These elements are greater than the median and less than the maximum.
A box-and-whisker plot organizes the information in a data set graphically.

Data Displays, Normal Distributions and Lines of Best Fit

The box-and-whisker plot is a useful way to display information relating to the distribution of elements in a data set. There are many other ways to graphically display the information in a data set. Different display methods can highlight different important properties of the data distribution.

What are histograms and frequency tables, and how are they related?

Just as a box-and-whisker plot arranges information in relation to the median of a data set, a frequency table provides information relating to the mode of a data set. A frequency table is a data display that lists the times that each element in a data set occurs. Often, the relative frequency is also displayed in a frequency table. The relative frequency is a value, given as a percent, that represents the number of times an element occurs in a data set.

For example, the number of points a local soccer team scored in their last 35 games is listed below.

Arrange this information into a frequency table.

The elements of this data set range from 0 to 5. The score 0 occurs 5 times. The score 1 occurs 6 times, 2 occurs 8 times, and so on.

Frequency Table
Score	Frequency	Relative Frequency
0	5
1	6
2	8
3	7
4	5
5	4

To compute the relative frequency of the score 0, we must divide its frequency (5) by the total number of elements in the data set (35). Therefore, we have . So, 0 occurs approximately 14.3% of the time. To compute the relative frequency of the score 1, we must divide its frequency (6) by the total elements, . So, 1 occurs approximately 17.1% of the time. Following the same method, we find that 2 occurs approximately 22.9% of the time, and 3 occurs approximately 20% of the time.

Frequency Table
Score	Frequency	Relative Frequency
0	5	14.3%
1	6	17.1%
2	8	22.9%
3	7	20%
4	5	14.3%
5	4	11.4%

Data Displays, Normal Distributions and Lines of Best Fit

A histogram is another type of display that can be used to graph information relating to frequency. In a histogram, bars are used to represent the number of times an element of a data set occurs.

Histograms sort elements. The x-axis denotes the categories, or classes, the elements will be sorted into, and the y-axis tells how many elements fall into each specific class. The class interval is a rule by which we define each of the classes that the elements of a data set will be sorted into when we organize a histogram. Many times the class interval of a histogram is a range of values, though this is not always the case.

The histogram below displays the information from the frequency table above. In this histogram, the class interval is defined as one goal. We could display the same information in a histogram with a different class interval.

Histograms are often displayed with a class interval larger than one unit. Thus, each class denotes a range of values rather than just one value. The histogram below displays the same information above with a class interval of 2 goals.

Important Tidbit

As a rule, histograms are drawn with no gap between the bars.. So, be sure to place the bars of a histogram right next to each other.

Occasionally, histograms are drawn with extra space between bars,
making them look more like bar graphs. We discuss the difference between histograms and bar graphs below.

How are bar graphs and histograms related?

An important thing to remember is that a histogram is a certain type of bar graph. In a histogram, we are using bars to display specific information about a data set. A histogram is like a frequency table that uses bars to represent the frequency. In histograms, there is only one variable to consider, and we categorize the elements of the data set using this one variable.

A bar graph, on the other hand, is more general. A bar graph uses bars to display information relating many measurements to many different items. For example, we would use a bar graph like the one below, not a histogram, to display how many points each soccer team scored at the last tournament.

$Bar Graph$

The big difference between histograms and bar graphs is that a histogram uses bars to display the frequency of just one variable, while a bar graph uses bars to relate many different measurements to many different items.

What are line graphs?

A line graph provides another method to display the information contained in a data set. More specifically, a line graph relates two variables, an independent variable and a dependent variable, within a data distribution. The x-axis denotes the independent variable, and the y-axis denotes the dependent variable.

Line graphs are very useful because not only do they provide a clear and concise way to represent data, they are also very useful in extrapolating and interpolating more refined information from a given data set. In addition, line graphs are often used to recognize a relationship between two variables and to make informed predictions for the future based upon the relationship displayed in the graph.

Question

Dr. Connolly teaches Honors Calculus at a university. Over several years, Dr. Connolly has monitored the number of women that enroll in his class. He has compiled this information into the line graph below.

$Class enrollment line graph$

According to the line graph Dr. Connolly created, which statement below is NOT correct?

The number of women enrolled in Dr. Connolly’s Honors Calculus class stayed the same from 1999 to 2000.
One year, there were no women enrolled in Dr. Connolly’s Honor Calculus class.
The number of women in Dr. Connolly’s Honors Calculus class has continuously increased since 1997.
7 women were enrolled in Dr. Connolly’s Honors Calculus class in 2003.

Reveal Answer

The correct choice is C. The number of women in the Honors Caclulus class has not been continuously increasing. In fact, there was a decline from 2003 to 2004.

Important Tidbit

We can extrapolate information that is not explicitly contained in a line graph by recognizing patterns in the data relationships. Though the number of enrolled women is not strictly increasing, there is an obvious trend toward more women in the Honors Calculus class. Based on this line graph, we can conjecture that in the coming years, there will be an even greater number of women enrolled in Dr. Connolly’s Honors Calculus class.

What are some other methods used to display data?

As previously stated, line graphs are very useful in noticing trends in a data set and making predictions for future developments based upon those trends. Another data display that is useful when determining the existence of a relationship between two variables is a scatter diagram, also called a scatter plot.

A scatter plot is a coordinate plane with points plotting one set of data values against another. Whereas the other methods for displaying data sets we have examined are based upon displaying a single variable in a data set., a scatter plot is used to compare two data sets, with one set being represented on the x-axis and the other on the y-axis.

Question

Suppose we measured the height and shoe size of a large sample group of people. By plotting the height on the x-axis and the shoe size on the y-axis, we create the scatter plot below.

$Scatter plot of shoe size vs. height$

Which statement below is supported by the information in the scatter plot?

Anyone with a shoe size over 12 is taller than 5’10’’.
As a person’s height increases, their shoe size decreases.
There is no clear relationship between a person’s height and their shoe size.
In general, the taller a person is, the larger their shoe size is.

Reveal Answer

The correct choice is D. The scatter plot indicates that height and shoe size are directly proportional.

Be Aware!

When interpreting a scatter plot, be wary of absolute statements, such as the statement made in choice
A. The lone point in the upper left corner shows that there was one person measured who is about 5 ft 2 in tall, with a shoe size of about .

Always be in the lookout for aberrational elements on a scatter plot. They can provide counterexamples to absolute statements. When making general statements, it is alright to ignore these aberrational elements, and describe the general trend of the data.

Why are there so many ways to display a data set?

A data set can be very large and, by simply listing the elements in a data set, it can be hard to deduce any useful information. All the methods we have discussed provide ways to simplify the information in a data set. However, as we have seen, different display methods better at highlighting different properties of a data set. It is important to be familiar with each of the display methods so that we know how to best highlight a certain aspect of a data set.

Another method that is often used to arrange data into a more accessible format is the stem-and-leaf display. A stem-and-leaf display is similar to a histogram because it allows us to quickly count the number of elements in a data set that fall within a specific range.

Consider the data distribution {5, 16, 18, 4, 23, 25, 29, 31, 24, 35, 44, 42, 39, 51, 40, 50, 39, 22, 48, 57, 12, 65, 44, 33, 28, 29, 10, 9, 27, 8}.

To organize this data as a stem-and-leaf display, we let the stem be the ten’s place, and the leaf be the one’s place. Then we create the following table.

Stem and Leaf Display
Stem	Leaf
0	4, 5, 8, 9
1	0, 2, 6, 8,
2	2, 3, 4, 5, 7, 8, 9, 9
3	1, 3, 5, 9, 9
4	0, 2, 4, 4, 8
5	0, 1, 7
6	5

Each element of the data set is represented, and we can quickly see that the greatest number of elements is between 20 and 30, because the leaf adjacent to the “2” stem is the longest.

What are normal distributions?

If we define the class interval of a histogram as a very small interval, and create correspondingly small bars, we could easily imagine a curve created by the bars of the histogram. In fact, the smaller we make our class interval, the more accurate our curve will be.

Suppose we want to make a histogram that displays the heights of all the students in a particular high school. If we define the class interval as 6 inches, we get the following histogram.

$Histogram with class interval 6 inches$

If we define the class interval as 0.1 inches, however, we get the histogram below.

$Histogram with class interval 0.1$

The resultant curve is called the normal curve, or the bell curve. The histogram tells us that the majority of the students in the high school are of average height, yet there are a couple of very tall and very short students. Data sets which follow this pattern are called normal distributions.

Important Tidbit

Many properties in nature and in behavioral and social sciences follow a normal distribution and produce a normal curve when mapped in this way.

Height
IQ
SAT score

How does all this relate to probability?

The normal curve makes very natural statements about probability. Consider, for example, the histogram relating the IQ of a population, in which the class interval is one point. As with many phenomena in nature, a bell curve results.

$Bell curve$

Probabilistically speaking, this curve states that the citizens in this population have an average IQ of about 100. As the IQ increases, the number of citizens with this IQ decreases. The bell curve tells us that there are very few citizens who qualify as genius(have an IQ above 140). However, there are equally few citizens with an IQ below 55. Thus, a citizen chosen at random will most likely have an average IQ, and it would be extremely unlikely to randomly pick a genius out of the population.

Using the language of standard deviations, we can make these statements much more precise. Examine the graph below.

$Bell curve with probabilities$

Each interval of standard deviation displays the probability that one of its elements will be randomly chosen. Note that the probabilities add up to 100% with 50% on either side of the mean value.

Important Tidbit

Recall: is the mean is the standard deviation

Question

The lengths of time for telephone calls in the Jones household approximate a normal distribution. If the mean length is 4.5 minutes, with a standard deviation of 1.5 minutes, about 84% of the calls are…

more than 9 minutes long
between 4.5 and 9 minutes long
less than 4.5 minutes long
between 3 and 9 minutes long

Reveal Answer

The correct choice is D. Use the normal curve, and fill in the appropriate values for the mean and standard deviations.

$Normal Curve with telephone calls$

According to the chart, only 0.15% of the calls made are more than 9 minutes long, eliminating choice A. The percentage of calls between 4.5 and 9 minutes is 34 + 13.5 + 2.35 = 49.85%, which eliminates choice B. For choice C, according to the chart, the percentage of calls less than 4.5 minutes is 34 + 13.5 + 2.35 + 0.15 = 50%. Finally, according to the chart, the percentage of calls between 3 and 9 minutes is 34 + 34 + 13.5 + 2.35 = 83.85%, which is about 84%, confirming D as the correct choice.

The normal distributions and the normal curve it generates are very useful patterns of data distribution. These patterns of distribution can be found in naturally occurring phenomena, such as height, and in data sets used in the behavioral and social sciences, such as sets of IQ scores.

What if the data set does not have a normal distribution? Can we find another curve that approximates the data set?

Much research in mathematics is devoted to finding a curve that accurately describes discrete data sets. A linear regression for a collection of data points is a linear equation that closely approximates the behavior of a collection of data points. The most popular method of finding a linear regression is the least squares method.

The term regression indicates that a linear equation is a less-than-perfect approximation of a data set. Indeed, it is very rare that a data distribution precisely mimics linear behavior. We can only create a good guess.

Recall that a linear equation is defined by the formula y = ax + b, where a and b are constants.

How exactly does a least squares regression work?

The method of least squares regression creates a linear equation that minimizes the square of the vertical distance between points in the data set and the corresponding values on the line.

$Least squares regression$

In the figure above, the points in the data set are black and the corresponding values on the line are blue. The resulting least squares regression line minimizes the value of the square of the distance, (denoted in red) between these two points.

Since a linear equation has the general form y = ax + b, we need to determine how to calculate the values of a and b in order to find a linear regression. Luckily, there are formulas we can use to calculate the values of a and b which will result in the least squares regression.

Suppose we are given the data set , and corresponding to each element of the data set, we have the ordered pairs . The following equations are used to compute the values of a and b.

Important Tidbit

Use the equations below to solve for a and b in the linear regression.

Question

Which choice shows the least squares regression line for the data set below?

{(1, 1), (2, 4), (3, 2), (4, 4), (5, 3), (7, 9),
(8, 5), (9, 10)}

y = x
y = x + 0.299
y = 2x + 1
y = x – 0.225

Reveal Answer

The correct choice is B. Before we begin plugging values into the equations above, let’s organize what we need to know by creating a table.

i
1	1	1	1	1
2	2	4	4	8
3	3	2	9	6
4	4	4	16	16
5	5	3	25	15
6	7	9	49	63
7	8	5	64	40
8	9	10	81	90
	39	38	249	239

Now we can more easily compute the values for a and b.

Thus the equation of the least squares regression line is y = x + 0.299.

In the figure below, the points of the data set are mapped with the corresponding least square regression line in red.

$Least squares example$

We now have a linear equation that closely resembles the data we have collected. Using this data, we can more easily extrapolate or interpolate information from the data set. For example, though the data point doesn’t exist, we can guess that when x = 6, y = 6 + 0.299 = 6.299. The method of least squares regression is a very useful technique for making predictions based upon trends in the existing data.

Review of New Vocabulary and Concepts

A frequency table is a data display that lists the number of times each element in a data set occurs.
A histogram is a type of display that can be used to graph information relating to frequency. In a histogram, bars are used to represent the number of times an element of the data set occurs.
Class interval is the rule by which we define each of the classes that the elements of a data set will be sorted into when creating a histogram. For instance, if the classes on a histogram are 5, 10, 15, 20, the class interval is 5. If they are 20, 40, 60, 80, the class interval is 20.
A bar graph is a type of data display that is similar to, but more general than a histogram. It uses bars to display information relating many measurements to many different items.
A line graph relates two variables, an independent variable and a dependent variable, within a data distribution.
A scatter plot is a coordinate plane with points plotting one set of data values against another. It is used to compare two separate data sets. One set is plotted on the x-axis, and the other set is plotted on the y-axis.
A stem-and-leaf display is a data display that is similar to a histogram because it allows us to quickly count the number of elements in a data set fall within a specific range.

i
1	1	1	1	1
2	2	4	4	8
3	3	2	9	6
4	4	4	16	16
5	5	3	25	15
6	7	9	49	63
7	8	5	64	40
8	9	10	81	90
	39	38	249	239

i
1	1	1	1	1
2	2	4	4	8
3	3	2	9	6
4	4	4	16	16
5	5	3	25	15
6	7	9	49	63
7	8	5	64	40
8	9	10	81	90
	39	38	249	239

Data Displays, Normal Distributions and Lines of Best Fit

Objective

Previously Covered:

Data Displays, Normal Distributions and Lines of Best Fit

Data Displays, Normal Distributions and Lines of Best Fit

Important Tidbit

How are bar graphs and histograms related?

What are line graphs?

Question

Important Tidbit

What are some other methods used to display data?

Question

Be Aware!

Why are there so many ways to display a data set?

What are normal distributions?

Important Tidbit

How does all this relate to probability?

Important Tidbit

Question

What if the data set does not have a normal distribution? Can we find another curve that approximates the data set?

How exactly does a least squares regression work?

Important Tidbit

Question

Review of New Vocabulary and Concepts

Further Reading in Probability, Statistics, & Data Analysis

i
1	1	1	1	1
2	2	4	4	8
3	3	2	9	6
4	4	4	16	16
5	5	3	25	15
6	7	9	49	63
7	8	5	64	40
8	9	10	81	90
	39	38	249	239