Second Edition

INTRODUCTORY

STATISTICS A Problem-Solving Approach STEPHEN KOKOSKA

A P P LI C ATI ON S Introductory Statistics: A Problem-Solving Approach, 2e presents a wide variety of applications from diverse disciplines. The following list indicates the Example and Exercise numbers related to different fields. Note that some items appear in more than one category.

EXAMPLES BY APPLICATION

Public Health and Nutrition 2.13, 2.14, 4.37, 5.1, 5.7, 5.10, 6.7, 8.1, 8.3, 8.4, 10.2, 10.6, 10.11, 10.15, 11.1, 11.2, 12.2, 12.7, 12.8, 12.12, 13.3, 14.2, 14.3, 14.4

Public Policy and Political Science 1.10, 3.7, 4.2, 4.3, 4.16, 4.25, 5.4, 5.17, 6.3, 7.1, 8.1, 9.13, 9.15

Biology and Environmental Science

Sports and Leisure

1.11, 2.7, 3.2, 3.3, 3.8, 3.9, 3.17, 5.1, 5.2, 6.8, 7.4, 8.5, 9.8, 9.15, 9.19, 10.9, 10.16, 11.2, 11.6, 12.9, 13.1, 14.5

1.7, 2.1, 3.2, 3.3, 3.12, 3.18, 3.20, 4.10, 4.18, 4.21, 4.22, 4.24, 5.8, 5.12, 5.19, 6.1, 9.10, 10.7, 12.1, 12.10

Business and Management 3.5, 4.7, 4.31, 4.36, 4.39, 5.2, 7.3, 7.8, 7.9, 7.11, 9.16, 9.18, 11.3, 14.7, 14.8, 14.9

Technology and the Internet

Demographics and Population Statistics

Travel and Transportation

1.2, 4.1, 4.31, 4.32, 7.1, 9.14, 12.7, 12.8

1.1, 2.3, 2.4, 2.5, 2.6, 2.8, 2.12, 3.3, 3.7, 3.13, 3.17, 3.18, 3.20, 4.2, 4.3, 4.7, 4.27, 4.38, 5.9, 6.1, 6.6, 7.1, 7.7, 8.9, 9.2, 9.12, 10.12, 12.6, 13.4

Economics and Finance 4.14, 10.15, 12.14, 14.1, 14.6, 14.8, 14.10

Education and Child Development 2.3, 3.15, 4.6, 5.11, 6.7, 7.6

4.9, 4.19, 8.8, 9.17, 10.1

EXERCISES BY APPLICATION

Fuel Consumption and Cars

Biology and Environmental Science

2.13, 3.11, 4.20, 4.38, 8.2, 8.11, 10.10, 12.11, 12.13

0.7, 0.9, 1.30, 1.36, 1.41, 1.42, 1.43, 2.7, 2.8, 2.9, 2.12, 2.14, 2.22, 2.40, 2.60, 2.63, 2.65, 2.67, 2.86, 2.92, 2.94, 2.109, 3.14, 3.19, 3.21, 3.31, 3.54, 3.55, 3.59, 3.65, 3.84, 3.86, 3.112, 3.134, 4.123, 4.130, 4.146, 4.151, 4.164, 4.177, 5.9, 5.12, 5.15, 5.16, 5.39, 6.47, 6.53, 6.55, 6.58, 6.82, 6.109, 7.16, 7.42, 7.45, 7.46, 7.48, 7.80, 7.91, 7.105, 8.12, 8.13, 8.14, 8.37, 8.44, 8.69, 8.86, 8.149, 8.157, 8.173, 8.177, 9.14, 9.15, 9.73, 9.79, 9.92, 9.121, 9.142, 9.144, 9.157, 9.159, 9.229, 9.236, 9.249, 9.254, 10.47, 10.57, 10.72, 10.74, 10.82, 10.92, 10.137, 10.154, 10.155, 10.159, 10.161, 10.168, 10.169, 11.21, 11.25, 11.26, 11.52, 11.53, 11.55, 11.60, 11.62, 11.77, 11.81, 11.98, 11.100, 12.18, 12.22, 12.28, 12.49, 12.56, 12.57, 12.60, 12.78, 12.81, 12.107, 12.115, 12.146, 12.147, 12.149, 12.153, 12.156, 12.159, 12.165, 13.14, 13.25, 13.70, 14.19, 14.21, 14.38, 14.63, 14.76, 14.114, 14.115, 14.116, 14.118, 14.123, 14.124, 14.128, 14.145, 14.153

Manufacturing and Product Development 1.12, 2.10, 2.11, 4.26, 5.2, 5.20, 6.2, 6.11, 8.7, 8.13, 9.3, 9.4, 9.6, 10.3, 10.5, 12.5

Marketing and Consumer Behavior 1.12, 2.2, 2.3, 2.8, 3.16, 4.4, 4.8, 4.13, 4.15, 4.17, 4.18, 4.19, 4.23, 4.30, 4.34, 4.35, 5.5, 5.14, 5.15, 5.18, 7.1, 7.10, 8.11, 9.14, 10.13, 11.4

Medicine and Clinical Studies 1.3, 1.5, 1.8, 2.9, 3.6, 3.10, 3.14, 3.19, 4.33, 4.37, 5.2, 5.7, 5.16, 6.9, 6.10, 9.1, 9.5, 9.7, 9.9, 9.17, 10.4, 10.8, 12.3, 12.4, 13.3, 14.4

Physical Sciences 1.4, 5.6, 8.10, 10.9, 11.5

Psychology and Human Behavior 1.5, 1.6, 1.8, 1.9, 2.3, 3.6, 4.6, 7.2, 8.8, 9.11, 9.18, 10.1, 10.8, 10.11, 13.2, 14.8

Business and Management 1.19, 2.32, 2.61, 2.96, 3.89, 3.105, 3.114, 4.76, 4.78, 4.90, 4.112, 4.131, 4.171, 5.9, 5.38, 5.93, 5.97, 5.98, 5.127, 5.132, 5.139, 5.142, 5.157, 6.95, 6.110, 7.77, 8.11, 8.50, 8.51, 8.75, 8.109, 8.110, 8.113, 8.133, 8.134, 8.145, 8.165, 9.29, 9.72, 9.146, 9.151, 9.222, 9.226, 9.239, 9.251, 9.256, 10.15, 11.28, 11.76, 12.23, 12.58, 12.108, 12.113, 12.142, 13.19, 13.26, 13.29, 13.58, 14.42, 14.59, 14.81

Demographics and Population Statistics 1.28, 1.43, 2.25, 3.80, 4.56, 4.64, 4.129, 4.149, 4.176, 5.43, 5.94, 5.95, 5.128, 7.14, 7.83, 7.90, 9.153, 9.195, 9.197, 11.50, 11.57, 12.83, 13.66

Economics and Finance 1.11, 2.14, 2.103, 3.110, 3.117, 4.20, 4.33, 4.91, 4.120, 4.141, 4.152, 4.157, 4.160, 4.178, 4.188, 5.9, 5.42, 5.87, 5.126, 5.133, 5.140, 6.20, 6.42, 6.117, 7.9, 7.21, 7.107, 8.45, 8.78, 8.107, 8.116, 8.160, 9.30, 9.50, 9.77, 9.148, 9.232, 9.257, 10.28, 10.60, 10.72, 10.102, 10.165, 10.167, 11.103, 12.61, 12.141, 12.150, 12.155, 12.161, 12.164, 13.13, 13.63, 14.35, 14.37, 14.100, 14.125

Education and Child Development 2.6, 2.21, 2.26, 3.17, 3.22, 3.47, 3.87, 3.108, 3.126, 4.94, 4.141, 5.14, 5.65, 6.76, 7.17, 7.76, 8.115, 9.12, 9.22, 9.46, 9.181, 9.192, 10.72, 13.46, 13.62, 14.102

Fuel Consumption and Cars 0.11, 1.5, 1.27, 1.31, 1.37, 2.5, 2.23, 2.87, 2.95, 2.97, 2.104, 2.106, 3.16, 3.29, 3.46, 4.22, 4.83, 4.85, 4.111, 4.142, 4.169, 4.190, 5.10, 5.33, 5.88, 6.94, 6.111, 6.118, 7.18, 7.57, 8.71, 8.139, 8.144, 9.89, 9.96, 9.114, 9.155, 10.45, 10.74, 10.91, 10.150, 11.23, 11.30, 12.54, 12.77, 12.140, 14.14, 14.62, 14.122, 14.140

Manufacturing and Product Development 1.6, 1.14, 1.16, 1.17, 1.29, 1.32, 1.33, 1.38, 1.39, 1.42, 1.52, 1.53, 2.12, 2.13, 2.62, 2.90, 2.102, 3.18, 3.26, 3.30, 3.83, 3.91, 3.93, 3.119, 3.121, 3.123, 3.128, 3.132, 3.133, 4.24, 4.52, 4.65, 4.84, 4.96, 4.112, 4.167, 4.172, 4.183, 5.12, 5,32, 5.37, 5.56, 5.62, 5.101, 5.103, 5.125, 5.141, 5.157, 6.12, 6.14, 6.45, 6.56, 6.59, 6.61, 6.83, 6.112, 6.116, 6.123, 6.124, 7.10, 7.11, 7.13, 7.41, 7.43, 7.54, 7.55, 7.59, 7.81, 7.82, 7.84, 7.86, 7.98, 7.100,

7.104, 8.15, 8.67, 9.68, 8.138, 8.146, 8.161, 8.175, 9.49, 9.82, 9.84, 9.90, 9.91, 9.97, 9.116, 9.118, 9.119, 9.120, 9.217, 9.219, 9.221, 9.225, 9.227, 9.228, 9.230, 9.231, 9.233, 9.234, 9.237, 9.245, 9.255, 10.14, 10.16, 10.19, 10.21, 10.23, 10.46, 10.48, 10.50, 10.52, 10.53, 10.55, 10.56, 10.64, 10.73, 10.87, 10.115, 10.120, 10.144, 10.145, 10.147, 10.158, 10.164, 11.16, 11.24, 11.29, 11.51, 11.63, 11.90, 11.91, 11.93, 12.21, 12.31, 12.104, 12.106, 12.162, 12.167, 12.170, 14.15, 14.16, 14.57, 14.96, 14.101, 14.132, 14.135, 14.138, 14.147, 14.148

Marketing and Consumer Behavior 1.5, 1.6, 1.7, 1.12, 1.18, 1.40, 1.41, 1.43, 1.45, 2.6, 2.7, 2.8, 2.11, 2.29, 2.31, 2.33, 2.38, 2.88, 3.51, 3.56, 3.90, 3.95, 4.35, 4.37, 4.55, 4.58, 4.60, 4.63, 4.67, 4.79, 4.81, 4.92, 4.112, 4.126, 4.127, 4.141, 4.142, 4.173, 4.179, 4.185, 4.194, 5.13, 5.36, 5.64, 5.89, 5.100, 5.102, 5.130, 5.148, 5.149, 5.154, 6.18, 6.19, 6.22, 6.43, 6.46, 6.54, 6.81, 6.100, 6.108, 6.126, 7.9, 7.11, 7.72, 7.74, 7.88, 7.89, 7.92, 8.34, 8.77, 8.102, 8.105, 8.108, 8.111, 8.153, 8.158, 9.13, 9.16, 9.53, 9.74, 9.83, 9.93, 9.94, 9.147, 9.152, 9.156, 9.164, 9.186, 9.253, 10.18, 10.27, 10.61, 10.102, 10.109, 10.119, 11.19, 11.99, 12.163, 13.12, 13.15, 13.16, 13.17, 13.22, 13.23, 13.43, 13.44, 13.56, 13.60, 13.65, 13.69, 13.71, 14.97, 14.137, 14.149

Medicine and Clinical Studies 0.4, 0.10, 0.13, 1.5, 1.6, 1.7, 1.15, 1.42, 1.46, 2.5, 2.105, 3.28, 3.124, 4.27, 4.66, 4.86, 4.111, 4.134, 4.155, 4.162, 4.165, 4.180, 4.184, 5.9, 5.12, 5.18, 5.153, 5.155, 5.158, 6.21, 6.57, 6.99, 7.9, 7.10, 7.96, 8.17, 8.81, 8.117, 8.148, 8.152, 8.156, 8.162, 8.169, 8.174, 9.24, 9.47, 9.52, 9.56, 9.113, 9.149, 9.150, 9.183, 9.194, 9.238, 9.244, 9.246, 10.20, 10.24, 10.49, 10.58, 10.73, 10.74, 10.90, 10.113, 10.122, 10.138, 11.22, 11.49, 11.85, 11.87, 12.26, 12.27, 12.29, 12.33, 12.87, 12.105, 12.110, 12.111, 12.154, 12.160, 12.168, 13.50, 14.13, 14.22, 14.23, 14.32, 14.33, 14.55, 14.103, 14.146

Physical Sciences 1.5, 1.49, 2.57, 2.58, 2.99, 3.45, 3.50, 3.52, 3.85, 3.127, 3.129, 3.130, 3.131, 4.19, 4.23, 4.153, 4.174, 5.121, 5.123, 5.131, 6.16, 6.60, 6.79, 6.101, 6.119, 7.10, 7.11, 7.50, 7.87, 7.93, 7.102, 8.31, 8.35, 8.42, 8.70, 8.136,

8.137, 8.143, 8.151, 8.167, 9.17, 9.45, 9.55, 9.75, 9.95, 9.160, 9.161, 9.188, 9.193, 9.215, 9.216, 9.235, 9.248, 9.250, 10.26, 10.59, 10.62, 10.63, 10.79, 10.83, 10.148, 10.151, 10.156, 10.170, 11.27, 11.31, 11.89, 11.96, 12.19, 12.30, 12.50, 12.51, 12.55, 12.62, 12.63, 12.76, 12.86, 12.88, 12.103, 12.109, 12.114, 12.139, 12.143, 12.144, 12.145, 12.148, 12.173, 13.68, 14.41, 14.75, 14.82, 14.120, 14.127, 14.136

Psychology and Human Behavior 0.12, 1.9, 1.12, 1.13, 2.19, 2.20, 2.30, 2.43, 2.55, 2.98, 4.28, 4.62, 4.87, 4.95, 4.97, 4.103, 4.128, 4.132, 4.156, 4.161, 4.181, 4.189, 5.17, 5.61, 5.66, 5.104, 5.119, 5.135, 5.147, 6.17, 6.23, 6.50, 6.96, 6.103, 7.10, 7.22, 7.44, 7.79, 8.47, 8.79, 8.114, 8.147, 8.163, 8.166, 9.51, 9.165, 9.180, 9.185, 10.85, 10.88, 10.102, 10.112, 10.117, 10.121, 10.162, 11.84, 12.25, 12.75, 12.82, 13.11, 13.24, 13.47, 13.53, 13.55, 13.59, 13.64, 13.67, 14.77, 14.99, 14.121, 14.150

Public Health and Nutrition 1.5, 1.6, 1.12, 1.43, 1.44, 1.54, 2.27, 2.56, 2.59, 2.89, 2.108, 3.24, 3.53, 3.61, 3.109, 3.115, 3.118, 4.29, 4.31, 4.54, 4.57, 4.133, 4.168, 4.186, 5.12, 5.59, 5.60, 5.91, 5.96, 5.134, 5.138, 5.156, 5.158, 6.48, 6.49, 6.75, 6.98, 6.102, 6.106, 6.107, 6.115, 6.122, 7.9, 7.15, 7.39, 7.47, 7.51, 7.70, 7.75, 7.101, 8.18, 8.32, 8.49, 8.83, 8.87, 8.100, 8.150, 8.168, 8.170, 8.172, 8.176, 9.31, 9.78, 9.85, 9.87, 9.111, 9.162, 9.163, 9.179, 9.218, 9.224, 9.242, 9.252, 10.25, 10.74, 10.81, 10.84, 10.86, 10.89, 10.106, 10.110, 10.114, 10.149, 10.160, 10.171, 11.15, 11.20, 11.56, 11.58, 11.61, 11.75, 11.86, 11.94, 11.101, 11.102, 12.32, 12.52, 12.59, 12.79, 12.151, 12.157, 12.172, 13.18, 13.20, 13.51, 13.57, 14.39, 14.61, 14.78, 14.117, 14.129, 14.131, 14.141, 14.142, 14.144

Public Policy and Political Science 1.5, 1.10, 1.12, 1.41, 1.47, 2.5, 2.24, 2.34, 2.42, 3.49, 3.107, 3.111, 3.113, 3.116, 4.82, 4.98, 4.105, 4.125, 4.135, 4.150, 4.154, 5.10, 5.34, 5.35, 5.58, 5.92, 5.150, 6.44, 6.78, 6.93, 6.97, 7.11, 7.40, 7.107, 8.41, 8.80, 8.101, 8.104, 8.106, 8.118, 9.21, 9.25, 9.26, 9.27, 9.28, 9.44, 9.48, 9.81, 9.86, 9.122, 9.145, 9.166, 9.182,

9.184, 9.187, 9.189, 9.190, 9.196, 9.240, 9.241, 9.247, 10.17, 10.107, 10.153, 10,163, 10.166, 11.33, 11.79, 12.85, 13.28, 13.45, 13.52, 13.54, 13.61, 14.34, 14.79, 14.95, 14.98

Sports and Leisure 0.5, 0.6, 0.8, 1.6, 1.34, 2.5, 2.10, 2.11, 2.12, 2.13, 2.28, 2.36, 2.41, 2.64, 2.91, 2.107, 3.20, 3.23, 3.27, 3.44, 3.82, 3.92, 3.120, 3.122, 3.125, 3.135, 4.7, 4.9, 4.25, 4.26, 4.59, 4.77, 4.88, 4.89, 4.99, 4.102, 4.111, 4.112, 4.119, 4.121, 4.124, 4.158, 4.159, 4.170, 4.175, 4.193, 5.9, 5.19, 5.40, 5.41, 5.57, 5.63, 5.90, 5.99, 5.105, 5.124, 5.129, 5.144, 5.146, 6.13, 6.25, 6.51, 6.52, 6.62, 6.63, 6.77, 6.80, 6.114, 7.12, 7.20, 7.24, 7.49, 7.52, 7.53, 7.56, 7.73, 7.78, 7.103, 7.106, 8.33, 8.46, 8.48, 8.52, 8.72, 8.76, 8.82, 8.84, 8.85, 8.140, 8.141, 8.154, 8.159, 8.171, 9.19, 9.54, 9.76, 9.112, 9.115, 9.154, 9.158, 9.220, 9.223, 10.54, 10.74, 10.80, 10.116, 10.140, 10.141, 10.142, 10.146, 10.157, 11.18, 11.32, 11.54, 11.59, 11.78, 11.82, 11.88, 11.97, 12.20, 12.24, 12.48, 12.53, 12.64, 12.112, 12.116, 12.169, 13.21, 13.42, 13.48, 13.49, 14.17, 14.36, 14.58, 14.80, 14.119, 14.126, 14.134, 14.151

Technology and the Internet 1.50, 2.100, 2.101, 3.15, 3.32, 3.62, 4.61, 4.112, 4.142, 5.120, 5.145, 6.105, 6.113, 7.9, 7.71, 7.94, 8.38, 8.43, 8.73, 8.119, 8.135, 8.164, 9.18, 9.80, 9.123, 9.191, 10.78, 10.108, 10.118, 11.92, 12.84, 12.158, 14.56, 14.60, 14.133

Travel and Transportation 1.12, 1.35, 1.41, 1.48, 1.51, 2.6, 2.8, 2.14, 2.35, 2.37, 2.39, 2.66, 2.93, 3.13, 3.25, 3.48, 3.57, 3.58, 3.81, 3.88, 3.94, 3.106, 3.136, 4.21, 4.30, 4.32, 4.36, 4.53, 4.93, 4.104, 4.111, 4.121, 4.122, 4.163, 4.166, 4.182, 4.187, 4.191, 4.192, 4.193, 4.194, 5.9, 5.10, 5.12, 5.122, 5.143, 6.15, 6.120, 6.121, 7.10, 7.19, 7.99, 8.16, 8.36, 8.40, 8.74, 8.103, 8.112, 8.142, 9.20, 9.23, 9.43, 9.117, 9.143, 9.243, 10.22, 10.51, 10.72, 10.73, 10.93, 10.111, 10.139, 10.143, 10.152, 11.17, 11.80, 11.83, 11.95, 12.80, 12.152, 12.166, 12.171, 13.27, 13.72, 14.18, 14.20, 14.40, 14.74, 14.113, 14.130, 14.139, 14.143, 14.152

INTRODUCTORY

STATISTICS

SECOND EDITION

INTRODUCTORY

STATISTICS

A Problem-Solving Approach Stephen Kokoska Bloomsburg University Omar Harran/Moment/Getty Images

Senior Publisher: Terri Ward Senior Acquisitions Editor: Karen Carson Marketing Manager: Cara LeClair Development Editor: Leslie Lahr Associate Editor: Marie Dripchak Senior Media Editor: Laura Judge Media Editor: Catriona Kaplan Associate Media Editor: Liam Ferguson Editorial Assistant: Victoria Garvey Marketing Assistant: Bailey James Photo Editor: Robin Fadool Cover Designer: Vicki Tomaselli Text Designer: Jerry Wilke Managing Editor: Lisa Kinne Senior Project Manager: Dennis Free, Aptara®, Inc. Illustrations and Composition: Aptara®, Inc. Production Coordinator: Julia DeRosa Printing and Binding: QuadGraphics Cover credit: Omar Harran/Moment/Getty Images

Library of Preassigned Control Number: 2014950583 Student Edition Hardcover (packaged with EESEE/CrunchIt! access card): ISBN-13: 978-1-4641-1169-3 ISBN-10: 1-4641-1169-3 Student Edition Loose-leaf (packaged with EESEE/CrunchIt! access card): ISBN-13: 978-1-4641-5752-3 ISBN-10: 1-4641-5752-9 Instructor Complimentary Copy: ISBN-13: 978-1-4641-7986-0 ISBN-10: 1-4641-7986-7 © 2015, 2011 by W. H. Freeman and Company All rights reserved Printed in the United States of America First printing W. H. Freeman and Company 41 Madison Avenue New York, NY 10010 Houndmills, Basingstoke RG21 6XS, England www.whfreeman.com

BRIEF CONTENTS

Chapter 0

Why Study Statistics

1

Chapter 1

An Introduction to Statistics and Statistical Inference

9

Chapter 2

Tables and Graphs for Summarizing Data

27

Chapter 3

Numerical Summary Measures

73

Chapter 4

Probability

123

Chapter 5

Random Variables and Discrete Probability Distributions

187

Chapter 6

Continuous Probability Distributions

243

Chapter 7

Sampling Distributions

295

Chapter 8

Confidence Intervals Based on a Single Sample

333

Chapter 9

Hypothesis Tests Based on a Single Sample

391

Chapter 10

Confidence Intervals and Hypothesis Tests Based on Two Samples or Treatments

461

Chapter 11

The Analysis of Variance

531

Chapter 12

Correlation and Linear Regression

573

Chapter 13

Categorical Data and Frequency Tables

651

Chapter 14

Nonparametric Statistics

681

Optional Sections (available online at www.whfreeman.com/introstats2e and on LaunchPad): Section 6.5

The Normal Approximation to the Binomial Distribution

Section 12.6

The Polynomial and Qualitative Predictor Models

Section 12.7

Model Selection Procedures

v

CONTENTS

Chapter 0 Why Study Statistics

1

The Statistical Inference Procedure Problem Solving With a Little Help from Technology

2 3 3

Chapter 6 Continuous Probability

Distributions 6.1 Probability Distributions for a Continuous Random Variable 6.2 The Normal Distribution 6.3 Checking the Normality Assumption 6.4 The Exponential Distribution

Chapter 1 An Introduction to Statistics and

Statistical Inference 1.1 Statistics Today 1.2 Populations, Samples, Probability, and Statistics 1.3 Experiments and Random Samples

9 10

Chapter 7 Sampling Distributions

11 19

7.1 Statistics, Parameters, and Sampling Distributions 7.2 The Sampling Distribution of the Sample Mean and the Central Limit Theorem 7.3 The Distribution of the Sample Proportion

Chapter 2 Tables and Graphs for

Summarizing Data 2.1 2.2 2.3 2.4

Types of Data Bar Charts and Pie Charts Stem-and-Leaf Plots Frequency Distributions and Histograms

27 28 33 45 53

Chapter 3 Numerical Summary Measures

73

3.1 Measures of Central Tendency 3.2 Measures of Variability 3.3 The Empirical Rule and Measures of Relative Standing 3.4 Five-Number Summary and Box Plots

74 86

Chapter 4 Probability 4.1 4.2 4.3 4.4 4.5

Experiments, Sample Spaces, and Events An Introduction to Probability Counting Techniques Conditional Probability Independence

5.1 Random Variables 5.2 Probability Distributions for Discrete Random Variables 5.3 Mean, Variance, and Standard Deviation for a Discrete Random Variable 5.4 The Binomial Distribution 5.5 Other Discrete Distributions

244 256 272 282

295 296

304 318

Chapter 8 Confidence Intervals Based

98 109

123 124 134 147 158 168

on a Single Sample

333

8.1 Point Estimation 8.2 A Confidence Interval for a Population Mean When ! Is Known 8.3 A Confidence Interval for a Population Mean When ! Is Unknown 8.4 A Large-Sample Confidence Interval for a Population Proportion 8.5 A Confidence Interval for a Population Variance or Standard Deviation

334 339 353 365 374

Chapter 9 Hypothesis Tests Based

on a Single Sample 9.1 The Parts of a Hypothesis Test and Choosing the Alternative Hypothesis 9.2 Hypothesis Test Errors 9.3 Hypothesis Tests Concerning a Population Mean When ! Is Known 9.4 p Values 9.5 Hypothesis Tests Concerning a Population Mean When ! Is Unknown 9.6 Large-Sample Hypothesis Tests Concerning a Population Proportion 9.7 Hypothesis Tests Concerning a Population Variance or Standard Deviation

Chapter 5 Random Variables and Discrete

Probability Distributions

243

187 188 193 202 211 224

vi

391 392 398 404 417 426 438 447

CONTENTS

Chapter 10 Confidence Intervals and

Hypothesis Tests Based on Two Samples or Treatments 10.1 Comparing Two Population Means Using Independent Samples When Population Variances Are Known 10.2 Comparing Two Population Means Using Independent Samples from Normal Populations 10.3 Paired Data 10.4 Comparing Two Population Proportions Using Large Samples 10.5 Comparing Two Population Variances or Standard Deviations Chapter 11 The Analysis of Variance 11.1 One-Way ANOVA 11.2 Isolating Differences 11.3 Two-Way ANOVA

461

12.1 Simple Linear Regression 12.2 Hypothesis Tests and Correlation 12.3 Inferences Concerning the Mean Value and an Observed Value of Y for x ! x* 12.4 Regression Diagnostics 12.5 Multiple Linear Regression

13.1 Univariate Categorical Data, Goodness-of-Fit Tests 13.2 Bivariate Categorical Data, Tests for Homogeneity and Independence Chapter 14 Nonparametric Statistics 14.1 14.2 14.3 14.4 14.5 14.6

The Sign Test The Signed-Rank Test The Rank-Sum Test The Kruskal-Wallis Test The Runs Test Spearman’s Rank Correlation

Tables Appendix

T-1

Table II Table III

501

Table V

513

Table VI

531 532 544 555

Table IV

Table VII Table VIII Table IX Table X

573 574 591 605 614 624

Table XI Table XII

Binomial Distribution Cumulative Probabilities Poisson Distribution Cumulative Probabilities Standard Normal Distribution Cumulative Probabilities Standardized Normal Scores Critical Values for the t Distribution Critical Values for the Chi-Square Distribution Critical Values for the F Distribution Critical Values for the Studentized Range Distribution Critical Values for the Wilcoxon Signed-Rank Statistic Critical Values for the Wilcoxon Rank-Sum Statistic Critical Values for the Runs Test Greek Alphabet

Answers to Odd-Numbered Exercises Index

T-2 T-4 T-7 T-9 T-10 T-11 T-13 T-16 T-19 T-22 T-25 T-27

A-1 I-1

Optional Sections

Chapter 13 Categorical Data and

Frequency Tables

N-1

474 490

Chapter 12 Correlation and Linear

Regression

Notes and Data Sources

Table I

463

vii

651 652

(available online at www.whfreeman.com/ introstats2e and on LaunchPad): Section 6.5

The Normal Approximation to the Binomial Distribution

Section 12.6

The Polynomial and Qualitative Predictor Models

Section 12.7

Model Selection Procedures

662

681 682 690 698 706 712 718

PREFACE

S

tudents frequently ask me why they need to take an introductory statistics course. My answer is simple. In almost every occupation and in ordinary daily life, you will have to make data-driven decisions, inferences, as well as assess risk. In addition, you must be able to translate complex problems into manageable pieces, recognize patterns, and most important, solve problems. This text helps students develop the fundamental lifelong tool of solving problems and interpreting solutions in real-world terms. One of my goals was to make this problem-solving approach accessible and easy to apply in many situations. I certainly want students to appreciate the beauty of statistics and the connections to so many other disciplines. However, it is even more important for students to be able to apply problem-solving skills to a wide range of academic and career pursuits, including business, science and technology, and education. Introductory Statistics: A Problem-Solving Approach, Second Edition, presents longterm, universal skills for students taking a one- or two-semester introductory-level statistics course. Examples include guided, explanatory Solution Trails that emphasize problem-solving techniques. Example solutions are presented in a numbered, step-bystep format. The generous collection and variety of exercises provide ample opportunity for practice and review. Concepts, examples, and exercises are presented from a practical, realistic perspective. Real and realistic data sets are current and relevant. The text uses mathematically correct notation and symbols and precise definitions to illustrate statistical procedures and proper communication. This text is designed to help students fully understand the steps in basic statistical arguments, emphasizing the importance of assumptions in order to follow valid arguments or identify inaccurate conclusions. Most important, students will understand the process of statistical inference. A four-step process (Claim, Experiment, Likelihood, Conclusion) is used throughout the text to present the smaller pieces of introductory statistics on which the larger, essential statistical inference puzzle is built.

NEW TO THIS EDITION In this thoroughly updated new edition, Steve Kokoska again combines a classic approach to teaching statistics with contemporary examples, pedagogical features, and use of technology. He blends solid mathematics with lucid, often humorous, writing and a distinctive stepped “Solution Trail” problem-solving approach, which helps students understand the processes behind basic statistical arguments, statistical inference, and data-based decision making.

LaunchPad Introductory Statistics is accompanied by its own dedicated version of W. H. Freeman’s breakthrough online course space, which offers the following: s 0RE BUILT 5NITS FOR EACH CHAPTER CURATED BY EXPERIENCED EDUCATORS WITH MEDIA FOR each chapter organized and ready to assign or customize to suit the course. s !LL ONLINE RESOURCES FOR THE TEXT IN ONE LOCATION INCLUDING AN INTERACTIVE E "OOK LearningCurve adaptive quizzing, Try It Now exercises, StatTutors, video technology manuals, statistical applets, CrunchIt! and JMP statistical software, EESEE case studies, and statistical videos. s )NTUITIVEANDUSEFULANALYTICS WITHA'RADEBOOKTHATALLOWSINSTRUCTORSTOSEEHOWTHE class is progressing, for individual students and as a whole. s ! STREAMLINED AND INTUITIVE INTERFACE THAT LETS INSTRUCTORS BUILD AN ENTIRE COURSE IN minutes.

ix

x

PREFACE

New Solution Trail Exercises Kokoska’s unique “Solution Trail” framework appears in the text margins alongside selected examples. This feature, highly praised by reviewers, serves as a unique guide for approaching and solving the problems before moving to the solution steps within the example. To allow students to put this guidance to use, exercise sets now feature questions that ask students to create their own solution trails.

New Concept Check Exercises Strengthening the book’s conceptual coverage, these exercises open each exercise set with true/false, fill-in-the-blank, and short-answer questions that help students solidify their understanding of the reading and the essential statistical ideas.

New Chapter 0 This introductory chapter eases students into the course and Kokoska’s approach. It includes about a dozen exercises that instructors can assign for the first day of class, helping students settle into the course more easily.

Revised Chapter Openers that include “Looking Forward/ Looking Back” “Looking Back” recaps key concepts learned in prior chapters. “Looking Forward” lists the key concepts to be covered within the chapter.

New “Last Step” Exercises Based on Opening Scenarios The chapter-opening question is presented again as an exercise at the end of the chapter, to close the concept and application loop, as a last step. In addition, this gives instructors the option of making the scenarios assignable and assessable.

Try It Now References Most examples include a reference to a specific related exercise in the end-of-chapter set. With this, students can test their understanding of the example’s concepts and techniques immediately.

Approximately 40% New and Updated Exercises and Examples Approximately 100 new examples and almost 800 new exercises are included in this new edition.

More Statistical Technology Integration In addition to presenting Excel, Minitab, and TI output and instruction, the new edition incorporates sample output screens and guidance for both CrunchIt!, W. H. Freeman’s web-based statistical software, and JMP. (CrunchIt! and JMP packages are available free of charge in LaunchPad.)

FEATURES Focus on Statistical Inference The main theme of this text is statistical inference and decision making through interpretation of numerical results. The process of statistical inference is introduced in a variety of contexts, all using a similar, carefully delineated, four-step approach: Claim, Experiment, Likelihood, and Conclusion.

PREFACE

Can the Florida Everglades be saved? Burmese pythons have invaded the Florida Everglades and now threaten the wildlife indigenous to the area. It is likely that people were keeping pythons as pets and somehow a few animals slithered into Everglades National Park. The first python was found in the Everglades in 1979, and these snakes became an officially established species in 2000.1 The Everglades has an ideal climate for the pythons, and the large areas of grass allow the snakes plenty of places to hide. In January 2013, the Florida Fish and Wildlife Conservation Commission started the Python Challenge. The purpose of the contest was to thin the python population, which could be tens of thousands, and help save the natural wildlife in the Everglades. There were 800 participants, with prizes for the most pythons captured and for the longest. At the end of the competition, 68 Burmese pythons had been harvested. Suppose a random sample of pythons captured during the Challenge was obtained. The length (in feet) of each python is given in the following table. 9.3 7.4 11.1 3.9 4.1

3.5 14.2 3.7 6.7 5.2

5.2 13.6 7.0 3.3 4.7

8.3 8.3 12.2 8.3 5.8

4.6 7.5 5.2 10.9 6.4

11.1 5.2 8.1 9.5 3.8

10.5 6.4 4.2 9.4 7.1

3.7 12.0 6.1 4.3 4.6

2.8 10.7 6.3 4.6 7.5

xi

Chapter Opener Each chapter begins with a unique, real-world question, providing an interesting introduction to new concepts and an application to begin discussion. The chapter question is presented again as an exercise at the end of the chapter, to close the concept and application loop, as a last step.

Looking Back

5.9 4.0 13.2 5.8 6.0

■

= Recall that x, p, and s2 are the point estimates for the parameters m, p, and s2 .

■

Remember how to construct and interpret confidence intervals.

■

Think about the concept of a sampling distribution for a statistic and the process of standardization.

Looking Forward

The tabular and graphical techniques presented in this chapter will be used to describe the shape, center, and spread of this distribution of python lengths and to identify any outliers.

■

Use the available information in a sample to make a specific decision about a population parameter.

■

Understand the formal decision process and learn the four-part hypothesis test procedure.

■

Conduct formal hypothesis tests concerning the population parameters m, p, and s2 .

Looking Back and Looking Forward At the beginning of almost every chapter, “Looking Back” includes reminders of specific concepts from earlier chapters that will be used to develop new skills. “Looking Forward” offers the learning objectives for the chapter. Solution Trail 9.8 KEYWORDS ■ ■ ■ ■

Is there any evidence? Greater than the long-term mean Standard deviation 1850 Random sample

T RA NSL AT ION ■

■ ■

Conduct a one-sided, righttailed test about a population mean m m0 5 5960 s 5 1850

CO N CEPT S ■

Hypothesis test concerning a population mean when s is known

VISION

Use the template for a one-sided, right-tailed test about m. The underlying population distribution is unknown, but n is large and s is known. Determine the appropriate alternative hypothesis and the corresponding rejection region, find the value of the test statistic, and draw a conclusion.

Solution Trail The Solution Trail is a structured technique and visual aid for solving problems that appears in the text margins alongside selected examples. Solution Trails serve as guides for approaching and solving the problems before moving to the solution steps within the example. The four steps of the Solution Trail are 1. 2. 3. 4.

Find the keywords. Correctly translate these words into statistics. Determine the applicable concepts. Develop a vision for the solution.

The keywords lead to a translation into statistics. Then, the statistics question is solved with the use of specific concepts. Finally, the keywords, translation, and concepts are all used to develop a vision for the solution. This method encourages students to think conceptually before making calculations. Selected exercises ask students to write a formal Solution Trail.

Step-by-Step Solutions The solutions to selected examples are presented in logical, systematic steps. Each line in a calculation is explained so that the reader can clearly follow each step in a solution.

SOLUTION STEP 1 Find the sample mean:

x5

1 1 (6.2 1 4.5 1 6.6 1 7.0 1 8.2) 5 (32.5) 5 6.5 5 5

STEP 2 Use Equation 3.4 to find the sample variance.

s2 5

1 3 (6.2 2 6.5) 2 1 (4.5 2 6.5) 2 1 (6.6 2 6.5) 2 1 (7.0 2 6.5) 2 1 (8.2 2 6.5) 2 4 4 Use data and x.

1 3 (20.3) 2 1 (22.0) 2 1 (0.1) 2 1 (0.5) 2 1 (1.7) 2 4 4 1 5 30.09 1 4.0 1 0.01 1 0.25 1 2.894 4 5

1 5 (7.24) 5 1.81 4

Compute differences.

Square each difference.

Add, divide by 4.

STEP 3 Take the positive square root of the variance to find the standard deviation.

s 5 !1.81 < 1.3454

A technology solution is shown in Figure 3.17.

xii

PREFACE

The points do not lie along a straight line. Each tail is flat, which makes the graph look S-shaped. This suggests that the underlying population is not normal. Figure 6.64 shows a technology solution.

Technology Solutions Wherever possible, a technology solution using CrunchIt!, JMP, the TI-84, Minitab, or Excel is presented at the end of each text example. This allows students to focus on concepts and interpretation.

x 440 420 400 380 360 340

!2

!1

1

2

Figure 6.63 Normal probability plot for the chemotherapy dose data.

z

Figure 6.64 JMP normal probability plot.

A Closer Look The details provided in these sections offer straightforward explanations of various definitions and concepts. The itemized specifics, including hints, tips, and reminders, make it easier for the reader to comprehend and learn important statistical ideas. In Example 4.32,

1.

P(R) 5

Theory Symbols More advanced material, which may be found in “A Closer Look” and regular exposition as appropriate, is offset with a blue triangle. This material can be skipped by the typical reader, but provides more complete explanations to various topics.

A CLOSER L OK In Example 4.32,

1.

P(R) 5 P(R d M) 1 P(R d F) 5 P(R d M) 1 P(R d Mr) In general, for any two events A and B, P(A) 5 P(A d B) 1 P(A d Br) This decomposition technique is often needed in order to find P(A). The Venn diagram in Figure 4.19 illustrates this equation. The events B and B! make up the entire sample space: S 5 B c Br. S B A A !B

A ! B!

Figure 4.19 Venn diagram showing decomposition of the event A.

B!

2. Suppose B1, B2, and B3 are mutually exclusive and exhaustive:

B1 c B2 c B3 " S. For any other event A, P(A) " P(A d B1) # P(A d B2) # P(A d B3)

How to Construct a Standard Box Plot Given a set of n observations x1, x2, . . . , xn: 1. Find the five-number summary xmin, Q1, ~ x , Q3, xmax. 2. Draw a (horizontal) measurement axis. Carefully sketch a box with edges at the quartiles: left edge at Q1, right edge at Q3. (The height of the box is irrelevant.) 3. Draw a vertical line in the box at the median. 4. Draw a horizontal line (whisker) from the left edge of the box to the minimum value (from Q1 to xmin). Draw a horizontal line (whisker) from the right edge of the box to the maximum value (from Q3 to xmax).

How To Boxes This feature provides clear steps for constructing basic graphs or performing essential calculations. How To boxes are color-coded and easy to locate within each chapter.

Definition

Definition/Formula Boxes Definitions and formulas are clearly marked and outlined with clean, crisp color-coded lines.

The sample (arithmetic) mean, denoted x, of the n observations x1, x2, . . . , xn is the sum of the observations divided by n. Written mathematically. x1 1 x2 1 c1 xn 1 x 5 g xi 5 (3.1) n n

xiii

PREFACE

Technology Corner This feature, at the end of most sections, presents step-by-step instructions for using CrunchIt!, the TI-84, Minitab, and Excel to solve the examples presented in that section. Keystrokes, menu items, specific functions, and screen illustrations are presented. Technology Corner

Helpful Icons

Procedure: Compute the sample mean, sample median, a trimmed mean, and the mode. Reconsider: Example 3.2, solution, and interpretations.

sample from the list of record times (in seconds) in official Data Set icons indicate when a 29 CUBETIME

data set is available online, and also the name of the data set.

Find the actual proportion of observations within one Figure 3.8 CrunchIt! descriptive statistics.

Statistical Applet icons indicate statistical applets that are available in LaunchPad.

STATISTICAL APPLET MEAN AND MEDIAN

Figure 3.9 The sample mean and the sample median using built-in calculator functions.

Figure 3.10 The sample mean is part of the output from the 1-Var Stats function.

Figure 3.11 The second output screen from 1-Var Stats shows the sample median (Med).

STEPPED STEPPED TUTORIAL TUTORIALS BOX BOX PLOTS PLOTS

Stepped Tutorial icons indicate detailed tutorials for specific calculations.

VIDEO TECH MANUALS

Video Tech Manual icons indicate video instructions for solving certain kinds of problems using statistical software.

EXEL DISCRIPTIVE SAMPLING FROM A DATA SET

Figure 3.12 Minitab descriptive statistics.

Solution Trail icons within the exercise sets indicate the opportunity for students to create their own Solution Trails. Figure 3.13 Excel descriptive statistics.

Grouped Exercises Kokoska offers a wide variety of interesting, engaging exercises on relevant topics, based on current data, at the end of each section and chapter. These problems provide plenty of opportunity for practice, review, and application of concepts. Answers to odd-numbered section and chapter exercises are given at the back of the book. Exercises are grouped according to: Concept Check 2.73 True/False A histogram can be used to describe the shape, center, and variability of a distribution. 2.74 Short Answer a. When is a density histogram appropriate? b. In a density histogram, what is the sum of areas of all

rectangles? 2.75 Fill in the Blank a. The most common unimodal distribution is a

Concept Check True/False, Fill-in-the-Blank, and ShortAnswer exercises designed to reinforce the basic concepts presented in the section.

. b. A unimodal distribution is

if there is a

vertical line of symmetry. c. If a unimodal distribution is not symmetric, then it is

. 2.76 True/False A bimodal distribution cannot be symmetric.

Practice Basic, introductory problems to familiarize students with the concepts and solution methods.

Practice 2.77 Consider the data given in the following table.

87 91 91 89

81 86 81 85

86 86 89 86

90 87 89 90

88 88 83 90

85 85 90 89

79 92 83 78

91 85 80 91

EX2.77

87 87 90 83

82 86 80 92

Construct a frequency distribution to summarize these data using the class intervals 78–80, 80–82, 82–84, . . . . 2.78 Consider the data given on the text website. Construct a EX2.78 frequency distribution to summarize these data. 2.79 Consider the following frequency distribution.

xiv

PREFACE

Applications 2.86 Biology and Environmental Science A weather

station located along the Maine coast in Kennebunkport collects data on temperature, wind speed, wind chill, and rain. The maximum wind speed (in miles per hour) for 50 weights into milligrams. Construct a frequency distribution randomly selected times in February 2013 are given on the MAXWIND text website.27 a. Construct a frequency distribution to summarize these data, and draw the corresponding histogram. b. Describe the shape of the distribution. Are there any(niacin) helps outliers? 2.87 Fuel Consumption and Cars The quality of an auto-

Applications Realistic, appealing exercises to build confidence and promote routine understanding. Many exercises are based on interesting and carefully researched data. Extended Applications 2.92 Biology and Environmental Science Fruits such as cherries and grapes are harvested and placed in a shallow box or crate called a lug. The size of a lug varies, but one typically holds between 16 and 28 pounds. A random sample of the weight (in pounds) of full lugs holding peaches was obtained, and the data are summarized in the following table.

CHALLENGE 2.107 Sports and Leisure An ogive, or cumulative relative

frequency polygon, is another type of visual representation of a frequency distribution. To construct an ogive: ■ Plot each point (upper endpoint of class interval, cumulative relative frequency). ■ Connect the points with line segments. Figures 2.52 and 2.53 show a frequency distribution and the corresponding ogive. The observations are ages. The values to be used in the plot are shown in bold in the table.

Class

Frequency

Relative frequency

12–16 16–20 20–24 24–28 28–32 32–36 32–40

8 10 20 30 15 10 7

0.08 0.10 0.20 0.30 0.15 0.10 0.07

Total

100

1.00

Cumulative relative frequency 0.08 0.18 0.38 0.68 0.83 0.93 1.00

Class

Frequency

20.0–20.5 20.5–21.0 21.0–21.5 21.5–22.0 22.0–22.5 22.5–23.0 23.0–23.5 23.5–24.0 24.0–24.5 24.5–25.0

6 12 17 21 28 25 19 15 11 10

a. Complete the frequency distribution. b. Construct a histogram corresponding to this frequency

distribution. c. Estimate the weight w such that 90% of all full peach lugs

weigh more than w. 2.93 Travel and Transportation Maglev trains operate in

Challenge Additional exercises and technology projects that allow students to discover more advanced concepts and connections. LAST STEP

Cumulative relative frequency

Figure 2.52 Frequency distribution.

1.0 0.8 (28,0.68)

0.6 0.4 0.2 (12,0) 5

10

15

20

25 Age

30

Extended Applications Applied problems that require extra care and thought.

35

40

45

Last Step Each set of chapter exercises concludes with the “Last Step.” This exercise is connected to the chapter-opening question and the solution involves the skills and concepts presented in the chapter.

Figure 2.53 Resulting ogive.

2.109 Can the Florida Everglades be saved? In January 2013, the Florida Fish and Wildlife Conservation Commission started the Python Challenge. The purpose of the contest was to thin the python population, which could be tens of thousands, and help save the natural wildlife in the Everglades. At the end of the competition, 68 Burmese pythons had been harvested. Suppose a random sample of pythons captured during the Challenge was obtained and the length (in feet) of PYTHON each is given in the following table:

9.3 3.5 5.2 8.3 4.6 11.1 10.5 3.7 2.8 5.9 7.4 14.2 13.6 8.3 7.5 5.2 6.4 12.0 10.7 4.0 11.1 3.7 7.0 12.2 5.2 8.1 4.2 6.1 6.3 13.2 3.9 6.7 3.3 8.3 10.9 9.5 9.4 4.3 4.6 5.8 4.1 5.2 4.7 5.8 6.4 3.8 7.1 4.6 7.5 6.0 a. Construct a frequency distribution, stem-and-leaf plot,

A random sample of game scores from Abby Sciuto’s evening bowling league with Sister Rosita was obtained, and the data BOWLING are given on the text website. a. Construct a frequency distribution for these data. b. Draw the resulting ogive for these data.

and histogram for these data. b. Use these tabular and graphical techniques to describe the

shape, center, and spread of this distribution, and to identify any outlying values.

2.108 Public Health and Nutrition A doughnut graph is

CHAPTER 2 SUMMARY Concept

Page

Categorical data set Numerical data set Discrete data set Continuous data set Frequency distribution

29 29 30 30 33

Class frequency Class relative frequency

33 33

Notation / Formula / Description

Consists of observations that may be placed into categories. Consists of observations that are numbers. The set of all possible values is finite, or countably infinite. The set of all possible values is an interval of numbers. A table used to describe a data set. It includes the class, frequency, and relative frequency (and cumulative relative frequency, if the data set is numerical). The number of observations within a class. The proportion of observations within a class: class frequency divided by total number of observations.

Chapter Summary A table at the end of each chapter provides a list of the main concepts with brief descriptions, proper notation, and applicable formulas, along with page numbers for quick reference.

MEDIA AND SUPPLEMENTS

W. H. Freeman’s new online homework system, LaunchPad, offers our quality content curated and organized for easy assignability in a simple but powerful interface. We’ve taken what we’ve learned from thousands of instructors and hundreds of thousands of students to create a new generation of W. H. Freeman/Macmillan technology. Curated Units. Combining a curated collection of videos, homework sets, tutorials, applets, and e-Book content, LaunchPad’s interactive units give instructors building blocks to use as is or as a starting point for their own learning units. Thousands of exercises from the text can be assigned as online homework, including many algorithmic exercises. An entire unit’s worth of work can be assigned in seconds, drastically reducing the amount of time it takes to have a course up and running. Easily customizable. Instructors can customize the LaunchPad Units by adding quizzes and other activities from our vast collection of resources. They can also add a discussion board, a dropbox, and RSS feed, with a few clicks. LaunchPad allows instructors to customize their students’ experience as much or as little as they like. Useful analytics. The Gradebook quickly and easily allows instructors to look up performance metrics for classes, individual students, and individual assignments. Intuitive interface and design. The student experience is simplified. Students’ navigation options and expectations are clearly laid out at all times, ensuring that they can never get lost in the system.

Assets integrated into LaunchPad include: Interactive e-Book. Every LaunchPad e-Book comes with powerful study tools for students, video and multimedia content, and easy customization for instructors. Students can search, highlight, and bookmark, making it easier to study and access key content. And teachers can ensure that their classes get just the book they want to deliver: customize and rearrange chapters, add and share notes and discussions, and link to quizzes, activities, and other resources. provides students and instructors with powerful adaptive quizzing, a gamelike format, direct links to the e-Book, and instant feedback. The quizzing system features questions tailored specifically to the text and adapts to students’ responses, providing material at different difficulty levels and topics based on student performance. offers an easy-to-use web-based version of the instructor’s solutions, allowing instructors to generate a solution file for any set of homework exercises. New Stepped Tutorials are centered on algorithmically generated quizzing with stepby-step feedback to help students work their way toward the correct solution. These new exercise tutorials (two to three per chapter) are easily assignable and assessable. Icons in the textbook indicate when a Stepped Tutorial is available for the material being covered.

xv

xvi

PREFACE

Statistical Video Series consists of StatClips, StatClips Examples, and Statistically Speaking “Snapshots.” View animated lecture videos, whiteboard lessons, and documentary-style footage that illustrate key statistical concepts and help students visualize statistics in realworld scenarios. New Video Technology Manuals available for TI-83/84 calculators, Minitab, Excel, JMP, SPSS, R, Rcmdr, and CrunchIT! provide brief instructions for using specific statistical software. Updated StatTutor Tutorials offer multimedia tutorials that explore important concepts and procedures in a presentation that combines video, audio, and interactive features. The newly revised format includes built-in, assignable assessments and a bright new interface. Updated Statistical Applets give students hands-on opportunities to familiarize themselves with important statistical concepts and procedures, in an interactive setting that allows them to manipulate variables and see the results graphically. These new applets now include a “Quiz Me” function that allows them to be both assignable and assessable. Icons in the textbook indicate when an applet is available for the material being covered. CrunchIt! is a web-based statistical program that allows users to perform all the statistical operations and graphing needed for an introductory statistics course and more. It saves users time by automatically loading data from the text, and it provides the flexibility to edit and import additional data. JMP Student Edition (developed by SAS) is easy to learn and contains all the capabilities required for introductory statistics, including pre-loaded data sets from Introductory Statistics: A Problem-Solving Approach. JMP is the commercial data analysis software of choice for scientists, engineers, and analysts at companies around the globe (for Windows and Mac). Stats@Work Simulations put students in the role of the statistical consultant, helping them better understand statistics interactively within the context of real-life scenarios. EESEE Case Studies (Electronic Encyclopedia of Statistical Examples and Exercises), developed by The Ohio State University Statistics Department, teach students to apply their statistical skills by exploring actual case studies using real data. Data files are available in ASCII, Excel, TI, Minitab, SPSS (an IBM Company),* and JMP formats. Student Solutions Manual provides solutions to the odd-numbered exercises in the text. Available electronically within LaunchPad, as well as in print form. Interactive Table Reader allows students to use statistical tables interactively to seek the information they need. Instructor’s Solutions Manual contains full solutions to all exercises from Introductory Statistics: A Problem-Solving Approach. Available electronically within LaunchPad. Test Bank offers hundreds of multiple-choice questions. Also available on CD-ROM (for Windows and Mac), where questions can be downloaded, edited, and resequenced to suit each instructor’s needs. *SPSS was acquired by IBM in October 2009.

PREFACE

xvii

Lecture PowerPoint Slides offer a detailed lecture presentation of statistical concepts covered in each chapter of Introductory Statistics: A Problem-Solving Approach.

Additional Resources Available with Introductory Statistics: A Problem-Solving Approach Companion Website www.whfreeman.com/introstats2e This open-access website includes statistical applets, data files, and self-quizzes. The website also offers three optional sections covering the normal approximation to the binomial distribution (Section 6.5), polynomial and qualitative predictor models (Section 12.6), and model selection procedures (Section 12.7). Instructor access to the Companion Website requires user registration as an instructor and features all of the open-access student web materials, plus: s Instructor version of EESEE with solutions to the exercises in the student version. s PowerPoint Slides containing all textbook figures and tables. s Lecture PowerPoint Slides s Tables and Formulas cards offer tables, key concepts, and formulas for use as a study tool or during exams (as allowed by the instructor); available as downloadable PDFs. Special Software Packages Student versions of JMP and Minitab are available for packaging with the text. JMP is available inside LaunchPad at no additional cost. Contact your W. H. Freeman representative for information or visit www.whfreeman.com. is a two-way radio-frequency classroom response solution developed by educators for educators. Each step of i-clicker’s development has been informed by teaching and learning. To learn more about packaging i-clicker with this textbook, please contact your local sales rep or visit www.iclicker.com. AT&T Courses

4:10 PM

Chemistry-101-001 Chemistry-10 01 0 01-001 1-0 1 1-001 001 001

Question 1 Select an answer

A B C D E

C

Received

ACKNOWLEDGMENTS

I

would like to thank the following colleagues who offered specific comments and suggestions on the second-edition manuscript throughout various stages of development:

Jonathan Baker, Ohio State University Andrea Boito, Penn State Altoona Alexandra Challiou, Notre Dame of Maryland University Carolyn K. Cuff, Westminster College Greg Davis, University of Wisconsin Green Bay Richard Gonzalez, University of Michigan Justin Grieves, Murray State University Christian Hansen, Eastern Washington University Christopher Hay-Jahans, University of Alaska Southeast Susan Herring, Sonoma State University Chester Ismay, Arizona State University Ananda Jayawardhana, Pitt State University Phillip Kendall, Michigan Technological University Bashir Khan, St. Mary’s University Barbara Kisilevsky, Queens University Tammi Kostos, McHenry County College Adam Lazowski, Sacred Heart University Jiexiang Li, College of Charleston Edgard Maboudou, University of Central Florida Tina Mancuso, Sage College Scott McClintock, West Chester University Jackie Miller, University of Michigan Daniel Ostrov, Santa Clara University William Radulovich, Florida State College at Jacksonville Enayetur Raheem, University of Wisconsin - Green Bay Daniel Rothe, Alpena Community College James Stamey, Baylor University Sunny Wang, St. Francis Xavier University Derek Webb, Bemidji State University Daniel Weiner, Boston University Mark Werner, University of Georgia Nancy Wyshinski, Trinity College A special thanks to Ruth Baruth, Terri Ward, Karen Carson, Cara LeClair, Lisa Kinne, Tracey Kuehn, Julia DeRosa, Vicki Tomaselli, Robin Fadool, Marie Dripchak, Liam Ferguson, Catriona Kaplan, Laura Judge, and Victoria Garvey of W. H. Freeman and Company. Designer Jerry Wilke and illustrator Cambraia Fernandez, led by Vicki Tomaselli, offered the creativity, expertise, and hard work that went into the design of this new edition. I am very grateful to Jackie Miller for her insights, suggestions, and editorial talent throughout the production of the second edition. She is doggedly accurate in her accuracy reviews and page proof examination. Thanks to Dennis Free of Aptara for his patience and typesetting expertise. Much appreciated are the copy editing skills brought to the project by Lynne Lackenbach; her time and perseverance helped to add cohesion and continuity to the flow of the material. Many thanks to Aaron Bogan for bringing his

xviii

ACKNOWLEDGMENTS

xix

attention to detail and knowledge of statistics to the accuracy review of the solutions manuals. And I could not have completed this project without Karen Carson and Leslie Lahr. Both have superb editing skills, a keen eye for style, a knack for eliciting the best from an author, and unwavering support. My sincere thanks go to the authors and reviewers of the supplementary materials available with Introductory Statistics: A Problem-Solving Approach, Second Edition; their hard work, expertise, and creativity have culminated in a top-notch package of resources: Test Bank written by Julie Clark, Hollins University Test Bank and iClicker slides accuracy reviewed by John Samons, Florida State College at Jacksonville Practice Quizzes written by James Stamey, Baylor University Practice Quizzes accuracy reviewed by Laurel Chiappetta, University of Pittsburgh iClicker slides created by Paul Baker, Catawba College Lecture PowerPoints created by Susan Herring, Sonoma State University I am very grateful to the entire Antoniewicz family for providing the foundation for a wide variety of problems, including those that involve nephelometric turbidity units, floor slip testers, and crazy crawler fishing lures. I continue to learn a great deal with every day of writing. I believe this kind of exposition has made me a better teacher. To Joan, thank you for your patience, understanding, inspiration, and tasty treats.

ABOUT THE AUTHOR

S

Credit: Eric Foster

teve received his undergraduate degree from Boston College and his M.S. and Ph.D. from the University of New Hampshire. His initial research interests included the statistical analysis of cancer chemoprevention experiments. He has published a number of research papers in mathematics journals, including Biometrics, Anticancer Research, and Computer Methods and Programs in Biomedicine; presented results at national conferences; and written several books. He has been awarded grants from the National Science Foundation, the Center for Rural Pennsylvania, and the Ben Franklin Program. Steve is a long-time consultant for the College Board and conducted workshops in Brazil, the Dominican Republic, and China. He was the AP Calculus Chief Reader for four years and has been involved with calculus reform and the use of technology in the classroom. He has been teaching at Bloomsburg University for 25 years and recently served as Director of the Honors Program. Steve has been teaching introductory statistics classes throughout his academic career, and there is no doubt that this is his favorite course. This class (and text) provides students with basic, lifelong, quantitative skills that they will use in almost any job and teaches them how to think and reason logically. Steve believes very strongly in data-driven decisions and conceptual understanding through problem solving. Steve’s uncle, Fr. Stanley Bezuszka, a Jesuit and professor at Boston College, was one of the original architects of the so-called new math in the 1950s and 1960s. He had a huge influence on Steve’s career. Steve helped Fr. B. with test accuracy checks, as a teaching assistant, and even writing projects through high school and college. Steve learned about the precision, order, and elegance of mathematics and developed an unbounded enthusiasm to teach.

xx

INTRODUCTORY

STATISTICS

Why Study Statistics

The Science of Intuition In the movie Erin Brokovich, actress Julia Roberts plays a feisty, unemployed, single mother of three children. After losing a lawsuit because of her bad behavior in the courtroom, Erin pressures her lawyer Ed Masry, for a job and he conceded. Despite having no legal background, Erin begins working on a real estate case involving Pacific Gas and Electric (PG&E) and the purchase of a home in Hinkley, California. Erin visits the seller, Donna Jensen, and learns that her husband has Hodgkin’s disease and that many Hinkley residents have concerns about the environment. After further investigation, Erin discovers that several residents of Hinkley have suffered from autoimmune disorders and various forms of cancer. In fact, so many people in Hinkley suffer from similar rare diseases that Erin concludes it could not be a coincidence. This is a very natural, intuitive conclusion, and it is the essence of statistical inference. Erin observed an occurrence that was so rare and extraordinary that she instinctively concluded it could not be due to pure chance or luck. There had to be another reason. Her logic was correct: The unusually high incidence of cancer in Hinkley suggested that something abnormal was happening. Indeed, PG&E had dumped water contaminated with the chemical chromium 6 into unlined storage pools. The polluted water seeped into the groundwater and eventually into local wells, and many people became ill with various medical problems. We all have this same natural instinctive reaction when we see something that is extraordinary. Sometimes we think, “Wow, that’s incredibly lucky.” More often we question the observed outcomes, “There must be some other explanation.” This natural reaction is the foundation of statistical inference. We make these kinds of decisions every single day. We gather evidence, we make an observation, and we conclude that the outcome is either reasonable or extraordinary. The purpose of statistics is simply to quantify this typical, everyday, deductive process. We need to learn about probability so that we know for sure when an outcome is really rare. And we need to study the concepts of randomness and uncertainty. The most important point here is that this process is not unusual or exceptional. The purpose of this text is to translate this common practice into statistical terms and models. This will make you better prepared to interpret outcomes, draw appropriate conclusions, and assess risk. s_bukley/Newscom

1

2

CHAPTER 0

Why Study Statistics

Here is another example of an extraordinary event involving a daily lottery number. The 1980 Pennsylvania Lottery scandal, or the Triple Six Fix, involved a three-digit daily lottery number. Nick Perry was the announcer for the Daily Number and the plan’s architect. With the help of partners, Nick was able to weight all of the balls except for the ones numbered 4 and 6. This meant that the winning three-digit lottery number would be a combination of 4s and 6s. There were thus only eight possible winning lottery numbers, 444, 446, 464, 466, 644, 646, 664, and 666, and the conspirators were certain that the plan would work. The winning number on the day of the fix was 666. Ignoring the connection to the Book of Revelations, lottery officials discovered that there were very unusual betting patterns that day, all on the eight possible lottery numbers involving 4 and 6. This extraordinary occurrence suggested that the unusual bets were not due to pure chance. This conclusion, along with an anonymous tip, helped in a grand jury investigation leading to convictions and jail time for several men.

The Statistical Inference Procedure The crucial prevailing theme in this text is statistical inference and decision making through problem solving. Computation is important and is shown throughout the text. However, calculators and computers remove the drudgery of hand calculations and allow us to concentrate more on interpretation and drawing conclusions. Most problems in this text contain a part asking the reader to interpret the numerical result or to draw a conclusion. The process of questioning a rare occurrence or claim can be described in four steps. Claim: This is the status quo, the ordinary, typical, and reasonable course of events— what we assume to be true. Experiment: To check a claim, we conduct a relevant experiment or make an appropriate observation. Likelihood: Here we consider the likelihood of occurrence of the observed experimental outcome, assuming the claim is true. We will use many techniques to determine whether the experimental outcome is a reasonable observation (subject to some variability), or whether it is an exceptionally rare occurrence. We need to consider carefully and quantify our natural reaction to the relevant experiment. Using probability rules and concepts, we will convert our natural reaction to an experimental outcome into a precise measurement. Conclusion: There are always only two possible conclusions. 1. If the outcome is reasonable, then we cannot doubt the original claim. The natural conclusion is that nothing out of the ordinary is occurring. More formally, there is no evidence to suggest that the claim is false. 2. If the experimental outcome is rare or extraordinary, we usually disregard the lucky alternative, and we think something is wrong. A rare outcome is a contradiction. Strange occurrences naturally make us question a claim. In this case we believe there is evidence to suggest that the claim is false. Let’s try to apply these four steps to the PG&E case in Erin Brokovich. The claim or status quo is that the cancer incidence rate in Hinkley is equivalent to the national incidence rate. Recent figures from the American Cancer Society suggest that the cancer incidence rate is approximately 551 in 100,000 for men and 419 in 100,000 for women.1 The experiment or observed outcome is the cancer incidence rate for the population living in Hinkley. In the movie, it is implied that Erin counts the number of people in Hinkley who have developed cancer.

With a Little Help from Technology

3

Erin determines that the likelihood, or probability, of observing that many people in Hinkley who have developed cancer is extremely low. Subject to reasonable variability, we should not see that many people with cancer in this location. The conclusion is that this rare event is not due to pure chance or luck. There is some other reason for this rare observation. The implication in the movie is that there is evidence to suggest that something else is affecting the health of the people in Hinkley.

Problem Solving

Solution Trail KEYW ORD S ■ ■ ■

Normally distributed Mean Standard deviation

TRAN SLATI O N ■ ■ ■

Normal random variable ! ! 34 " ! 0.5

CONC EPTS ■ ■

Normal probability distribution Standardization

VI SI ON

Define a normal random variable and translate each question into a probability statement. Standardize and use cumulative probability associated with Z if necessary.

Perhaps one of the most difficult concepts to teach is problem solving. We all struggle to solve problems: thinking about where to begin, what assumptions we can make, and which rules and techniques to use. One reason many students consider statistics a difficult course is because almost every problem is a word problem. These word problems have to be translated into mathematics. The Solution Trail in this text is a prescriptive technique and visual aid for problem solving. To decipher a word problem, start by identifying the keywords and phrases. Here are the four steps identified in each Solution Trail for solving many of problems in this text. 1. Find the keywords. 2. Correctly translate these words in statistics. 3. Determine the applicable concepts. 4. Develop a vision, or strategy, for the solution.

Many of the examples presented in this text have a corresponding Solution Trail in the margin to aid in problem solving. An example of a Solution Trail appears in the margin. Note that many of these terms and symbols may be unfamiliar to you at this point. Right now, just focus on the idea that the Solution Trial involves keywords, a translation, concepts, and a vision. The keywords in the problem lead to a translation into statistics. The statistics question is then solved by using the appropriate, specific concepts. The keywords, translation, and concepts are used to develop a grand vision for solving the problem. This solution technique is not applicable to every problem, but it is most appropriate for finding probabilities through hypothesis testing, which is the foundation of most introductory statistics courses. Some exercises in this text ask you to write each step in the Solution Trail formally. As you become accustomed to using this solution style, it will become routine, natural, and helpful.

With a Little Help from Technology Although it is important to know and understand underlying formulas, their derivations, and how to apply them, we will use and present several different technology tools to supplement problem solving. Your focus should be on the interpretation of results, not the actual numerical calculations. Four common technology tools are presented in this text. 1. CrunchIt! is available in LaunchPad, the publisher’s online homework system, and is

accessed under the Resources tab. The opening screen (Figure 0.1) looks like a spreadsheet with pull-down menus at the top. You can enter data in columns, Var1, Var2, etc., import data from a file, and export and save data. Most Statistics, Graphics, and Distribution Calculator functions start with input screens. Output is displayed in a new screen. Figure 0.2 shows the input screen for a bar chart with summarized data, and Figure 0.3 shows the resulting graph.2

4

CHAPTER 0

Why Study Statistics

Figure 0.1 CrunchIt! opening screen.

Figure 0.2 Bar chart input screen.

Figure 0.3 CrunchIt! bar chart.

2. The Texas Instruments TI-84 Plus C graphing calculator includes many common

statistical features such as confidence intervals, hypothesis tests, and probability distribution functions. Data are entered and edited in the stat list editor as shown in Figure 0.4. Figure 0.5 shows the results from a one-sample t test, and Figure 0.6 shows a visualization of this hypothesis test.

With a Little Help from Technology

Figure 0.4 The stat list editor.

Figure 0.5 One-sample t-test output.

5

Figure 0.6 One-sample t-test visualization.

3. Minitab is a powerful software tool for analyzing data. It has a logical interface,

including a worksheet screen similar to a common spreadsheet. Data, graph, and statistics tools can be accessed through pull-down menus, and most commands can also be entered in a session window. Figure 0.7 shows a bar chart of the number of Rolex 24 sports car race wins by automobile manufacturer. 4. Excel 2013 includes many common chart features accessible under the Insert tab. There are also probability distribution functions that allow the user to build templates for confidence intervals, hypothesis tests, and other statistical procedures. The Data Analysis tool pack provides additional statistical functions. Figure 0.8 shows some descriptive statistics associated with the ages of 100 stock brokers at a New York City firm.

Figure 0.7 Minitab bar chart.

Figure 0.8 Excel descriptive statistics.

In addition to these tools, JMP statistical software is used by scientists, engineers, and others who want to explore or mine data. Various statistical tools and dynamic graphics are available, and this software features a friendly interactive interface. Figure 0.9 shows a scatter plot of the price of used Honda Accords versus the age of the vehicle, the leastsquares regression line, and confidence bands for the true mean price for each age. Many other technology tools and statistical software packages are also available. For example, R is free statistical software, SPSS is used primarily in the social sciences, and SAS incorporates a proprietary programming language. Regardless of your technology choice, remember that careful and thorough interpretation of the results is an essential part of using software properly.

6

CHAPTER 0

Why Study Statistics

Figure 0.9 JMP scatter plot, regression line, and confidence bands.

CHAPTER 0 EXERCISES

0.1 Name the four parts of every statistical inference problem. 0.2 Apply the four statistical inference steps to the Triple Six Fix. 0.3 Name the four parts of the Solution Trail. 0.4 The Canary Party recently began the Not a Coincidence campaign to highlight women who have been affected by Merck’s human papillomavirus (HPV) vaccine, Gardasil.3 As of November 2013, a report states that there have been 31,741 adverse events, 10,849 hospitalizations, and 144 deaths due to HPV vaccines. Explain why The Canary Party believes that there must be something wrong with the vaccine. 0.5 It had been very rare for an NBA player to suffer a major

knee injury while on the court. Derrick Rose tore an anterior cruciate ligament (ACL) in his left knee in 2012. Rose was the first player to suffer an ACL tear since Danny Manning in 1995 and Bernard King in 1985. Since the injury to Derrick Rose, at least six NBA players have experienced similar injuries—torn ACLs. State two possible explanations for this rare rash of injuries. Which explanation do you think is more plausible? Why? 0.6 In the movie, Wall Street, corporate raider Gordon Gekko and his partner Bud Fox made a lot of money trading stocks. However, several of the trades attracted the attention of the Securities and Exchange Commission (SEC). Why do you think

the SEC believed Gordon and Bud may have had inside information or manipulated the price of certain stocks? 0.7 What do you think it means when a weatherperson says, “There is a 50% chance of rain today.” Contact a weatherperson and ask him or her what this statement means. Does this explanation agree with yours? 0.8 James Bozeman of Orlando won the Florida lotto twice. He beat the odds of 1 in 22,957,480 twice to win a total of $13 million. State two possible explanations for this occurrence. Which explanation do you think is more reasonable? Why? 0.9 In January 2014, 33 whales died off the coast of Florida. Twenty-five were found on Kice Island in Collier County. Blair Mase, a marine mammal scientist with the National Oceanic and Atmospheric Administration (NOAA) indicated that NOAA was carefully investigating these deaths.4 Explain why NOAA believes the whales did not die as a result of natural causes and is investigating the deaths. 0.10 In January 2014, 62 people became sick after dining at

one of two restaurants that share a kitchen in Muskegon County, Michigan.5 The illnesses occurred over a four-day period, and county health officials began an immediate investigation. a. Explain why officials investigated the source of these illnesses. b. Apply the four statistical inference steps to this situation.

Chapter 0

Exercises

7

0.11 In 2009 and 2010, Toyota issued a costly recall of over 9

0.13 Recently, the Sedgwick County Health Department

million vehicles because of possibly out-of-control gas pedals. There had been at least 60 reported cases of runaway vehicles, some of which resulted in at least one death.6 a. State two possible reasons for this observed high number of runaway vehicles. b. Why do you think Toyota issued this recall?

reported at least 27 cases of whooping cough in one month. This observed count was more than in any month in the previous five years. Do you think health officials should be concerned about this outbreak of whooping cough? Why or why not?

0.12 Suppose there were 15 home burglaries in a small town

during the entire year. None occurred on a Thursday. Do you think there is evidence to suggest that something very unusual is happening on Thursdays in this town to prevent burglaries on this day of the week? Why or why not?

0.14 To understand the definitions and formulas in this text,

you will need to feel comfortable with mathematical notation. To review and prepare for the notation we will use, make sure you are familiar with the following: a. Subscript notation—for example, x1, x2, . . . n b. Summation notation—for example, g xi i51 c. The definition of a function.

8

CHAPTER 1

An Introduction to Statistics and Statistical Inference

1

An Introduction to Statistics and Statistical Inference ▲

Looking Forward

■

Recognize that data and statistics are pervasive and that statistics are used to describe typical values and variability, and to make decisions that affect everyone.

■

Understand the relationships among a population, a sample, probability, and statistics.

■

Learn the basic steps in a statistical inference procedure.

Is it safe to eat rice? Arsenic is a naturally occurring element that is found mainly in the Earth’s crust. Some people are exposed to high levels of arsenic in their jobs, or near hazardous waste sites, or in some areas of the country in which there are high levels of arsenic in the surrounding soil, rocks, or even water. Exposure to small amounts of arsenic can cause skin discoloration, and long-term exposure has been associated with higher rates of some forms of cancer. Excessive exposure can cause death. In 2012, the U.S. Food and Drug Administration (FDA) and Consumer Reports announced test results that revealed many brands of rice contain more arsenic in a single serving than is allowed by the Environmental Protection Agency (EPA) in a quart of drinking water.1 Trace amounts of arsenic may also be found in flour, juices, and even beer. Earlier in that year, a study conducted at Dartmouth College detected arsenic in cereal bars and infant formula. The FDA has established a safe level of arsenic in drinking water, 10 parts per billion (ppb). However, there is no equivalent safe maximum level for food. Suppose the FDA is conducting an extensive study to determine whether to issue any warnings about rice consumption. One hundred random samples of rice are obtained, and each is carefully measured for arsenic. The methods presented in this chapter will enable us to identify the population of interest and the sample, and to understand the definition and importance of a random sample. Most important, we will characterize the deductive process used when an an extraordinary event is observed and cannot be attributed to luck.

CONTENTS 1.1 Statistics Today 1.2 Populations, Samples, Probability, and Statistics 1.3 Experiments and Random Samples wanphen chawarung/Shutterstock

9

10

CHAPTER 1

An Introduction to Statistics and Statistical Inference

1.1 Statistics Today Statistics data are everywhere: in newspapers, magazines, the Internet, the evening weather forecast, medical studies, and even sports reports. They are used to describe typical values and variability, and to make decisions that affect every one of us. It is important to be able to read and understand statistical summaries and arguments with a critical eye. This chapter presents the basic elements of every statistics problem—a population and a sample and their connection to probability and statistics. Two common methods for data collection, observational sampling and experimentation, are also introduced. Statistics data are used by professionals in many different disciplines. Actuaries are probably the biggest users of statistics. They conduct statistical analyses, assess risk, and estimate financial outcomes. An actuary helped compute your last automobile insurance bill. Statistical analyses are used in a variety of settings. The National Agricultural Statistics Service publishes statistics on food production and supply, prices, farm labor, and even the price of land. Pollsters use statistical methods to predict a candidate’s chances of winning an election. Using complex statistical analyses, companies make decisions about new products. Traditional statistical techniques and new sophisticated methods are used every day in making decisions that affect our lives directly. Pharmaceutical companies use a battery of standard statistical tests to determine a new drug’s efficacy and possible side effects. Data mining, a combination of computer science and statistics, is a new technique used for constructing theoretical models and detecting patterns. This technique is used by many companies to understand customers better and to respond quickly to their needs. Predictive microbiology is used to ensure that our food is not contaminated and is safe to consume. Given certain food properties and environmental parameters, a mathematical model is used to predict safety and shelf life. Statistics is the science of collecting and interpreting data, and drawing logical conclusions from available information to solve real-world problems. This text presents several numerical and graphical procedures for organizing and summarizing data. The constant theme throughout the course, however, is statistical inference using a four-step approach: claim, experiment, likelihood, and conclusion. Here are some examples of statistics in the news. 1. Statistical inference: As reported in the Archives of Internal Medicine,2 researchers

discovered a decline in the incidence of heart attacks in one Minnesota county following the implementation of new smoke-free workplace laws. The incidence of heart attacks decreased by 33%, from 150.8 to 100.7 per 100,000 people. The authors concluded that second-hand smoke affects the cardiovascular system nearly as much as active smoking. 2. Summary statistics: In July 2012, Time Newsfeed reported that the average Canadian is now richer that the average American. The story summarized an article published in the Toronto-based Globe and Mail and concluded that the average net worth of a Canadian household was approximately $40,000 more that a typical American household. The average net worths in this report are summary statistics that suggest the middle, or central tendency, of a data set. 3. Probability and odds: Every year, approximately 1 in 5 Americans gets the flu. WebMD suggests that the best way to prevent the flu is to get a flu shot, or influenza vaccine. Some people have a higher risk of getting the flu. For example, children and infants, pregnant women, and seniors have a greater chance of getting the flu. Individuals with disabilities are at higher risk because of their lack of mobility. People with certain health conditions have a greater chance of getting the flu, and people traveling to certain areas of the world may be subject to a higher probability of getting the flu. A solid background in probability is necessary to understand statistical inference.

1.2

Populations, Samples, Probability, and Statistics

11

4. Likelihood and inference: Recent research suggests that there is a link between the

incidence of lung cancer and cancer-causing chemical pollutants near large industrial facilities, especially oil refineries.3 The incidence of lung cancer in oil refinery counties was higher than in non–oil refinery counties. The chance of this happening was so small that the researchers concluded that oil refinery products significantly affect lung carcinogenesis. 5. Relative frequency and probability: According to Oddee, some vending machines mysteriously fall over and crush 13 people per year.4 If all 312 million Americans are equally likely to be killed by a vending machine, then the probability that a randomly selected individual will be crushed by a vending machine during a particular year is 0.00000004167 5 13/312,000,000. The relative frequency of occurrence is a good estimate of probability and is often used to develop statistical models and make predictions. There has been an explosion of numerical information, in stories like those above, in business, in consumer reports, and even in casual conversation. Interpretation of graphs and evaluation of statistical arguments are no longer reserved for academics and researchers. It is essential for all of us to be able to understand arguments based on acquired data. This numerical, or quantitative, literacy is a vital life-long tool. No matter how you are employed or where you live, you will have to make decisions based on available information or data. Here are some questions you may have to consider. Stephen Finn/Shutterstock

1. Do you have enough information (data) to make a confident decision? How were the

data obtained? If more information is necessary, how will these data be gathered? 2. How are the data summarized? Are the graphical and/or numerical techniques appropriate? Does the summary represent the data accurately? 3. What is the appropriate statistical technique for analyzing the data? Are the conclusions reasonable and reliable?

1.2 Populations, Samples, Probability, and Statistics There are two very general applications of statistics: descriptive statistics and inferential statistics. Descriptive statistics involve summarizing and organizing the given information, graphically and/or numerically. The focus of this text is statistical inference. The procedures of inferential statistics allow us to use the given data to draw conclusions and assess risk.

Definition

Here’s a dictionary definition for inference: a deduction or logical conclusion.

Descriptive statistics: Graphical and numerical methods used to describe, organize, and summarize data. Inferential statistics: Techniques and methods used to analyze a small, specific set of data so as to draw a conclusion about a large, more general collection of data.

Example 1.1 Mishandled Baggage The U.S. Department of Transportation publishes information concerning automobiles, public transportation, railroads, and waterways. Much of this information can be summarized or organized—in tables or charts, with a variety of graphs, and numerically— to describe typical values and variability. These summary descriptive procedures might be used to indicate preference for a certain airline or to promote safety records. Figure 1.1 shows a bar graph of the total mishandled baggage reports for certain airlines in July 2012.

An Introduction to Statistics and Statistical Inference

40000 35000 30000 25000 20000 15000 10000

r tr an Je tb lu e Fr on tie r De Am lta er ic a Ha n wa So iian ut hw es t Al as ka Un ite d M es Sk a yw es t

5000

Ai

CHAPTER 1

Number of Reports

12

Airline

Figure 1.1 Mishandled baggage reports. (Source: Office of Aviation Enforcement and Proceedings, U.S. Department of Transportation.)

Example 1.2 Summary Statistics The U.S. Census Bureau maintains a huge database of people and households, business and industry, geography, and other special topics. This information may be neatly organized using tables, bar charts, pie charts, histograms, or stem-and-leaf plots, or summarized numerically using the mean, median, quartiles, percentiles, variance, or standard deviation. These simple descriptive statistics reveal characteristics of the entire data set. Part of a table is shown in Table 1.1.

Example 1.3 Hand Sanitizers that Kill Germs A department manager for Walgreens claims that Germ-X HandiSani hand sanitizer kills 99.99% of common harmful germs and bacteria in as little as 15 seconds. For a hand sanitizer to be effective, the alcohol concentration must be at least 60%. To check the germ-killing claim, several randomly selected bottles of the hand sanitizer are obtained and the alcohol concentration in each is carefully measured. These concentrations are used to determine whether there is any evidence to suggest sanitizer is ineffective. The collected data are used in inferential statistics to draw a conclusion regarding a claim.

Example 1.4 Magnetic Fuel Savers Many companies sell magnetic fuel savers for stoves, which are designed to condition liquef ied propane gas (LPG) prior to combustion to increase power output, reduce emissions, and save gas. An independent agency tests these devices by recording the amount of gas necessary to boil a specific volume of water. Each test boil is classified by stove brand, shape and thickness of the pot, and burner size. The data collected are used to determine whether there is a difference in efficiency. If there is a difference in the amount of LPG gas used, further inferential statistical techniques will be used to isolate this difference. It may be due to the stove brand, type of pot, or burner size. TRY IT NOW

GO TO EXERCISE 1.5

Whether we are summarizing data or making an inference, every statistics problem involves a population and a sample. Consider the definitions on the page that follows.

1.2

13

Populations, Samples, Probability, and Statistics

Table 1.1 A portion of table summarizing health and safety characteristics in owner-occupied units (national)

Region Characteristics

Total

Northeast

Midwest

South

West

Total Safety equipment Smoke detectors: Working smoke detector Powered by: Electricity Batteries Both Not reported Carbon monoxide detectors: Working carbon monoxide detector Powered by: Electricity Batteries Both Not reported Mold Housing units with mold in last 12 months Kitchen Bathroom(s) Bedroom(s) Living room Basement Other room Mold not present Not reported Musty smells Housing units with musty smells in last 12 months Daily Weekly Monthly A few times Musty smells not present Not reported

76,091

13,480

18,032

29,119

15,460

70,801

12,893

17,015

26,476

14,417

4,506 40,094 25,763 438

853 7,553 4,409 78

954 10,372 5,622 68

1,791 14,221 10,253 211

908 7,949 5,479 81

35,215

9,311

10,832

9,129

5,944

7,248 15,895 11,810 262

1,764 4,467 2,998 80

2,811 4,533 3,416 72

1,647 4,205 3,205 72

1,026 2,689 2,191 38

2,015 213 683 378 219 611 277 72,817 1,259

527 28 99 78 45 276 75 12,762 191

524 49 180 69 52 216 65 17,231 277

611 88 265 137 86 94 76 27,953 554

353 49 139 94 35 26 60 14,870 238

11,238 772 5,235 354 4,877 63,563 1,291

2,377 233 808 94 1,242 10,912 191

2,999 216 1,083 112 1,589 14,760 273

4,000 202 2,316 85 1,397 24,544 574

1,861 121 1,028 63 649 13,346 253

All numbers are in thousands. Source: U.S. Census Bureau.

Definition A population is the entire collection of individuals or objects to be considered or studied. A sample is a subset of the entire population, a small selection of individuals or objects taken from the entire collection. A variable is a characteristic of an individual or object in a population of interest.

A CLOSER L OK 1. A population consists of all objects of a particular type. There are usually infinitely

many objects in a population, or at least so many that we cannot look at all of them.

14

CHAPTER 1

An Introduction to Statistics and Statistical Inference

2. A sample is simply a (usually) small part of a population. 3. A variable may be a qualitative (categorical) or a quantitative (numerical) attribute of

each individual in a population. The Solution Trail is a technique and visual aid for problem solving (illustrated in the next example). It is a guide to help us plan how to solve a problem. Look at the Solution Trail before you read the steps of the solution. Start this hike by identifying keywords and phrases. The four steps to solving each problem are 1. Find the keywords. 2. Correctly translate these words into statistics. 3. Determine the applicable concepts. 4. Develop a vision for the solution.

The keywords lead to a translation into statistics. The statistics question is solved using specific concepts. The keywords, translation, and concepts are all used to develop a vision for the solution. This technique is not applicable or necessary in every problem. It is most appropriate for probability through hypothesis testing, the foundation of most introductory statistics courses. The following examples illustrate the relationships among populations, samples, and variables.

Solution Trail 1.5 KE YWOR DS ■ ■ ■

Magnesium level Slice of whole-grain bread One hundred slices

T RANSL ATI ON ■ ■ ■

Characteristic of each slice All slices of whole-grain bread Subset of all slices

CONCEPTS ■ ■ ■

Variable Population Sample

VI S ION

Determine the set of all objects of interest, the subset, and the attribute to be measured.

Example 1.5 High Anxiety Various research studies suggest that whole-grain foods may be a natural help for those people who suffer from anxiety, or maybe even statistics anxiety! Whole grains generally contain high levels of magnesium, and magnesium deficiency can lead to anxiety.5 A new study is concerned with the magnesium level in a slice of whole-grain bread. One hundred slices of whole grain bread are selected at random from various markets, and the magnesium in each is carefully measured and recorded. Describe the population, sample, and variable in this problem.

SOLUTION STEP 1 The population consists of all slices of whole-grain bread in the entire world.

Although this population is not infinite, we certainly could not examine every single slice. STEP 2 The sample is the 100 slices selected at random. This is a subset of or selection from the population. STEP 3 The variable in this problem is the magnesium level. This characteristic will be carefully measured for each slice, and the data will be summarized or used to draw a conclusion.

Example 1.6 Asleep at the Wheel The Centers for Disease Control and Prevention released results from a study that indicated approximately 4% of all adults in the United States said they had fallen asleep at least once while driving in the last month. Nodding off while driving seems to be more common in men, and some officials claim the percentage of all U.S. adults who have fallen asleep while driving is greater than 4%. To check this claim, 10 adult drivers were selected from across the country. Each person was asked if he or she had fallen asleep while driving, and the results were recorded. Describe the population, sample, and variable in this problem.

SOLUTION STEP 1 The population consists of all adult drivers in the United States. This population is

not infinite, but it is so large that it would be impossible to contact every adult driver.

1.2

Solution Trail 1.6 KEYW OR DS ■ ■ ■

Adult drivers 10 adult drivers Fallen asleep while driving

TRAN SLATI O N ■

■ ■

All adult drivers in the United States Subset of all adult drivers Characteristic of adult drivers

CONC EPTS ■ ■ ■

Population Sample Variable

VI SI ON

Determine the set of all objects of interest, the subset, and the attribute measured for each object.

Populations, Samples, Probability, and Statistics

15

STEP 2 The sample consists of the 10 adult drivers selected. STEP 3 The variable in the problem is whether the driver has fallen asleep at the wheel,

a yes/no response. TRY IT NOW

GO TO EXERCISE 1.7

A CLOSER L OK Example 1.6 raises some important issues regarding the sample of 10 adult drivers. 1. How large a sample is necessary for us to be confident in our conclusion? Ten adult

drivers may not seem like enough. But how many do we need? 100? 1000? We will consider the problem of sample size in Chapter 7 and beyond. 2. This problem does not say how the sample was obtained. Perhaps the first 10 drivers who recently renewed their licenses were selected. Or maybe only those from one state were included. To draw a valid conclusion, we need to be certain the sample is representative of the entire population. The formal definition of a representative sample is presented in Section 1.3. Statistical inference is based on, and follows from, basic probability concepts. Probability and inferential statistics are both related to a population and a sample, but from different perspectives. For the rest of this chapter, statistics really means inferential statistics.

Definition To solve a probability problem, certain characteristics of a population are assumed to be known. We then answer questions concerning a sample from that population. In a statistics problem, we assume very little about a population. We use the information about a sample to answer questions concerning the population.

Figure 1.2 illustrates this definition. Picture an entire population of individuals or objects. Suppose we know everything about the population and we select a sample from this population. A probability problem would involve answering a question concerning the sample. In a typical statistics problem, we assume very little about the population. We select a sample and analyze it completely. We use this information to draw a conclusion about the population.

Probability

Sample

Population

Statistics

Figure 1.2 Relationships among probability, statistics, population, and sample.

In Figure 1.2, it may seem like we can start our study anywhere in this circular diagram. However, we need to understand probability before we can learn statistics. A solid background in probability is necessary before we can actually do statistical inference.

16

CHAPTER 1

An Introduction to Statistics and Statistical Inference

Example 1.7 Most-Watched Television Finales According to Koldcast Entertainment Media, the final episode of M*A*S*H, with almost 106 million viewers, is still the most-watched television finale ever.6 Other top television finales include Cheers, Seinfeld, Breaking Bad, and Friends. Consider the population consisting of all television viewers and a sample of 20 from this population. Population: All television viewers at the time M*A*S*H was aired Sample: The 20 television viewers from this population Here is a probability question. The final episode of M*A*S*H was watched by 47% of all television viewers. What is the probability that 10 or more (of the 20) selected viewers watched this episode? We know something about the population, and try to answer a question about the sample. Here is a statistics question. Suppose we interview the 20 viewers in the sample and find that 9 of the 20 watched the final episode of M*A*S*H. What can we conclude about the percentage of all television viewers who watched the final episode of M*A*S*H? We know about the sample and try to answer a question about the (whole or general) population.

Example 1.8 Trouble Falling Asleep A recent study of sleep habits by Statistics Canada indicated that 35% of those women surveyed had difficulty falling asleep and staying asleep, whereas only 25% of men experienced the same troubles. The study also showed that Canadian women tend to sleep longer than Canadian men.7 Consider the population consisting of all Canadian women and a sample of 100 from this population. A probability question: Suppose 35% of all Canadian women have difficulty falling asleep. What is the probability that at most 30 (of the 100) women in the sample have difficulty falling asleep? A statistics question: All 100 women selected are asked to complete an extensive questionnaire. The information indicates that 45 of the 100 have difficulty falling asleep. What does this suggest about the proportion of all Canadian women who have difficulty falling asleep? TRY IT NOW

GO TO EXERCISE 1.12

SECTION 1.2 EXERCISES Concept Check

a. It has been reported that 62% of all people use a social

1.1 True/False Inferential statistics are used to draw a

b. Only 37% of all people in the United States are eligible to

conclusion about a population. 1.2 True/False Descriptive statistics are used to indicate how

the data were collected. 1.3 Fill in the Blank a. The entire collection of objects being studied is called the

. b. A small subset from the set of all 2013 minivans is called

a

.

c. Consider the amount of sugar in breakfast cereals. This

characteristic of breakfast cereal (objects) is called a .

Practice 1.4 Probability/Statistics In each of the following problems, write a probability and a statistics question associated with the given information.

media. donate blood. c. Fifty-two percent of all old washing machines are

front-loading. d. Forty percent of the people in the Washington, DC, area

travel over the holiday season.

Applications 1.5 Descriptive or Inferential Statistics Determine whether each of the following is a descriptive or an inferential statistics problem. a. The Nebraska Department of Transportation maintains records concerning all trucks stopped for inspection. A report of these inspections lists the proportion of all trucks stopped, by cargo. b. Eric Knudsen, a researcher at Stanford University Medical Center, obtains a random sample of wild owls and measures how far each can turn its neck. The data are used

1.2

17

300 250 Bankruptcies

200 150 100 50

ru M ct an io n uf ac W ho tu r le sa ing le t Re rad e ta il Tr t r an ad sp e or ta tio In n fo rm at io n Fi na nc Re e al es ta te

Co ns t

to conclude that an owl can turn its neck more than 120 degrees from the forward position. c. A Navy research facility runs several tests to check the structural integrity of a new submarine. A laboratory report states the vessel can withstand pressure at depths of at most 800 ft. d. A safety inspector in Atlanta selects a sample of apartment buildings and checks the fire ladders on each. The proportion of broken ladders in the sample is used to estimate the proportion of broken fire ladders in the entire city. e. Like most states, a large portion of the New York State budget is spent on health care, pensions, and education. The pie chart in Figure 1.3 shows the percentage of the budget spent on each item for Fiscal Year 2013.8

Populations, Samples, Probability, and Statistics

Industry

Figure 1.4 Total bankruptcies in Quebec.

Fiscal Year 2013 Spending

e. A report issued by the athletic department at Brigham Health Care 38% Pensions 12%

Education 11%

General 2% Interest 3% Protection 6%

Transportation 10%

Welfare 8% Other Spending 10%

Figure 1.3 Pie chart illustrating New York State Fiscal Year 2013 spending. f. A report from the Louisiana Department of Agriculture

and Forestry lists the prices paid for raw forest products at the first point of sale. 1.6 Descriptive or Inferential Statistics Determine

whether each of the following is a descriptive or an inferential statistics problem. a. The bar chart in Figure 1.4 shows the number of bankruptcies for certain industries in Quebec during a recent year.9 b. The resting heart rate was measured for adult males from two separate groups: those who exercise at least three days per week, and those who do not exercise regularly. The resulting data are used to suggest that regular exercise decreases the resting heart rate in adult males. c. Interior Exterior Remodeling, in Northridge, California, maintains a comprehensive list of each home constructed by type, size, exterior color, etc. d. Researchers at the Center for Food Safety selected a sample of frozen toaster apple strudel sold in grocery stores. Measurements indicated the producer was baking each piece of strudel with less apple than advertised on the box.

Young University listed each item in a trainer’s bag and the number of times each was used. f. The manager at the Bear Pause Theater in Hackensack, Minnesota, surveyed patrons and summarized opinions associated with seating comfort, movie sound, and snacks. 1.7 Medicine and Clinical Studies Managers at Cedarcrest Hospital in Newington, Connecticut, are interested in the length of stay (in days) of patients admitted for open-heart surgery. Hospital managers have decided to limit their investigation to open-heart patients who were operated on within the last year. Thirty open-heart surgery patients admitted to the hospital within the last year are selected. What is the population of interest, the sample, and the variable in this problem? Write a Solution Trail for this problem. 1.8 Marketing and Consumer Behavior T-shirt labels irritate many people’s skin, so Calico Graphics of Wolfeboro, New Hampshire, would like to produce shirts without a label. The company wants to know whether there is an advantage to producing this type of T-shirt. Fifty people are surveyed about whether they cut the tags off their T-shirts. What is the population of interest, the sample, and the variable in this problem? Write a Solution Trail for this problem. 1.9 Psychology and Human Behavior Managers at Citi-

group, Inc., in New York, are concerned about the number of employees who eat and/or drink at their desks while working. Some managers believe this is an unnecessary distraction, and spills can cause computer failures and ruin documents. Thirtyfive employees are selected, and each is questioned about eating/ drinking while working. Describe the population and the sample in this problem. 1.10 Public Policy and Political Science Senator Marco

Rubio of Florida is unsure of his vote on an emotional and

18

CHAPTER 1

An Introduction to Statistics and Statistical Inference

controversial issue. Before he votes, he would like to know what his constituents think about the proposed bill. An aide for the senator selects 500 people from Florida and asks them whether they believe the bill should become law. Describe the population and the sample in this problem. 1.11 Economics and Finance In 2012, Hurricane Sandy

caused over 2 million households to lose power and damaged over 70,000 homes and businesses. Every family filed some sort of insurance claim for damage to their home and/or car. An insurance company serving the area is interested in the typical amount of a claim as a result of this storm. Seventyfive affected families are selected and their total claims are recorded. Describe the population and the sample in this problem. 1.12 Probability or Statistics In each of the following

problems, identify the population and the sample, and determine whether the question involves probability or statistics. a. Seventy-five percent of all people who buy a dining room table purchase matching chairs. Five people who purchased a dining room table within the last month are selected at random. What is the probability that all five purchased matching chairs? b. Twenty-five people entering a rest area and food court on Highway 59 near Houston are selected at random. Of these 25 people, 20 purchased food from at least one of the eateries. Estimate the true proportion of people stopping at this rest area who purchase food. c. Historical records indicate 1 out of every 500 people using a particularly steep water slide suffer some kind of injury. Fifty people using the slide are selected at random. How many do you expect to be injured? d. A building inspector in Henderson, Nevada, is checking public buildings with doors that open automatically. One hundred doors are randomly selected. Careful inspection reveals that 12 doors are broken. Use this information to estimate the percentage of automatic doors in Henderson that are broken. e. One thousand people entering Los Angeles International Airport (LAX) are selected at random. Each person is asked to complete a short survey regarding travel. The survey results show 637 carry a frequent-flier card. Is there evidence to suggest the true proportion of travelers entering LAX who carry frequent flier cards is greater than 0.60? f. The Risdall Advertising Agency reports 65% of all women have purchased perfume within the last three months. Thirty-four women are selected at random. Is it likely more than 20 of these women purchased perfume within the last three months? g. Representatives from the Occupational Safety and Health Administration inspected several for-profit and Medicare nursing homes for any violations. The resulting data will be used to determine whether there is any evidence to suggest the quality of treatment is different in the two types of nursing homes.

1.13 Psychology and Human Behavior During each

summer, many families spend part of their vacation time at a beach along the East or West Coast. Due to the popularity of movies like Jaws, and recent shark attacks on surfers, swimmers, snorkelers, and spearfishermen, Americans have become increasingly concerned about water activities. Research suggests that 46% of all shark attacks are on divers.10 One thousand records of shark attacks are selected, and each is categorized by victim group. a. What is the population of interest? b. What is the sample? c. Describe the variable of interest. 1.14 Manufacturing and Product Development Spray-on

tans, or fake tans, contain several chemicals that have been linked to allergies, diabetes, and obesity.11 Twenty fake tan products are selected and the amount of dihydroxyacetone (the active ingredient) in each is measured. Describe the population, the sample, and variable in this problem. 1.15 Medicine and Clinical Studies There is some evidence to suggest people with chronic hepatitis C have a liver enzyme level that fluctuates between normal and abnormal.12 Fifty patients diagnosed with hepatitis C are selected and their liver enzyme levels are recorded each day for one month. Describe the population, sample, and variable in this problem. 1.16 Manufacturing and Product Development Paper

towel manufacturers constantly advertise their products’ strength, amount of stretch, and softness. A consumer group is interested in testing the absorption of Bounty paper towels. Thirty-five rolls are selected, and the amount of absorption for a single paper towel from each roll is recorded. a. What is the population of interest? b. What is the sample? c. Describe the variable of interest.

Extended Applications 1.17 Manufacturing and Product Development While

much of the cheddar cheese consumed around the world is processed, some is still produced in the traditional manner: made in small batches, wrapped in cloth to breathe, and allowed to age. Most traditional cheddar is aged one to two years; like fine wines, older cheddars assume their own character and flavor. Suppose 75% of all cheddars are aged less than two years, and a sample of 20 cheddar cheeses from around the world is obtained. a. Describe the population and the sample in this problem. b. Write a probability question and a statistics question involving this population and sample. 1.18 Marketing and Consumer Behavior Magazines,

newspapers, and books have become more readily available in digital format. In addition, the quality of readers, for example, the Kindle, Nook, and iPad, has increased. A recent study suggests that 21% of adults read an ebook within the

1.3

past year.13 Suppose a sample of 500 adults in the United States is obtained. a. Describe the population and the sample in this problem. b. Write a probability question and a statistics question involving this population and sample. 1.19 Business and Management One of the main reasons that U.S. companies shift jobs overseas is labor costs. Although the compensation gap between the United States and China has decreased recently, the tax code still rewards companies for

Experiments and Random Samples

19

making certain investments overseas. A recent study suggests that 4% of large companies have plans to relocate jobs back to the United States.14 Seventy-five large companies are selected, and each is surveyed to determine if it plans to move jobs back to the United States. a. What is the population of interest? b. What is the sample? c. Describe the variable of interest. d. Write a probability question and a statistics question involving this population and sample.

1.3 Experiments and Random Samples Statisticians analyze data from two types of experiments: observational studies and experimental studies. The definitions are given below.

Definition In an observational study, we observe the response for a specific variable for each individual or object. In an experimental study, we investigate the effects of certain conditions on individuals or objects in the sample.

The collected data in an observational study may be summarized in a variety of ways, or used to draw a conclusion about the entire population. The following is an example of an observational study.

Example 1.9 Is There Time for Breakfast?

© Lynch Creative Ltd/Alamy

Lorraine GaNun, a guidance counselor at Rice School in Marlton, New Jersey, is interested in the amount of time each student spends in the morning eating breakfast. Some students wake up an hour before the bus arrives, have a leisurely breakfast, read the newspaper comics, and complete last-minute homework. Others roll out of bed and onto the school bus. Mrs. GaNun decides to measure the amount of time from wake-up to school bus arrival. A random sample of students is selected, and each is asked for the amount of school-day preparation time. The data are summarized graphically and numerically in this observational study. In almost all statistical applications, it is important for the data to be representative of the relevant population. A representative sample has characteristics similar to those of the entire population, and therefore can be used to draw a conclusion about the (general) population. The following definition describes a method for obtaining data in an observational study to ensure the resulting sample is representative of the corresponding population.

Definition A (simple) random sample (SRS) of size n is a sample selected in such a way that every possible sample of size n has the same chance of being selected.

20

CHAPTER 1

An Introduction to Statistics and Statistical Inference

A CLOSER L OK How can we be absolutely certain every possible sample of size n is equally likely?

1. In practice, a random sample may be very difficult to achieve. Statisticians employ

2. 3.

4.

5.

6.

various techniques, including random number tables and random number generators, to select a random sample. If a sample is not random, then it is biased. There are many different kinds of bias and factors that contribute to a biased sample. Nonresponse bias is very common when data are collected using surveys. The majority of people who receive a survey in the mail simply discard it. The original collection of people receiving the survey may be random, but the final sample of completed surveys is not. Because the sample is biased, it is impossible to draw a valid conclusion. Self-selection bias occurs when the individuals (or objects) choose to be included in the sample, as opposed to being selected. For example, a television news program may ask viewers to respond to a yes/no question by dialing one of two phone numbers to cast their vote. Viewers choose to participate, and usually those with strong opinions (either way) vote. There are many more who did not have the opportunity to respond— every single sample is not equally likely. Certainly this sample is biased, and hence no valid conclusion is possible. If the population is infinite, then the number of simple random samples is also infinite. For finite populations, the formula for the number of possible random samples is presented in Chapter 7. A simple random sample is vital for sound statistical practice. Before doing any analysis, you should always ask how the data were obtained. If there is any evidence of a pattern in selection, if the observations are associated or linked in some way, or if there is some connection among the observations, then the sample is not random. There is simply no way to transform bad data into good statistics.

Example 1.10 Town Facility Master Plan STATISTICAL APPLET SIMPLE RANDOM SAMPLE

Mark Traeger, a member of the Sandown, New Hampshire, planning board, is conducting a survey of residents concerning proposed changes in facilities over the next 10 years. He plans to choose 100 residents from the total town population of 5143 and will ask each selected person to complete a short questionnaire. The results will be summarized and presented at the next town meeting. Traeger would like a simple random sample of size 100, a representative sample of the entire town population. Here is one basic selection procedure. Write each person’s name on a piece of paper and place all of them in a hat. Thoroughly mix the papers and then select 100 names. Although this procedure is clear-cut and uncomplicated, it can be very tedious if the number of individuals or objects in the population is large. In addition, it is hard to guarantee a thorough mixing of the slips of paper. More practical methods for selecting a simple random sample include the use of a random number table or a random number generator (available in most statistical software packages). In this example, we might assign each resident a number, from 1 to 5143, and use a random number generator to produce a list of 100 numbers in this range. The residents associated with these 100 numbers would comprise the random sample. TRY IT NOW

GO TO EXERCISE 1.29

Researchers often investigate the effects of certain conditions on individuals or objects. The data obtained are from an experimental study. Individuals are randomly assigned to specific groups, and certain factors are systematically controlled, or imposed,

1.3

Experiments and Random Samples

21

in order to investigate and isolate specific effects. The following example is of an experimental study.

Example 1.11 To Fertilize or Not to Fertilize The manager of Gardener’s Supply Company claims that a new organic fertilizer, in comparison with the leading brand, increases the yield and size of tomatoes. To test this claim, tomato plants are randomly assigned to one of two groups. One group is grown using the leading fertilizer and the other is cultivated using the new product. At harvest time the size and weight of each tomato is recorded, along with the total yield per plant. The collected data from this experiment are used to compare the two fertilizers. TRY IT NOW

GO TO EXERCISE 1.28

In an experimental study, researchers must be careful to ensure that significant effects are indeed due to an imposed treatment, or controlled factor. Confounding occurs when several factors together contribute to an effect, but no single cause can be isolated. Suppose the tomato plants in one of the groups in Example 1.11 are watered more and/or exposed to more sunlight and warmer temperatures. If the tomato plants that received the new fertilizer were subject to these different (favorable) growing conditions, a difference in yield cannot be attributed to the new product. The focus of this text is statistical inference, most of which is based on determining the likelihood of an observed experimental outcome. This strategy will be used informally in the early chapters of this book. Formal procedures will be presented beginning in Chapter 9. For now, we will follow the four-step process presented below.

Statistical Inference Procedure The process of checking a claim can be divided into four parts. Claim: This is a statement of what we assume to be true. Experiment: To check the claim, we conduct a relevant experiment. Likelihood: This considers the likelihood of occurrence of the observed experimental outcome assuming the claim is true. We will use many techniques to determine whether the experimental outcome is a reasonable observation (subject to reasonable variability), or whether it is a rare occurrence. Conclusion: There are only two possible conclusions. (1) If the outcome is reasonable, then we cannot doubt the claim. We usually write, “There is no evidence to suggest the claim is false.’’ (2) If the outcome is rare, we disregard the lucky alternative, and question the claim. A rare outcome is a contradiction. It shouldn’t happen (often) if the claim is true. In this case we write, “There is evidence to suggest the claim is false.’’

Example 1.12 Cell Phone Chargers The Wireless Emporium ships a box containing 1000 cell phone chargers and claims 999 are in perfect condition and only 1 is defective. Upon receipt of the shipment, a quality control inspector reaches into the box, mixes the chargers around a bit, selects one at random, and it’s defective! Claim: There were 999 good cell phone chargers and 1 defective charger in the box. Experiment: The quality control inspector selected one cell phone charger from the box, tested it, and found it to be defective.

22

CHAPTER 1

An Introduction to Statistics and Statistical Inference

We have found evidence the claim is false by showing that the observed experimental outcome is unreasonable, an outcome so rare that it should almost never happen if the claim is really true.

Likelihood: One of two things has happened. 1. The quality control inspector could be incredibly lucky. Intuitively, the chance of selecting the one defective charger from among the 1000 total chargers is very small. It is possible to select the one defective charger, but it is very unlikely. 2. The claim (999 perfect chargers, 1 defective) is false. Because the chance of selecting the single defective charger is so small, it is more likely the manufacturer (Wireless Emporium) lied about the number of defective chargers in the shipment. (Perhaps there are really 999 defective chargers and only one good charger in the box.) Conclusion: Typically, statistical inference discounts the lucky alternative. Selecting the single defective charger is an extremely rare occurrence. Therefore, there is evidence to suggest the manufacturer’s claim is false, because this outcome is very rare. We will use this four-step process to check a claim in many different contexts. The method for determining likelihood is the key to this valuable tool for logical reasoning.

SECTION 1.3 EXERCISES Concept Check 1.20 True/False In an observational study, we record the

response for a specific variable for each individual or object. 1.21 True/False In an experimental study, we investigate the

effects of certain conditions on at least three different groups. 1.22 True/False The number of simple random samples is

always infinite. 1.23 True/False A simple random sample is representative

of the entire population of interest. 1.24 Fill in the Blank a. If a sample is not random, then it is b. It is very common to experience c.

. when

data is collected using surveys. occurs when individuals ask to be included in an survey.

1.25 Statistical Inference Name the four parts of every

statistical inference problem. 1.26 Liar, Liar Suppose an experimental outcome is very rare. What two things could have happened?

Applications 1.27 Fuel Consumption and Cars The administration at

the University of Nebraska in Lincoln is interested in student reaction to a planned parking garage on campus. A dormitory near the proposed site is selected and several Student Senate members volunteer to solicit responses. One Thursday

evening, the volunteers each take a specific dorm wing, knock on doors, and record student answers to several prepared questions. a. Is this an observational or an experimental study? b. Describe the sample in this problem. c. Is this a random sample? Justify your answer. 1.28 Demographics and Population Statistics State Farm Insurance Company would like to estimate the proportion of volunteer firefighters across the country who are full-time teachers. The 25 largest volunteer fire companies in the United States are identified. Each is contacted and asked to complete a short survey regarding the number of volunteers and the occupation of each volunteer. a. Is this an observational or an experimental study? b. Describe the sample in this problem. c. Is this a random sample? Justify your answer. 1.29 Manufacturing and Product Development The

Visniak Bottling Plant in Cheektowaga, New York, has been accused of systematically underfilling 12-oz bottles of soda. An inspection team enters the plant one afternoon and selects bottled soda ready for shipment from various locations within the plant. The contents of each selected bottle are carefully measured. a. Describe the population and the sample in this problem. b. Is this a random sample? Justify your answer. 1.30 Biology and Environment Science Oregon Scientific has come under suspicion of purposely shipping defective wireless weather stations. The Attorney General’s office in Delaware would like to estimate the proportion of defective products being shipped by this company. Describe a method

1.3

for obtaining a simple random sample of shipped wireless weather stations. 1.31 Fuel Consumption and Cars The Massachusetts State

Police union is interested in the number of miles driven by each officer during an 8-hour shift. Twelve officers are selected from the 11:00 P.M. to 7:00 A.M. shift, and the number of miles traveled by each officer is recorded. a. Is this an observational or an experimental study? b. Describe the population and the sample in this problem. c. Is this a random sample? Justify your answer. 1.32 Manufacturing and Product Development Gillette

claims a new disposable razor provides a closer shave than any other brand currently on the market. One hundred men who are observed buying a disposable razor are selected and asked to participate in a shaving study. a. Describe the population and the sample in this problem. b. Is this a random sample? Justify your answer. 1.33 Manufacturing and Product Development Midwest

Pet Supplies claims its K9 Chain Link Dog Kennel can be set up in less than 30 minutes. An investigative reporter would like to check this claim. Describe a method for obtaining a simple random sample of customers who set up this kennel. 1.34 Sports and Leisure A National Football League coach

is permitted to initiate two challenges to referee calls per game (outside of the final 2 minutes in each half). If both challenges are successful, then the coach is given a third. During a challenge, the referee reviews the play in question on a replay monitor on the field, and the call is either confirmed or the challenge is upheld. The NFL reports the time required to resolve a coach’s challenge is less than 5 minutes. A sports statistician would like to check this claim. Describe a method for obtaining a simple random sample of challenges during NFL games. 1.35 Travel and Transportation The Department of Public Works in Bismarck, North Dakota, would like to estimate the number of potholes per mile (after a long, snowy winter). Each selected mile-long stretch will be thoroughly examined for potholes, and the number in each section will be recorded. a. Describe a method for obtaining a simple random sample of mile-long road segments. b. Is this an observational or experimental study? 1.36 Biology and Environmental Science The Faber Floral Company in Kankakee, Illinois, claims to have developed a special spray for roses that causes the blossom to last longer than an untreated flower. Fifty long-stemmed roses are obtained and randomly assigned to one of two groups: treated versus untreated. The treated roses are sprayed, and the lifetime of each blossom is carefully recorded. a. Is this an observational or an experimental study? b. What is the variable of interest? c. Describe a technique to randomly assign each rose to a group.

Experiments and Random Samples

23

1.37 Fuel Consumption and Cars Electric and plug-in

electric cars are designed to save gasoline and help the environment. In addition, there are certain tax credits for these types of hybrid automobiles.15 Although there are certainly benefits to owning a hybrid car, many people complain about the slow acceleration, repair expense, and overall comfort. Thirty-five passengers are randomly selected. Each is blindfolded and taken for a ride in a traditional combustion-engine automobile and in a comparably sized hybrid car (over the same route). The passenger is then asked to select the car with the most comfortable ride. a. Is this an observational or an experimental study? b. What is the variable of interest? c. Describe possible sources of bias in these results. 1.38 Manufacturing and Product Development

The ceramic tile used to construct the floors in a mall must be sturdy, easy to clean, and long-lasting. Before installing a specific tile, a construction firm orders a box of 25 tiles and uses a standard strength test on each. The results are used to determine whether the tiles will be used throughout the new mall. a. Describe the population and the sample in this problem. b. Is this a random sample? If so, justify your answer. If not, describe a technique for obtaining a random sample.

1.39 Manufacturing and Product Development Many

comforters contain both white feathers and down in order to provide a warm, soft cover. A bed-and-bath company would like to expand its line of products and sell comforters for queen- and king-size beds. Before manufacturing begins, a random sample of comforters is obtained from other companies and the proportions of white feathers, down, and other components are measured and recorded. These data will be used to determine the exact mixture of feathers and down for the new line of comforters. a. Is this an observational or an experimental study? b. What are the variables of interest? c. Describe a method for obtaining a random sample of comforters from current manufacturers. 1.40 Marketing and Consumer Behavior Disney World is going to initiate the use of wireless tracking wristbands for visitors to the Orlando, Florida, theme park.16 The new MagicBand has several functions: it serves as a hotel room key and park entry pass, and is linked to customer credit card information. Visitors wearing these wristbands will have immediate access to certain rides. Suppose a random sample of Disney World visitors wearing wristbands is obtained and the wait time for Big Thunder Mountain is recorded for each. a. Is this an observational or an experimental study? b. What are the variables of interest? c. Describe a method for obtaining a random sample of visitors wearing wristbands.

24

CHAPTER 1

An Introduction to Statistics and Statistical Inference

CHAPTER 1 SUMMARY Concept

Page

Descriptive statistics

11

Inferential statistics Population Sample Variable Probability problem

11 13 13 13 15

Statistics problem Observational study

15 19

Experimental study

19

Simple random sample of size n

19

Statistical inference procedure

21

Notation / Formula / Description

Graphical and numerical methods used to describe, organize, and summarize data. Techniques and methods used to draw a conclusion or make an inference. The entire collection of individuals or objects to be considered or studied. A subset of the entire population. A characteristic of an individual or object in a population of interest. Certain properties of a population are assumed known. Questions involve a sample taken from this population. Information about a sample is used to answer questions concerning a population. We observe the response for a specific variable for each individual or object in the sample. We investigate the effects of certain conditions on individuals or objects in the sample. A sample selected in such a way that every possible sample of size n has the same chance of being selected. Four-step process: Claim, Experiment, Likelihood, and Conclusion.

CHAPTER 1 EXERCISES

1

APPLICATIONS 1.41 Descriptive or Inferential Statistics

Determine whether each of the following is a descriptive or an inferential statistics problem. a. The Society of Government Economists conducted a salary and working conditions survey of top bank executives in the United States. A report issued by this group included a table that listed the number of bank executives in each state with salaries above $1 million. b. The Flowers Canada Growers obtained a sample of people who sent roses for Valentine’s Day and recorded the color of the roses purchased. This information was used to construct a table listing the proportion of each color of rose purchased on Valentine’s Day. c. The Intergovernmental Panel of Climate Change collected data associated with global warming and predicted the extinction of up to 30% of plant and animal species in the world. d. American Express conducted a survey of travelers at Los Angeles International Airport. The information was used to estimate the proportion of all travelers who make a purchase in an airport duty-free shop. 1.42 Descriptive or Inferential Statistics Determine whether each of the following is a descriptive or an inferential statistics problem. a. A report by NASA listed each weather satellite orbiting the Earth and the number of years each has been in service.

b. The Agricultural Research Service obtained samples of

natural cocoa from a variety of sources and measured the total antioxidant capacity in each sample. The resulting data were used to suggest that eating a moderate amount of chocolate may help prevent cancer, heart disease, and stroke. c. The U.S. Patent Office issued a report listing every company that was granted a patent in 2013 and the number of patents awarded to each company. d. A researcher at Emory University used brain scans to conclude that zen meditation may help treat disorders characterized by distracting thoughts. 1.43 Descriptive or Inferential Statistics Determine whether each of the following is a descriptive or an inferential statistics problem. a. The Food Channel conducted a blind taste test to determine the best chocolate for baking. A random sample of adults was obtained, and each was asked to select the best chocolate from among 10 varieties. The final report listed each chocolate along with the number of people who rated it the best. b. After an extensive survey, the Association of Realtors in Chicago concluded that the mean price of a single-family home was less the $500,000. c. After conducting several measurements, the Beijing Municipal Environmental Monitoring Center issued a warning that indicated the density of PM2.5 (fine particulate matter, a measure of air pollution) was over the safe limit. d. International Living issued a report listing percentages of Americans retired and living in each foreign country.

Chapter 1

1.44 Public Health and Nutrition Parents Association would like to determine the proportion of teenagers who have the ability to prepare an entire meal. A sample of teenagers was obtained and all were asked if they can cook. Describe the population of interest, the sample, and the variable of interest in this problem. 1.45 Marketing and Consumer Behavior

Hallmark is interested in the proportion of adults who sent a greeting card on Mother-in-Law Day. A sample of 400 adults was obtained and all were asked whether they sent a greeting card on this holiday, which started in 2002. Describe the population and the sample in this problem.

1.46 Medicine and Clinical Studies A recent study by the American Academy of Neurology suggests that soft drinks, iced tea, and even fruit drinks may lead to depression.17 One thousand individuals who regularly consume soft drinks were selected and each was evaluated for signs of depression. Describe the population, the sample, and the variable in this problem. 1.47 Public Policy and Political Science

The Office of the Privacy Commissioner of Canada’s (OPC) Contributions Program is interested in reaction to a proposal to allow police to obtain cell phone records without a subpoena. One thousand people in British Columbia were called and each was asked to respond to several questions. a. Is this an observational or an experimental study? b. Describe the sample in this problem. c. Is this a random sample? Justify your answer.

1.48 Travel and Transportation Amtrak would like to estimate the proportion of travelers on the Sunset Limited, from New Orleans to Los Angeles, who utilize the Sightseer Lounge en route. At the end of the trip on March 15, an Amtrak representative stopped every third person getting off the train and asked them if they used the Sightseer Lounge to buy food, a drink, or souvenirs. a. Is this an observational or an experimental study? b. Describe the sample in this problem. c. Is this a random sample? If not, suggest a method for obtaining a random sample. 1.49 Physical Sciences The Air Liquide Company has developed a new deicing chemical for airplanes, consisting of glycol and several proprietary additives. The new chemical was designed to keep aircraft wings ice-free for a longer period of time. Ten typical Dehaviland commuter airplanes were obtained and randomly assigned to one of two groups: new chemical versus old chemical. Each plane was subject to constant icing conditions in a controlled environment and treated with one of the chemicals. The length of time until ice formed on the wings was recorded for each plane. a. Is this an observational or an experimental study? b. What is the variable of interest? c. Describe a technique to randomly assign each plane to a chemical group.

Exercises

25

EXTENDED APPLICATIONS 1.50 Technology and Internet NationMaster.com reported that the most recent software piracy rate in the United States was 20%. They define the piracy rate as the number of units of pirated software deployed divided by the total number of units of software installed. One thousand installed software titles are selected. Each is carefully examined to determine if the software was pirated. a. What is the population of interest? b. What is the sample? c. Describe the variable of interest. d. Write a probability question and a statistic question involving this population and sample. 1.51 Travel and Transportation The Channel Tunnel, or Chunnel, is a 31.4-mile railroad tunnel beneath the English Channel between Folkstone, Kent, in England and Coquelles in France. To ensure passenger safety, engineers selected the 35 deepest areas in the tunnel and measured the pressure on each section. a. Is this an observational or an experimental study? b. What is the variable of interest? c. Is this sample random? If so, justify your answer. If not, describe a technique to obtain a random sample. 1.52 Manufacturing and Product Development In January 2013, flaws were discovered in two Boeing 787 Dreamliners aircraft. Japan Airlines found cracks in the cockpit window in one jet and a minor oil leak in another.18 To assure the public that the jet is safe, the FAA selected 20 Dreamliners currently operated by American Airlines and carefully inspected each for any flaws. a. Is this an observational or an experimental study? b. What is the variable of interest? c. Is this sample random? If so, justify your answer. If not, describe a technique to obtain a random sample. 1.53 Public Policy and Political Science

The Thirty Bench Wine Makers in Beamsville, Ontario, would like to determine if the alcohol content of its wine is determined by the grape variety. Samples from two Riesling wines, made from different grape varieties, were obtained and the alcohol content in each bottle was carefully measured. a. Is this an observational or an experimental study? b. What is the variable of interest?

LAST STEP 1.54 Is it safe to eat rice? A rice manufacturer claims that a single serving contains at most 10 ppb of arsenic. Suppose a random sample of 100 rice servings is obtained and each is measured for arsenic. a. Identify the population of interest and the sample. b. Apply the statistical inference procedure to draw a conclusion when an extraordinary, rare event is observed.

2

Tables and Graphs for Summarizing Data Looking Back ■

Realize the difference between a sample and the population.

■

Recognize the importance of a simple random sample in the statistical inference procedure.

■

Understand the difference between descriptive and inferential statistics.

Looking Forward ■

Be able to classify a data set as categorical or numerical, discrete or continuous.

■

Learn several graphical summary techniques.

■

Construct bar charts, pie charts, stem-and-leaf plots, and histograms.

Can the Florida Everglades be saved? Burmese pythons have invaded the Florida Everglades and now threaten the wildlife indigenous to the area. It is likely that people were keeping pythons as pets and somehow a few animals slithered into Everglades National Park. The first python was found in the Everglades in 1979, and these snakes became an officially established species in 2000.1 The Everglades has an ideal climate for the pythons, and the large areas of grass allow the snakes plenty of places to hide. In January 2013, the Florida Fish and Wildlife Conservation Commission started the Python Challenge. The purpose of the contest was to thin the python population, which could be tens of thousands, and help save the natural wildlife in the Everglades. There were 800 participants, with prizes for the most pythons captured and for the longest. At the end of the competition, 68 Burmese pythons had been harvested. Suppose a random sample of pythons captured during the Challenge was obtained. The length (in feet) of each python is given in the following table. 9.3 7.4 11.1 3.9 4.1

3.5 14.2 3.7 6.7 5.2

5.2 13.6 7.0 3.3 4.7

8.3 8.3 12.2 8.3 5.8

4.6 7.5 5.2 10.9 6.4

11.1 5.2 8.1 9.5 3.8

10.5 6.4 4.2 9.4 7.1

3.7 12.0 6.1 4.3 4.6

2.8 10.7 6.3 4.6 7.5

5.9 4.0 13.2 5.8 6.0

The tabular and graphical techniques presented in this chapter will be used to describe the shape, center, and spread of this distribution of python lengths and to identify any outliers.

CONTENTS 2.1 Types of Data 2.2 Bar Charts and Pie Charts 2.3 Stem-and-Leaf Plots 2.4 Frequency Distributions and Histograms Michael R. Rochford/University of Florida/AP

27

28

CH A PTER 2

Tables and Graphs for Summarizing Data

2.1 Types of Data As members of an information society, we have access to all kinds of descriptive statistics: in newspapers, in research journals, and even via the Internet. Whether the information is obtained from a carefully designed experiment or an observational study, the first step is to organize and summarize the data. Tables, charts, and graphs reveal characteristics about the shape, center, and variability of a data set, or distribution. For example, Figure 2.1 shows a stacked bar chart of the number of automobile crashes related to hand-held cell phone use in certain New Jersey counties over the past several years. 1100 1000

Number of crashes

900 800 700 600 500 400 300

Cu

m

be

rla

M pe Ca

Ca

m

de

ay

n

n Bu

rli

ng

to

en rg Be

At

la

nt

ic

100 0

nd

200

County

Figure 2.1 Automobile accidents in six New Jersey counties over several years.

The shape of a distribution may be symmetric or skewed. The center of a distribution refers to the position of the majority of the data, and measures of variability indicate the spread of the data. The variability (or dispersion) of a distribution describes how much the measurements vary, and how compact or how spread out the data are. Although they are not suitable for making inferences, the tabular and graphical techniques introduced in this chapter help to describe the distribution of data and identify unusual characteristics. The summary table or graph to be used, and later the statistical analysis to be performed, depends on the type of data. Consider golfers arriving at a public country club on a Saturday morning. Here are several characteristics we could record: brand of golf clubs, handicap, whether the patron wears a golf hat, even the number of days since the golfer last played at this course.

Definition We’ll do more with bivariate data in Chapter 12.

A data set consisting of observations on only a single characteristic, or attribute, is a univariate data set. If we measure, or record, two observations on each individual, the data set is bivariate. If there are more than two observations on the same person, the data set is multivariate. Suppose we record only the make of car driven by each person who arrives at the country club—a univariate data set. The observations, for example, Ford, Honda, or Lexus, are categorical. There is no natural ordering of the data, and each observation

2.1

Types of Data

29

falls into only one category or class. We might instead ask each person who arrives how long it took to reach the country club. This time the responses, for example, 10, 15, or 45 minutes, are numerical.

Definition A categorical, or qualitative, univariate data set consists of non-numerical observations that may be placed in categories. A numerical, or quantitative, univariate data set consists of observations that are numbers. The following examples illustrate the two basic types of data sets.

Example 2.1 Oscar Night A random sample of actresses attending the Oscars was obtained, and the designer of each gown was recorded. The responses are given in the following table. Gucci Vera Wang

Versace Ralph Lauren

Dior Valentino

Vera Wang Gucci

Dior Dior

Versace

Each response is non-numerical, because there is no natural ordering. This is a (univariate) categorical data set.

Example 2.2 Priority Mail The U.S. Postal Service offers Priority Mail Flat Rate boxes with which customers can expect delivery within two days of packages weighing up to 70 pounds. A random sample of Small Boxes shipped from Post Offices in Oklahoma was obtained and each was weighed. The resulting weights (in pounds) are given in the following table. DATA SET MAILWTS

The number of lightning strikes is discrete. For instance, the number of possible lightning strikes can be 1, or 2, or 3, and so on; but not, for example, 2.5. However, if we had an instrument that could measure barometric pressure accurately enough, any number between 960 and 1070 millibars is possible, for example, 995.466347789.

2.0 6.9

6.1 4.8

7.3 10.8

6.4 9.2

8.0 3.2

8.2 7.9

9.9 8.5

5.2 8.9

7.7 6.6

6.7 8.1

Because each observation is numerical, this is a (univariate) numerical data set. We can classify numerical data even further. Consider the following examples. On a hot summer day in the Southeast, suppose we record the number of lightning strikes within a specified county during the next 24 hours. The possible values are 0, 1, 2, 3, up to, say, 10. There are only a finite number of possible numerical values, and these values are discrete, isolated points on a number line (Figure 2.2). Instead, suppose we record the barometric pressure, in millibars, at 4:00 P.M. The possible values are not discrete and isolated. Rather, the barometric pressure can (theoretically) be any number in the continuous interval 960 to 1070 (Figure 2.3). 0

1

2

3

4

5

6

7

8

9

10

Figure 2.2 Possible values for a discrete data set: a finite number of values, isolated on a number line. 1000

1050

1100

Figure 2.3 Possible values for a continuous data set: numerical values on some interval.

30

C HA PTER 2

Tables and Graphs for Summarizing Data

Definition A numerical data set is discrete if the set of all possible values is finite, or countably infinite. Discrete data sets are usually associated with counting. A numerical data set is continuous if the set of all possible values is an interval of numbers. Continuous data sets are usually associated with measuring.

A CLOSER L OK 1. To decide whether a data set is discrete or continuous, consider all the possible val-

Mathematically, a set is countably infinite if it can be put into one-to-one correspondence with the counting numbers (1, 2, 3, 4, . . .). The three dots, . . . , mean the list continues in the same manner.

STEPPED STEPPED TUTORIAL TUTORIALS VARIABLE TYPES, BOX PLOTS VALUES, INDIVIDUALS

ues. Finite or countably infinite means discrete. An interval of possible values means continuous. 2. Countably infinite means there are infinitely many possible values, but they are countable. You may not ever be able to finish counting all of the possible values, but there exists a method for actually counting them. 3. The interval for a continuous data set can be any interval, of any length, open or closed. The exact interval may not be known, only that there is some interval of possible values. 4. In practice, we have no measurement device that is precise enough to return any number in some interval. We may only be able to achieve up to 10 digits of accuracy. So a continuous data set may contain any number in some interval in theory, but not in reality. The classifications of univariate data are shown in Figure 2.4. Here is an example to illustrate these classifications.

Categorical data Univariate data

Discrete data Numerical data Continuous data

Figure 2.4 Classifications of univariate data.

Example 2.3 Univariate Data Classifications A researcher obtained the following observations. Classify each resulting data set as categorical or numerical. If the data set is numerical, determine whether it is discrete or continuous. a. The number of books read by middle-school students during an academic year. b. The position of the drawbridge in Belmar, New Jersey, at noon on days in July. Assume

the drawbridge is not moving, and is either open or closed to boat traffic. c. The length of time (in minutes) it takes to get a haircut. d. The number of garage sales advertised in a local newspaper. e. The types of candy received at houses on Halloween.

2.1

Types of Data

31

f. The air pressure in footballs at the beginning of college games. g. The type of plumbing problem reported by the next person who contacts the Plumbing

Pros.

SOLUTION a. The observations are numbers, so the data set is numerical. The set of possible values

is finite. We don’t know the maximum number of books read, but the possible numbers in the data set represent counts. The data set is discrete. b. The observations are categorical: Up (open) or down (closed). There is no natural

ordering; the possible responses fall into groups or classes. This data set is categorical. c. The observations are numbers and the set of possible values is some interval, perhaps

5 to 45 minutes. This is a numerical continuous data set. d. The observations are numbers and the set of possible values is finite. We can count the

number of advertised garage sales. The minimum number may be 0 and the maximum may be 25. This is a numerical discrete data set. e. The observations may be Milky Way, Snickers, Nestle Crunch, etc. Although there may

be some personal preference and an individual ranking, this is a categorical data set. f. The observations are numbers and the set of possible values is some interval, say 12.5

to 13.5 psi. This is a numerical continuous data set. g. The observations are dripping faucets, leaking pipes, plugged sinks, etc. There may be

some preference for the Plumbing Pros, but this is a categorical data set. TRY IT NOW

GO TO EXERCISE 2.5

Ways of summarizing and displaying categorical data are discussed in Section 2.2, and tables and graphs for numerical data are presented in Sections 2.3 and 2.4.

SECTION 2.1 EXERCISES Concept Check 2.1 True/False A data set obtained by recording the height

and weight of every person entering a doctor’s office is univariate. 2.2 True/False Every data set is multivariate. 2.3 True/False A data set consisting of 37 times, in seconds,

for pedestrians to cross a certain city street is univariate. 2.4 Fill in the Blank a. A

univariate data set consists of observations that are numbers. b. A univariate data set consists of nonnumerical observations. c. If the set of all possible values for a numerical data set is . finite, then the data set is d. If the set of all possible values for a numerical data set is some . interval of numbers, then the data set is

Practice 2.5 Univariate Data Classifications A set of observations

is obtained as indicated below. In each case, classify the

resulting data set as categorical or numerical. If the data set is numerical, determine whether it is discrete or continuous. a. The weights of several reams of paper. b. The number of cars towed from the Pennsylvania Turnpike during given 24-hour periods. c. The first ingredient in the product listing of boxes of cereal. d. The number of games the Red Sox win during several seasons. e. The amount of sand used on roads during winters in a small town. f. The diagnoses of patients in an emergency ward. 2.6 Univariate Data Classifications

A set of observations is obtained as indicated below. In each case, classify the resulting data set as categorical or numerical. If the data set is numerical, determine whether it is discrete or continuous. a. The lengths of the spans of bridges in New York State. b. The number of people hired by a company during certain weeks. c. The cloud ceiling at airports around the country. d. The temperature of the coffee purchased at several fast-food restaurants. e. The type of notebook used by students in a statistics class.

32

C HAPT ER 2

Tables and Graphs for Summarizing Data

f. The classifications of Forward Operating Air Force bases

(Main Air Base, Air Facility, Air Site, or Air Point). 2.7 Univariate Data Classifications A set of observations

is obtained as indicated below. In each case, classify the resulting data set as categorical or numerical. If the data set is numerical, determine whether it is discrete or continuous. a. The number of steps on apartment fire escapes. b. The number of leaves on maple trees. c. The reason several automobiles fail inspection. d. The weight of fully loaded tractor trailers. e. The area of several Nebraska farms. f. The cellular calling plan selected by customers. 2.8 Univariate Data Classifications A set of observations

is obtained as indicated below. In each case, classify the resulting data set as categorical or numerical. If the data set is numerical, determine whether it is discrete or continuous. a. The number of engine revolutions per minute in automobiles. b. The thickness of the polar ice cap in several locations. c. The state in which families vacationed last summer. d. The type of Internet connection in county households. e. The make of watch worn by people entering a certain department store. f. The number of raisins in 24-ounce boxes. 2.9 Numerical Observations A set of numerical observa-

tions is obtained as described below. Classify each resulting data set as discrete or continuous. a. The widths of posters at an art gallery. b. The time it takes to compile computer programs. c. The number of radioactive particles that escape from special containers during a one-hour period. d. The time it takes to bake batches of banana muffins. e. The concentration of carbon monoxide in homes during the winter. f. The number of pages in best-selling murder-mystery novels. 2.10 Numerical Observations A set of numerical observations is obtained as described below. Classify each resulting data set as discrete or continuous. a. The weight of baseball bats. b. The area of selected dorm rooms. c. The number of bees in hives. d. The height of a storm surge during hurricanes. e. The amount of ink used in office printers during a week. f. The number of fish in office aquariums. 2.11 Numerical Observations A set of numerical observations is obtained as described below. Classify each resulting data set as discrete or continuous. a. The time it takes giant slalom skiers to cover a race course. b. The number of magazines available for sale at newsstands. c. The number of black squares in crossword puzzles. d. The length of time spent waiting in line at grocery-store checkout lanes. e. The number of French fries in a small order from fast-food restaurants. f. The length in words of email messages received.

2.12 Univariate Data Classifications Classify each data set as categorical, discrete, or continuous. a. A random sample of mature Eastern tent caterpillars is obtained from a tree branch in a neighborhood yard. The length of each caterpillar is recorded. b. Randomly selected prime-time television shows are selected and the number of violent acts is recorded for each show. c. A representative sample of employees from a large company is obtained, and the overtime hours for the past month are recorded for each employee. d. An HMO selects a random sample of subscribers and records the number of office visits over the past year for each patient. e. Thirty-six apples are randomly selected from an orchard. Each is graded for quality of appearance: excellent, good, fair, or poor. f. A random sample of mattresses is obtained and the firmness (medium, medium firm, firm, or extra firm) of each is recorded. 2.13 Univariate Data Classifications Classify each data set as categorical, discrete, or continuous. a. A random sample of cheeses is obtained and the number of months each is allowed to age before sale is recorded. b. Sixteen universities are selected and each computer network system is carefully analyzed. The computer virus threat is assessed for each campus: low, medium, or high. c. A random sample of Waterford Normandy dinner plates is selected and the weight of each plate is recorded. d. Thirty-five new customers at a health club are selected and the body-fat percentage of each member is computed and recorded. e. A random sample of CDs is obtained from a local music store. The company that produced each CD is noted. f. A collection of pens is obtained from employees at a large company. For each pen, the outside diameter of the barrel at its widest point is measured and recorded. 2.14 Univariate Data Classifications Classify each data set as categorical, discrete, or continuous. a. A random sample of Hudson River ferry trips is obtained and the number of riders on each trip is recorded. b. A random sample of military helicopters is obtained and the weight of each is recorded. c. A random sample of communities in Canada is selected and the number of full-time police officers employed is recorded. d. A random sample of locations in the United States is selected. Temperature data are used to determine whether or not a new record high temperature was set during the past year. e. A random sample of stock analysts is obtained and each is asked to rate a specific stock as buy, sell, or hold. f. A random sample of cross-country flights is obtained and the number of controllers each pilot talks to during the flight is recorded.

2.2

Bar Charts and Pie Charts

33

2.2 Bar Charts and Pie Charts The natural summary measures for a categorical data set are the number of times each category occurred and the proportion of times each category occurred. These values are usually displayed in a table as in Table 2.1.

Table 2.1 A frequency distribution summarizing the results of a survey on computer security threats

Frequency

Relative Frequency

Physical damage Natural events Loss of essential services Compromise of information Technical failures Compromise of functions

130 50 75 35 95 115

0.26 0.10 0.15 0.07 0.19 0.23

Total

500

1.00

Class

Definition A frequency distribution for categorical data is a summary table that presents categories, counts, and proportions. 1. Each unique value in a categorical data set is a label, or class. In Table 2.1, the classes are physical damage, natural events, loss of essential services, etc. 2. The frequency is the count for each class. In Table 2.1, the frequency for the compromise of information class is 35 (i.e., 35 computer security threats were due to compromise of information). 3. The relative frequency, or sample proportion, for each class is the frequency of the class divided by the total number of observations. In Table 2.1, the relative frequency for the technical failure class is 95/500 5 0.19.

A frequency distribution for a categorical data set is illustrated in the next example.

Example 2.4 Cruise Ship Destinations A random sample of cruise ships leaving from the Port of New York showed the following destinations. Bermuda Southampton Caribbean Caribbean Bahamas

Jimmy Lopes/FeaturePics

Southampton Bermuda Bermuda Southampton Bermuda

Mediterranean Southampton Mediterranean Mediterranean Bahamas

Southampton Caribbean Caribbean Southampton Southampton

Caribbean Caribbean Southampton Southampton Southampton

Construct a frequency distribution to describe these data. What proportion of cruise ships did not go to Southampton?

34

C HAPTER 2

Tables and Graphs for Summarizing Data

SOLUTION STEP 1 Each unique destination is a label, or class. This is a categorical data set. There

are five unique classes and 25 observations in total. STEP 2 Draw a table and list each unique class in the left-hand column. Find the frequency and relative frequency for each class. For example, because Bermuda appears four times in the sample, the frequency for this class is 4. The relative frequency for Bermuda is 4/25 5 0.16.

Class

Tally

Frequency

Relative frequency

Bahamas Bermuda Caribbean Mediterranean Southampton

|| |||| 00000 ||| 0000 0000

2 4 6 3 10

0.08 0.16 0.24 0.12 0.40

25

1.00

Total

(5 2/25) (5 4/25) (5 6/25) (5 3/25) (5 10/25)

The proportion of cruise ships that did go to Southampton is 10/25 5 0.40. The total proportion is always 1.00. Therefore, the proportion of cruise ships that did not go to Southampton is 1.00 2 0.40 5 0.60. TRY IT NOW

GO TO EXERCISE 2.19

A CLOSER L OK 1. If you have to construct a frequency distribution by hand, an additional tally column is

helpful. Insert this after the class column, and use a tally mark or tick mark to count observations as you read them from the table. 2. The last (total) row is optional, but it is a good check of your calculations. The frequencies should sum to the total number of observations, and the relative frequencies should sum to 1.00 (subject to round-off error). 3. There is no rule for ordering the classes. In Example 2.4, the classes happen to be presented in alphabetical order. A bar chart is a graphical representation of a frequency distribution for categorical data. An example of a bar chart is shown in Figure 2.5.2 45 40 35 Frequency

A tally mark is a short line drawn for each count up to four. On number five, draw a diagonal line across the other four. Count in sets of five.

30 25 20 15 10 5 0

CT

DE

FL

MA State

NJ

PA

RI

Figure 2.5 Bar chart showing the number of Nathan’s World Famous Beef Hot Dogs franchises in certain states as of March 25, 2012.

2.2

Bar Charts and Pie Charts

35

How to Construct a Bar Chart 1. Draw a horizontal axis with equally spaced tick marks, one for each class. 2. Draw a vertical axis for the frequency (or relative frequency) and use appropriate tick

marks. Label each axis. 3. Draw a rectangle centered at each tick mark (class) with height equal to, or propor-

tional to, the frequency of each class (also called the class frequency). The bars should be of equal width, but do not necessarily have to abut one another; there can be spaces between them.

Example 2.5 Cruise Ship Destinations, Continued Construct a bar chart for the cruise ship data in Example 2.4. Either frequency or relative frequency may be used on the vertical axis. Both are acceptable because the resulting graphical representations of the distribution are identical. The only difference between the two graphs is the labels on the vertical axis. Unless it is stated otherwise, frequencies are used.

SOLUTION STEP 1 Use the frequency distribution for the cruise ship data. There are five classes,

and the frequencies range from 2 to 10. STEP 2 Draw a horizontal and a vertical axis. On the horizontal axis, draw five ticks for the five classes and label them with the class names. Because the greatest frequency is 10, draw and label tick marks from 0 to at least 10 on the vertical axis. STEP 3 The height of each vertical bar is determined by the frequency of the class. For example, the frequency of trips to Bermuda is 4, so the height of the bar representing Bermuda is 4. The resulting bar chart is shown in Figure 2.6. A technology solution is shown in Figure 2.7.

12

Frequency

10 8 6 4 2 on

ha

ne

ut

rra

So

ite ed M

m pt

an

an be rib

ud

Ca

rm Be

Ba

ha

m

as

a

Destination

Figure 2.6 Bar chart for the cruise ship data.

31%, US

18%, Europe 10%, Japan

11%, Other 30%, Emerging

TRY IT NOW

Figure 2.7 CrunchIt! bar chart.

GO TO EXERCISE 2.21

A pie chart is another graphical representation of a frequency distribution for categorical data. An example of a pie chart is shown in Figure 2.8.3

How to Construct a Pie Chart 1. Divide a circle (or pie) into slices or wedges so that each slice corresponds to a class. 2. The size of each slice is measured by the angle of the slice. To compute the angle of

Figure 2.8 Projected global spending on medicine in 2016.

each slice, multiply the relative frequency by 3608 (the number of degrees in a whole or complete circle).

36

CHAPT ER 2

Tables and Graphs for Summarizing Data

3. The first slice of a pie chart is usually drawn with an edge horizontal and to the right

(08). The angle is measured counterclockwise. Each successive slice is added counterclockwise with the appropriate angle.

Example 2.6 Cruise Ship Destinations: Another Stop Construct a pie chart for the cruise ship data in Example 2.4.

SOLUTION STEP 1 Add a column to the frequency distribution for slice angle. Use the relative fre-

quency of each class to find the slice angle. Class

Relative frequency

Angle

Bahamas Bermuda Caribbean Mediterranean Southampton

0.08 0.16 0.24 0.12 0.40

28.88 5 (0.08 3 3608) 57.68 5 (0.16 3 3608) 86.48 5 (0.24 3 3608) 43.28 5 (0.12 3 3608) 144.08 5 (0.40 3 3608)

Total

1.00

360.08

STEP 2 Draw a circle and mark slices using the angles in the frequency distribution.

Draw the first slice with an edge extending from the center of the circle to the right. The remaining slices are drawn moving around the pie counterclockwise. It may be helpful to use a protractor and compass to draw the circle and measure the angles. See Figure 2.9. A technology solution is shown in Figure 2.10.

Caribbean

Bermuda Bahamas

28.8º

Mediterranean Southampton

Figure 2.9 Pie chart for the cruise ship data.

Figure 2.10 CrunchIt! pie chart.

2.2

Bar Charts and Pie Charts

37

Note: Because Southampton corresponds to the biggest slice of the pie (chart), this class has the greatest frequency (and relative frequency); it is the destination that occurred most often in the sample. STEPPED STEPPED TUTORIAL TUTORIALS

TRY IT NOW

BAR BOX CHARTS PLOTS AND PIE CHARTS

A CLOSER L OK

GO TO EXERCISE 2.23

1. A pie chart is hard to draw accurately by hand, even with a protractor and compass.

A graphing calculator or computer is quicker and more efficient for constructing this graph. 2. There are lots of pie-chart variations, for example, exploding pie charts and 3D pie charts. Each is simply a visual representation of a frequency distribution for categorical data.

Technology Corner Procedure: Construct a bar chart. Reconsider: Example 2.5, solution, and interpretations.

VIDEO TECH MANUALS EXELCHART DISCRIPTIVE BAR

CrunchIt! CrunchIt! has a built-in function to construct a bar chart from original or summarized data. 1. Enter the classes in column Var1 and the frequencies in column Var2. Rename each column if desired. 2. Select Graphics; Bar Chart. Choose Var1 for Labels and Var2 for Heights. Optionally enter a Title, X Label, and Y

Label. Click the Calculate button. Refer to Figure 2.7.

TI-84 Plus C The TI-84 Plus C does not accept categorical data. Therefore, there is no built-in function to construct a bar chart. However, you may assign a number to each class, and use the Histogram statistical plot to construct a bar chart. 1. Enter integers corresponding to each class in list L1 and the frequency for each class in the corresponding row in list

L2 (Figure 2.11). 2. Press STATPLOT and select Plot1 from the STAT PLOTS menu. 3. Turn the plot On and select Type histogram. For Xlist, enter the name of the list containing the categories. For

Freq, enter the name of the list containing the frequencies. Select a Color (Figure 2.12). 4. Consider each class to have width 1. Enter appropriate WINDOW settings. Press GRAPH to display the bar chart

(Figure 2.13).

Figure 2.11 The categories and frequencies.

Figure 2.12 The Plot1 setup screen.

Figure 2.13 TI-84 Plus C bar chart.

38

CHA PTER 2

Tables and Graphs for Summarizing Data

Minitab The input can be either the entire data set in a single column or a summary table of categories and frequencies in two columns. 1. 2. 3. 4.

Enter the data into column C1. Select Graph; Bar chart. Choose Counts of unique values and Simple. Enter C1 under Categorical variables. Edit graph attributes as necessary, for example, the axes labels, plot title, and gaps between clusters. See Figure 2.14.

Figure 2.14 Minitab bar chart.

Excel The built-in functions Frequency or Sumif may be used to construct a frequency distribution. Assume this summary information is available. 1. Enter the categories into column A and the corresponding frequencies into column B. 2. Select the range of cells A1:B5. Under the Insert tab, select Column; 2-D Column; Clustered Column. 3. Use Chart Tools to format the bar chart as necessary. See Figure 2.15.

Figure 2.15 Excel bar chart.

Procedure: Construct a pie chart. Reconsider: Example 2.6, solution, and interpretations.

VIDEO TECH MANUALS EXEL DISCRIPTIVE PIE CHART

CrunchIt! CrunchIt! has a built-in function to construct a pie chart from original or summarized data. 1. Enter the classes in column Var1 and the frequencies in column Var2. Rename each column if desired. 2. Select Graphics; Pie Chart. Choose Var1 for Labels and Var2 for Sizes. Optionally enter a Title. Click the Calculate

button. Refer to Figure 2.7.

2.2

Bar Charts and Pie Charts

39

TI-84 Plus C A pie chart can be constructed using the CellSheet App for the TI-84 Plus C. 1. Open a new spreadsheet in the CellSheet app. 2. Enter the categories in row 1 and the corresponding frequencies in row 2. See Figure 2.16. 3. Select MENU; Charts; Pie. Enter the range for the categories, the range for the frequencies, select Number or

Percent for display, and enter a Title if desired. See Figure 2.17. 4. Highlight Draw and press ENTER to display the pie chart. See Figure 2.18.

Figure 2.16 The categories and frequencies in a CellSheet spreadsheet.

Figure 2.17 The PIE CHART setup screen.

Figure 2.18 TI-84 Plus C pie chart.

Minitab The input can be either the entire data set in a single column or a summary table of categories and frequencies in two columns. 1. 2. 3. 4.

Enter the data into column C1. Select Graph; Bar chart. Choose Counts of unique values and Simple. Enter C1 under Categorical variables. Edit graph attributes as necessary, for example, the axes labels, plot title, and gaps between clusters. See Figure 2.19.

Figure 2.19 Minitab pie chart.

Excel The built-in functions Frequency or Sumif may be used to construct a frequency distribution. Assume this summary information is available. 1. Enter the categories into column A and the corresponding frequencies into column B. 2. Select the range of cells A1:B5. Under the Insert tab, select Column; 2-D Column; Clustered Column. 3. Use Chart Tools to format the bar chart as necessary. See Figure 2.20.

40

CH A PTER 2

Tables and Graphs for Summarizing Data

Figure 2.20 Excel pie chart.

SECTION 2.2 EXERCISES Concept Check 2.15 True/False A frequency distribution is a summary table

for categorical data. 2.16 True/False The relative frequency for each class in a

frequency distribution is a sample proportion. 2.17 True/False A bar chart is constructed using the

frequency for each class. 2.18 True/False All the slices in a pie chart should have

approximately the same angle.

Applications 2.19 Psychology and Human Behavior A random sample

of TV viewers was obtained and each person was asked to select the entertainment category of his or her favorite show. The results are given in the following table. TVSHOW Comedy Reality Sports Drama Educational Sports Comedy Sports Soap Soap Comedy

Comedy Sports Comedy Soap Reality Educational Soap Drama Reality Drama Comedy

Drama Reality Drama Soap Drama Drama Reality Drama Reality Educational

Soap Sports Reality Soap Drama Comedy Soap Soap Soap Drama

Construct a frequency distribution for these data. 2.20 Psychology and Human Behavior A random sample

of patrons visiting the Rena Branston Gallery in San Francisco

was obtained, and each was asked the type of art the patron most enjoy viewing. The results are given in the following table. ARTSTYLE Abstract Realist Surrealist Realist Surrealist Expressionist Surrealist Surrealist Expressionist Abstract Realist

Abstract Realist Abstract Realist Surrealist Expressionist Surrealist Surrealist Realist Abstract Realist

Surrealist Realist Abstract Abstract Abstract Abstract Abstract Abstract Realist Abstract

Expressionist Realist Abstract Abstract Realist Surrealist Abstract Realist Expressionist Expressionist

Construct a frequency distribution for these data. 2.21 Education and Child Development To prepare for

negotiations, the Faculty Association at Eastern Michigan University asked members to name the most important contract issue. A summary of their responses is given in the following CONTRACT table. Issue Salary Health insurance Retirement benefits Class size Temporary faculty Parking

Frequency 50 100 75 60 90 25

a. Find the relative frequency for each issue. b. Construct a bar chart for these data using frequency on the

vertical axis.

2.2

2.22 Biology and Environmental Science The following table lists the number of dairy farms for various counties in Vermont.4 DAIRY

County Addison Bennington Caledonia Chittenden Essex Franklin Grand Isle Lamoille Orange Orleans Rutland Washington Windham Windsor

Frequency 145 18 80 42 12 210 17 38 93 141 71 39 25 39

a. Find the relative frequency for each county. b. Construct a bar chart for these data using relative

frequency on the vertical axis.

Bar Charts and Pie Charts

41

a. Find the relative frequency for each prize. b. Construct a pie chart for these data. 2.26 Education and Child Development The grade distri-

bution for a large psychology class at Louisiana State University is given in the following frequency PSYCHGRD distribution: Grade

Frequency

A B C D F

10 43 54 26 15

Relative frequency

a. Find the relative frequency for each grade. b. Construct a bar chart using frequency on the vertical axis

and a pie chart from the frequency distribution. c. How many students were in this psychology class? What

proportion of students passed (i.e., received a D or better)? 2.27 Public Health and Nutrition A random survey of 200 customers who purchased ice cream at Brigham’s showed the ICECREAM following proportions:

2.23 Fuel Consumption and Cars

The business manager of a Chrysler automobile dealership sent a survey to randomly selected owners in order to gauge customer satisfaction. One question was, “How likely are you to buy another car of the same make and model?” Survey participants could answer Very Likely (VL), Likely (L), Neutral (N), Unlikely (U), or Very Unlikely (VU). The results are given on the text website. CARSATIS a. Construct a frequency distribution for these data. b. Use the table in part (a) to construct a pie chart for these data.

2.24 Public Policy and Political Science According to the

Atlanta Journal, U.S. Senator Saxby Chambliss will not seek reelection in 2014.5 A random sample of Georgia voters was obtained and each was asked to consider certain potential successors. The political affiliation of each voter is given on the website for this book: Democrat (D), Republican (R), VOTING Independent (I). a. Construct a frequency distribution for these data. b. Use the table in part (a) to construct a pie chart for these data. 2.25 Demographics and Population Statistics The following table lists the number of Nobel Prize laureates in PRIZECT each category.6

Nobel Prize Physics Chemistry Medicine Literature Peace Economic Sciences

Frequency 194 163 201 109 125 71

Ice cream

Relative frequency

The Big Dig Cashew Turtle Chocolate Chip Pistachio Strawberry Vanilla with Oreos

0.100 0.185 0.260 0.150 0.080 0.225

a. Find the frequency of each ice cream (class). b. Construct a bar chart using frequency on the vertical axis

and a pie chart for these ice cream data. 2.28 Sports and Leisure A random sample of long-time

subscribers to Popular Woodworking was obtained, and each person was asked to name the brand of table saw he or she uses. TABLESAW The results are given in the following table: DeWalt DeWalt Craftsman DeWalt Black & Decker Makita Makita Delta Black & Decker DeWalt

DeWalt Craftsman Craftsman Black & Decker DeWalt Delta DeWalt Delta Makita

Craftsman Delta Delta Makita Delta Makita Black & Decker Makita Craftsman

a. Construct a frequency distribution for these data. b. Carefully sketch a bar chart using frequency on the vertical

axis and a pie chart for these data.

42

CHA PT ER 2

Tables and Graphs for Summarizing Data

c. What proportion of people in this sample use a Craftsman

or Black & Decker table saw? d. What proportion of people in this sample do not use a Delta table saw? 2.29 Marketing and Consumer Behavior Suppose there were 253 exhibitors at the NFPA World Fire Safety Conference in Chicago, Illinois, in June 2013. Each exhibitor was classified according to the type of product or service offered for sale. The proportions are given in the following FIREXHIB table:

Product Alarms Training Extinguishers Pumps Sprinklers Building materials Electrical equipment Hazmat storage Security products Signaling systems

Proportion

Building window

Vehicle window

Containers

0.35

0.15

0.10

Tableware Lamps 0.25

Construct a bar chart and a pie chart for these data using the proportions in the table. 2.32 Business and Management In the 2012 Canadian Lawyer Corporate Counsel Survey, each company/organization was classified by sector. The sectors and corresponding CASECTOR proportions are given in the following table.7

Sector

Proportion

Government Professional services Technology Industry, manufacturing Service Resource-based Financial Nonprofit

0.2964 0.0632 0.0514 0.0237 0.0632 0.0751 0.1265 0.0870 0.1621 0.0514

0.15

0.246 0.062 0.136 0.154 0.104 0.098 0.142 0.058

a. Construct a bar chart and a pie chart for these data using

the proportions in the table. b. Suppose 225 companies participated in this survey. Find

a. Find the number of exhibitors in each classification. b. Carefully sketch a bar chart and a pie chart using the

proportions for each class. 2.30 Psychology and Human Behavior Using the Library

the frequency, the number of companies, for each sector. 2.33 Marketing and Consumer Behavior A survey of new homes built in the Sleepy Creek Mountains of West Virginia SIDING produced the following results for the type of siding:

of Congress classification scheme, the Brookings Public Library in South Dakota recorded the type of book borrowed by 30 randomly selected patrons. The data are given in the following table: BOOKS Medicine Science Education Education Technology Science Law Technology

Science Education Law Technology Medicine Literature Literature Education

Medicine Medicine Medicine Literature Technology Medicine Law

Medicine Science Technology Education Science Literature Technology

a. Construct a frequency distribution for these data. b. Carefully sketch a bar chart using relative frequency on

the vertical axis and pie chart for these data. c. Do you think the public library should try to purchase

more books in one particular subject area? Why or why not? 2.31 Marketing and Consumer Behavior Cardinal Glass Industries produces several products for residential buildings, for vehicles, and for ordinary consumer use. The proportion of each type of manufactured product is given in the following table: GLASS

Siding

Frequency

Aluminum Brick Stucco Vinyl Wood

20 15 12 45 24

a. Find the relative frequency for each siding classification. b. Construct a bar chart using frequency on the vertical axis

and a pie chart for these data. 2.34 Public Policy and Political Science There are many

think tanks in the world, consisting of groups of independent scholars with academic, government, and/or private experience. These think-tank scholars publish articles in appropriate journals and offer advice on politics, economics, and governmental policy matters. The following table lists the number of THINK think tanks (TTs) in the world in 2012 by region.8 Region Africa Asia Europe Latin America and the Caribbean Middle East and North Africa North America Oceania

No. of TTs 554 1194 1836 721 339 1919 40

2.2

a. Find the relative frequency associated with each

45

region. b. Construct a bar chart using frequency on the vertical axis and a pie chart for these data.

40 30 25 20 15 10 5 e Va

lu

ty Th

rif

He r

ris te

rp

tz

e

t ge En

Bu d

Av is

am

o

0 Al

Hurricane Sandy had a devastating effect on the coast of the northeastern United States. This superstorm caused an estimated $65.6 billion in damages and was the largest Atlantic hurricane by diameter ever recorded. A survey was conducted to determine the primary source of local transportation information for people affected by the storm. The following SANDY table lists each source and its frequency.9

35 Frequency

2.35 Travel and Transportation In late October 2012,

43

Bar Charts and Pie Charts

Source

Agency

Frequency

Official websites and alerts Social media News websites News TV/radio Friends/family Community groups Smartphone apps Other

265 198 152 147 115 45 45 30

a. b. c. d.

Construct a frequency distribution for these survey results. How many observations were in this data set? What proportion of people did not use Hertz or Enterprise? Construct a pie chart for these data.

2.38 Marketing and Consumer Behavior

One thousand customers entering the Mall of America in Bloomington, Minnesota, were randomly selected and asked to rank the variety of stores. The results are given in the following table.

a. Find the relative frequency associated with each

source.

Response

b. Construct a pie chart for these data.

Extended Applications 2.36 Sports and Leisure Complete the following frequency

distribution from a random sample of people visiting Atlantic City casinos.

Class Bally’s Caesars Harrah’s Resorts Sands Trump Plaza Total

Frequency 40 25 32

Relative frequency 0.125 0.110

25

Frequency

Excellent Very good Good Fair Poor

and a pie chart for these data. variety as very good or excellent? 2.39 Travel and Transportation The following table shows

the number of fatal vehicle crashes in 2011 by day of the week CRASH for rural and urban roads or streets.10

1.000

Monday Tuesday Wednesday Thursday Friday Saturday Sunday

2.37 Travel and Transportation Families traveling to Walt

Disney World in Florida often rent a car rather than use airport and hotel shuttle buses. A recent survey asked families to indicate the rental car agency used. The results are presented in the following bar chart.

0.4250 0.1180

c. What proportion of customers did not rank the store

Day of week

Justify your answer.

50 152 255

a. Complete the frequency distribution. b. Construct a bar chart using frequency on the vertical axis

0.280

a. What is the size of the random sample? b. Which casino is most preferred by people in this survey?

Relative frequency

Rural

Urban

557 383 348 351 415 464 676

448 359 328 330 386 404 514

a. Find the relative frequency for rural for each day.

Construct a bar chart using relative frequency on the vertical axis (and day on the horizontal axis).

44

CHA PT ER 2

Tables and Graphs for Summarizing Data

b. Find the relative frequency for urban for each day.

a. Construct a bar chart for these data. b. Construct a pie chart for these data. c. Is it reasonable to conclude that hunters in Texas are more

Construct a bar chart using relative frequency on the vertical axis (and day on the horizontal axis). c. How do these two bar charts compare? Describe any similarities or differences. 2.40 Biology and Environmental Science An avalanche is a major danger facing skiers, snowboarders, and snowmobilers. The number of avalanche fatalities per year in the United States varies from approximately 5 to 35, but in general has increased steadily since 1956. The following table shows the number of avalanche fatalities from 2002 to 2011 in the United States by AVALANCH activity.11

Climber Hiker Rec snowplayer Resident Ski inbounds Ski out of bounds Ski tour Snowboard inbounds Snowboard out of bounds Snowboard tour Snowmobiler Snowshoer Work other Work patrol Ski helicoptor Snowmobiler other

2.42 Public Policy and Political Science A side-by-side or a

stacked bar chart may be used to compare categorical data obtained from two (or more) different sources or groups. Figures 2.21 and 2.22 show an example of each—a comparison of test grades in two different sections of an introductory statistics course. The blue rectangles represent students from Section 01; the green rectangles represent students from RATINGS Section 02.

Frequency 30 2 1 8 9 21 53 1 10 23 107 12 1 3 2 1

10 8 Frequency

Activity

likely to be injured hunting hogs than any other animal? Why or why not?

6 4 2 0

A

B

C

D

E

Grade

Figure 2.21 Side-by-side bar chart. Bars corresponding to the same category are placed side by side for easy comparison.

a. Find the relative frequency of fatalities for each

activity. 16

b. Construct a bar chart using frequency on the vertical axis,

and a bar chart using relative frequency on the vertical axis. Which of these two graphs do you think is a better graphical description of avalanche fatalities by activity? Why? intended game involved in accidental hunting accidents in HUNTING Texas in the years 2009–2011.12 Animal hunted Dove White-tailed deer Rabbit/hare Hog Quail/pheasant Turkey Duck/goose Coyote Squirrel/prairie dog Nongame bird/snake Raccoon

Frequency 14 11 5 21 6 3 4 2 4 5 2

12 Frequency

2.41 Sports and Leisure The following table shows the

14

10 8 6 4 2 0

A

B

C

D

E

Grade

Figure 2.22 Stacked bar chart. Within each category, bars are stacked for comparison.

In January 2013, a poll by ABC News indicated that President Obama’s popularity had reached a three-year high. Suppose the frequency of occurrence of each response, grouped by sex, is given in the following table.

2.3

Rating Excellent Very good Good Fair Poor

Men frequency

Women frequency

368 550 426 450 206

350 375 165 360 250

a. Compute the relative frequency for each rating, for both

groups. b. Construct a side-by-side bar chart using the relative frequency of each class. c. Why should relative frequency be used for comparison in the side-by-side bar chart rather than frequency? 2.43 Psychology and Human Behavior The following

table shows the number of property-crime violations in PROPERTY Manitoba and Saskatchewan in 2011.13

Violation Breaking and entering Theft of motor vehicles Theft over $5,000 (non-motor vehicle) Theft under $5,000 (non-motor vehicle) Mischief

45

Stem-and-Leaf Plots

Manitoba

Saskatchewan

9,305 3,919 427

9,079 4,967 510

17,933

19,756

26,361

31,741

a. Find the relative frequency for Manitoba for each

violation. b. Find the relative frequency for Saskatchewan for each

violation. c. Construct a side-by-side bar chart using the relative

frequency of each class.

2.3 Stem-and-Leaf Plots We will eventually need a quantitative measure of very far away.

The center of a distribution, or typical value, often occurs where the data are clustered.

This section introduces the stem-and-leaf plot, a graphical technique for describing numerical data. In Section 2.4, you will learn about some other tables and graphs for summarizing numerical data. The goal of all of these techniques is the same: to get a quick idea of the distribution of the data in terms of shape, center, and variability. In addition, we are always watching for outliers, values that are very far away from the rest. A stem-and-leaf plot is a relatively new graphical procedure used to describe numerical data. It is fairly easy to construct, even by hand, and most statistical software packages have options for drawing this graph. A stem-and-leaf plot is a combination of sorting and graphing. One advantage of this plot is that the actual data are used to create the graph; we do not lose the original data values as we do when using tally marks to count them. A stem-and-leaf plot can be used to describe the shape, center, and variability of the distribution. In Section 2.4, some specific terms and expressions used to describe shape are defined and illustrated. To estimate the center of a distribution, or to find a typical value, first arrange the observations in increasing order. Simply approximate a middle value, or range of values, in this list. More precise definitions and computations are presented in Chapter 3. The variability refers to the spread or compactness of the data. In addition, we always check for outliers.

How to Create a Stem-and-Leaf Plot There are, of course, exceptions to this two-digit rule.

To create a stem-and-leaf plot, each observation in the data set must have at least two digits. Think of each observation as consisting of two pieces (a stem and a leaf). For example, suppose we consider the number of people watching a movie, and in one theater there are 372 people. The number 372 could be split into the pieces 37 (the first two digits) and 2 (the last digit). 1. Split each observation into a Stem: one or more of the leading, or left-hand, digits; and a Leaf: the trailing, or remaining, digit(s) to the right. Each observation in the data set must be split at the same place, for example between the tens place and the ones place. 2. Write a sequence of stems in a column, from the smallest occurring stem to the largest. Include all stems between the smallest and largest, even if there are no corresponding leaves.

46

C HA PTER 2

Tables and Graphs for Summarizing Data

3. List all the digits of each leaf next to its corresponding stem. It is not necessary to put

the leaves in increasing order, but make sure the leaves line up vertically. 4. Indicate the units for the stems and leaves.

TUTORIALS STEPPED TUTORIAL BOX PLOTS STEMPLOTS

DATA SET WTRFALL

Example 2.7 Waterfall Heights Kerepakupai Meru, or Angel Falls, is the highest waterfall in the world.14 Because the falls are so high, 979 meters, by the time water reaches the canyon below, it has vaporized into a giant mist cloud. Suppose the following table lists the total height, in meters, of several waterfalls in the world. 693 720 674 640 610

745 719 671 638 610

631 715 671 620 610

635 715 665 620 610

625 707 660 612 610

629 707 660 610 600

739 706 650 610 600

738 705 646 610 600

732 700 645 610 651

725 680 640 610 727

Construct a stem-and-leaf plot for these data.

SOLUTION STEP 1 There are only two options for splitting each observation: a. split between the hundreds place and the tens place (e.g., split 693 as 6 and

93); or b. split between the tens place and the ones place (e.g., split 693 as 69 and 3).

Yoshio Tomii/SuperStock

If we split between the hundreds and tens place, there will be only two stems, because the only numbers in the hundreds place are 6 and 7. The resulting plot will not reveal much about the distribution of the data. The better split is between the tens place and the ones place. STEP 2 Scan the data to find the smallest and largest stems, and list all of the stems in a vertical column. Write each leaf next to its corresponding stem. For example, 693 1 69 0 3 c c stem leaf

A 3 is placed in the 69 stem row.

For 745, a 5 is placed in the 74 stem row. For 631, a 1 is placed in the 63 stem row. For 635, a 5 is placed in the 63 stem row. STEP 3 Continue in this manner, to produce the following stem-and-leaf plot: 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74

000 20000000000 5900 158 6500 01 500 411 0 3 77650 955 507 982 5

Stem 5 10 Leaf 5 1

2.3

A technology solution:

47

Stem-and-Leaf Plots

Note that Stem 5 10 means the rightmost digit in the stem is in the tens place and Leaf 5 1 means the leftmost digit in each leaf is in the ones place. Reading from the graph, the smallest waterfall height is 600 meters and the largest is 745 meters. The center of a data set is a typical value or values near the middle of the observations when they are arranged in (increasing) order. For these data, the center appears to be in the 64 or 65 stem row. There are no outlying values. Figure 2.23 shows a technology solution. TRY IT NOW

GO TO EXERCISE 2.51

A CLOSER L OK 1. As a general rule of thumb, try to construct the plot with 5 to 20 stems. With fewer than

5, the graph is too compact; with more than 20, the observations are too spread out. Neither extreme reveals much about the distribution. 2. Sometimes, to help us find the center of the data, we put the leaves in increasing order, to make an ordered stem-and-leaf plot. 3. Some advantages of a stem-and-leaf plot: each observation is a visible part of the graph and (in an ordered stem-and-leaf plot, as when using a computer) the data are sorted. However, a stem-and-leaf plot can get very big, very fast. If a stem-and-leaf plot is made for a very large data set, the stems may be divided, usually in half or fifths. Consider the following example.

Figure 2.23 CrunchIt! stemand-leaf plot.

Example 2.8 Hotel Room Rates The I-95 Exit Guide allows travelers to easily find hotels at exits along I-95 within 12 mile of an I-95 exit.15 Suppose a random sample of room rates (in dollars) for hotels along I-95 was obtained, and the data are given in the following table.

75 76 89 77 55 77

84 77 89 83 60 75

78 77 74 78 63 61

79 78 73 81 69 71

72 61 79 91 65 68

73 58 86 70 73 72

50 80 85 93 93 77

90 81 94 75 81

85 75 64 78 79

69 75 72 54 79

Construct an ordered stem-and-leaf plot for these data.

SOLUTION STEP 1 If we split each observation between the tens place and the ones place, there will

be five stems. However, the leaves will extend far to the right and the shape, center, and spread of the distribution will be unclear. STEP 2 Divide each stem in half. The first 5-stem row holds numbers 50–54, the second 5-stem row holds numbers 55–59, the first 6-stem row holds numbers 60–64, etc. STEP 3 The resulting stem-and-leaf plot, with divided stems, offers a better graphical description of the distribution. Note that the leaves have been ordered.

48

CH APTER 2

Tables and Graphs for Summarizing Data

Stem-and-leaf plot for the hotel rate data 5 5 6 6 7 7 8 8 9

04 58 01134 5899 0122233334 5555567777788889999 011134 55699 01334

Stem 5 10 Leaf 5 1

STEP 4 Notice that we can draw a straight line across the stem-and-leaf plot near the

75–79 row and the graph is almost a mirror image, or reflection, over this line. Therefore, the distribution of the data is approximately symmetric, centered near 75–79. In addition, the distribution is compact and there are no outlying values. Figure 2.24 shows a technology solution.

Figure 2.24 JMP stem-and-leaf plot.

TRY IT NOW

GO TO EXERCISE 2.55

Two sets of data can be compared graphically using a back-to-back stem-and-leaf plot. Two plots are constructed using the same stem column. List the leaves for one data set to the left, and those for the other to the right.

Example 2.9 Cholesterol Levels Your total cholesterol level is the sum of low-density lipoproteins (LDLs) and highdensity lipoproteins (HDLs). A total cholesterol level of less than 200 mg/dL (milligrams per deciliter) is desirable, whereas 240 mg/dL or higher is considered high risk.16 According to the Centers for Disease Control and Prevention, the average total cholesterol for adult Americans is about 200 mg/dL. Suppose a random sample of total cholesterol levels was obtained for men and women. The data are given in the following table. Men 110 164 210 173 194

124 172 224 181 194

132 180 112 193 179

Women 147 193 158 205 185

157 201 165 216 185

183 186 177 207 205

190 212 195 158 204

201 213 203 218 189

211 169 203 213 179

Construct a back-to-back ordered stem-and-leaf plot for these data.

154 173 189 205 177

2.3

Stem-and-Leaf Plots

49

SOLUTION STEP 1 The following graph is a back-to-back stem-and-leaf plot for these data.

Men 20 4 2 7 87 54 932 5510 4433 51 60 4

Women 11 12 13 14 15 16 17 18 19 20 21 22

48 9 3779 3699 05 1334557 12338

Stem 5 10 Leaf 5 1

STEP 2 The center column of numbers (11, 12, 13, . . .) represents the stems for both

groups. The stem-and-leaf plot for the men’s data is constructed to the left, while the plot for the women’s data is constructed to the right. Note that the leaves have been placed in increasing order, starting from the stem and proceeding outward. STEP 3 The distribution for the men seems more spread out, or has more variability, while the distribution of women’s HDL-cholesterol levels is more compact and seems centered at a slightly greater value. TRY IT NOW In mathematics, truncate means discard the digits to the right of a specific place. In order to round a number to a certain position, consider the digit to the right of the rounding position. If this digit is a 5 or greater, then round up. Otherwise, leave the rounding digit unchanged (and replace all digits to the right with 0).

GO TO EXERCISE 2.57

When constructing a stem-and-leaf plot, if there are two or more digits in each leaf, the trailing digits may be truncated or the entire leaf may be rounded. Suppose a data set includes the total yardage for randomly selected golf courses. Consider three observations, 6518, 6523, and 6576, and suppose each observation is split between the hundreds place and the tens place. The following diagram shows the 65 stem row for three stem-and-leaf plots. The first is constructed with two-digit leaves. The second is constructed by simply truncating the ones, or last, digit. The third plot is constructed by rounding each leaf to the nearest ten. Two-digit leaf 65 0 18 leaf 5 18 65 0 23 leaf 5 23 65 0 76 leaf 5 76

Truncate each leaf 65 0 18 leaf 5 1 65 0 23 leaf 5 2 65 0 76 leaf 5 7

Round each leaf 65 0 18 rounds to 20, leaf 5 2 65 0 23 rounds to 20, leaf 5 2 65 0 76 rounds to 80, leaf 5 8

Stem

Stem

Stem

( 65 (

Leaves ( 18 23 76 (

Stem 5 100 Leaf 5 10

( 65 (

Leaves ( 127 (

Stem 5 100 Leaf 5 10

( 65 (

Leaves ( 228 (

Stem 5 100 Leaf 5 10

50

C HAPT ER 2

Tables and Graphs for Summarizing Data

Technology Corner VIDEO TECH MANUALS

Procedure: Construct a stem-and-leaf plot. Reconsider: Example 2.7, solution, and interpretations.

EXEL DISCRIPTIVE STEM PLOT

There is no built-in command on the TI-84 Plus C or in Excel to construct a stem-and-leaf plot. Some calculator programs are available, and there are several add-ins for Excel for drawing a stem-and-leaf plot.

CrunchIt! 1. Enter the data in column Var1. Rename the column if desired. 2. Select Graphics; Stem and Leaf. Choose Var1 for the Sample and optionally enter a Title. Click the Calculate button.

Refer to Figure 2.23 (page 47).

Minitab 1. 2. 3. 4.

Enter the data into column C1. Select Graph; Stem-and-Leaf. Enter C1 under Graph variables. Click OK. The increment (distance between stems) is automatically selected and the leaves are placed in order. The numbers in the left column represent the cumulative counts from each end. The stem row containing the middle value is marked by only a count, in parentheses, of the number of observations in that row. N is the total number of observations. See Figure 2.25.

Figure 2.25 Minitab stem-and-leaf plot.

SECTION 2.3 EXERCISES Concept Check 2.44 True/False The stem in a stem-and-leaf plot must be

only one digit. 2.45 True/False When constructing a stem-and-leaf plot,

one can omit stem rows that contain no leaves.

2.48 Short Answer We try to construct a stem-and-leaf plot

with 5–20 stems. What happens if we use fewer than 5 or more than 20 stems?

Practice 2.49 Construct a stem-and-leaf plot for the following data.

2.46 True/False There may be more than one way to split

each observation into a stem and a leaf. 2.47 True/False When constructing a stem-and-leaf plot,

every observation must be split into a stem and a leaf in the same way.

4.7 4.9 6.8

5.1 5.4 3.5

6.6 6.1 6.4

3.9 4.1 6.4

5.0 3.6 7.1

2.9 6.4 2.7

3.6 4.7 5.8

5.5 4.1 5.2

4.2 5.7 5.9

5.1 3.6 5.7

2.3

Determine a range of numbers to indicate the center of the data. Within this range, select one number that is a typical EX2.49 value for this data set. 2.50 Construct a stem-and-leaf plot for the data given on the

text website.

EX2.50

2.51 Construct a stem-and-leaf plot for the data given on the text

website. Split each observation between the tens place and the ones place, and divide each stem in half. Determine a range of numbers to indicate the center of the data. Within this range, select EX2.51 one number that is a typical value for this data set. 2.52 Construct a stem-and-leaf plot for the data given on the

website for this book. Use the stem-and-leaf plot to identify any EX2.52 outliers in this distribution. 2.53 Consider the following stem-and-leaf plot:

50 51 52 53 54 55 56 57 58 59 60

a. b. c. d.

3 5 37 46 339 00337 111334677 0011344445788 012234466677 3335569 112

1717 2991 3292 2573 2497

1719 2430 2844 2840 3466

1645 2730 3426 2449 3228

3739 3469 2067 2584 3192

3024 5086 3215 1505

3664 2119 2767 1390

selected and the reaction time (in seconds) for each was REACTIME recorded. a. Construct a stem-and-leaf plot by splitting each observation between the ones place and the tenths place. Truncate the hundredths digit so that each leaf has a single digit. b. Construct a stem-and-leaf plot by splitting each observation between the ones place and the tenths place. Round each leaf to the nearest tenth so that each leaf has a single digit. c. Describe any differences between the two plots. What is a typical value? 2.56 Public Health and Nutrition The owner of Copperfield Racquet and Health Club randomly selected 50 people and recorded the number of calories burned after 20 minutes CALBURN on a treadmill. a. Construct a stem-and-leaf plot by splitting each observation between the tens place and the ones place. b. Construct a stem-and-leaf plot by splitting each observation between the tens place and the ones place, and by dividing each stem in half. c. Which stem-and-leaf plot is better? Why? 2.57 Physical Sciences A random sample of hot water

Stem 5 10 Leaf 5 1 List the actual observations in the 54 stem row. What is a typical value for this data set? Do the data seem to be evenly distributed, or does one end tail off more slowly than the other? Does the stem-and-leaf plot suggest there are any outliers in this data set? If so, what are they?

2.54 Consider the data given in the table below:

51

Stem-and-Leaf Plots

EX2.54

3830 3021 3124 1645

a. Construct a stem-and-leaf plot by splitting each observation

between the thousands place and the hundreds place. b. Construct a stem-and-leaf plot by splitting each observation between the hundreds place and the tens place (using two-digit leafs). c. Which plot presents a better picture of the distribution? Why?

Applications 2.55 Psychology and Human Behavior A random sample of patients involved in a psychology experiment was

temperatures (8F) on lower floors and upper floors in the WATEMP Renaissance Dallas Hotel was obtained. a. Construct a back-to-back stem-and-leaf plot to compare these two distributions. b. Using the plot in part (a), describe any similarities and/or differences between the distributions. 2.58 Physical Sciences The intensity of light is measured in

foot-candles or in lux. In full daylight, the light intensity is approximately 10,700 lux, and at twilight the light intensity is about 11 lux. The recommended level of light in offices is 500 lux.17 A random sample of 50 offices was obtained and the lux measurement at a typical work area was recorded for each. The LUXMEAS data are given in the following table: 468 497 503 529 435

526 487 482 518 520

463 506 531 495 499

520 464 486 497 492

481 474 488 471 519

521 516 508 458 466

536 503 495 494 450

492 481 536 519 482

509 562 504 511 514

520 514 514 490 475

a. Construct a stem-and-leaf plot for these light-intensity data. b. What is a typical light intensity? Are there any outliers? If

so, what are they? 2.59 Public Health and Nutrition

Every patient who visits a hospital emergency room is classified by the immediacy with which the patient should be seen. The American College of Emergency Physicians suggests that approximately 92% of all patients who visit emergency rooms can be classified as urgent, that is, need attention within 1 minute to 2 hours.18 Suppose a random sample of hospital emergency rooms was obtained and yearly records were examined. The

52

CH APTER 2

Tables and Graphs for Summarizing Data

percentage of patient visits classified as urgent is given on URGENT the text website. a. Construct a stem-and-leaf plot for these data. b. What is a typical value? Are there any outliers? If so, what are they? 2.60 Biology and Environmental Science The Port of Tacoma (Washington) handled approximately 1.7 million TEUs (20-foot equivalent units) in 2012.19 A random sample of domestic containers was obtained and the volume of each (in DOMVOL TEUs) was recorded. a. Construct a stem-and-leaf plot for these data. Split each observation between the tenths place and the hundredths place. b. Describe the shape of the container volume distribution in terms of shape, center, and spread. Are there any outliers? If so, what are they? 2.61 Business and Management A random sample of gasoline stations in Philadelphia was obtained. The number of years each station has been in operation is given on the text GASYEAR website. a. Construct a stem-and-leaf plot for these data. b. What is a typical number of years a station has been in operation? Are there any outliers? If so, what are they? 2.62 Manufacturing and Product Development Home

Depot conducted a survey on the lifetime of dishwashers. Forty random users were contacted and asked to report the number of years their dishwasher lasted before needing DISHLIFE replacement. a. Construct a stem-and-leaf plot for these data. Split each observation between the ones place and the tenths place. b. Describe the distribution of dishwasher lifetimes in terms of shape, center, and spread. c. What is a typical lifetime? Are there any outliers? If so, what are they? 2.63 Biology and Environmental Science The Great Pumpkin Commonwealth promotes the hobby of growing giant pumpkins. This group establishes standards and regulations so that the each pumpkin is of high quality and to ensure fairness in the competition for the largest pumpkin. A random sample of the largest pumpkins from 2012 was obtained, and the data are PUMPKIN given on the text website.20 a. Construct a stem-and-leaf plot for these data. b. What is a typical weight for these giant pumpkins? Are there any outliers in the data set? If so, what are they?

Extended Applications 2.64 Sports and Leisure A greyhound race handicapper

uses several factors to predict the winner, such as past performance, track condition, early speed, form, and competition. Races on a 5/16-mile track were randomly selected at the Naples-Fort Myers track, and the winning time (in seconds) was recorded for each. The data are given in the following GREYRACE table.21

30.78 30.47 30.35 30.06 30.67

30.00 30.70 30.35 30.59 30.31

30.47 30.17 30.41 30.56 29.95

30.81 30.58 31.37 30.38 30.21

30.02 30.56 30.57 30.82 29.98

30.76 30.44 30.52 31.05 30.59

a. Construct a stem-and-leaf plot for these data. Split each obser-

vation between the tenths place and the hundredths place. b. What is a typical winning time? If a dog has never run

better than 31.20 seconds in a 5/16-mile race, do you think it has a chance of winning? Justify your answer. c. Could a stem-and-leaf plot be constructed with the split between the ones place and the tenths place? How about between the tens place and the ones place? Explain. 2.65 Biology and Environmental Science Many piano sellers recommend a special humidifier, especially for more expensive pianos. This device is installed inside the piano and works to keep the instrument in tune by maintaining a stable humidity. To test whether a humidifier really helps, several pianos with and without humidifiers were tuned and then checked six months later. Middle C was used as a measure of how well each piano stayed in tune. In a perfectly tuned piano, middle C has a frequency of 256 cycles per second. The frequency (in cycles per second) of middle C for each group, MIDDLEC after six months, is given on the text website. a. Construct a back-to-back stem-and-leaf plot for this data. b. Use the plot in part (a) to describe any differences between the groups. Based on this plot, do you think a humidifier helps a piano stay in tune? Justify your answer. 2.66 Travel and Transportation There are national

standards for every road sign, pavement marking, and traffic signal. However, there are no formal state policies regarding the duration of an amber light. According to the Center for Sustainable Mobility at the Virginia Tech Transportation Institute, the amber light time is set to 4.2 seconds on a 45-mph road.22 Longer times would be used on roads with higher speed limits. Suppose a random sample of traffic signals for 45-mph roads in Norman, Oklahoma, was selected, and the duration of SIGNAL the amber light was recorded for each. a. Construct a stem-and-leaf plot for these data. Divide each stem into five parts. b. Based on the plot in part (a), do you believe this city has set the amber light duration to meet federal recommendations? Justify your answer. 2.67 Biology and Environmental Science The 2011 Maine Sea Scallop Survey was conducted in November 2011 between West Quoddy Head and Matinicus Island.23 Each sea scallop catch was divided into three size categories: seed, sublegal, and harvestable ($101.6 mm). Based on information in this report, a random sample of shell heights (in millimeters) is given on SCALLOP the text website. a. Construct a stem-and-leaf plot for these data. b. Describe the distribution of shell heights in terms of shape, center, and spread. c. Estimate the proportion of sea scallops that are harvestable.

2.4

Frequency Distributions and Histograms

53

2.4 Frequency Distributions and Histograms Stem-and-leaf plots can be used to describe the shape, center, and variability of a numerical data set, but they can become huge and complex if the number of observations is large. A summary table like a frequency distribution for categorical data would be helpful. However, when the data set is numerical, there are no natural categories, as for qualitative data. The solution is to use intervals as categories, or classes. We can then construct a frequency distribution for continuous data (similar to the categorical case), and a histogram (analogous to a bar chart for categorical data). For a random sample of days in 2011 and 2012, Figure 2.26 shows a histogram of silver prices (in dollars per ounce).24 10

Frequency

8

6

4

2

0 24

26

28

30

32

34 36 Silver price

38

40

42

Figure 2.26 An example of a histogram.

Definition A frequency distribution for numerical data is a summary table that displays classes, frequencies, relative frequencies, and cumulative relative frequencies.

Here is a procedure for constructing a frequency distribution, along with the necessary definitions.

How to Construct a Frequency Distribution for Numerical Data In other words, partition the measurement axis into 5–20 subintervals.

1. Choose a range of values that captures all of the data. Divide it into nonoverlapping

2.

3. 4. 5.

(usually equal) intervals. Each interval is called a class, or class interval. The endpoints of each class are the class boundaries. We use the left-endpoint convention. An observation equal to an endpoint is allocated to the class with that value as its lower endpoint. Hence, the lower class boundary is always included in the interval, and the upper class boundary is never included. This ensures that each observation falls into exactly one interval. As a rule of thumb, there should be 5–20 intervals. Use friendly numbers, for example, 10–20, 20–30, etc., not 15.376–18.457, 18.457–21.538, etc. Count the number of observations in each class interval. This count is the class frequency or simply the frequency. Compute the proportion of observations in each class. This ratio, the class frequency divided by the total number of observations, is the relative frequency.

54

CH APTER 2

Tables and Graphs for Summarizing Data

6. Find the cumulative relative frequency (CRF) for each class: the sum of all the rela-

tive frequencies of classes up to and including that class. This column is a running total or accumulation of relative frequency, by row. DATA SET

Example 2.10 Nuts and Bolts

TORQUE

Torque is a measure of the force needed to cause an object to rotate. It is usually measured in foot-pounds (ft-lb). As part of a quality-control program, Whirlpool inspectors measure the initial torque needed to loosen the balancing bolts on each leg of a clothes washer. A random sample of these measurements is given in the following table. 20.4 41.3 13.0

24.1 11.0 44.4

28.4 37.5 16.9

53.4 36.4 14.9

62.1 25.6 63.7

31.7 43.5

57.2 23.1

45.7 24.2

38.1 35.5

51.1 26.4

Construct a frequency distribution for these data.

SOLUTION STEP 1 The data set is numerical (continuous). The observations are measurements, and

STEP 2

STEP 3 STEP 4

STEP 5

each can be any number in some interval. Scan the data to find the smallest and largest observations (11.0 and 63.7). Choose between 5 and 20 reasonable (equal) intervals that capture all of the data. The range of values 10–70 captures all of the data. Divide this range using the friendly numbers 10, 20, 30, . . . into the class intervals 10–20, 20–30, 30–40, etc. Count the number of observations in each interval. For example, in the interval 10–20, there are four observations (16.9, 14.9, 11.0, and 13.0), so the frequency is 4. Compute the proportion of observations in each class. For example, in the interval 10–20, the relative frequency is 4 (observations) divided by 25 (total number of observations). Find the CRF for each class. For example, for the class 30–40, the cumulative relative frequency is the sum of the relative frequencies of this class and of all those listed above it: 0.16 1 0.28 1 0.20 5 0.64.

Class

Frequency

Relative frequency

10–20 20–30 30–40 40–50 50–60 60–70

4 7 5 4 3 2

0.16 0.28 0.20 0.16 0.12 0.08

Total

25

1.00

Cumulative relative frequency (54/25) (57/25) (55/25) (54/25) (53/25) (52/25)

0.16 0.44 0.64 0.80 0.92 1.00

(50.16) (50.16 1 0.28) (50.44 1 0.20) (50.64 1 0.16) (50.80 1 0.12) (50.92 1 0.08)

STEP 6 As for categorical data, if you must construct a frequency distribution by hand,

an additional tally column is helpful (as introduced in Section 2.2). Insert this after the class column, and use a tally mark or tick mark to count observations as you read them from the table. TRY IT NOW

GO TO EXERCISE 2.77

The last (total) row in a frequency distribution is optional, but it is a good check of your calculations. The frequencies should sum to the total number of observations

2.4

Frequency Distributions and Histograms

55

(25 in Example 2.10), and the relative frequencies should sum to 1.00 (subject to round-off error). The CRF of the first class row is equal to the relative frequency of the first class. There are no other observations before the first class. The CRF of the last class should be 1.00 (subject to round-off error). You must accumulate all of the data by the last class. CRF gives the proportion of observations in that class and all previous classes. In Example 2.10, the CRF of the class 40–50 is 0.80. Interpretation: the proportion of torque measurements less than 50 is 0.80.

A CLOSER L OK This idea of working backward from cumulative relative frequency to obtain relative frequency is a handy technique for answering many probability questions.

1. Suppose you were given just the CRF for each class. To find the relative frequency for

a class, take the class CRF and subtract the previous class CRF. In Example 2.10, to find the relative frequency for the class 50–60: 0.92 (CRF for the class 50–60) 2 0.80 (CRF for the previous class 40–50) 5 0.12. 2. If the data set is numerical and discrete, use the same procedure outlined above for constructing a frequency distribution. If the number of discrete observations is small, then each value may be a class, or category. In addition, certain liberties are sometimes acceptable in listing the classes. For example, suppose a discrete data set consists of integers from 1 to 30. One might use the classes 1–5, 6–10, 11–15, 16–20, 21–25, and 26–30. This is not a strict partition of the interval 1–30 even though these classes are disjoint, or do not overlap. They do not allow for all numbers between 1 and 30. For example, the value 5.5 is between 1 and 30 but does not fall into any of these classes. However, these classes work fine in this case, because each observation is an integer. The resulting frequency distribution is perfectly valid. A histogram is a graphical representation of a frequency distribution, a plot of frequency versus class interval. Given a frequency distribution, here is a procedure for constructing a histogram.

How to Construct a Histogram 1. Draw a horizontal (measurement) axis and place tick marks corresponding to the class

boundaries. 2. Draw a vertical axis and place tick marks corresponding to frequency. Label each axis. 3. Draw a rectangle above each class with height equal to frequency.

TUTORIALS STEPPED TUTORIAL BOX PLOTS HISTOGRAMS

Example 2.11 Nuts and Bolts, Continued Construct a frequency histogram for the torque data presented in Example 2.10. For reference, here is the frequency distribution from Example 2.10:

Class

Frequency

Relative frequency

10–20 20–30 30–40 40–50 50–60 60–70

4 7 5 4 3 2

0.16 0.28 0.20 0.16 0.12 0.08

Cumulative relative frequency 0.16 0.44 0.64 0.80 0.92 1.00

56

C HAPT ER 2

Tables and Graphs for Summarizing Data

SOLUTION STEP 1 Draw a horizontal axis and place tick marks corresponding to the class boundar-

Frequency

ies, or endpoints: 10 through 70 by tens. STEP 2 Draw a vertical axis for frequency and place appropriate tick marks by checking the frequency distribution. The frequencies range from 0 to 7, so draw tick marks at 0 to at least 7 on the vertical axis. STEP 3 Draw a rectangle above each class with height equal to frequency. The resulting histogram is shown in Figure 2.27. Figure 2.28 shows a technology solution. 8 7 6 5 4 3 2 1 0

10

20

30

40 50 Torque

60

70

Figure 2.27 Frequency histogram for torque.

TRY IT NOW

Figure 2.28 CrunchIt! histogram.

GO TO EXERCISE 2.86

A CLOSER L OK 1. A histogram tells us about the shape, center, and variability of the distribution. In addi-

Histogram usually means frequency histogram.

tion, we can quickly identify any outliers. 2. If you must draw a histogram by hand, then you need to construct the frequency distribution first. However, calculators and computers construct histograms directly from the data. The frequency distribution is in the background and is usually not displayed. 3. To construct a relative frequency histogram, plot relative frequency versus class interval. The only difference between a frequency histogram and a relative frequency histogram is the scale on the vertical axis. The two graphs are identical in appearance. In Example 2.12 both a frequency histogram and a relative frequency histogram are shown. 4. Histograms should not be used for inference. They provide a quick look at the distribution of data and only suggest certain characteristics.

Example 2.12 Highway Tunnels A random sample of highway tunnel lengths (in feet) was obtained, and the resulting frequency distribution is shown in the following table:

Class © Caro/Alamy

0–500 500–1000 1000–1500 1500–2000 2000–2500 2500–3000 Total

Frequency

Relative frequency

Cumulative relative frequency

16 28 54 48 36 18

0.08 0.14 0.27 0.24 0.18 0.09

0.08 0.22 0.49 0.73 0.91 1.00

200

1.00

2.4

Frequency Distributions and Histograms

57

Use this table to construct a frequency histogram and a relative frequency histogram for these data.

SOLUTION STEP 1 For each graph, draw a horizontal axis and place tick marks at the class boundar-

ies: 0, 500, 1000, . . . , 3000. STEP 2 For the frequency histogram: a. Draw a vertical axis for frequency. Since the largest frequency is 54, use the tick marks at 0, 10, 20, . . . , 60. b. Draw a rectangle above each class with height equal to frequency. The resulting frequency histogram is shown in Figure 2.29. STEP 3 For the relative frequency histogram: a. Draw a vertical axis for relative frequency. Because the largest relative frequency is 0.27, use the tick marks at 0, 0.05, 0.10, . . . , 0.30. b. Draw a rectangle above each class with height equal to relative frequency. The resulting relative frequency histogram is shown in Figure 2.30. 60

0.30

50

0.25

40

0.20

30

0.15

20

0.10

10

0.05

500 1000 1500 2000 2500 3000

Figure 2.29 Frequency histogram for the tunnel-length data.

TRY IT NOW

0.00

500 1000 1500 2000 2500 3000

Figure 2.30 Relative frequency histogram for the tunnel-length data.

GO TO EXERCISE 2.89

If the class widths are unequal in a frequency distribution, then neither the frequency nor the relative frequency should be used on the vertical axis of the corresponding histogram. To account for the unequal class widths, set the area of each rectangle equal to the relative frequency. In this case, the height of each rectangle is called the density, and it is equal to the relative frequency divided by the class width.

How to Find the Density

If two classes have the same frequency, but one class has double the width, then the corresponding rectangle in a traditional histogram would have double the area. This misrepresents the distribution.

To find the density for each class: 1. Set the area of each rectangle equal to relative frequency. The area of each rectangle is height times class width. Area of rectangle 5 Relative frequency 5 ( Height ) 3 ( Class width ) 2. Solve for the height.

Density 5 Height 5 ( Relative frequency ) / ( Class width )

The following example shows an extended frequency distribution with the density of each class included, and the corresponding density histogram.

C HAPT ER 2

Tables and Graphs for Summarizing Data

Example 2.13 Accident Demographics Younger drivers tend to be involved in more automobile crashes than older drivers. This may be attributed to risk and inexperience. Suppose the following table shows the number of automobile accidents in Michigan in 2013 for each driver age group.25 The width of each class and the density calculations are also shown.

Class

Frequency

Relative frequency

16–18 18–21 21–25 25–30 30–40 40–50 50–60

18,157 40,122 45,247 43,106 73,846 78,442 68,781

0.0494 0.1091 0.1231 0.1172 0.2008 0.2133 0.1871

Total

367,701

1.0000

Width Density 2 3 4 5 10 10 10

0.0247 0.0364 0.0308 0.0234 0.0201 0.0213 0.0187

(50.0494/2) (50.1091/3) (50.1231/4) (50.1172/5) (50.2008/10) (50.2133/10) (50.1871/10)

Use this table to construct a density histogram for these data.

SOLUTION STEP 1 The class intervals are of unequal width, so the class density must be used as the

height of each rectangle in a histogram. STEP 2 Draw a horizontal axis corresponding to age. Because the classes range from 16 to 60, use tick marks at 15, 20, 25, . . . , 60, or tick marks corresponding to the endpoints of each class. STEP 3 Add a vertical axis for density. The largest density is 0.0364, so use the tick marks 0, 0.005, 0.010, . . . , 0.040. STEP 4 Draw a rectangle above each class with height equal to density. The resulting density histogram is shown in Figure 2.31. A technology solution is shown in Figure 2.32.

0.040 0.035 0.030 Density

58

0.025 0.020 0.015 0.010 0.005 0

1618 21 25

30

40

50

60

Age

Figure 2.31 Histogram for unequal class widths: density histogram for the age data.

TRY IT NOW

Figure 2.32 A technology solution: Minitab density histogram.

GO TO EXERCISE 2.94

Shape of a Distribution Because the relative frequency is equal to the area of each rectangle in a density histogram, the sum of the areas of all the rectangles is 1. This is an important concept as we begin to associate area with probability.

2.4

Frequency Distributions and Histograms

59

The shape of a distribution, represented in a histogram, is an important characteristic. To help describe the various shapes, we draw a smooth curve along the tops of the rectangles that captures the general nature of the distribution (as shown in Figure 2.33). To help identify and describe distributions quickly, a smoothed histogram is often drawn on a graph without a vertical axis, without any tick marks on the measurement axis, and without any rectangles (as shown in Figure 2.34). 60

50 40 30 20 10 0

–3

–2

–1

1

2

3

Figure 2.33 Smooth curve that captures the general shape of the distribution.

Figure 2.34 Typical smoothed histogram.

The first important characteristic of a distribution is the number of peaks.

Definition 1. A unimodal distribution has one peak. This is very common; almost all distributions have a single peak. 2. A bimodal distribution has two peaks. This shape is not very common and may occur if data from two different populations are accidentally mixed. 3. A multimodal distribution has more than one peak. A distribution with more than two distinct peaks is very rare.

Examples of these three types of distributions are shown in Figures 2.35–2.37.

Figure 2.35 Unimodal histogram.

Figure 2.36 Bimodal histogram.

Figure 2.37 Multimodal distribution with four peaks.

The following characteristics are used to further classify and identify unimodal distributions.

Definition 1. A unimodal distribution is symmetric if there is a vertical line of symmetry in the distribution. 2. The lower tail of a distribution is the leftmost portion of the distribution, and the upper tail is the rightmost portion of the distribution.

60

C HAPT ER 2

Tables and Graphs for Summarizing Data

3. If a unimodal distribution is not symmetric, then it is skewed. (a) In a positively skewed distribution or a distribution that is skewed to the right, the upper tail extends farther than the lower tail. (b) In a negatively skewed distribution, or a distribution that is skewed to the left, the lower tail extends farther than the upper tail.

Figures 2.38 and 2.39 show examples of symmetric, unimodal distributions. Each shows the (dashed) line of symmetry. The left half of the distribution is a mirror image of the right half. A bimodal or multimodal distribution may also be symmetric, and many distributions are approximately symmetric.

Figure 2.38 Symmetric distribution.

Figure 2.39 Symmetric distribution.

Examples of skewed distributions are shown in Figures 2.40 and 2.41. Positively skewed distributions are more common. The distribution of the lifetime of an electronics part might be positively skewed.

Figure 2.40 Positively skewed distribution. We will learn much more about the normal curve in Chapter 6. The vertical cross section of a bell is a normal curve.

Figure 2.41 Negatively skewed distribution.

The most common unimodal distribution shape is a normal curve (as shown in Figure 2.42). This curve is symmetric and bell-shaped, and can be used to model, or approximate, many populations. A curve with heavy tails has more observations in the tails of the distribution than a comparable normal curve. The tails do not drop down to the measurement axis as quickly as a normal curve. A curve with light tails has fewer observations in the tails of the distribution than a comparable normal curve. The tails drop to the measurement axis quickly. Examples of curves with heavy and light tails are shown in Figures 2.43 and 2.44. Both of these characteristics are subtle and tricky to spot.

Figure 2.42 Normal curve.

Figure 2.43 A distribution with heavy tails.

Figure 2.44 A distribution with light tails.

2.4

Example 2.14 Radiation Exposure

DATA SET

The U.S. Nuclear Regulatory Commission was established to regulate commercial, industrial, academic, and medical uses of nuclear materials.26 The NRC is charged with monitoring our health and safety and protecting the environment. Part of this responsibility involves monitoring the radiation exposure at nuclear power reactors and other facilities. The individual radiation dose per year is measured in rem (roentgen equivalent man). Suppose a sample of 50 individual radiation measurements was obtained from employees and the data are given in the following table:

REMS

Solution Trail 2.14 K EYW ORDS ■ ■

Less than 0.40 At least 0.50

0.62 0.71 0.30 0.14 0.19

TR ANSL ATI O N ■ ■

Up to but not including 0.40 0.50 or greater

C ONCE PTS ■

61

Frequency Distributions and Histograms

0.29 0.53 0.32 0.18 0.12

0.06 0.28 0.18 0.27 0.22

0.09 0.19 0.29 0.20 0.21

0.10 0.16 0.15 0.37 0.12

0.24 0.40 0.13 0.22 0.05

0.06 0.08 0.42 0.26 0.22

0.38 0.24 0.18 0.31 0.26

0.32 0.57 0.28 0.11 0.49

0.46 0.11 0.39 0.29 0.43

a. Construct a frequency distribution and a histogram for these data using the class inter-

Use the frequency distribution to describe the distribution and answer questions about specific classes.

vals 0–0.10, 0.10–0.20, etc. b. Describe the shape, center, and spread of the distribution. c. What proportion of observations are less than 0.40 rem? d. What proportion of observations are at least 0.50 rem?

V ISI ON

Use the frequency distribution to construct the histogram. Smooth out the rectangles to describe the shape, consider the middle of the observations to determine the center, and try to decide whether the data are compact or spread out over a wide range. Use the relative frequencies or cumulative relative frequencies to find the proportion of observations in certain classes.

SOLUTION STEP 1 The class intervals are given. Create a frequency distribution and compute the

frequency, relative frequency, and cumulative relative frequency for each class.

Class

Frequency

0.00–0.10 0.10–0.20 0.20–0.30 0.30–0.40 0.40–0.50 0.50–0.60 0.60–0.70 0.70–0.80

5 14 15 7 5 2 1 1

0.10 0.28 0.30 0.14 0.10 0.04 0.02 0.02

Total

50

1.00

16

Cumulative relative frequency

Relative frequency (5 5/50) (5 14/50) (5 15/50) (5 7/50) (5 5/50) (5 2/50) (5 1/50) (5 1/50)

0.10 0.38 0.68 0.82 0.92 0.96 0.98 1.00

(5 0.10) (5 0.10 1 0.28) (5 0.38 1 0.30) (5 0.68 1 0.14) (5 0.82 1 0.10) (5 0.92 1 0.04) (5 0.96 1 0.02) (5 0.98 1 0.02)

Use the frequency distribution to sketch the histogram (Figure 2.45). A technology solution is shown in Figure 2.46.

14 Frequency

12 10 8 6 4 2 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Radiation

Figure 2.45 Histogram for the radiation data.

Figure 2.46 JMP histogram for the radiation data.

62

CH APT ER 2

Tables and Graphs for Summarizing Data

Other estimates of the center are also valid here. 0.24 is reasonable. So is 0.27. More precise measurements of the center of a data set are presented in Chapter 3.

The distribution is positively skewed. There are more observations in the lower tail than in the upper tail. The upper tail extends farther than the lower tail. To estimate the center of the distribution, use the histogram to identify a value such that approximately half of the observations are below that number and half are above that number. A number between 0.2 and 0.3 appears to divide the ordered data in half. Typical values for this data set are in this range, and an estimate of the center is 0.25. The variability is typically described as either compact (data that are compressed or squeezed together) or spread out (observations that extend over a wide range). Although this is somewhat subjective for now, this data set is fairly compact. All of the observations lie between 0.05 and 0.71 (even though the smallest class boundary is 0.00 and the largest class boundary is 0.80). STEP 2 Using the cumulative relative frequency column of the frequency distribution, the proportion of observations less than 0.40 is 0.82. STEP 3 There are two ways to find the proportion of observations that are at least 0.50. a. Add the relative frequencies that correspond to the classes that are at least 0.50. 0.04 + 0.02 + 0.02 = 0.08 ()* ()* ()* 0.50–0.60 0.60–0.70 0.70–0.80 b. Find the cumulative relative frequency up to 0.50 and subtract this value from 1.

Proportion of observations $ 0.50 5 1 2 ( proportion of observations , 0.50 ) 5 1 2 0.92 5 0.08 TRY IT NOW

GO TO EXERCISE 2.95

Technology Corner Procedure: Construct a histogram. Reconsider: Example 2.11, solution, and interpretations.

VIDEO TECH MANUALS EXEL DISCRIPTIVE HISTOGRAM

CrunchIt! CrunchIt! has a built-in function to construct a histogram. 1. Enter the data into column Var1. 2. Select Graphics; Histogram. Choose Sample (column) Var1. Optionally enter the Bin Width and Start Bins At. Optionally

enter a Title and X Label. The Y Label is Count by default. Click the Calculate button. Refer to Figure 2.28 (page 56).

TI-84 Plus C A histogram is one of the six types of TI-84 Plus C statistical plots. There is no built-in function to construct a density histogram. There are calculator programs available that will produce this graph. 1. Enter the data into list L1. 2. Select STATPLOT ; Plot1 to define, or set up, the histogram. Turn the plot On, select Type histogram, set Xlist to

the name of the list containing the data, set Freq (frequency of occurrence of each observation) to 1, and select a Color. See Figure 2.47. 3. Set the WINDOW parameters so that Xmin is the left endpoint on the first class, Xmax is the right endpoint on the last class, and Xscl is the width of each class. Ymin should be 0 (the smallest frequency) and Ymax should be at least the largest frequency. Set Yscl to a reasonable distance between tick marks. See Figure 2.48.

2.4

Figure 2.47 TI-84 Plus C Plot1 setup screen.

Figure 2.48 The WINDOW settings.

Frequency Distributions and Histograms

63

Figure 2.49 TI-84 Plus C histogram.

4. Press GRAPH to display the histogram. See Figure 2.49. Note: The TRACE key is used to move on the graph between

rectangles (classes). The corresponding class boundaries and frequency are displayed.

Minitab The input may be either a single column containing the data or summary information in two columns: observations and frequencies. 1. 2. 3. 4.

Enter the data into column C1. Select Graph; Histogram and highlight Simple histogram. Click OK. Enter C1 in the Graph variables window. Click OK to view the histogram. Edit the horizontal axis scale to use the correct class intervals. Under the Binning tab, select Interval type: Cutpoint and enter the Midpoint/Cutpoint positions (class boundaries). See Figure 2.50.

Figure 2.50 Minitab histogram.

Excel The input may be either a single column containing the data or summary information in two columns: right endpoint of each class (bin limits) and corresponding frequencies. There are several methods to construct a frequency distribution in Excel using FREQUENCY, SUMIF, or COUNTIFS, for example. 1. Enter the data into column A and the right endpoint of each class into column B. 2. Under the Data tab, select Data Analysis and choose Histogram. Enter the Input Range, the Bin Range, the Output

Range, and select Chart Output.

64

C HA PTER 2

Tables and Graphs for Summarizing Data

3. Each class is labeled with its right endpoint. In addition, Excel places observations on a boundary in the smaller class.

See Figure 2.51.

Figure 2.51 Excel histogram.

SECTION 2.4 EXERCISES Concept Check

Practice

2.68 True/False

2.77 Consider the data given in the following table.

The classes in a frequency distribution may

EX2.77

overlap. 2.69 True/False

The classes in a frequency distribution should have the same width. 2.70 True/False The cumulative relative frequency for each class in a frequency distribution may be greater than 1.

87 91 91 89

81 86 81 85

86 86 89 86

90 87 89 90

88 88 83 90

85 85 90 89

79 92 83 78

91 85 80 91

87 87 90 83

82 86 80 92

2.71 True/False The relative frequency for each class can be determined by using the cumulative relative frequencies.

Construct a frequency distribution to summarize these data using the class intervals 78–80, 80–82, 82–84, . . . .

2.72 True/False The only difference between a frequency histogram and a relative frequency histogram (for the same data) is the scale on the vertical axis.

2.78 Consider the data given on the text website. Construct a EX2.78 frequency distribution to summarize these data.

2.73 True/False A histogram can be used to describe the

2.79 Consider the following frequency distribution.

shape, center, and variability of a distribution. 2.74 Short Answer a. When is a density histogram appropriate? b. In a density histogram, what is the sum of areas of all

rectangles? 2.75 Fill in the Blank a. The most common unimodal distribution is a

. b. A unimodal distribution is

if there is a vertical line of symmetry. c. If a unimodal distribution is not symmetric, then it is .

2.76 True/False A bimodal distribution cannot be symmetric.

Class 400–410 410–420 420–430 430–440 440–450 450–460 460–470 470–480 480–490 490–500

Frequency 5 8 10 12 9 8 5 4 3 2

Relative frequency

Cumulative relative frequency

0.0758 0.1212 0.1515 0.1818 0.1364 0.1212 0.0758 0.0606 0.0455 0.0303

0.0758 0.1970 0.3485 0.5303 0.6667 0.7879 0.8637 0.9243 0.9698 1.0001

2.4

Draw the corresponding frequency histogram. (Notice the last entry in the Cumulative Relative Frequency column is not exactly 1. This is due to round-off error.)

Class

2.80 Consider the following frequency distribution.

Class

Frequency

Relative frequency

0.5–1.0 1.0–1.5 1.5–2.0 2.0–2.5 2.5–3.0 3.0–3.5 3.5–4.0 4.0–4.5 4.5–5.0 5.0–5.5 5.5–6.0 6.0–6.5 6.5–7.0

6 8 10 16 24 34 36 22 18 12 8 4 2

0.03 0.04 0.05 0.08 0.12 0.17 0.18 0.11 0.09 0.06 0.04 0.02 0.01

Cumulative relative frequency 0.03 0.07 0.12 0.20 0.32 0.49 0.67 0.78 0.87 0.93 0.97 0.99 1.00

Draw the corresponding relative frequency histogram. 2.81 Complete the following frequency distribution.

Class

Frequency

100–150 150–200 200–250 250–300 300–350 350–400

155 120 130 145 150 100

Relative frequency

Cumulative relative frequency

2.82 Complete the following frequency distribution.

1.0–1.1 1.1–1.2 1.2–1.3 1.3–1.4 1.4–1.5 1.5–1.6 1.6–1.7 1.7–1.8 Total

Frequency

Relative frequency

Cumulative relative frequency

0.05 20 0.15 65 0.25 35 25 300

2.83 Complete the following frequency distribution and draw

the corresponding histogram.

Frequency

Relative frequency

Cumulative relative frequency

0–25 25–50 50–75 75–100 100–125 125–150 150–175 175–200 Total

0.150 0.350 0.525 0.675 0.800 0.900 0.975 1.000 1000

EX2.84 2.84 Consider the data given on the text website. a. Construct a frequency distribution to summarize these

data using the class intervals 0–1, 1–2, 2–3, etc., and draw the corresponding histogram. b. Use the histogram to describe the shape of the distribution. Are there any outliers? EX2.85 2.85 Consider the data given on the text website. a. Construct a frequency distribution to summarize these

data and draw the corresponding histogram. b. Use the histogram to describe the shape of the distribution. c. Use the frequency distribution to estimate the middle of

the data: a number M such that 50% of the observations are below M and 50% are above M. d. Use the frequency distribution to estimate a number Q1 such that 25% of the observations are below Q1 and 75% are above Q1. e. Use the frequency distribution to estimate a number Q3 such that 75% of the observations are below Q3 and 25% are above Q3.

Applications

Total

Class

65

Frequency Distributions and Histograms

2.86 Biology and Environmental Science A weather station located along the Maine coast in Kennebunkport collects data on temperature, wind speed, wind chill, and rain. The maximum wind speed (in miles per hour) for 50 randomly selected times in February 2013 are given on the MAXWIND text website.27 a. Construct a frequency distribution to summarize these data, and draw the corresponding histogram. b. Describe the shape of the distribution. Are there any outliers? 2.87 Fuel Consumption and Cars The quality of an auto-

mobile battery is often measured by cold cranking amps (CCA), a measure of the current supplied at 08F. Thirty automobile batteries were randomly selected and subjected to subfreezing temperatures. The resulting CCA data are given in the BATTERY following table. 63 122 340

87 514 199

302 91 77

4 117 217

259 325 64

106 39 320

198 30 145

55 164 84

99 75 47

134 16 232

66

CH APT ER 2

Tables and Graphs for Summarizing Data

a. Construct a frequency distribution to summarize these

data, and draw the corresponding histogram. b. Describe the shape of the distribution. c. Estimate the middle of the distribution, a number M such that 50% of the data are below M and 50% are above M. 2.88 Marketing and Consumer Behavior The weights of a diamond and other precious stones are usually measured in carats. One carat is traditionally equal to 200 milligrams. A random sample of the weights (in carats) of conflict-free loose DIAMOND diamonds is given in the following table.28

0.23 0.51 0.76 1.52 1.76

0.27 0.58 0.90 1.05 2.00

0.30 0.61 1.14 1.51 1.01

0.25 0.80 1.11 1.36 1.69

0.27 0.90 1.38 0.91 1.51

0.26 1.02 1.52 1.22 2.00

0.40 0.92 0.96 1.54 1.45

0.40 1.01 1.16 1.35 1.38

a. Construct a frequency distribution and a histogram for

these data. b. Multiply each observation in the table by 200, to convert the

weights into milligrams. Construct a frequency distribution and a histogram for these new, transformed data. c. Compare the two histograms. Are the shapes similar? Describe any differences. 2.89 Public Health and Nutrition Vitamin B3 (niacin) helps

to detoxify the body, aids digestion, can ease the pain of migraine headaches, and helps promote healthy skin. A random sample of adults in the United States and in Europe was obtained and the daily intake of niacin (in milligrams) was recorded. The data are summarized in the following table.

Class 0–3 3–6 6–9 9–12 12–15 15–18 18–21 21–24

United States frequency 15 23 21 14 12 9 3 2

Europe frequency 4 6 12 17 32 25 20 10

a. Construct two relative frequency histograms, one for the

United States and one for Europe. b. Describe the shape of each histogram. Does a comparison

of the two histograms suggest any differences in niacin intake between the two samples? Explain. 2.90 Manufacturing and Product Development

In the United States, yarn is often sold in hanks. For woolen yarn, one hank is approximately 1463 meters. A quality control inspector uses a special machine to quickly measure each hank. A random sample was obtained during the manufacturing process, and the length (in meters) of each hank is given on the text website. YARN

a. Construct a histogram for these data. b. Describe the distribution in terms of shape, center, and

variability. 2.91 Sports and Leisure The National Hockey League is concerned about the number of penalty minutes assessed to each player. While some people in attendance hope to see a lot of fighting (and penalty minutes), the League Office believes most fans are interested in good, clean hockey. A sample of total penalty minutes per player during the 2012–2013 regular season was PENALTY obtained, and the data are given in the following table.29

39 21 39 25

38 29 21 14

14 16 19 22

22 18 19 17

26 19 15 15

15 17 25 71

65 37 46 14

39 40 14 13

24 56 32 23

44 17 30 24

a. Construct a histogram for these data. Describe the

distribution in terms of shape, center, and variability. Write a Solution Trail for this problem. b. Find a value m for the number of minutes such that 90% of all players have fewer than m penalty minutes.

Extended Applications 2.92 Biology and Environmental Science Fruits such as cherries and grapes are harvested and placed in a shallow box or crate called a lug. The size of a lug varies, but one typically holds between 16 and 28 pounds. A random sample of the weight (in pounds) of full lugs holding peaches was obtained, and the data are summarized in the following table.

Class

Frequency

20.0–20.5 20.5–21.0 21.0–21.5 21.5–22.0 22.0–22.5 22.5–23.0 23.0–23.5 23.5–24.0 24.0–24.5 24.5–25.0

6 12 17 21 28 25 19 15 11 10

a. Complete the frequency distribution. b. Construct a histogram corresponding to this frequency

distribution. c. Estimate the weight w such that 90% of all full peach lugs

weigh more than w. 2.93 Travel and Transportation Maglev trains operate in

Germany and Japan at speeds of up to 300 miles per hour. Magnets create a frictionless system in which the train operates at a distance of 100–150 millimeters from the rail. The size of this air gap is monitored constantly to ensure a safe ride. A random sample of the size of air gaps (in millimeters) at one specific location in the track was obtained. The frequency distribution for this data is shown in the following table.

Chapter 2

Class

Frequency

Relative frequency

100–105 105–110 110–115 115–120 120–125 125–130 130–135 135–140

Cumulative relative frequency 0.050 0.425 0.625 0.750 0.850 0.925 0.975 1.000

Total

200

a. Complete the frequency distribution. b. Draw a histogram corresponding to this frequency distribution. c. What proportion of air gaps were between 110 and 125

millimeters? 2.94 Biology and Environmental Science Many scientists have warned that global warming is causing the polar ice caps to melt and, therefore, sea levels around the world to rise. A random sample of the sea level (in millimeters) at Rockport, Massachusetts, from 1987 to 2011 was obtained and is summarized in the following table.30

Summary

67

a. Complete the frequency distribution. b. A traditional frequency histogram or relative frequency

histogram is not appropriate in this case. Why not? c. Construct a density histogram corresponding to this

frequency distribution. 2.95 Fuel Consumption and Cars The total cost of owning

an automobile includes the amount spent on repairs. Before purchasing a new car, many consumers research the past quality of specific makes and models. The data in the following table lists the number of problems per 100 vehicles over the three CARCOST years 2010–2012. Lexus Toyota Honda Suzuki Ford BMW Nissan Hyundai Mini Volkswagen Dodge

71 112 119 122 127 133 137 141 150 174 190

Porsche Mercedes Acura Mazda Cadillac GMC Infiniti Audi Chrysler Jeep Land Rover

94 115 120 124 128 134 138 147 153 178 220

Lincoln Buick Ram Chevrolet Subaru Scion Kia Volvo Jaguar Mitsubishi

112 118 122 125 132 135 140 149 164 178

Source: J. D. Powers 2013 Dependability Survey

Class

Frequency

1400–1500 1500–1600 1600–1800 1800–2000 2000–2100 2100–2200 2200–2300 2300–2800

1 2 41 192 79 60 15 10

Total

400

Relative frequency

Width

Density

a. Construct a histogram for these data. b. Describe the distribution in terms of shape, center, and

variability. c. Find a number Q1 such that 25% of the problem data are

less than Q1. Find a number Q3 such that 25% of the problem data are greater than Q3. d. How many values should be between Q1 and Q3? Find the actual number of values between Q1 and Q3. Explain any difference between these two values.

CHAPTER 2 SUMMARY Concept

Page

Categorical data set Numerical data set Discrete data set Continuous data set Frequency distribution

29 29 30 30 33

Class frequency Class relative frequency

33 33

Notation / Formula / Description

Consists of observations that may be placed into categories. Consists of observations that are numbers. The set of all possible values is finite, or countably infinite. The set of all possible values is an interval of numbers. A table used to describe a data set. It includes the class, frequency, and relative frequency (and cumulative relative frequency, if the data set is numerical). The number of observations within a class. The proportion of observations within a class: class frequency divided by total number of observations.

68

C HAPT ER 2

Tables and Graphs for Summarizing Data

Class cumulative relative frequency Bar chart

34

Pie chart

35

Stem-and-leaf plot

45

Histogram Density histogram

55 58

Unimodal distribution Bimodal distribution Multimodal distribution Symmetric distribution Positively skewed distribution Negatively skewed distribution Normal curve

59 59 59 59 60 60 60

34

The proportion of observations within a class and every class before it: the sum of all the relative frequencies up to and including the class. A graphical representation of a frequency distribution for categorical data with a vertical bar for each class. A graphical representation of a frequency distribution for categorical data with a slice, or wedge, for each class. A graph used to describe numerical data. Each observation is split into a stem and a leaf. A graphical representation of a frequency distribution for numerical data. A graphical representation of a frequency distribution for numerical data containing class intervals of unequal width. A distribution with one peak. A distribution with two peaks. A distribution with more than one peak. A distribution with a vertical line of symmetry. A distribution in which the upper tail extends farther than the lower tail. A distribution in which the lower tail extends farther than the upper tail. The most common distribution, a bell-shaped curve.

CHAPTER 2 EXERCISES

2

APPLICATIONS

2.98 Psychology and Human Behavior In January 2013,

2.96 Business and Management A laborshed is a region

from which an employment center draws its workforce. In order to understand the potential workforce in a laborshed, the Walker County Development Authority in Alabama sampled residents WORKERS and reported the data in the following table. Employment status Employed (white collar) Employed (blue collar) Unemployed Homemaker Retired

Frequency

Harris Interactive released results of a survey in which adults were asked to name their favorite TV personality.31 Ellen DeGeneres captured the top spot, with Mark Harmon second. Jon Stewart, Jay Leno, and Jim Parsons round out the top five, and Bill O’Reilly, Anderson Cooper, and Oprah Winfrey also received strong support. Suppose the results from this survey TVHOST are given in the following table.

125 200 30 50 95

a. Add a relative frequency column to the table. b. Construct a bar chart and a pie chart for these data. 2.97 Fuel Consumption and Cars The coefficient of drag

(Cd) is a measure of a car’s aerodynamics. This unitless number is related directly to the speed of the car, overall performance, and miles per gallon. A low coefficient of drag indicates good performance. A random sample of new automobiles was examined, and the coefficient of drag was computed. The DRAGCOEF results are given on the text website. a. Construct a stem-and-leaf plot for these data. b. Use the plot in part (a) to describe the distribution in terms of shape, center, and variability.

a. b. c. d.

TV personality

Frequency

Ellen DeGeneres Mark Harmon Jon Stewart Jay Leno Jim Parsons Others

638 532 402 376 350 320

Find the relative frequency for each category. Construct a pie chart for these data. What proportion of adults selected Jay Leno or Jim Parsons? What proportion of adults did not select Jon Stewart?

2.99 Physical Science Construction equipment used to build homes, businesses, and roads (for example, cranes, backhoes, and front loaders) can be exceptionally loud. The noise level in dBA (A-weighted decibels) measured 50 feet away from

Chapter 2

several construction-related machines are given on the text NOISE website.32 a. Construct a frequency distribution for these data. b. Draw the corresponding histogram. c. What proportion of construction equipment had a peak noise level below 80 dBA? d. What proportion of construction equipment have peak noise levels of at least 90 dBA? 2.100 Technology and the Internet Many computer sellers

and most software vendors maintain help lines for customers. A random sample of the duration (in minutes) of customer support calls to Amazon.com was obtained, and the resulting stem-and-leaf plot is given below. Stem 0 1 2 3 4 5 6 7

Leaf 11223344555566678888999 0001222222335668999 012334556678 000123478 334468 125 15 7

Exercises

69

c. Suppose the performance of these microwave ovens is graded

by actual output power, according to the following chart. Power 900–1000 800–900 700–800 600–700 500–600 0–500

Grade Excellent Very good Good Fair Poor Not serviceable

Classify each power output, construct a frequency distribution by grade, and draw the resulting pie chart.

EXTENDED APPLICATIONS

Stem 5 10 Leaf 5 1

a. Describe the shape of this distribution of the duration of

technical support calls. b. Use the plot to construct a frequency distribution using

the class intervals 0–5, 5–10, 10–15, etc. c. What proportion of support calls last less than 15 minutes? d. If a call lasts at least 25 minutes, a supervisor monitors the

conversation. What proportion of calls were monitored? 2.101 Technology and the Internet Many police depart-

ments have been experimenting with and implementing state-of-the-art emergency 9-1-1 equipment. This equipment is designed to allow a faster response time without voice contact. Caller information is displayed on a monitor, printed, and then processed. To compare the two procedures (old and new), a random sample of police response times (in minutes) was obtained. The data are given on the text POLICE website. a. Construct a back-to-back stem-and-leaf plot for these data. b. Use the plot in part (a) to describe any similarities and/or differences between the distributions. c. Based on the plot in part (a), which procedure is better? Justify your answer. 2.102 Manufacturing and Product Development Microwave ovens are often rated by their output power, for example, 900 watts. However, the actual output of a microwave oven tends to decrease with age. If the actual output is more than 400 watts below the rated output, then service is recommended. A random sample of five-year-old, 1000-watt-rated microwave MICRO ovens was obtained and tested for output. a. Construct a frequency distribution for these data and draw the corresponding histogram. b. Based on this random sample, what proportion of five-year-old, 1000-watt microwave ovens need service?

2.103 Economics and Finance PwC, World Bank, and IFC released a study about paying taxes around the world. The report includes measures of the world’s tax systems associated with a standardized business. One of the measures, Time to Comply (in hours), for each of the 185 countries in the study, is TAXPAY given on the text website.33 a. Use the class intervals 0–200, 200–400, etc., to construct a frequency distribution and draw the corresponding histogram. b. Describe the distribution in terms of shape, center, and variability. c. What is a typical Time to Comply? Are there any outliers? d. What proportion of countries had a time to comply of at least 800 hours? 2.104 Fuel Consumption and Cars Remanufactured parts are common in the automotive industry. To ensure quality, Hite Parts Exchange routinely checks the maximum output of rebuilt alternators. Each day a random sample is obtained and the output delivered (in amps) at 2500 rpm is recorded. The results from a recent day are presented in the following table.

Class

Frequency

30.0–32.0 32.0–33.0 33.0–34.0 34.0–34.5 34.5–35.0 35.0–35.5 35.5–36.0 36.0–50.0

8 7 10 25 30 40 45 5

Total

170

a. Find the width and the density for each class. b. Construct a density histogram for these data. 2.105 Medicine and Clinical Studies A common cold usually lasts from 3 to 14 days. Some studies suggest echinacea, zinc, or vitamin C can prevent colds and/or shorten their duration. In a new study of the effect of vitamin C, patients with colds were randomly assigned to a placebo group or a

70

CH APTER 2

Tables and Graphs for Summarizing Data

vitamin C group. The duration of each cold (in days) was recorded, and the data are summarized in the following table. Duration 3 4 5 6 7 8 9 10 11 12 13 14

Placebo frequency 0 0 8 7 21 10 26 15 8 3 1 1

Figures 2.52 and 2.53 show a frequency distribution and the corresponding ogive. The observations are ages. The values to be used in the plot are shown in bold in the table.

Vitamin C frequency 3 6 7 10 18 15 17 10 9 2 3 0

Class

Frequency

Relative frequency

12–16 16–20 20–24 24–28 28–32 32–36 32–40

8 10 20 30 15 10 7

0.08 0.10 0.20 0.30 0.15 0.10 0.07

Total

100

1.00

Cumulative relative frequency 0.08 0.18 0.38 0.68 0.83 0.93 1.00

Figure 2.52 Frequency distribution.

a. Use appropriate graphical procedures to compare the

placebo and vitamin C data sets. variability? c. Is there any graphical evidence to suggest vitamin C reduced the duration of a cold? 2.106 Fuel Consumption and Cars The performance of a

gas furnace can be measured by the annual fuel utilization efficiency (AFUE). This number depends on many furnace properties, and is an indication of the proportion of fuel energy delivered as heat energy during an entire heating season. The U.S. Department of Energy (DOE) requires all new gas furnaces to operate at an AFUE of at least 78%.34 A gas company selected a random sample of customers, carefully tested each furnace, and recorded the AFUE number. The data FURNACE are given on the text website. a. Construct a stem-and-leaf plot for these data. b. Construct a frequency distribution for these data and draw the corresponding histogram. c. Describe the distribution in terms of shape, center, and variability. Are there any outliers? If so, what are they? d. Using the frequency distribution in part (a), approximately what proportion of furnaces do not meet the DOE’s minimum AFUE requirement? e. The gas company classifies each AFUE reading according to the following scheme: 90 or above, excellent; at least 80 but below 90, good; at least 70 but below 80, fair; and less than 70, poor. Classify each reading in the table above, and construct a bar chart for these classification data.

CHALLENGE 2.107 Sports and Leisure An ogive, or cumulative relative

frequency polygon, is another type of visual representation of a frequency distribution. To construct an ogive: ■ Plot each point (upper endpoint of class interval, cumulative relative frequency). ■ Connect the points with line segments.

Cumulative relative frequency

b. Do the graphs suggest any differences in shape, center, or

1.0 0.8 (28,0.68)

0.6 0.4 0.2 (12,0) 5

10

15

20

25 Age

30

35

40

45

Figure 2.53 Resulting ogive.

A random sample of game scores from Abby Sciuto’s evening bowling league with Sister Rosita was obtained, and the data BOWLING are given on the text website. a. Construct a frequency distribution for these data. b. Draw the resulting ogive for these data. 2.108 Public Health and Nutrition A doughnut graph is another graphical representation of a frequency distribution for categorical data. To construct a doughnut graph: 1. Divide a (flat) doughnut (or washer) into pieces, so that each piece (bite of the doughnut) corresponds to a class. 2. The size of each piece is measured by the angle made at the center of the doughnut. To compute the angle of each piece, multiply the relative frequency times 3608 (the number of degrees in a whole, or complete, circle).

The manager at a Whole Foods Market obtained a random sample of customers who purchased at least one popular herb (for cooking or medicinal purposes). Figure 2.54 and 2.55 show a frequency distribution and the corresponding doughnut graph.

Chapter 2

Herb Echinacea Ephedra Feverfew Garlic Ginkgo Kava Saw palmetto St. John’s wort Total

Frequency

Relative frequency

25 15 20 35 40 30 20 15

0.125 0.075 0.100 0.175 0.200 0.150 0.100 0.075

200

1.000

Figure 2.54 Frequency distribution.

Garlic 17.5%

Feverfew Ephedra 10% 7.5% Echinacea 12.5%

Ginkgo 20% Kava 15%

St. John's wort 7.5% Saw palmetto 10%

Figure 2.55 Resulting doughnut graph.

Class

71

Exercises

Frequency

Smoking or smoking materials Heating equipment Cooking and cooking equipment Children playing with matches Arson / suspicious

70 85 205 105 35

a. Find the relative frequency for each class. b. Draw a doughnut graph for these data.

LAST STEP 2.109 Can the Florida Everglades be saved? In January 2013, the Florida Fish and Wildlife Conservation Commission started the Python Challenge. The purpose of the contest was to thin the python population, which could be tens of thousands, and help save the natural wildlife in the Everglades. At the end of the competition, 68 Burmese pythons had been harvested. Suppose a random sample of pythons captured during the Challenge was obtained and the length (in feet) of PYTHON each is given in the following table:

9.3 3.5 5.2 8.3 4.6 11.1 10.5 3.7 2.8 5.9 7.4 14.2 13.6 8.3 7.5 5.2 6.4 12.0 10.7 4.0 11.1 3.7 7.0 12.2 5.2 8.1 4.2 6.1 6.3 13.2 3.9 6.7 3.3 8.3 10.9 9.5 9.4 4.3 4.6 5.8 4.1 5.2 4.7 5.8 6.4 3.8 7.1 4.6 7.5 6.0 a. Construct a frequency distribution, stem-and-leaf plot,

and histogram for these data. A random sample of house fires in Bismarck, North Dakota, was selected and the cause of each was recorded. The resulting data are shown in the following table.

b. Use these tabular and graphical techniques to describe the

shape, center, and spread of this distribution, and to identify any outlying values.

3

Numerical Summary Measures Looking Back ■

Be familiar with several common tabular and graphical summary procedures.

■

Be able to construct a bar chart, pie chart, frequency distribution, stem-and-leaf plot, and histogram.

Looking Forward ■

Learn how to compute and interpret common numerical summary measures that describe central tendency, variability, or relative standing.

■

Learn how to measure distance in statistics.

■

Find a five-number summary and construct box plots.

How efficient is the Canadian Pacific Railway? The Canadian Pacific Railway (CPR) was incorporated in 1881 and played an important role in the development of western Canada. It is primarily a freight railway with over 14,000 miles of track. To increase efficiency, officials at CPR monitor several variables, including train speed, cars on each train, and terminal dwell time. In addition, the type and amount of freight is carefully recorded for each train. The following table shows the number of carloads of grain mill products for 30 randomly selected weeks in 2011 and 2012.1 572 610 718

711 611 707

582 557 673

663 685 697

612 683 808

577 629 755

650 626 438

550 637 569

590 634 684

659 723 637

The procedures presented in this chapter will be used to describe the center and variability of these data, and to search for any unusual observations.

CONTENTS 3.1 Measures of Central Tendency 3.2 Measures of Variability 3.3 The Empirical Rule and Measures of Relative Standing 3.4 Five-Number Summary and Box Plots George Rose/Getty Images

73

74

CHA PTER 3

Numerical Summary Measures

3.1 Measures of Central Tendency As we learned in Chapter 2, tabular and graphical procedures provide some very useful summaries of data. However, these techniques are not sufficient for statistical inference. For example, because there are no definite rules for constructing a histogram, two people may construct very different looking displays for the same data, which could lead to different conclusions. The numerical summary measures presented in this chapter are more precise, combine information from the data into a single number, and allow us to draw a conclusion about an entire population. The two most common types of numerical summary measures describe the center and the variability of the data. A numerical summary measure is a single number computed from a sample that conveys a specific characteristic of the entire sample. Measures of central tendency indicate where the majority of the data is centered, bunched, or clustered. There are many different measures of central tendency. They all combine information from a sample into a single number, and each has advantages and disadvantages. To properly define and understand numerical summary measures, the following notation will be used. Note: A capital, or uppercase, X has a very different meaning (introduced in Chapter 5).

The three dots, . . . , mean the list continues in the same manner.

x: This stands for a specific, fixed observation on a variable. In general, lowercase letters are used to represent observations on a variable; y and z are also commonly used. n: This is usually used to denote the number of observations in a data set, or the sample size. If there are two relevant data sets, then m and n may be used to denote their sample sizes. Or, if there are two (or more) relevant data sets, then n1, n2, n3, . . . may be used to denote their sample sizes. x1, x2, x3, . . . , xn: This refers to a set of fixed observations on a variable. The subscripts indicate the order in which the observations were selected, not magnitude. For example, x5 is the fifth observation drawn from a population, not the fifth largest. c1 xn : This is an example of summation notation, often used to a xi 5 x1 1 x2 1 n

i51

write long mathematical expressions more concisely. Here, the sum of n observations can be written more compactly by using the notation on the left side. g is the Greek capital letter sigma; i is the index of summation; 1 is the lower bound; and n is the upper bound. To make the notation more compact and less threatening, we will usually omit the subscript i ! 1 and superscript n. Unless specifically indicated, each summation applies to all values of the variable. For example, the following notation is used to represent the sum of each squared observation: g xi2 5 x12 1 x22 1 c1 xn2.

The following example illustrates the use of this notation and some of the computations used throughout this text.

Example 3.1 Sum Practice DATA SET EG3.1

Suppose x1 ! 5, x2 ! 9, x3 ! 12, x4 ! "6, x5 ! 17, and x6 ! "2. Compute the following sums: b. g x2i

a. ( g xi ) 2

c. g ( xi 2 7 ) 2

SOLUTION In each case, i is the index of summation, 1 is the lower bound, and 6 is the upper bound. Apply the definition of summation notation to each expression. a. In words, expression (a) says add all of the observations, and square the result.

( g xi ) 2 5 5 5

( x1 1 x2 1 x3 1 x4 1 x5 1 x6 ) 2 3 5 1 9 1 12 1 ( 26 ) 1 17 1 ( 22 ) 4 2 ( 35 ) 2 5 1225

Expand summation notation. Use given data. Add, and square the sum.

3.1

75

Measures of Central Tendency

b. In words, expression (b) says square each observation, and add the resulting values.

g x2i 5 x21 1 x22 1 x23 1 x24 1 x25 1 x26 2

2

2

Expand summation notation. 2

2

5 ( 5 ) 1 ( 9 ) 1 ( 12 ) 1 ( 26 ) 1 ( 17 ) 1 ( 22 ) 5 25 1 81 1 144 1 36 1 289 1 4 5 579

2

Use given data. Square each observation. Add.

c. In words, expression (c) says subtract 7 from each observation, square each difference,

and add the resulting values.

g ( xi 2 7 ) 2 5 ( x1 2 7 ) 2 1 ( x2 2 7 ) 2 1 ( x3 2 7 ) 2 1 ( x4 2 7 ) 2 1 ( x5 2 7 ) 2 1 ( x6 2 7 ) 2 Expand summation notation.

5 ( 5 2 7 ) 2 1 ( 9 2 7 ) 2 1 ( 12 2 7 ) 2 1 (26 2 7 ) 2 1 ( 17 2 7 ) 2 1 (22 2 7 ) 2 Use given data.

5 ( 22 ) 2 1 ( 2 ) 2 1 ( 5 ) 2 1 ( 213 ) 2 1 ( 10 ) 2 1 ( 29 ) 2 Compute each difference. 5 4 1 4 1 25 1 169 1 100 1 81 5 383 Square each difference, and add. TRY IT NOW

GO TO EXERCISE 3.2

The most common measure of central tendency is the sample, or arithmetic, mean.

Definition x is read as “x bar.”

The sample (arithmetic) mean, denoted x, of the n observations x1, x2, . . . , xn is the sum of the observations divided by n. Written mathematically. x1 1 x2 1 c1 xn 1 x 5 g xi 5 (3.1) n n

A CLOSER L OK 1. The notation x is used to represent the sample mean for a set of observations denoted

by x1, x2, . . . , xn. Similarly, y would represent the sample mean for a set of observations denoted by y1, y2, . . . , yn. 2. The population mean is denoted by m, the Greek letter mu.

Example 3.2 Base Camp Temperature DATA SET DENALI

Denali National Park and Preserve in Alaska covers over 6 million acres and includes the tallest mountain in North America, Mount McKinley. Over 1200 climbers reached the peak of Mount McKinley in 2012. The temperature (in degrees Fahrenheit) at the 7200foot base camp for 12 randomly selected days is given in the following table.2 6

11

20

19

23

28

30

8

23

25

29

33

Find the sample mean temperature at the base camp.

SOLUTION Use Equation 3.1 to find the sample mean.

1 1 ( x1 1 x2 1 c1 x12 ) g xi 5 Add all the numbers, and divide by n ! 12. 12 12 1 ( 6 1 11 1 20 1 19 1 23 1 28 1 30 1 8 1 23 1 25 1 29 1 33 ) 5 12 1 ( 255 ) 5 21.25°F 5 12

x5

Galyna Andrushko/Shutterstock

76

CHA PTER 3

Numerical Summary Measures

Figure 3.1 shows the sample mean using CrunchIt!.

Figure 3.1 The sample mean using CrunchIt!.

A CLOSER L OK 1. x is a sample characteristic. It describes the center of a fixed collection of data. There

is no set rule to determine the number of included decimal places. Often, at least one extra decimal place to the right is used to write the result; then the sample mean has one more decimal place than the original data values. 2. The sample mean is an average. There are many other averages, for example, the geometric mean, the harmonic mean, a weighted mean, the median, and the mode. People usually associate the average with the sample mean. 3. m is a population characteristic. It describes the center of an entire population. If the population happens to be of finite size N, then m is the sum of all the values divided by N. Most populations of interest are infinite, or at least very large, and therefore m is an unknown constant that cannot be measured. It seems reasonable to use x to estimate and draw conclusions about m. 4. The population mean m is a fixed constant. x varies from sample to sample. It is reasonable to think that two sample means computed using samples from the same population should be close, but different. If a data set contains outliers—observations very far away from the rest—then the sample mean may not be a very good measure of central tendency. An outlier has lots of influence on the sample mean, and tends to pull the mean in its direction. Example 3.3 shows how an outlier can affect the sample mean.

Example 3.3 Base Camp Temperature (Modified) DATA SET DENALI2

Modify the data in Example 3.2: Suppose one temperature at the base camp was 72, not 33. So the data set is now 6

11

20

19

23

28

30

8

23

25

29

72

The observation 72 is an obvious outlier. The new sample mean is 1 ( 6 1 11 1 20 1 19 1 23 1 28 1 30 1 8 1 23 1 25 1 29 1 72 ) 12 1 ( 294 ) 5 24.5°F 5 12

y5

Because x ! 21.25, y . x. The sample mean is pulled in the direction of the outlier, and is therefore not necessarily an adequate measure of central tendency. The sample median is another measure of central tendency that is not as sensitive to outlying values.

Definition ~x is read as “x tilde.”

The sample median, denoted ~x , of the n observations x1, x2, . . . , xn is the middle number when the observations are arranged in order from smallest to largest. 1. If n is odd, the sample median is the single middle value. 2. If n is even, the sample median is the mean of the two middle values.

3.1

Measures of Central Tendency

77

A CLOSER L OK 1. The median divides the data set into two parts, so that half of the observations lie

below and half lie above the median. 2. Only one calculation is necessary to find the median (no calculations are needed if n is

odd). Put the observations in ascending order of magnitude (not the order in which the observations were selected), and find the middle value. ~ 3. Similarly, y represents the sample median for a set of observations denoted by y1, y2, . . . , yn. ~. 4. The population median is denoted by m

Example 3.4 Median Calculations The following three examples show how to find the median under various circumstances, and the effect of an outlying value. The observations are already arranged in order from smallest to largest. Observations

Median

a. 10 11 14 16 17

There are n ! 5 observations. The middle number is in the third position. ~x 5 14. There are still n ! 5 observations. The middle number is in the third position, and ~x 5 14. The outlier 57 does not affect the median. There are n ! 6 observations. There is no single middle value. The median is the mean of the observations in the third and fourth positions. ~x 5 12 ( 14 1 16 ) 5 15.

b. 10 11 14 16 57 DATA SET

c. 10 11 14 16 17 20

PUBLISH

Example 3.5 Nonfarm Employment The number of people employed in the publishing industry in Oregon over the last 12 years is given in the following table. Find the median number of people employed. 16,100 15,200

16,900 15,900

15,200 15,700

13,900 14,500

13,500 14,100

14,300 14,000

Source: Oregon Employment Departments.

SOLUTION STEP 1 Arrange the observations in order.

13,500 13,900 14,000 14,100 14,300 14,500 15,200 15,200 15,700 15,900 16,100 16,900 STEP 2 There are n ! 12 observations. The median is the mean of the two middle values

(in the sixth and seventh positions). ~x 5 1 ( 14,500 1 15,200 ) 5 14,850 2 Figure 3.2 shows a technology solution. TRY IT NOW

GO TO EXERCISE 3.13

A CLOSER L OK Figure 3.2 The sample median and other summary statistics found using JMP.

1. In general, the sample mean is not equal to the sample median, x 2 ~ x . If the distribu-

tion of the sample is symmetric, then x 5 ~x . If the sample distribution is approximately symmetric, then x < ~x .

78

CHAPTER 3

Numerical Summary Measures

~ . If the 2. In general, the population mean is not equal to the population median, m 2 m

~. distribution of the population is symmetric, then m 5 m 3. The relative positions of x and ~ x suggest the shape of a distribution. The smoothed histograms in Figures 3.3–3.5 illustrate three possibilities: a. If x . ~ x , the distribution of the sample is positively skewed, or skewed to the right (Figure 3.3). b. If x < ~ x , the distribution of the sample is approximately symmetric (Figure 3.4). c. If x , ~ x , the distribution of the sample is negatively skewed, or skewed to the left (Figure 3.5). STATISTICAL APPLET MEAN AND MEDIAN

STEPPED STEPPED TUTORIAL TUTORIALS MEASURES BOX PLOTSOF CENTER: MEAN AND MEDIAN

Recall: A histogram consists of rectangles drawn above each class with height proportional to frequency or relative frequency. We draw a curve along the tops of the rectangles to smooth out the histogram and display an enhanced graphical representation of the distribution.

~ x

¯x

Figure 3.3 Positively skewed distribution.

~ x ! ¯x

Figure 3.4 Approximately symmetric distribution.

¯x

~ x

Figure 3.5 Negatively skewed distribution.

Because the sample mean is extremely sensitive to outliers, and the sample median is very insensitive to outliers, it seems reasonable to search for a compromise measure of central tendency. A trimmed mean is moderately sensitive to outliers.

Definition A 100p% trimmed mean, denoted xtr(p), of the n observations x1, x2, . . . , xn is the sample mean of the trimmed data set. 1. Order the observations from smallest to largest. 2. Delete, or trim, the smallest 100p% and the largest 100p% of the observations from the data set. 3. Compute the sample mean for the remaining data. 100p is the trimming percentage, the percentage of observations deleted from each end of the ordered list.

A CLOSER L OK 1. We compute a trimmed mean by deleting the smallest and largest values, which are

possible outliers. Some statisticians believe that deleting any data is a bad idea, because every observation contributes to the big picture. 2. A 100p% trimmed mean is computed by deleting the smallest 100p% and the largest 100p% of the observations. Therefore, 2(100p)% of the observations are removed. 3. There is no set rule for determining the value of p. It seems reasonable to delete only a few observations, and to select p so that np (the number of observations deleted from each end of the ordered data) is an integer. 4. Here is a specific example using the notation: xtr(0.05) is a (100)(0.05) ! 5% trimmed mean. In this example, 10% of the observations are discarded.

Example 3.6 Overtime and Stress DATA SET STRESS

According to an article in The Guardian,3 Americans spend more time at their jobs than workers in Germany do. Dr. Paul Landsbergis, an epidemiologist at Mt. Sinai Medical Center, studies job stress, and he warns that too many overtime hours may increase the chance of heart

3.1

Measures of Central Tendency

79

disease. Suppose the following December overtime hours for tellers at the Kaw Valley State Bank and Trust Company in Topeka, Kansas, were obtained. Find a 10% trimmed mean. 0.2 0.8 1.5 1.5 1.6 1.7 1.7 1.8 2.0 2.0 2.2 2.5 2.7 2.7 3.0 3.0 3.2 3.5 4.0 5.0

SOLUTION STEP 1 The trimming percentage is 10%. p ! 10/100 ! 0.10. Find the number of obser-

vations to delete from each end of the ordered list. There are n ! 20 observations. np 5 ( 20 )( 0.10 ) 5 2

Trim 2 observations from each end.

Note that np may not be an integer. Computer software packages have algorithms for dealing with this problem. STEP 2 The resulting data set is Figure 3.6 Calculation of a trimmed mean using Excel.

0.2 0.8 1.5 1.5 1.6 1.7 1.7 1.8 2.0 2.0 2.2 2.5 2.7 2.7 3.0 3.0 3.2 3.5 4.0 5.0 STEP 3 Find the sample mean for the remaining data.

xtr(0.10) 5

1 1 ( 1.5 1 1.5 1 1.6 1 c1 3.0 1 3.2 1 3.5 ) 5 ( 36.6 ) 5 2.29 16 16

2.29 hours is the 10% trimmed mean. TRY IT NOW

GO TO EXERCISE 3.15

Another commonly used measure of central tendency is the mode.

Definition The mode, denoted M, of the n observations x1, x2, . . . , xn is the value that occurs most often, or with the greatest frequency. If all the observations occur with the same frequency, then the mode does not exist. If two or more observations occur with the same greatest frequency, then the mode is not unique. If there are two modes, the distribution is bimodal, three modes, trimodal, etc.

!

Figure 3.7 We expect the mode M of a sample from this distribution to be near the population mean m.

The mode is easy to compute and, intuitively, it does return a reasonable measure of central tendency. For example, consider a bell-shaped distribution. A random sample from this distribution should contain lots of (identical) values near the center. Therefore, the mode should suggest the middle of the distribution (Figure 3.7). For symmetric distributions, the mean, the median, and the mode will be about the same.

A CLOSER L OK 1. Sometimes only a data summary table, or grouped data, is available. Let x1, x2, . . . , xk

be a set of (representative) observations with corresponding frequencies f1, f2, . . . , fk. For example, x7 occurs f7 times. The total number of observations is n 5 g fi. If the data are grouped, there are corresponding formulas for the (approximate) measures of central tendency defined above. 2. Remember, there are many other averages, for example, the weighted mean, the geometric mean, and the harmonic mean.

The remainder of this section describes summary measures for qualitative data.

The natural summary measures for observations on a qualitative variable are simply the frequency and relative frequency of occurrence for each category. We have already

80

CHA PTER 3

Numerical Summary Measures

done this! Recall Example 2.4, in which 25 cruise ships were randomly selected and the destination of each ship was recorded. Each response was categorical (destination), and the data were summarized in a table listing only category, frequency of occurrence for each category, and relative frequency of occurrence for each category. Suppose now that the commuter students at a small college are asked to complete a survey to identify the make of car they use to drive to school. Numerical summary measures for this categorical variable should include frequencies and relative frequencies, or proportions, as shown in the following table. (The cumulative relative frequency is used only for numerical data sets, and doesn’t really make sense here because there is no natural ordering.)

Category

Frequency

Relative frequency

Buick Chevrolet Ford Honda Mazda Saturn

137 288 202 336 175 322

0.0938 0.1973 0.1384 0.2301 0.1199 0.2205

1460

1.0000

Total

A dichotomous or Bernoulli variable is a special categorical variable that has only two possible responses. One response is often associated with, or called, a success, denoted S, and the other response is called a failure, denoted F. The two possible actual responses are ignored. For example, suppose a medical researcher selects children at random and asks them all whether they have had an ear infection within the past year. The response had an ear infection might be a success, and had no ear infection would be a failure. The same numerical measures are used to summarize observations on this kind of categorical variable: frequency and relative frequency of occurrence for each response. The relative frequency of successes has a special name.

Definition = p is read as “p hat.”

For observations on a categorical variable with only two responses, the sample propor= tion of successes, denoted p, is the relative frequency of occurrence of successes: number of S’s in the sample n(S) = p5 5 n total number of responses

(3.2)

A CLOSER L OK The symbol p is used in notation to represent several quantities: the population proportion of successes, the sample proportion of successes, and in the definition of the trimmed mean. The context in which the notation is used implies the appropriate concept.

1. The population proportion of successes is denoted by p. 2. The success response is not necessarily associated with a good thing. For example, a

researcher may be interested in the proportion of laboratory animals that die when they are exposed to a certain toxic chemical. A success may be associated with the death of an animal. = 3. The sample proportion of successes p can be thought of as a sample mean in disguise. Suppose every S is changed to a 1, and every F to a 0. The sample mean for this new numerical data is x5

1 n(S) = ( a sum of 0’s and 1’s ) 5 5p n n

3.1

Measures of Central Tendency

81

Example 3.7 Seatbelt Checkpoint In many states it is against the law to drive without a fastened seatbelt. The State Police recently established a checkpoint along a heavily traveled road. A success was recorded for a driver wearing a seatbelt, and a failure recorded otherwise. The observations from this checkpoint are given in the following table.

DATA SET SEATBELT

S S

S F

F S

F S

S S

S S

F F

S S

F S

S F

F S

S F

S S

S F

The sample contains 28 observations and 18 successes. The sample proportion of successes is n(S) 18 = 5 5 0.6429 p5 n 28 Approximately 64% of the drivers stopped at the checkpoint were wearing their seatbelts. = It is reasonable to assume the value of p is close to the population proportion of successes—in this example, the true proportion of drivers who wear a seatbelt. TRY IT NOW

GO TO EXERCISE 3.10

Technology Corner Procedure: Compute the sample mean, sample median, a trimmed mean, and the mode. Reconsider: Example 3.2, solution, and interpretations.

CrunchIt! CrunchIt! has a built-in function to find certain descriptive statistics, including the sample mean and the sample median. There is no built-in function to compute a trimmed mean nor a sample mode. 1. Enter the data into a column. 2. Select Statistics; Descriptive Statistics. Choose the appropriate column and click the Calculate button. See

Figure 3.8.

Figure 3.8 CrunchIt! descriptive statistics.

TI-84 Plus C There are several ways to find the sample mean and the sample median using the graphing calculator. There is no built-in function to compute a trimmed mean nor a sample mode. 1. Enter the data into list L1. 2. Use the command LIST ; MATH; mean to compute the sample mean. Use the command LIST ; MATH; median to

compute the sample median. See Figure 3.9. 3. The function STAT ; CALC; 1-Var Stats returns several summary statistics. The sample mean is on the first output

screen and the sample median is on the second, denoted by Med. See Figures 3.10 and 3.11.

82

CHA PTER 3

Numerical Summary Measures

Figure 3.9 The sample mean and the sample median using built-in calculator functions.

Figure 3.10 The sample mean is part of the output from the 1-Var Stats function.

Figure 3.11 The second output screen from 1-Var Stats shows the sample median (Med).

Minitab There are several ways to find the summary statistics using Minitab. In addition to the general Describe command, there are Calc; Column statistics functions, Calc; Calculator functions, and various macros. 1. Enter the data into column C1. 2. Select Stat; Basic Statistics; Display Descriptive Statistics. Enter C1 in the Variables window. 3. Choose the Statistics option button and check the summary statistics Mean, Median, Mode, and Trimmed mean.

Note: Minitab computes only a 5% trimmed mean. Other macros allow any percentage. See Figure 3.12.

Figure 3.12 Minitab descriptive statistics.

Excel Excel has built-in functions for these four descriptive statistics. Under the Data tab, Data Analysis; Descriptive Statistics can also be used to compute several summary statistics simultaneously. 1. Enter the data into column A. 2. Use the appropriate Excel function to compute the sample mean, sample median, trimmed mean, and mode. Note: the

second argument in TRIMMEAN is the total proportion of data trimmed. Excel rounds the number of trimmed observations down to the nearest multiple of 2. See Figure 3.13.

Figure 3.13 Excel descriptive statistics.

3.1

83

Measures of Central Tendency

SECTION 3.1 EXERCISES Concept Check

3.8 Consider the data given in the following table.

3.1 Fill in the Blank a. The two most common types of numerical summary

measures describe the _____________ and the _____________ of the data. b. Measures of central tendency suggest where the data is _____________. 3.2 True/False a. The sample mean and the population mean are always the

same value. b. The sample mean and the sample median can be the same

value. c. The sample mean is sensitive to outliers. d. When computing a trimmed mean, we discard the same

number of observations from each end of the ordered list. e. The mode may not exist for a specific data set. f. It is reasonable to assume that the sample proportion of

successes is close to the population proportion of successes.

5

7

8

27

3

15

7

6

4

5

5

1

Find the sample median. Note that this summary statistic is a better measure of central tendency than the sample mean for this data set. Why? 3.9 Use the values of the sample mean and the sample median to determine whether the distribution is symmetric, skewed to the left, or skewed to the right. x 5 49 a. x 5 37, ~ x 5 62.75 b. x 5 63.5, ~ c. x 5 237, ~ x 5 216 x 5 12.56 d. x 5 212.56, ~ 3.10 Compute the indicated trimmed mean for each data

set.

EX3.10

a. {24, 36, 26, 30, 28, 35, 33, 33, 34, 27} xtr(0.10) b. {72, 76, 76, 77, 85, 76, 80, 86, 62, 70} xtr(0.20) c. {182, 169, 180, 166, 173, 101, 188, 124, 182, 137, 100,

137, 118, 111, 137, 181, 189, 130, 168, 133}

xtr(0.20)

d. {5.5, 7.5, 7.3, 6.4, 5.3, 9.5, 7.2, 5.8, 7.0, 6.7, 9.0, 8.1, 8.4,

Practice 3.3 Compute each summation using the following random

sample.

EX3.8

EX3.3

a. gxi d. g ( xi 2 5 ) 2

x1 5 215

x2 5 6

gx2i

x3 5 40

b. e. g ( 2xi )

x4 5 13

x5 5 38 c. g ( xi 2 10 ) f. 2gxi

3.4 Suppose the following random sample is obtained.

43.3 54.1

52.7 46.7

67.7 47.2

52.1 48.5

54.7 45.8

EX3.4 Compute the following sums. a. gx2i b. gx3i c. g ( xi 2 50 ) 2 d. ( gxi ) 2 xi e. g ( xi 2 51.28 ) f. g 7 3.5 Compute the mean for each sample with known sum. a. gxi 5 1057, n 5 10 b. gxi 5 356, n 5 27 c. gxi 5 250.5, n 5 36 d. gxi 5 1.355, n 5 11 e. gxi 5 237.4, n 5 15 f. gxi 5 496.81, n 5 28

3.6 Find the position, or location, of the sample median in an ordered data set of size n. a. n " 22 b. n " 37 c. n " 117 d. n " 64 3.7 Find the sample mean and the sample median for each EX3.7 data set. a. 5, 3, 7, 9, 11, 5, 6, 7, 7 b. !7, 10, 25, 22, 36, !24, 0, 1, 12, 9, !11 c. 5.4, 3.3, 6.0, 10.1, 13.6, 7.7, 16.6, 28.9, 4.6 d. !103.7, !110.4, !109.1, !99.7, !115.6

5.8, 5.4, 7.2, 7.4, 7.5, 5.9, 7.5} 3.11 a. b. c.

xtr(0.15)

EX3.11 Find the mode for each data set, if it exists. 3, 5, 6, 7, 3, 4, 6, 6, 8, 11, 13, 2, 1 !17, !10, 0, 3, !5, 4.3, 12, 0, 5, !2.1, 1.7, !7 6.6, 7.3, 5.2, 6.2, 8.3, 9.8, 4.1, 3.7

3.12 Find the sample proportion of successes for each data

set.

EX3.12

a. S, F, S, F, F, F, F, F, S, S, S, F, F, S b. F, S, S, F, S, F, F, S, S, S, S, S, S, S, S, F, S, S, S, S, S c. S, F, S, F, F, F, F, F, S, S, S, F, F, S, F, F, S, F, S, S, S, S,

F, F, F, F, F, F, F, S, S, F, S, F, F

Applications 3.13 Travel and Transportation Tractor trailers tend to

exceed the speed limit (65 mph) on one downhill stretch of Route 80 in Pennsylvania. Using a radar gun, the following RADAR tractor trailer speeds (in mph) were observed. 81 67

66 74

67 65

69 77

79 74

62 64

70 71

73 64

67 67

60 61

61

a. Find the sample mean, x. b. Find the sample median, ~ x. c. What do your answers to parts (a) and (b) suggest about

the shape of the distribution of speeds? 3.14 Biology and Environmental Science The 2012 estimated wheat production (in 1000 metric tons) for several WHEAT countries is given on the text website.4 a. Find the sample mean and the sample median for these data.

84

CHAPTER 3

Numerical Summary Measures

b. What do the summary statistics in part (a) suggest about

the shape of the distribution of wheat yield? 3.15 Technology and the Internet Steam is a software platform used to distribute and manage multiplayer online games. A day was selected at random and the number of simultaneous peak users for certain games is given in the GAMES following table.5

930 948

858 678

827 742

849 769

744 754

849 782

763 862

753 894

781 861

a. Find the sample mean and the sample median. b. Suppose the last observation had been 2861 instead of

861. Find the sample mean and sample median for this revised data set. Explain how this change in the data affects the mean and median found in part (a). 3.16 Fuel Consumption and Cars The text website contains

a table that lists the atmospheric CO2 concentration (in ppm) ATMOCO2 for 36 months ending in October 2012.6 a. Find the sample mean and sample median for this data. b. A certain group considers any monthly concentration less than 390 a success (not harmful to the environment). Find the sample proportion of successes. 3.17 Education and Child Development The Math SAT

scores for all students in an introductory statistics class at EDINSAT Edinboro University are given on the text website. a. Find the sample mean and the sample median. b. Find a 5% trimmed mean. c. Using these three numerical summary measures, describe the shape of the distribution. 3.18 Manufacturing and Product Development A

random sample of 12-ounce cans of Dr Pepper soda was obtained from E. M. Heaths supermarket. The exact amount of soda (in ounces) in each can was measured, and the data SODACAN are given on the text website. a. Find the sample mean and the sample median. b. What do the summary statistics in part (a) suggest about the shape of the distribution of the amount of soda in each can? c. Suppose any amount of 12 ounces or greater is considered a success. Find the sample proportion of successes. 3.19 Biology and Environmental Science There are approximately 10,000 commercial fishing harvesters in Maine, and these businesses contribute one-third of the value of all New England fisheries. Some of the species caught off the Maine coast include cod, salmon, and flounder. The number of pounds (in millions) caught by Maine fisheries for several CATCH recent years is given on the text website.7 a. Find the sample mean and the sample median for this data set. b. Which statistic is a better measure of central tendency for these data? Justify your answer. 3.20 Sports and Leisure Some critics of Major League

Baseball believe the ball is juiced (livelier) because it is

manufactured to give hitters an advantage. To investigate this claim, a sample of the earned run average (ERA) for American League starting pitchers for the 2012 season was PITCH obtained.8 a. Find the sample mean and the sample median. b. Suppose the pitcher with the highest ERA plays in Colorado, where the air is thin, and the home runs are many. To eliminate such outliers, find a 4% trimmed mean. c. Find the mode for the original data set, if it exists. 3.21 Biology and Environmental Science The water temperature (in degrees Fahrenheit) during the summer of 2012 at several locations off the coast of Florida is given in the TEMP2012 following table.9

79 81 84

80 81 85

80 83 86

80 84 86

80 84 79

81 84 84

79 80 79

82 81 80

84 83 79

86 84 79

a. Find the sample mean and the sample median. b. Find a 10% trimmed mean for these data. c. Find the mode for the original data, if it exists. 3.22 Education and Child Development An educational

study was designed to compare cooperative learning versus traditional lecture style. Two sections of an introductory statistics class were used. Seven students were randomly selected from each section. The scores on the second test (a 30-item exam) are EDSTYLE given in the following table. Traditional Cooperative

21 25

28 30

25 28

25 25

21 24

19 24

23 29

Which group of students did better on average? Justify your answer.10

Extended Applications 3.23 Sports and Leisure The gold medal in the women’s

10-meter platform diving at the 2012 Summer Olympics in London was won by Roulin Chen from China. Many of the participants in this competition included a 407C, an inward 312 somersault, as one of their dives. The text website contains the DIVING scores for some of these dives for various participants.11 a. Find the sample mean and the sample median for this data set. b. Find the mode, if it exists. c. Multiply each score by the degree of difficulty, 3.2. Find the sample mean for this new data set. How does this sample mean compare with the sample mean found in part (a)? 3.24 Public Health and Nutrition Residents in the greater Toronto area have complained that there has been an increase in aircraft noise.12 This increased noise may lead to sleep disturbances and other health problems. Suppose the noise level (in dBA) of several aircraft was measured using one flight path. AIRNOISE The data are given in the following table.

75

72

71

65

63

72

70

68

3.1

a. Find the sample mean for these data. b. Some researchers believe that the wind velocity gradient

can add as much as 4 dB to each reading. Add 4 dB to each observation in the data set. Compute the new sample mean. How does this compare with the sample mean found in part (a)? 3.25 Travel and Transportation The following table

contains the estimated unlinked light rail transit passenger trips TRANSIT (in thousands) for various cities during June 2012. 995.1 849.6 933.4 3605.5

5132.7 1691.2 1426.4 2381.4

190.9 22.5 409.9 655.3

1018.0 597.5 1867.0 155.0

2593.8 6415.3 517.0 1822.0

4055.9 752.4 245.8 896.2

Source: American Public Transportation Association.

transit trips. b. Use each June observation to estimate the yearly number of trips. That is, multiply each observation by 12. Find the sample mean for this new data set. How does this sample mean compare with the sample mean found in part (a)? 3.26 Manufacturing and Product Development A new

quality control program was recently started at a Hyundai manufacturing facility. Several times each day, randomly selected panels from a stamping press are inspected for defects. A nondefective panel is a success (S). A defective panel is a failure (F) and must be restamped at an additional cost. During a recent inspection, the following 32 observations were recorded: S S S

S F S

S S S

F S

S S

F S

F S

S S

S S

S S

S F

S S

3.28 Medicine and Clinical Studies In a random sample of

13 patients with calcaneus bone fractures, the sample mean number of days until fracture healing was x 5 37.85 and the sample median was ~x 5 40. Suppose an additional patient is added to the sample so that x14 ! 44.5. a. Find the sample mean for all 14 patients. b. Is there any way to determine the sample median for all 14 patients? Explain. 3.29 Fuel Consumption and Cars The estimated oil

reserves (in millions of barrels) of four wells are given by x1 5 1078 x2 5 5833 x3 5 10,772 x4 5 7320 a. Find x5 so that the mean for all five observations is 6883.4. b. Find x5 so that the sample mean is equal to the sample

median. 3.30 Manufacturing and Product Development A

a. Find the sample mean number of unlinked light rail

S S S

85

Measures of Central Tendency

S S

a. Find the sample proportion of successes. b. Change each S to a 1, and each F to a 0. Find the sample

mean for these new data. How does the mean compare with the sample proportion of successes found in part (a)? c. Suppose 8 additional panels were selected and inspected (for a total of 40 panels). Is it possible for the sample proportion of successes to be 0.9? Why or why not? 3.27 Sports and Leisure The playing time for rookies in the

National Basketball Association (NBA) depends on many factors, including position and performance. A random sample of playing times per game (in minutes) for rookies in the NBA during the 2011–2012 season was obtained. The data are given ROOKIES in the following table. 34.2 30.5 29.4 29.4 25.5 23.1 20.2 19.5 10.5 18.9 18.6 16.7 15.2 15.0 14.6 13.5 13.2 12.8 Source: National Basketball Association.

a. Find the sample mean and the sample median. b. What do the summary statistics in part (a) suggest about

consumer group has tested the drying time for 15 samples of exterior latex paint. The sample mean drying time is 83.8 minutes. What must the 16th drying time be if the 16th observation decreases the mean drying time by 30 seconds? By 1 minute? 3.31 Biology and Environmental Science The beaches along the coast of New Hampshire are famous for chilly waters, even during the hottest summer days. A recent sample of the water temperature on 24 randomly selected summer days was obtained. The following temperatures are in degrees BEACH Fahrenheit ("F).

58 57 59

58 61 53

53 56 59

53 55 63

59 59

57 60

54 55

61 53

56 55

a. Find the sample mean and the sample median. b. Convert each temperature to degrees Celsius ("C). Use the 2 32 formula C 5 F 1.8 . Find the mean for all the water temperatures in degrees Celsius. c. What is the relationship between the sample means in parts (a) and (b)?

3.32 Technology and the Internet A recent survey of students at Minneapolis North High School included a question about the number of computers at home. The (grouped) data are HOMECOMP summarized below.

Number of computers

Frequency of occurrence

0 1 2 3 4 5

3 27 23 7 3 1

the shape of the distribution of playing time for rookies? c. Can you change the maximum observation (34.2) so that the

sample mean is equal to the sample median? Why or why not?

60 58

Find the sample mean and the sample median number of computers at home.

86

CHA PTER 3

Numerical Summary Measures

3.2 Measures of Variability Measures of central tendency are only one characteristic of a data set. These numerical summary measures alone are not sufficient to describe a sample completely. It is possible to have two very different data sets with (approximately) the same mean (and median). Figures 3.14 and 3.15 show two smoothed histograms to illustrate the problem.

5

10

15

20

25

30

Figure 3.14 Sample 1: x1, x2, . . . , xn. The smoothed histogram suggests a compact distribution.

5

10

15

20

25

30

Figure 3.15 Sample 2: y1, y2, . . . , ym. The smoothed histogram suggests the data are more disperse, or spread out.

The measures of central tendency (sample mean and sample median) are approximately the same (x < y < 15 and ~x < ~y < 15), but the data in Sample 1 are more compact because more of the data are clustered about the mean x 5 15. To describe the difference between the data sets, we need to consider variability.

Definition The (sample) range, denoted R, of the n observations x1, x2, . . . , xn is the largest observation minus the smallest observation. Written mathematically, R 5 xmax 2 xmin

(3.3)

where xmax denotes the maximum, or largest, observation, and xmin stands for the minimum, or smallest, observation.

A CLOSER L OK 1. In theory, the sample range does measure, or describe, variability. A data set with a

small range has little variability and is compact. A data set with a large range has lots of variability and is spread out. 2. The sample range is used in many quality control applications. For example, a production supervisor may want to maintain small variability in a manufacturing process. The sample range may be used to determine whether the process is still well controlled, or whether there is abnormal variation. Despite being very easy to compute and a logical measure, the sample range is not adequate for describing variability. The sample range may not accurately represent the variability of a distribution if the maximum and minimum values are outliers. The sample range for each data set summarized by the smoothed histograms in Figures 3.14 and 3.15 is approximately the same: R < 30 2 0 5 30. In fact, the two data sets have approximately the same mean. Therefore, it is necessary to use a better, more sensitive measure of variability. To derive a more precise measure of variability, consider how far each observation lies from the mean. A graph may be used to visualize the spread of data and to suggest another measurement. A dot plot is a graph that simply displays a dot corresponding to each observation

3.2

Measures of Variability

87

Sample 2: y’s Data: {12, 13, 14, 26, 27, 28}

Sample 1: x’s Data: {17, 18, 19, 21, 22, 23}

12

14

16

18

x" ! y"

22

24

26

28

Figure 3.16 Stacked dot plot.

along a number line. The stacked dot plot in Figure 3.16 may be used to compare the variability in Sample 1 (x’s) versus Sample 2 (y’s). In Sample 1, the data set is compact; each observation is very close to the mean. In Sample 2, the data set is more spread out; each observation is far away from the mean. This analysis of Figure 3.16 suggests that a better measure of variability might include the distances from the mean.

Definition Given a set of n observations x1, x2, . . . , xn, the ith deviation about the mean is xi 2 x.

A CLOSER L OK 1. Given a data set, to calculate the ith deviation about the mean, find x, then compute the

difference xi 2 x. For example, the seventh deviation about the mean is the value x7 2 x. 2. We usually do not need any one deviation about the mean; all of the deviations about the mean together will be used to find a suitable measure of variability. 3. If the ith deviation about the mean is positive, then the observation is to the right of the mean: If xi 2 x . 0, then xi . x. If the ith deviation about the mean is negative, then the observation is to the left of the mean: If xi 2 x , 0, then xi , x. A data set with little variability should have small deviations about the mean, and the squares of the deviations should be small. A data set with lots of variability should have large deviations about the mean, and the squares of the deviations should be large. This idea is used to define the sample variance.

Definition The sample variance, denoted s2, of the n observations x1, x2, . . . , xn is the sum of the squared deviations about the mean divided by n ! 1. Written mathematically, 1 g ( xi 2 x ) 2 n21 1 3 ( x1 2 x ) 2 1 ( x2 2 x ) 2 1 c1 ( xn 2 x ) 2 4 5 n21

s2 5

(3.4)

The sample standard deviation, denoted s, is the positive square root of the sample variance. Written mathematically, s 5 "s2

(3.5)

88

CHA PTER 3

Numerical Summary Measures

A CLOSER L OK The sample variance s2 is often called an average of the squared deviations about the mean, yet we divide the sum of the squared deviations by n # 1. Although this does not seem correct, dividing by n # 1 makes s2 an unbiased estimator of !2. We will see later in the text that an unbiased statistic is, in some sense, a good thing. There are n # 1 degrees of freedom, a kind of dimension of variability, associated with the sample variance s2.

1. The population variance, a measure of variability for an entire population, is denoted 2.

3.

4. 5.

by !2, and the population standard deviation is denoted by s, the Greek letter sigma. Just knowing s2 doesn’t seem to say much about variability. If s2 " 6, for example, it is hard to infer anything about variability. However, the sample variance s2 is a measure of variability, and it is useful in comparisons. For example, if Sample 1 and Sample 2 have similar units, s21 " 14, and s22 " 10, then the data in Sample 2 are more compact. The sample standard deviation s is used (rather than s2) in many statistical inference problems. So, if we need to find s (by hand), we need to compute s2 first, and then take the positive square root to find s. The units for the sample standard deviation are the same as for the original data. And a value of s " 0 means there is no variability in the data set. The notation s2x is used to represent the sample variance for a set of observations denoted by x1, x2, . . . , xn. Similarly, s2y represents the sample variance for a set of observations y1, y2, . . . , yn.

Example 3.8 Zucchini Weight DATA SET ZUCCHINI

Welliver Farms in Bloomsburg, Pennsylvania, sells a wide variety of fruits and vegetables and frequently donates crates of zucchini to the local food cupboard. Five of the donated zucchini were randomly selected, and each was carefully weighed. The weights, in ounces, were 6.2, 4.5, 6.6, 7.0, and 8.2. Find the sample variance and the sample standard deviation for these data.

SOLUTION STEP 1 Find the sample mean:

x5

1 1 ( 6.2 1 4.5 1 6.6 1 7.0 1 8.2 ) 5 ( 32.5 ) 5 6.5 5 5

STEP 2 Use Equation 3.4 to find the sample variance.

s2 5

1 3 ( 6.2 2 6.5 ) 2 1 ( 4.5 2 6.5 ) 2 1 ( 6.6 2 6.5 ) 2 1 ( 7.0 2 6.5 ) 2 1 ( 8.2 2 6.5 ) 2 4 4 Use data and x.

1 3 ( 20.3 ) 2 1 ( 22.0 ) 2 1 ( 0.1 ) 2 1 ( 0.5 ) 2 1 ( 1.7 ) 2 4 4 1 5 3 0.09 1 4.0 1 0.01 1 0.25 1 2.89 4 4 5

5

1 ( 7.24 ) 5 1.81 4

Compute differences.

Square each difference.

Add, divide by 4.

STEP 3 Take the positive square root of the variance to find the standard deviation. Figure 3.17 Sample variance and sample standard deviation.

s 5 !1.81 < 1.3454

A technology solution is shown in Figure 3.17. Equation 3.4 is the definition of the sample variance and may be used to find s2, but there is actually a more efficient technique for computing s2.

Definition The computational formula for the sample variance is s2 5

1 1 c g x2i 2 ( g xi ) 2 d n n21

(3.6)

3.2

Measures of Variability

89

This is a convenient shortcut method for calculating s2 without having to find all the deviations about the mean. Suppose x1, x2, . . . , xn is a set of observations. To find s2, Equation 3.6 says: 1. Find the sum of the squared observations, g x2i . 2. Find the sum of the observations, g xi.

3. Square the sum of the observations, ( g xi ) 2.

1 ( g xi ) 2. n 5. Subtract the two quantities, and multiply the difference by 1 / (n !1), 4. Multiply the square of the sum of the observations by 1/ n,

s2 5

1 1 c g x2i 2 ( g xi ) 2 d n n21

Example 3.9 Zucchini Weight (Continued) Use the computational formula for s2 to find the sample variance for the data in Example 3.8. The zucchini weights are 6.2, 4.5, 6.6, 7.0, and 8.2.

SOLUTION STEP 1 Find the sum of the squared observations:

g x2i 5 6.22 1 4.52 1 6.62 1 7.02 1 8.22 5 38.44 1 20.25 1 43.56 1 49.0 1 67.24 5 218.49

STEP 2 Find the sum of the observations:

g xi 5 6.2 1 4.5 1 6.6 1 7.0 1 8.2 5 32.5

STEP 3 Square this sum and multiply by 1 / n:

1 1 ( g xi ) 2 5 ( 32.5 ) 2 5 211.25 5 5

STEP 4 Subtract the two quantities, and multiply by 1 / (n ! 1):

s2 5

1 1 ( 218.49 2 211.25 ) 5 ( 7.24 ) 5 1.81 4 4

(the same answer as above). TRY IT NOW

GO TO EXERCISE 3.31

It can be shown that Equation 3.4 and Equation 3.6 are equivalent. Exercise 3.49 at the end of this section asks for a proof. If you must find a sample variance by hand, then use the computational formula. It has fewer calculations (is more efficient) and is usually more accurate (has less round-off error). In fact, most calculator and computer programs that find the sample variance use the computational formula. The sample variance is always greater than or equal to zero: s2 " 0. This is easy to see by looking at the definition in Equation 3.4. We sum squared deviations about the mean (always greater than or equal to zero) and divide by a positive number (n ! 1). There are two special cases. 1. s2 # 0: This occurs if all the observations are the same. If all the observations are

equal to some constant c, the mean is c, and all the deviations about the mean are zero. Hence, s2 # 0. This makes sense intuitively also: If all the observations are the same, there is no variability. 2. n # 1: This is a strange case, but it can occur. If n # 1, there is no variability—or, another way to think of this—we cannot measure variability. The denominator in Equation 3.4 is zero, and anything divided by zero is undefined.

90

CHA PTER 3

Numerical Summary Measures

The sample variance (and the sample standard deviation) can be greatly influenced by outliers. An observation very far away from the rest has a large deviation about the mean, a large squared deviation about the mean, and therefore contributes a lot to the sum (in the definition of the sample variance). The interquartile range is another measure of variability, and it is resistant to outliers.

Definition Note that the definition for Q1 and Q3 involves the median, not the mean.

Let x1, x2, . . . , xn be a set of observations. The quartiles divide the data into four parts. 1. The first (lower) quartile, denoted Q1 (QL), is the median of the lower half of the observations when they are arranged in ascending order. 2. The second quartile is the median ~ x5Q. 2

3. The third (upper) quartile, denoted Q3 (QU), is the median of the upper half of the

observations when they are arranged in ascending order. 4. The interquartile range, denoted IQR, is the difference IQR ! Q3 " Q1.

In smoothed histograms, the area under the curve between two points corresponds to the proportion of observations between those points. Interpreting Figure 3.18: 25% of the observations are between Q1 and ~x .

The quartiles are illustrated in Figure 3.18.

25%

25% Q1

25%

! x ! Q2

25% Q3

Figure 3.18 Smoothed histogram and quartiles.

A CLOSER L OK There is a very intuitive method for finding the quartiles. Arrange the data in order from smallest to largest. The median, ~x 5 Q2, is the middle value. The first quartile, Q1, is the median of the lower half, and the third quartile, Q3, is the median of the upper half. In practice, a more general method is used for locating the position, or depth, of the first and third quartiles (in the ordered data set).

How to Compute Quartiles Suppose x1, x2, . . . , xn is a set of n observations. 1. Arrange the observations in ascending order, from smallest to largest. 2. To find Q1, compute d1 ! n / 4. a. If d1 is a whole number, then the depth of Q1 (position in the ordered list) is d1 # 0.5. Q1 is the mean of the observations in positions d1 and d1 # 1 in the ordered list. b. If d1 is not a whole number, round up to the next whole number for the depth of Q1. 3. To find Q3, compute d3 ! 3n/4. a. If d3 is a whole number, then the depth of Q3 is d3 # 0.5. Q3 is the mean of the observations in positions d3 and d3 # 1 in the ordered list. b. If d3 is not a whole number, round up to the next whole number for the depth of Q3.

3.2

91

Measures of Variability

Example 3.10 Pulse Rates DATA SET PULSE

The following 10 observations represent the resting pulse rate for patients involved in an exercise study: 68

71

64

58

61

76

73

62

72

66

a. Find the first quartile, the third quartile, and the interquartile range. b. Suppose there are 12 patients in the study, with x11 ! 78 and x12 ! 81. Find the first

quartile, the third quartile, and the interquartile range for this modified data set.

SOLUTION STEP 1 Arrange the observations in order from smallest to largest.

Observation Position

58 1

61 2

62 3

64 4

66 5

68 6

71 7

72 8

73 9

76 10

STEP 2 Find the depth of the first quartile.

d1 5

© Tom Tracy Photography/Alamy

n 10 5 5 2.5 4 4

Because d1 is not a whole number, round up. The depth of the first quartile is 3.

Q1 is in the third position in the ordered list. Using the table above, Q1 ! 62. STEP 3 Find the depth of the third quartile. d3 5

( 3 )( 10 ) 3n 5 5 7.5 4 4

Because d3 is not a whole number, round up. The depth of the third quartile is 8.

Q3 is in the eighth position in the ordered list. Using the table above, Q3 ! 72. STEP 4 Find the interquartile range IQR ! Q3 " Q1. IQR 5 72 2 62 5 10 A technology solution is shown in Figures 3.19 and 3.20.

8

Figure 3.20 Compute IQR on the Home screen.

Figure 3.19 1-Var Stats is used to compute the quartiles.

STEP 5 If there are 12 patients in the study, arrange the observations in order from smallest

to largest in the modified data set. Observation Position

58 1

61 2

62 3

64 4

66 5

68 6

71 7

72 8

73 9

76 10

78 11

STEP 6 Find the depth of the first quartile.

d1 5

n 12 5 53 4 4

Because d1 is a whole number, add 0.5. The depth of the first quartile is 3.5.

81 12

92

CHA PTER 3

Numerical Summary Measures

Q1 is the mean of the observations in the third and fourth positions in the ordered list. 1 Q1 5 ( 62 1 64 ) 5 63 2 STEP 7 Find the depth of the third quartile.

d3 5

( 3 )( 12 ) 3n 5 59 4 4

Because d3 is a whole number, add 0.5. The depth of the third quartile is 9.5.

Q3 is the mean of the observations in the ninth and tenth positions in the ordered list. Q3 5

1 ( 73 1 76 ) 5 74.5 2

STEP 8 Find the interquartile range.

IQR 5 74.5 2 63 5 11.5 A technology solution is shown in Figures 3.21 and 3.22.

Figure 3.21 1-Var Stats is used to compute the quartiles.

TRY IT NOW STEPPED STEPPED TUTORIAL TUTORIALS MEASURES BOX PLOTSOF SPREAD

Figure 3.22 Compute IQR on the Home screen.

GO TO EXERCISE 3.39

A CLOSER L OK 1. The interquartile range is the length of an interval that includes the middle half (middle

50%) of the data. STATISTICAL APPLET ONE-VARIABLE STATISTICAL CALCULATOR

2. The interquartile range is not sensitive to outlying values. The lower and/or upper 25%

of the distribution can be extreme without affecting Q1 and/or Q3.

Technology Corner Procedure: Compute the sample variance, sample standard deviation, first quartile, third quartile, and interquartile range. Reconsider: Example 3.10(b), solution, and interpretations.

VIDEO TECH MANUALS EXEL DISCRIPTIVE SUMMARY STATISTICS

CrunchIt! CrunchIt! has a built-in function to find certain descriptive statistics, including the sample standard deviation, first quartile, and third quartile. There is no built-in function to compute the sample variance nor the interquartile range. 1. Enter the data into a column. 2. Select Statistics; Descriptive Statistics. Choose the appropriate column and click the Calculate button. See

Figure 3.23.

3.2

Measures of Variability

93

Figure 3.23 CrunchIt! descriptive statistics.

TI-84 Plus C 1. Enter the data into list L1. 2. Select LIST ; MATH; variance. Take the square root of the variance to find the standard deviation. See

Figure 3.17. 3. Select STAT ; CALC; 1-Var Stats. 4. The quartiles are displayed on the second output screen. Refer to Figure 3.21. Note: The sample standard deviation is displayed on the first output screen and the value is stored in the statistic variable Sx. 5. Compute the interquartile range on the Home screen. Use the TI-84 Plus statistics variables that represent the quartiles. Refer to Figure 3.22.

Minitab 1. Enter the data into column C1. 2. Select Stat; Basic Statistics; Display Descriptive Statistics. Enter C1 in the Variables window. 3. Choose the Statistics option button and check the summary statistics Standard deviation, variance, First quartile, Third

quartile, and Interquartile range. Note: Minitab computes quartiles using a slightly different algorithm. See Figure 3.24.

Figure 3.24 Measures of variability computed using Minitab.

Excel Use built-in functions to compute the sample standard deviation, sample variance, quartiles, and interquartile range. 1. Enter the data into column A. 2. Use the function STDEV.S to compute the sample standard deviation; VAR.S to compute the sample variance;

QUARTILE to compute the first and third quartile; and compute the interquartile range using the results. See Figure 3.25. Note: Excel computes quartiles using another different algorithm.

Figure 3.25 Measures of variability computed using Excel.

94

CHA PTER 3

Numerical Summary Measures

SECTION 3.2 EXERCISES Concept Check

b. If 20 is subtracted from each observation in part (a), a

new data set is formed:

3.33 True/False Every deviation about the mean is non-

negative.

1

3.34 True/False The sample standard deviation is greater

than or equal to 0. 3.35 True/False The sample standard deviation and the

population standard deviation are always the same value. 3.36 True/False The computational formula for the sample

variance is used only for large data sets.

8

3.38 Find the sample range, sample variance, and sample EX3.38 standard deviation for each data set. a. {2.7, 6.0, 5.7, 5.4, 4.0, 3.1, 6.6, 5.7, 6.1, 3.0} b. {18.5, 23.5, 15.7, 15.7, 36.3, 20.8, 21.1, 20.2, 26.8, 19.9, 17.6, 17.5, 21.5, 22.4, 25.7} c. {23.94, !31.04, 37.09, 22.64, !61.23, 1.59, 23.09, 1.14} d. {0.13, 0.96, !0.50, 0.10, !1.65, !0.14, 1.43, !2.57, !1.28, !0.24, !0.90, !1.27, 1.53, 3.00, !1.28, 1.04, !0.90, 2.44, 1.70, 3.13}

3.39 Compute the sample variance and the sample standard

deviation for each sample with known sum(s). a. gxi 5 1219.29

b. gxi 5 35.2918 c. g xi 5 218.291

gx2i 5 58,945.1

gx2i 5 7748.98

d. g ( xi 2 x ) 2 5 49.784

13

27

31

!9

61

16

420 940

560 1020

760 220

240 1620

660 720

Find the sample variance and the sample standard deviation for this new data set. How are these values related to the sample variance and the sample standard deviation found in part (a)? 3.43 a. b. c. d.

How does an outlier affect each of the following? The sample variance The sample standard deviation The first quartile and the third quartile The interquartile range

Applications 3.44 Sports and Leisure The following times (in seconds) are

from the ISU World Cup 2012/2013, Montreal, women’s 500SPDSKT meter speed skating event, October 26–28, 2012.13

n 5 30

g x2i 5 3615.96

!8

Find the sample variance and the sample standard deviation for this new data set. How are these values related to the sample variance and the sample standard deviation found in part (a)? c. If each observation in part (a) is multiplied by 20, the following data set is formed:

3.37 True/False Quartiles divide the data into four parts.

Practice

18

n 5 17

47.611 47.149

n 5 15

n 5 21

3.40 Find the depth of the first quartile and the third quartile in

an ordered data set of size n. a. n " 60 b. n " 37 c. n " 100 d. n " 48

46.206 59.028

45.299 46.188

45.405 54.118

47.611 45.416

a. Find the sample range, R. b. Find the sample variance, s2, and the sample standard

deviation, s. c. Find the first and third quartiles, Q1 and Q3, and the inter-

quartile range, IQR. 3.45 Physical Sciences

3.41 Find the first quartile, the third quartile, and the interquarEX3.41 tile range for each data set. a. {20, 17, 37, 33, 29, 50, 20, 33} b. {13.1, 7.8, 11.9, 2.3, 6.7, 2.3, 7.4, 2.7, 8.9, 6.6, 6.8, 5.1, 2.2, 5.6, 5.5, 2.1, 7.7, 13.9, 1.6, 1.7} c. {!15, !13, !7, !15, !22, !12, !21, !21, !26, !17} d. {43.6, 44.1, 59.5, 52.3, 50.9, 39.7, 42.4, 58.5, 40.9, 38.5, 44.2, 60.3, 72.2, 34.8, 46.0, 54.7, 51.0, 54.3, 49.7, 62.9, 44.6, 61.3, 52.4, 43.9, 68.8, 59.2, 57.1, 70.5, 52.3, 49.5}

3.42 Consider the following data set:

EX3.42

The following table includes some of the data from an experiment performed by H. S. Lew for the Center for Building Technology at the U.S. National Institute of Standards and Technology (NIST). These data are used to certify computational results and evaluate statistical software. Each observation represents the deflection of a steel-concrete beam while subjected to periodic pressure. BEAM !213 !360 154

!564 203 !125

!35 !338 !559

!15 !431 92

141 194 !21

115 !220 !579

!420 !513

Source: National Institute of Standards and Technology.

21

28

38

12

33

47

51

11

81

a. Find the sample variance and the sample standard

deviation.

36

a. Find the sample standard deviation. b. Find the interquartile range. c. Which statistic, s or IQR, is a better measure of variabil-

ity for this data set? Why?

3.2

3.46 Fuel Consumption and Cars The gross vehicle weight

rating (in pounds) for several 2013 automobiles is given in the AUTOWT following table:14

5369 6472

5612 3925

6305 4201

6355 4680

6891 4734

5137 4178

6371 5730

6327 5899

a. Find the sample variance and the sample standard deviation. b. Find the first and third quartiles. c. Find the interquartile range and the quartile deviation

(another measure of variability), QD ! (Q3 " Q1) / 2. 3.47 Education and Child Development

Many educators believe that success in school is related directly to the amount of time spent completing homework assignments. A research study compared the academic ability of 17-year-olds who spend less than one hour on homework every day and those who spend more than two hours on homework every day. The National Assessment of Educational Progress (NAEP) scores for each student in each group are given in the following HOMEWORK table.19

Less than one hour 290 289 291 289 289 294 288 291 293 290 290 291 290 290 296 292

More than two hours 303 305 302 297 294 303 299 297 303 299 300 295 297 297 297 293 296 297 302 294 Source: National Center for Education Statistics.

a. Find the sample variance, sample standard deviation, and

interquartile range of the progress scores for students who spend less than one hour on homework. b. Find the sample variance, sample standard deviation, and interquartile range of the progress scores for students who spend more than two hours on homework. c. Use your answers to parts (a) and (b) to determine which data set has more variability. 3.48 Travel and Transportation Air Canada recently

discontinued a regularly scheduled flight from Montreal to Iqaluit. The route was not profitable because of rising fuel costs. Before the flight was canceled, seven days were randomly selected and the number of passengers recorded. The data are given in the PASSENGER following table. 51

76

47

61

53

68

79

a. Compute s2 using the definition in Equation 3.4. b. Compute s2 using the computational formula in Equation 3.6. c. How do your answers to parts (a) and (b) compare? 3.49 Public Policy and Political Science The president of

the United States has the authority to grant clemencies, pardons, and commutations of sentences to convicted criminals. A sample of U.S. presidents was obtained, and the number of presidential clemency actions for each was recorded. The data CLEMENCY are given in the following table.15

Measures of Variability

President

95

Clemency actions

Calvin Coolidge Jimmy Carter Woodrow Wilson John F. Kennedy Thomas Jefferson Millard Fillmore Rutherford B. Hayes Richard Nixon Andrew Jackson James Madison Zachary Taylor Martin Van Buren Ulysses S. Grant Lyndon B. Johnson George W. Bush

1545 566 2480 575 119 170 893 926 386 196 38 168 1332 1187 176

a. Find Q1, Q3, and IQR for the clemency actions data. b. Find s2 and s. c. Franklin D. Roosevelt had the highest number of clem-

ency actions of any president, 3687. Add this value to the data set. Find IQR and s2 for this expanded data set. d. How do IQR and s2 compare in these two data sets? Explain why these values are the same/different. 3.50 Physical Sciences The following operating tempera-

tures (#F) for a certain steam turbine were measured on 10 TURBINE randomly selected days. 298 313 305 292 283 348 291 286 346 304 a. Find Q1, Q3, and IQR. b. Find s2 and s. c. Suppose the smallest observation (283) is changed to 226.

Find IQR and s2 for this modified data set. d. How do IQR and s2 compare in these two data sets? Which measurement is more sensitive to outliers?

3.51 Marketing and Consumer Behavior Two measures designed to give a relative measure of variability are the coefficient of variation, denoted CV, and the coefficient of quartile variation, denoted CQV. These measures are defined by

CV 5 100 #

s x

CQV 5 100 #

Q3 2 Q1 Q3 1 Q1

The areas (in square feet) for homes constructed in two new residential developments in San Antonio (one in North Central and one on the city’s West Side) were recorded and are given HOMES in the following table.

East-side development 2038

1939

2024

1990

2109

2102

1918

2022

2142 1877

2382

1489

2070

2340

West-side development 2061 1725

2383 2368

2638 1674

96

CHAPTER 3

Numerical Summary Measures

a. Compute CV and CQV for each development. b. Compare the coefficient of variation and the coefficient of

quartile variation for each development. Which data set has more variability? 3.52 Physical Sciences Solar wind released from the Sun can

affect power grids on Earth, the northern and southern lights, and even the tails of comets. The following table lists the proton density in protons per cubic centimeter (p/cc) for several times SUNWIND in November 2012.16 2.8 5.0

15.7 5.5

0.7 1.3

0.5 10.9

0.6 3.2

2.7 3.4

2.2 2.7

2.7 1.2

3.1 1.7

0.5 0.8

a. Find the sample variance and the sample standard

deviation. b. Find the Q1, Q3, and IQR for these data. c. Remove the two largest proton densities from the data set. Answer parts (a) and (b) for this reduced data set. Compare the sample standard deviation and IQR in these two data sets and explain how these values have changed. 3.53 Public Health and Nutrition The Center for Science in the Public Interest (CSPI), a consumer group concerned about nutrition labeling, has defined a new measure of breakfast cereal called the nutritional index (NI), which is based on calories, vitamins, minerals, and sugar content per serving. A larger NI indicates greater nutritional value. The NI was measured for randomly selected cereals sold by Kellogg’s and General Mills. The results are given in the CEREALNI following table.

70 83

77 70

79 67

71 72

80 68

88 74

62 80

81 62

82 74

31 59

46 47

29 80

81 41

63 91

41 41

60 33

General Mills 54 66

49 68

State Fires

AK 398

CA 7737

CO 1447

CT 180

GA FL 2878 2217

State Fires

IN 67

KY 896

LA 828

MA 1446

MD 143

State Fires

NC 2575

NH 312

NJ 994

OH 218

ME 551

a. Find the sample variance and the sample standard deviation. b. Find Q1, Q3, and IQR for these data. c. Verify that the sum of the deviations about the mean is 0

(subject to round-off error). 3.56 Marketing and Consumer Behavior An Internet search for the best deal on a 12-megapixel digital camera revealed the BESTDEAL following prices (in U.S. dollars).19

160 300

169 295

783 600

90 579

129 356

188 553

a. Find Q1, Q3, and IQR for these price data. b. Suppose the highest price (783) is changed to 699. Find

Q1, Q3, and IQR for this modified data set. c. How large could the maximum price be without changing

IQR? d. How much could the minimum price be raised before Q1

Kellogg’s 86 75

and the number of acres burned has approximately doubled. The following table lists the number of wildland fires in 2012 WILDFIRE as of November 2 in selected states.18

50 39

a. Find s2, s, and IQR for Kellogg’s. b. Find s2, s, and IQR for General Mills. c. Use the results in parts (a) and (b) to compare the vari-

ability in NI for the two companies. 3.54 Biology and Environmental Science

Stage data (in feet-NGVD) for the Mississippi River at Baton Rouge at various times in 2012 are given on the text MSRIVER website.17 a. Find Q1, Q3, and IQR for these stream velocity data. b. How large could the minimum stage be without changing the IQR? c. Find the coefficient of quartile variation, CQV (defined in Exercise 3.51).

Extended Applications 3.55 Biology and Environmental Science The number of

wildland fires has increased dramatically over the last decade,

changes? 3.57 Travel and Transportation A typical road bridge is

constructed to last 50 years. The mean age of all bridges in the United States is approximately 43 years. The number of structurally deficient bridges in each state and the District of Columbia BRIDGES as of 2009 is given on the text website.20 a. Find the sample variance and the sample standard deviation of the number of structurally deficient bridges. b. Suppose each state is able to repair 10% of all structurally deficient bridges. Find the sample variance and the sample standard deviation for this new data set. c. How do your answers to parts (a) and (b) compare? 3.58 Travel and Transportation In its annual study, the

International Telework Association & Council asks survey participants how many miles they must drive to work each day. Six study participants were selected at random and their MILEAGE mileage was recorded: 25

39

16

35

18

45

a. Find each deviation about the mean. b. Verify that the sum of the deviations about the mean is 0

c. Prove that, in general, g ( xi 2 x ) 5 0. (Hint: Write as two

(subject to round-off error).

separate sums, and use the definition of the sample mean.)

3.2

3.59 Biology and Environmental Science The Virginia Estuarine & Coastal Observing System monitors the Chesapeake Bay and records values of several variables, including salinity, temperature, and turbidity. The wind speed (in miles per hour) at selected locations on November 6, 2012, CHESABAY is given in the following table.21

2 6 30

5 9 25

11 24 14

17 27 24

9 8 10

7 8 18

11 10 18

10 28 25

a. Find the sample variance and the sample standard devia-

tion for these wind speed data. b. Convert each observation to meters per second (multiply

each observation by 0.44704). Find the sample variance and the sample standard deviation for the wind-speed data in meters per second. c. How do your answers to parts (a) and (b) compare? 3.60 Proof Prove that Equation 3.4 (the definition of the sample variance) can be written as Equation 3.6. That is, show that

1 1 1 g ( xi 2 x ) 2 5 c g x2i 2 ( g xi ) 2 d n n21 n21

3.61 Public Health and Nutrition A nutritional study

recently found the following number of calories in one slice of PIZZA plain pizza at 10 different national chains.27 228 281 274 408 364 259 317 299 302 231 Source: Food Science and Human Nutrition, Colorado State University.

a. Find the sample variance and the sample standard deviation. b. Add 15 (calories) to each observation. Find the sample

variance and the sample standard deviation for this modified data set. c. How do your answers to parts (a) and (b) compare? d. Suppose a data set (x’s) has variance s2x and standard deviation sx. A new (transformed) data set is created using the equation yi ! xi " b, where b is a constant. How are the variance and standard deviation of the new data set (s2y and sy) related to s2x and sx? 3.62 Technology and the Internet A benchmark computer

program was executed on eight different machines, and the following times to completion (in seconds) were recorded: PROGRAMS

12.592 13.646

3.63 Transformed Data Combine the results obtained in the

previous two exercises. Suppose a data set (x’s) has variance s2x and standard deviation sx. A new (transformed) data set is created using the equation yi ! ax " b, where a and b are constants. How are the variance and standard deviation of the new data set (s2y and sy) related to s2x and sx? 3.64 Is This Possible? Consider the set of observations

5 5, 7, 3, 2, 4, 6, 9, 11, 13 6

Can you find a subset of size n ! 7 with x 5 5 and s2 ! 6? If not, why not?

Challenge 3.65 Biology and Environmental Science A whalewatching tour off the coast of Maine is considered a success if at least one whale is sighted. Thirty-two randomly selected summer tours are classified in the following table:

S S S

S S F

S S S

S S S

F S S

S S S

12.396 15.377

6.801 7.602

a. Find the sample variance and the sample standard deviation. b. Multiply each observation by 7. Find the sample variance

and the sample standard deviation for this modified data set. c. How do your answers to parts (a) and (b) compare? d. Suppose a data set (x’s) has variance s2x and standard devia-

tion sx. A new (transformed) data set is created using the equation yi ! ax, where a is a constant. How are the variance and standard deviation of the new data set (s2y and sy) related to s2x and sx?

S S S

S F S

S S

S S

S S

S S

a. Find the sample proportion of successes. b. Change each S to a 1 and each F to a 0. Find the sample

variance for these new data. Write the sample variance in terms of the sample proportion. c. If a population happens to be of finite size N, then the population mean and population variance are defined by

m5

1 g xi N

s2 5

1 g ( xi 2 m ) 2 N

Suppose the table represents an entire population. Find the population variance for the data (consisting of 0’s and 1’s). Write the population variance in terms of the sample proportion. 3.66 Other Summary Statistics Many other summary statistics

can also be used to describe various characteristics of a numerical data set. Suppose x1, x2, . . . , xn is a set of observations. For r ! 1, 2, 3, . . . , the rth moment about the mean x is defined as

mr 5

1 g ( xi 2 x ) r n

For example, the second moment about the mean is

m2 5 14.152 12.075

97

Measures of Variability

1 g ( xi 2 x ) 2 n

Certain moments about the mean are used to define the coefficient of skewness (g1) and the coefficient of kurtosis (g2):

g1 5

m3 m3/2 2

g2 5

m4 m22

The statistic g1 is a measure of the lack of symmetry, and g2 is a measure of the extent of the peak in a distribution. Use technology to compute the values g1 and g2 for various distributions: skewed, symmetric, unimodal, uniform. Use your results to determine the values of g1 that suggest more skewness in the distribution, and the values of g2 that indicate a flatter, more uniform distribution.

98

CHA PTER 3

Numerical Summary Measures

3.3 The Empirical Rule and Measures of Relative Standing Measures of central tendency and measures of variability are used to describe the general nature of a data set. These two types of measures may be combined to describe the distribution of a data set more precisely. In addition, these values may be used to define measures of relative standing, quantities used to compare observations from different data sets (with different units), or even to draw a conclusion or make an inference. The first result combines the mean and the standard deviation to describe a distribution.

Chebyshev’s Rule What happens if k # 1?

Let k ! 1. For any set of observations, the proportion of observations within k standard deviations of the mean [lying in the interval ( x 2 ks, x 1 ks ) , where s is the standard deviation] is at least 1 " (1 / k 2 ).

Recall interval notation: (a, b) denotes an open interval, with the endpoints not included, from a to b. Therefore, ( x 2 ks, x 1 ks ) means the set of all x’s such that x 2 ks , x , x 1 ks.

The diagram in Figure 3.26, and the accompanying table, illustrate this idea. For any set of observations, the smoothed histogram shows that the proportion of observations captured in the interval ( x 2 ks, x 1 ks ) is at least 1 " (1 / k 2 ). For example, the proportion of observations within 1.5 standard deviations of the mean is at least 0.56 (or 56%). The proportion of observations within 3 standard deviations of the mean is at least 0.89 (or 89%). 1 k 12 2 k 1.5 2.0

Recall: In smoothed histograms, the area under the curve between two points a and b corresponds to the proportion of observations between a and b.

At least 1 !

2.3

1 k2

3.0 x! ! ks

x!

x! " ks

12

1 < 0.56 1.52

1 < 0.75 2.02 1 12 < 0.81 2.32 1 12 < 0.89 3.02 12

Figure 3.26 Illustration of Chebyshev’s rule.

A CLOSER L OK A symmetric interval about the mean is centered at the mean and has endpoints that are the same distance from the mean.

1. Chebyshev’s rule simply helps to describe a set of observations using symmetric

intervals about the mean. If we move k standard deviations from the mean in both directions, then the proportion of observations captured is at least 1 " (1 / k 2 ). 2. The total area under the curve (the sum of all the proportions) is 1. Hence, Chebyshev’s rule also implies that the proportion of observations in the tails of the distribution, outside the interval ( x 2 k s, x 1 k s ) , is at most 1 / k 2. 3. As indicated in the statement of Chebyshev’s rule and as suggested in the table, you may use any value of k greater than 1, including decimals. The two most common values for k are k # 2 and k # 3. The actual proportions of observations within 2 and within 3 standard deviations can be compared to the values predicted by Chebyshev’s rule and the empirical rule (page 100). In addition, k # 2 and k # 3 provide the fundamental background to statistical inference. 4. Chebyshev’s rule is very conservative because it applies to any set of observations. Usually, the proportion of observations within k standard deviations of the mean is bigger than 1 " (1 / k 2 ).

3.3

The Empirical Rule and Measures of Relative Standing

99

5. Chebyshev’s rule may also be used to describe a population. If the mean and standard devia-

tion are known, then m and s may be used in place of x and s. For any population, the proportion of observations that lie in the interval (m 2 ks, m 1 ks) is at least 1 ! (1 / k 2 ).

Example 3.11 Automobile Battery Lifetime Solution Trail 3.12 K EYW ORDS ■

Approximate proportion of observations between.

T RANSLATI O N ■

What proportion of observations is captured by the interval?

C ONCEPT S ■

Chebyshev’s rule.

V ISI ON

We don’t know anything about the shape of the distribution of the length of songs. However, Chebyshev’s rule applies to any distribution, tells us about the proportion of observations captured by certain intervals, and may be used here if the questions involve symmetric intervals about the mean.

In a random sample of the lifetime (in months) of a Honda Odyssey automobile battery, x 5 54 and s " 5.3. Use Chebyshev’s rule with k " 2 and k " 3 to describe this distribution of battery lifetimes.

SOLUTION 1 1 1 3 5 1 2 2 5 1 2 5 5 0.75 2 4 4 k 2 At least 3 / 4 (or 75%) of the observations lie in the interval ( x 2 2s, x 1 2s ) 5 ( 54 2 2 ( 5.3 ) , 54 1 2 ( 5.3 )) 5 ( 43.4, 64.6 ) . 1 1 1 8 STEP 2 For k 5 3: 1 2 2 5 1 2 2 5 1 2 5 < 0.89 9 9 k 3 At least 8 / 9 (or 89%) of the observations lie in the interval ( x 2 3s, x 1 3s ) 5 ( 54 2 3 ( 5.3 ) , 54 1 3 ( 5.3 )) 5 ( 38.1, 69.9 ) . STEP 3 Note also: At most 1 / 4 (or 25%) of the observations lie outside the interval (43.4, 64.6). At most 1 / 9 (or 11%) of the observations lie outside the interval (38.1, 69.9). STEP 1 For k 5 2: 1 2

Example 3.12 How Long Was “In-A-Gadda-Da-Vida”? In 1968, the psychedelic rock band Iron Butterfly recorded the 17-minute song “In-AGadda-Da-Vida.” Most popular songs are much shorter, for example, “Viva la Vida” by Coldplay is approximately 4 minutes long. Suppose that in a random sample of the length (in minutes) of songs produced by hard rock bands, x 5 3.35 and s " 0.5. a. Find the approximate proportion of observations between 2.35 and 4.35 minutes. b. Find the approximate proportion of observations less than 1.85 or greater than 4.85 minutes. c. Approximately what proportion of songs lasts more than 5 minutes?

SOLUTION No values of k are specified, so use k " 2 and k " 3. a. ( x 2 2s, x 1 2s ) 5 ( 3.35 2 2 ( 0.5 ) , 3.35 1 2 ( 0.5 )) 5 ( 2.35, 4.35 )

k52 At least 1 ! (1 / 4) " 3 / 4 (or 75%) of the observations lie between 2.35 and 4.35 minutes.

b. ( x 2 3s, x 1 3s ) 5 ( 3.35 2 3 ( 0.5 ) , 3.35 1 3 ( 0.5 )) 5 ( 1.85, 4.85 )

k53 At least 1 ! (1 / 9) " 8 / 9 (or 89%) of the observations lie between 1.85 and 4.85 minutes. At most 1 / 9 (or 11%) of the observations are less than 1.85 or greater than 4.85 minutes.

c. Since Chebyshev’s rule measures intervals in terms of the number of standard

deviations from x, find out how far 5 is from x in standard deviations. x 1 ks 5 3.35 1 k ( 0.5 ) 5 5 1 k 5 3.3 We cannot assume anything about the shape of the distribution. 1 1 512 < 0.91 2 k 3.32 At least 0.91 (or 91%) of the observations lie in the interval ( x 2 3.3s, x 1 3.3s ) 5 ( 3.35 2 3.3 ( 0.5 ) , 3.35 1 3.3 ( 0.5 ) ) 5 ( 1.7, 5.0 ) 12

100

CHA PTER 3

Numerical Summary Measures

Therefore, at most 1 ! 0.91 " 0.09 (or 9%) of the observations are outside this interval, either less than 1.7 or greater than 5.0 minutes. We cannot assume that the distribution is symmetric, so we do not know what part of the 9% is less than 1.7 and what part is more than 5.0 minutes. To be conservative, the best we can say is that at most 9% of the observations are more than 5 minutes long. TRY IT NOW A normal curve is bell-shaped and symmetric, centered at the mean.

GO TO EXERCISE 3.65

If a set of observations can be reasonably modeled by a normal curve, then we can describe this distribution more precisely. The empirical rule involves the mean and standard deviation also, and the results apply to three specific symmetric intervals about the mean.

The Empirical Rule If the shape of the distribution of a set of observations is approximately normal, then: 1. The proportion of observations within one standard deviation of the mean is approximately 0.68. 2. The proportion of observations within two standard deviations of the mean is approximately 0.95. 3. The proportion of observations within three standard deviations of the mean is approximately 0.997.

Figure 3.27 illustrates the empirical rule, the symmetric intervals about the mean, and the proportions. The empirical rule conclusions are more accurate than Chebyshev’s rule because we know (assume) more about the shape of the distribution (normality).

0.68

x! ! 1s

x!

0.95

x! " 1s

x! ! 2s

x!

0.997

x! " 2s

x! ! 3s

x!

x! " 3s

Figure 3.27 Symmetric intervals and proportions associated with the empirical rule.

A CLOSER L OK For now, the reasons for the proportions 0.68, 0.95, and 0.997 remain a mystery. We will discover where these numbers come from in Chapter 6.

1. Given a set of observations, the empirical rule may be used to check normality. To test

for normality numerically, find the mean, standard deviation, and the three symmetric intervals about the mean ( x 2 ks, x 1 ks ) , k 5 1, 2, 3. Compute the actual proportion of observations in each interval. If the actual proportions are close to 0.68, 0.95, and 0.997, then normality seems reasonable. Otherwise, there is evidence to suggest the shape of the distribution is not normal. This process is sort of a backward empirical rule. 2. The empirical rule may also be used to describe a population. If the distribution of the population is approximately normal, and the mean and standard deviation are known, then m and s may be used in place of x and s. 3. The proportion of observations beyond three standard deviations from the mean is 1 ! 0.997 " 0.003 (pretty small). Therefore, if the shape of a (population) distribution is approximately normal, it would be unusual to have an observation more than three standard deviations from the mean. What if there is one? (See Example 3.14.)

3.3

101

The Empirical Rule and Measures of Relative Standing

Example 3.13 Expensive Speeding Tickets Solution Trail 3.13 K EYW ORDS ■ ■

Approximately normal; Approximate proportion of observations between

T RANSLATI O N ■

What proportion of observations is captured by the interval?

Some of the world’s most expensive speeding tickets are issued in Finland and Canada. Over a long weekend in August 2012, there were 3556 speeding tickets issued in Alberta, Canada.22 The cost of each ticket depends on the speed of the car and the posted limit. In a random sample of these ticket fines (in Canadian dollars), suppose the shape of the distribution is approximately normal, with x 5 130 and s ! 25. a. Approximately what proportion of observations is between 80 and 180? b. Approximately what proportion of observations is greater than 205 or less than 55? c. Approximately what proportion of observations is greater than 205? d. Approximately what proportion of observations is between 105 and 180?

C ONCEPT S ■

The empirical rule.

V ISI ON

Since the shape of the distribution is approximately normal, the empirical rule may be used to determine the proportion of observations captured by certain intervals, related in some way to three special symmetric intervals about the mean.

SOLUTION a. Find the values one, two, and three standard deviations about the mean in each direction. See Figure 3.28. Notice that ( 130 2 2 ( 25 ) , 130 1 2 ( 25 )) 5 ( 80, 180 ) So, 80 to 180 is a symmetric interval about the mean, two standard deviations in each direction. The empirical rule states that approximately 0.95 (or 95%) of the observations lie in this interval. See Figure 3.28. b. Notice that ( 130 2 3 ( 25 ) , 130 1 3 ( 25 )) 5 ( 55, 205 ) So, 55 to 205 is a symmetric interval about the mean, three standard deviations in each direction. The empirical rule states that approximately 0.997 (or 99.7%) of the observations lie in this interval. The remaining proportion, 1 " 0.997 ! 0.003 (or 0.3%), of observations lie outside this interval, greater than 205 or less than 55. See Figure 3.29. 1 ! 0.997 " 0.003

0.95

55

80

105

130

0.997

155

180

205

Figure 3.28 Approximately 0.95 (or 95%) of the observations lie within two standard deviations of the mean, in the interval (80, 180).

55

80

105

130

155

180

205

Figure 3.29 Approximately 1 " 0.997 ! 0.003 (or 0.3%) of the observations lie outside the interval (55, 205).

c. Because a normal distribution is symmetric about the mean, the remaining proportion outside three standard deviations from the mean (1 " 0.997 ! 0.003) is divided evenly between the two tails. Therefore, approximately 0.003 / 2 ! 0.0015 (or 0.15%) of the observations are greater than 205. See Figure 3.30. d. (105, 180) is not a symmetric interval about the mean. However, approximately 0.68 of the observations lie in the interval (105, 155) (one standard deviation from the

102

CHA PTER 3

Numerical Summary Measures

mean). Approximately 0.95 of the observations lie in the interval (80, 180) (two standard deviations from the mean). This means that 0.95 ! 0.68 " 0.27 of the observations lie in the intervals (80, 105) and (155, 180). Because a normal distribution is symmetric, 0.27 / 2 " 0.135 of the observations lie between 155 and 180. Therefore, a total of approximately 0.68 # 0.135 " 0.815 (or 81.5%) of the observations lie between 105 and 180. See Figure 3.31. 0.95 ! 0.68 " 0.27

0.0015

55

0.997

80

0.0015

105 130 155 180 205

Figure 3.30 Approximately 0.0015 (or 0.15%) of the observations are greater than 205.

TRY IT NOW

0.815

55

80

105

130

155

180

205

Figure 3.31 Approximately 0.27 (or 27%) of the observations lie in the intervals (80, 105) and (155,180).

GO TO EXERCISE 3.70

Example 3.14 When Will the Pain Stop? Solution Trail 3.14 KE YWO R DS ■ ■

Approximately normal Evidence to refute the claim?

T RANSLATI ON ■

Draw a conclusion.

CONC EPTS ■ ■

Empirical rule Inference procedure.

VI SION

Because the distribution of pain-relief times is approximately normal, the empirical rule may be used to determine how often observed times in certain intervals occur. If the observed pain-relief time is rare, then we should question the manufacturer’s claim.

First Horizon Pharmaceutical has just developed a new medicine for treatment of routine aches and pains. The company claims the distribution of pain-relief times (in hours) is approximately normal, with mean m 5 8 and standard deviation s 5 0.2. A patient with a typical muscle ache is randomly selected and the medicine is administered. The patient reports pain relief for only 7 hours. Is there any evidence to refute the manufacturer’s claim?

SOLUTION STEP 1 Because the shape of the (population) distribution is approximately normal, the

empirical rule applies (using m 5 8 and s 5 0.2). a. Approximately 0.68 of the population lies in the interval (7.8, 8.2). b. Approximately 0.95 of the population lies in the interval (7.6, 8.4). c. Approximately 0.997 of the population lies in the interval (7.4, 8.6). STEP 2 The observation x " 7 hours lies outside the largest interval (7.4, 8.6). Only

1 ! 0.997 " 0.003 of the population lies outside this interval. More precisely (because of symmetry), only 0.003 / 2 " 0.0015 of the population lies below 7.4. Seven hours is a very rare observation. Two things may have occurred. a. Seven hours is an incredibly lucky observation. Even though the proportion of observations below 7.4 is small, it is still possible for the manufacturer’s claim to be true and for the pain reliever to last only 7 hours in this patient. b. The manufacturer’s claim is false. Because an observation of 7 hours is so rare, it is more likely that one of the assumptions is wrong. The shape of the distribution may not be normal, the mean may be different from 8, and/or the standard deviation might be different from 0.2. STEP 3 Typically, statistical inference discounts the lucky alternative. Therefore, because

7 hours is such an unlikely observation, there is evidence to suggest the manufacturer’s claim is false. Something is awry. We would rarely see pain relief of only 7 hours if the claim is true.

3.3

The Empirical Rule and Measures of Relative Standing

103

Note: We may be too quick to make an inference based on only a single observation. We will learn how to use more observations (information) to reach a more confident conclusion. One method for comparing observations from different samples (with different units) is to use a standardized score. For a given observation, this relative measure is used to determine the distance from the mean in standard deviations.

Definition Suppose x1, x2, . . . , xn is a set of n observations with mean x and standard deviation s. The z-score corresponding to the ith observation, xi, is given by zi 5

xi 2 x s

(3.7)

zi is a measure associated with xi that indicates the distance from x in standard deviations.

A CLOSER L OK Statisticians tend to measure distances in standard deviations, not miles, feet, inches, or meters. We often ask, “How many standard deviations from the mean is a given observation?”

1. zi may be positive or negative (or zero). A positive z-score indicates the observation is

to the right of the mean. A negative z-score indicates the observation is to the left of the mean. 2. A z-score is a measure of relative standing; it indicates where an observation lies in relation to the rest of the values in the data set. There are other methods of standardization, but this is the most common. 3. Given a set of n observations, the sum of all the z-scores is 0; g zi 5 0. Can you prove this?

Example 3.15 Starting Salary Solution Trail 3.15 KEYW ORDS ■

Which salary is better?

Most college career counselors agree that starting salary is associated with academic major. Even if a person’s first job is not related directly to his course of study, his salary may still be related to his academic major. A recent survey of academic major and starting salary of graduates showed the following information:

TR ANSLATI O N ■

Which salary is farther away from the mean (to the right) in standard deviations?

Major English Computer science

CONC EPTS ■

z-score.

VI SI ON

Compute and compare the z-scores for each salary. This will allow us to determine how many statistical steps each observation is from the mean. The higher the z-score, the better the salary.

Mean

Standard deviation

$38,100 $57,690

$3,600 $5,370

A computer science major who responded to the survey received a starting salary of $64,000, and an English major received an offer of $47,000. Which salary is better, in terms of statistics?

SOLUTION STEP 1 The higher starting salary is probably better (subject to working conditions, ben-

efits, location, etc.), but to answer this question in terms of statistics, consider the z-scores. STEP 2 Computer science major:

z5

64,000 2 57,690 < 1.18 5,370

104

CHAPTER 3

Numerical Summary Measures

$64,000 is approximately 1.18 standard deviations to the right of the mean. English major:

z5

47,000 2 38,100 < 2.47 3,600

$47,000 is approximately 2.47 standard deviations to the right of the mean. STEP 3 The English major’s starting salary is actually better, because the salary is much

higher than those of most English majors. TRY IT NOW

GO TO EXERCISE 3.73

Example 3.16 Pet Return Policy Solution Trail 3.16 KE YWORDS ■

Is this a reasonable lifetime?

T RANSLATI ON ■ ■

Draw a conclusion Do you think the guinea pig should have lived longer?

CONC EPTS ■ ■

Inference procedure z-score.

VI SION

For any distribution, most observations are within three standard deviations of the mean, or have a z-score between #3 and $3. Compute the z-score for this guinea pig’s lifetime, the number of statistical steps from the mean.

The owner of the Jungle Pet Store is trying to establish a policy for the return of animals. In a random sample of the lifetime (in months) of pet guinea pigs, x 5 72 and s ! 12. One of the guinea pigs in this sample lived 62 months. Is this a reasonable lifetime, or should the store provide some sort of refund (or a new guinea pig)?23

SOLUTION STEP 1 To determine whether 62 months is a reasonable lifetime, consider the z-score

corresponding to this observation. 62 2 72 STEP 2 z 5 5 20.83 12 The observation (62 months) is only 0.83 standard deviations to the left of the mean. Because 62 months is within 1 standard deviation of the mean (regardless of the shape of the distribution), this is a very conservative, reasonable observation. STEP 3 The guinea pig lived a very normal life. No refund is necessary. TRY IT NOW

GO TO EXERCISE 3.74

Another indication of relative standing is a percentile. Do you remember all of those standardized tests in grade school? The results were usually reported in terms of percentiles. The 90th percentile was a good score and the 25th percentile meant more homework in your future.

Definition Let x1, x2, . . . , xn be a set of observations. The percentiles divide the data set into 100 parts. For any integer r (0 " r " 100), the rth percentile, denoted pr, is a value such that r percent of the observations lie at or below pr (and 100 # r percent lie above pr). 0.75

0.25 P75

Figure 3.32 The 75th percentile is illustrated using a smoothed histogram. Remember, the area under the curve between a and b corresponds to the proportion of observations between a and b. So the total area under the curve is 1.

The rth percentile has the same units as the observations, not a percent. Figure 3.32 shows a smoothed histogram and illustrates the location of the 75th percentile on the measurement axis.

A CLOSER L OK 1. The 50th percentile is the median, p50 5 ~ x. 2. The 25th percentile is the first quartile and the 75th percentile is the third quartile:

p25 5 Q1, p75 5 Q3.

3.3

105

The Empirical Rule and Measures of Relative Standing

How to Compute Percentiles Suppose x1, x2, . . . , xn is a set of n observations. 1. Arrange the observations in ascending order, from smallest to largest. n#r 2. To find pr, compute dr 5 . 100 a. If dr is a whole number, then the depth of pr (position in the ordered list) is dr ! 0.5. pr is the mean of the observations in positions dr and dr ! 1 in the ordered list. b. If dr is not a whole number, round up to the next whole number for the depth of pr.

Example 3.17 Camping Out There are 4524 campsites managed by the Minnesota Department of Natural Resources Division of Parks and Trails.24 The number of campsites utilized each day is carefully monitored, and on a randomly selected day there were 3000 campsites in use, a number that lies at the 75th percentile. Interpret these results.

SOLUTION Here, 3000 is a single observation from the population of number of campsites used per day and percentiles divide these observations into 100 parts. Because 3000 lies at the 75th percentile, 75% of the days had 3000 or fewer campsites used, and on 100 " 75 # 25% of the days, more than 3000 campsites were utilized. Note: We do not know anything about the shape of the distribution, nor do we know the mean or standard deviation. There is no way of telling how far 3000 is from the mean in standard deviations. © Andy Selinger/age fotostock

DATA SET BRIDGE

Example 3.18 Scenic Stroll Across the Brooklyn Bridge The Brooklyn Bridge in New York City is a popular tourist attraction, and many people enjoy a walk along the pedestrian walkway. A walk across the bridge takes approximately 25–60 minutes.25 A random sample of people walking across the bridge was obtained, and their times are given in the following table: 44 28 37

Solution Trail 3.18 K EYW ORDS ■

Find the time at which it took 20% of the walkers to make it across the bridge.

T RANSLATI O N ■

Find the time, t, such that 20% of all times are less than or equal to t.

C ONCEPT S ■

Percentiles.

V ISI ON

Find p20 (in minutes) so that 20% of the observations lie below and 80% lie above. Follow the steps for computing percentiles.

51 30 44

43 60 48

31 42 51

50 36 34

53 54 38

59 31 58

49 33 59

55 48 53

25 39 59

Find the time at which it took 20% of the walkers to make it across the bridge.

SOLUTION STEP 1 Order the data from smallest to largest. A portion of this ordered list is given in

the following table: Observation Position

25

28

30

31

31

33

34

36

37

38

1

2

3

4

5

6

7

8

9

10

STEP 2 Compute

d20 5

n#r 30 # 20 5 56 100 100

STEP 3 Because d20 is a whole number, add 0.5. The depth of p20 is d20 ! 0.5 # 6 ! 0.5 # 6.5.

106

CHA PTER 3

Numerical Summary Measures

STEP 4 The 20th percentile, p20, is the mean of the sixth and seventh observations.

p20 5

1 ( 33 1 34 ) 5 33.5 2

Figure 3.33 shows a technology solution. Figure 3.33 The 20th percentile computed using CrunchIt! Note: CrunchIt! computes percentiles using a slightly different algorithm. STEP 5 Twenty percent of the walkers made it across the bridge before 33.5 minutes, and

80% made it after 33.5 minutes.

Technology Corner Procedure: Compute the rth percentile. Reconsider: Example 3.18, solution, and interpretation. The TI-84 does not have a built-in function to compute the rth percentile.

CrunchIt! CrunchIt! has a built-in function to find certain descriptive statistics, including the rth percentile. 1. Enter the data into a column. 2. Select Statistics; Descriptive Statistics. Choose the appropriate column and enter the desired value of r. Click the Calcu-

late button. Refer to Figure 3.33.

Minitab The Minitab calculator function Percentile can be used to compute the rth percentile in a Session Window. 1. Enter the data into column C1. 2. In the Session Window, compute and print the appropriate percentile (Figure 3.34).

Figure 3.34 The 20th percentile computed using Minitab.

Excel 1. Enter the data into column A. 2. Use the function PERCENTILE.EXC to compute the 20th percentile. See Figure 3.35. Figure 3.35 The Excel function PERCENTILE.

3.3

The Empirical Rule and Measures of Relative Standing

107

SECTION 3.3 EXERCISES Concept Check 3.67 True/False Chebyshev’s rule applies to any set of data. 3.68 True/False The conclusion in Chebyshev’s rule applies

to a symmetric interval about the mean. 3.69 True/False In a smoothed histogram, the area under the

curve between two points a and b corresponds to the proportion of observations between a and b. 3.70 Short Answer Why is Chebyshev’s rule conservative? 3.71 True/False The empirical rule applies to any set of

observations. 3.72 Short Answer If the shape of a distribution is approxi-

mately normal, what does it mean if an observation is more than three standard deviations from the mean? 3.73 True/False A z-score is a measure of relative standing. 3.74 True/False All quartiles are percentiles.

Practice 3.75 For each data set with x and s given, find a symmetric interval k standard deviations about the mean, and use Chebyshev’s rule to compute the approximate proportion of observations within this interval. a. x 5 50, s ! 5, k!2 b. x 5 352, s ! 10.5, k!3 c. x 5 17, s ! 3.5, k ! 1.6 d. x 5 36.5, s ! 10.45, k ! 1.75 e. x 5 158, s ! 25, k ! 2.5 f. x 5 255, s ! 0.125, k ! 2.8 g. x 5 1.7, s ! 25.8, k ! 2.25 3.76 Assume the distribution of each data set is approximately

normal, with x and s given. Find the intervals (referred to by the empirical rule) that are one, two, and three standard deviations about the mean. Carefully sketch the corresponding normal curve for each data set, indicating the endpoints of each interval. a. x 5 20, s!5 b. x 5 37, s ! 0.2 c. x 5 675, s ! 250 d. x 5 25.5, s ! 12 e. x 5 98.6, s ! 1.7 f. x 5 5280, s ! 150 3.77 For each data set with x and s given, find the z-score corresponding to the given observation x. a. x 5 8, s ! 3, x ! 17 b. x 5 100, s ! 16, x ! 80 c. x 5 15, s ! 3, x ! 17.5 d. x 5 27, s ! 4.5, x ! 22 e. x 5 122, s ! 32, x ! 175 f. x 5 2105, s ! 33, x ! "90 g. x 5 6.55, s ! 0.25, x!6

h. x 5 64, i. x 5 0.025, j. x 5 407,

s ! 8.75, x ! 100 s ! 0.0018, x ! 0.027 s ! 16, x ! 500

3.78 For each data set with x and s given, find an observation corresponding to the z-score given. a. x 5 25, s ! 5, z ! 2.3 b. x 5 9.8, s ! 1.2, z ! "0.7 c. x 5 2456, s ! 37, z ! 1.25 d. x 5 37.6, s ! 5.9, z ! "1.96 e. x 5 55, s ! 0.05, z ! 3.5 f. x 5 3.14, s ! 0.5, z ! 1.28 g. x 5 2.35, s ! 0.94, z ! "2.5 h. x 5 0.529, s ! 1.9, z ! 0.55 3.79 Find the position, or depth, of the indicated percentile in

an ordered data set of size n. a. n ! 150, p80 b. n ! 257, p35 c. n ! 36, p60 d. n ! 75, p40 e. n ! 100, p20 f. n ! 5035, p70

Applications 3.80 Demographics and Population Statistics The FBI uses public assistance in tracking criminals by maintaining the “Ten Most Wanted Fugitives” list. A fugitive is removed from this list if she is captured, the charges are dropped, or she no longer fits a certain profile. In a random sample of fugitives, the mean time on the list was 26.5 months, with a standard deviation of 4.3 months. a. What values are one standard deviation away from the mean? What values are two standard deviations away from the mean? b. Without assuming anything about the shape of the distribution of times, approximately what proportion of times is between 17.9 months and 35.1 months? Write a Solution Trail for this problem. 3.81 Travel and Transportation Royal Caribbean Cruises recently ordered a cruise ship similar to Oasis of the Seas, the world’s largest cruise ship. This ship has over 12 restaurants, four pools, a parklike area with trees, and even zip lines. A random sample of large passenger liners was obtained, and the cruising speed of each was recorded. The sample mean was 25.6 knots and the standard deviation was 3.4 knots. Assume the shape of the speed distribution is approximately normal. a. What values are two standard deviation away from the mean? What values are three standard deviations away from the mean? b. Approximately what proportion of speeds is between 22.2 and 29.0 knots? 3.82 Sports and Leisure During the Hawaiian International

Billfish Tournament, teams tag and release Pacific blue marlin.

108

CHAPTER 3

Numerical Summary Measures

During the 2012 tournament, the team headed by Sue Vermillion caught a 638-pound fish, a weight in the 85th percentile of all blue marlin caught. Interpret this value. 3.83 Manufacturing and Product Development There

are approximately 3 million parts in a Boeing 777, and suppliers are all over the world. It takes considerable coordination and organization to assemble this aircraft. From the time the first part is moved from the factory to delivery of an aircraft, the mean time to assemble a Boeing 777 is 83 days.26 Suppose the standard deviation is 6 days. a. What values are one standard deviation away from the mean? What values are two standard deviations away from the mean? b. Without assuming anything about the shape of the distribution of times, approximately what proportion of assembly times is between 71 and 95 days? c. Without assuming anything about the shape of the distribution of times, approximately what proportion of assembly times is either less than 65 or greater than 101 days? d. Assuming the distribution of times is normal, what proportion of assembly times is between 71 and 95? Either less than 65 or greater than 101? 3.84 Biology and Environmental Science

The Commonwealth of Pennsylvania is concerned about the dwindling number of family-owned farms and the number of smaller, less efficient farms. For a random sample, the total acreage of each farm was recorded. The mean was 1125 acres, with a standard deviation of 250. The shape of the distribution of areas is not normal. a. Approximately what proportion of areas is between 625 and 1625 acres? b. Approximately what proportion of areas is between 375 and 1875 acres? c. Approximately what proportion of areas is less than 375 acres? d. Approximately what proportion of areas is between 750 and 1500 acres?

3.85 Physical Sciences During the spring, many rivers are

monitored very carefully so as to be able to warn residents of an impending flood. The depth (in feet) of the Susquehanna River at the Bloomsburg bridge is measured and reported daily. In a random sample of depths, x 5 16.7, s ! 2.1, and the shape of the distribution is approximately normal. a. Approximately what proportion of depths is between 14.6 and 18.8 feet? b. Approximately what proportion of depths is less than 14.6 feet? c. Approximately what proportion of depths is between 14.6 and 23 feet? 3.86 Biology and Environmental Science Many farmers use the height of their corn on July 4th as an indication of the entire crop. In a random sample of corn-stalk heights on July 4th in Columbia County, x 5 25.6, s ! 0.9 inches, and a histogram of the observations is bell-shaped.

a. Approximately what proportion of observations is

between 23.8 and 27.4 inches? b. Approximately what proportion of observations is

between 22.9 and 26.5 inches? c. Approximately what proportion of observations is less

than 27.4 inches? 3.87 Education and Child Development The Iowa Test of

Basic Skills (ITBS) is a multiple-choice exam given to students in various grades in each state each year. The purpose is to test fundamental skills in reading, mathematics, language, social studies, and science. Scores are reported as state and/or national percentile points. Results from the 2010–2011 academic year indicate that students in grades 7–9 at Carlisle High School in Arkansas scored at the 53rd and 69th (national) percentiles in mathematics and science, respectively.27 a. Interpret these values. b. The 50th percentile (in any subject area) is the national average. Explain the meaning of average in this context. c. Suppose a seventh grader scored at the 99th percentile (nationally) in mathematics. Interpret this result. 3.88 Travel and Transportation Bicycle delivery services

are utilized in many metropolitan areas because they can provide rush deliveries and are not subject to traffic jams or parking restrictions. Suppose an architectural firm would like to evaluate two bicycle delivery services in New York City. The first service has a mean and standard deviation for delivery (in minutes) of 37 and 5. The second service has a mean of 42 with a standard deviation of 7. The company sent two test packages to the same location, one with each delivery service. The times to delivery were 33 and 35 minutes, respectively. Use z-scores to determine which service performed better. 3.89 Business and Management The Green Mill Restaurant and Bar in Wausau, Wisconsin, is advertising quick lunches with a mean waiting time of 11 minutes and a standard deviation of 2.5 minutes. The general manager (Rob Meyer, a former statistician) also claims that the distribution of waiting times is approximately normal. a. Suppose your waiting time is 13 minutes. Is there any reason to believe the general manager’s claim is false? (Use a z-score.) Write a Solution Trail for this problem. b. Suppose your waiting time is 20 minutes. Now, is there any evidence to refute the general manager’s claim? 3.90 Marketing and Consumer Behavior The time spent in a grocery store is an important issue for shoppers and for companies trying to market new products. Men tend to spend less time in a grocery store than women, and people spend more time in the store on weekends. A random sample of shoppers at a local grocery store was obtained and the shopping GROCERY time (in minutes) for each was recorded. a. Construct a histogram for these data. b. Use your histogram in part (a) to approximate the following percentiles: (i) 45th, (ii) 80th, (iii) 10th. c. Compute the exact percentiles in part (b) and compare your results.

3.4

Extended Applications

Five-Number Summary and Box Plots

109

3.93 Manufacturing and Product Development Paint

3.91 Manufacturing and Product Development The

engine in a tractor trailer is designed to last 1,000,000 miles before a rebuild or overhaul. The engines are also designed to run nonstop and have between 400 and 600 horsepower.28 A random sample of tractor trailers was obtained and the horseENGINEHP power was measured for each engine. a. Find the mean and the standard deviation of these horsepower measurements. b. Find the actual proportion of observations within one standard deviation of the mean, within two standard deviations of the mean, and within three standard deviations of the mean. c. Using the results in part (b), do you think the shape of the distribution of horsepower measurements is normal? Why or why not? 3.92 Sports and Leisure In 1974, Erno Rubik created an

imaginative and best-selling puzzle—the Rubik’s cube. Many countries hold competitions in which participants try to solve this puzzle as quickly as possible. The text website provides a sample from the list of record times (in seconds) in official CUBETIME world competitions.29 a. Find the actual proportion of observations within one standard deviation of the mean, within two standard deviations of the mean, and within three standard deviations of the mean. b. Using the results in part (a), do you think the shape of the distribution of national record times is normal? Why or why not? c. Construct a histogram for these data. Describe the shape of the distribution.

viscosity is a measure of thickness that determines whether the paint will cover in a single coat. A random sample of latex paint viscosities (in KU, or Krebs units) was obtained, and the data VISCOS are given in the following table: 113 124 141 115 115 129 113 129 112 112 a. Find the mean and the standard deviation for this data. b. Find the z-score for each observation. c. Find the mean and the standard deviation for all of the

z-scores. d. For any set of observations, can you predict the mean and

standard deviation of the corresponding z-scores? Try to prove this result.

Challenge 3.94 Travel and Transportation According to the Massachusetts Bay Transportation Authority, the ride from Chestnut Hill to Boston’s Logan Airport on the MBTA takes less than 45 minutes. A random sample of travel times (in minutes) was obtained, and MBTA the results are given in the following table:

46.5 38.3 39.1 41.1 42.0 39.0 34.8 36.5 38.6 38.4

37.6 44.4

41.6 45.5 42.4

a. Find the mean ( x ) and the standard deviation (s) for this

b. Compute each z-score and find gz2i .

data set.

c. Find a general formula for g z2i for any data set.

3.95 Reconsider Example 3.16. Find a good minimum

guaranteed life. That is, if a guinea pig fails to reach such an age, then the store would provide a refund.

3.4 Five-Number Summary and Box Plots A box plot, or box-and-whisker plot, is a compact graphical summary that conveys information about central tendency, symmetry, skewness, variability, and outliers. A standard box plot is constructed using the minimum and maximum values in the data set, the first and third quartiles, and the median. This collection of values is called the five-number summary.

Definition The five-number summary for a set of n observations x1, x2, . . . , xn consists of the minimum value, the maximum value, the first and third quartiles, and the median. Recall: The range of a data set is the largest observation (maximum value) minus the smallest observation (minimum value). This descriptive statistic was our first attempt at measuring variability in a data set.

These five numbers do provide a glimpse of symmetry, central tendency, and variability in a data set. For example, minimum and maximum values that are very far apart suggest lots of variability. If the median is approximately halfway between the minimum and maximum values and approximately halfway between the first and third quartiles, that suggests the distribution is symmetric. A box plot is constructed as described below.

110

CHAPTER 3

Numerical Summary Measures

How to Construct a Standard Box Plot

Recall: xmin denotes the minimum value and xmax denotes the maximum value.

Given a set of n observations x1, x2, . . . , xn: 1. Find the five-number summary xmin, Q1, ~ x , Q3, xmax. 2. Draw a (horizontal) measurement axis. Carefully sketch a box with edges at the quartiles: left edge at Q1, right edge at Q3. (The height of the box is irrelevant.) 3. Draw a vertical line in the box at the median. 4. Draw a horizontal line (whisker) from the left edge of the box to the minimum value (from Q1 to xmin). Draw a horizontal line (whisker) from the right edge of the box to the maximum value (from Q3 to xmax). Figure 3.36 illustrates this step-by-step procedure and shows a standard box plot with the five numbers indicated on a measurement axis. Note that the length of the box is the interquartile range. The box contains the middle half of the values.

xmin

Figure 3.36 Standard box plot.

Q1

~ x

xmax

Q3

The position of the vertical line in the box (median) and the lengths of the horizontal lines (whiskers) indicate symmetry or skewness, and variability. Figure 3.37 shows a standard box plot for a distribution of data that is skewed to the right. The lower half of the data is in the interval from 3 to 4.5, while the upper half of the data is much more spread out, from 4.5 to 11. Figure 3.38 shows a standard box plot for a fairly symmetric distribution with lots of variability. The lower and upper half of the data are evenly distributed, but the whiskers extend far from each edge of the box. That is, 25% of the data are between 0 and 4, and 25% are between 7 and 11.

0 1 2 3 4 5 6 7 8 9 10 11 12

!1 0 1 2 3 4 5 6 7 8 9 10 11 12

Figure 3.37 Standard box plot for data skewed to the right.

Figure 3.38 Standard box plot for a symmetric distribution.

Example 3.19 Blood Pressure DATA SET SYSTOLIC

There is some evidence to suggest that consumption of nonalcoholic red wine may decrease systolic and diastolic blood pressure.30 Suppose the systolic blood pressure for 30 randomly selected subjects involved in this research study is given in the following table. Construct a standard box plot for these data. 177 167 148

122 138 175

128 107 169

191 188 203

180 102 135

142 116 142

197 138 168

196 114 181

67 188 168

160 176 150

SOLUTION STEP 1 Find the five-number summary:

xmin 5 67

Q1 5 135

~x 5 163.5

Q3 5 180

xmax 5 203

STEP 2 Draw a measurement axis and sketch a box with edges at Q1 ! 135 and Q3 ! 180.

3.4

Five-Number Summary and Box Plots

111

STEP 3 Draw a vertical line at the median, ~ x 5 163.5. STEP 4 Draw a horizontal line from Q1 ! 135 to xmin ! 67, and another horizontal line

from Q3 ! 180 to xmax ! 203. The resulting box plot is shown in Figure 3.39.

Figure 3.39 Standard box plot for the systolic blood pressure data. Tick marks for Q1, ~x , and Q3 are added to this graph for clarity.

60

80

100

120 140 160 180 200 Q1! 135 x ! 163.5 Q3 ! 180

STEP 5 The box plot suggests the data are negatively skewed, or skewed to the left. The

lower half of the data is much more spread out than the upper half. A technology solution is shown in Figure 3.40. Figure 3.40 JMP quantile box plot.

A CLOSER L OK 1. A box plot has only one measurement axis, and it may be horizontal or vertical. Many soft-

ware packages, including CrunchIt! and Minitab, draw box plots with a vertical measurement axis by default. The construction and interpretations are the same. The Transpose option in Minitab produces a box plot with a horizontal measurement axis. 2. The software does not usually include tick marks on the measurement axis for the fivenumber summary. The tick marks and scale are selected simply for convenience. There are some disadvantages to using a standard box plot based on the five-number summary to describe a data set. By examining the graph, there is no way of knowing how many observations are between each quartile and the extreme. Each whisker is drawn from the quartile to the extreme, regardless of the number of observations in between. In addition, there are no provisions for identifying outliers. A standard (graphical) technique for distinguishing outliers is important because these values play an important role in statistical inference. Therefore, many statisticians prefer to use a modified box plot to describe a data set graphically. This type of graph still conveys information about center, variability, symmetry, and skewness, but it is also more precise and plots outliers.

How to Construct a Modified Box Plot Given a set of n observations x1, x2, . . . , xn: 1. Find the quartiles, the median, and the interquartile range: Q , ~x , Q , IQR 5 Q 2 Q . 1

3

3

1

2. Compute the two inner fences (low and high) and the two outer (low and high) fences

using the following formulas: IFL 5 Q1 2 1.5 ( IQR ) IFH 5 Q3 1 1.5 ( IQR ) OFL 5 Q1 2 3 ( IQR ) OFH 5 Q3 1 3 ( IQR ) Think of the interquartile range as a step. The inner fences are 1.5 steps away from the quartiles, and the outer fences are 3 steps away from the quartiles. 3. Draw a (horizontal) measurement axis. Carefully sketch a box with edges at the quartiles: left edge at Q1, right edge at Q3. Draw a vertical line in the box at the median. 4. Draw a horizontal line (whisker) from the left edge of the box to the most extreme observation within the low inner fence. This line will extend from Q1 to at most IFL. Draw a horizontal line (whisker) from the right edge of the box to the most extreme observation within the high inner fence. This line will extend from Q3 to at most IFH. 5. Any observations between the inner and outer fences (between IFL and OFL, or between IFH and OFH) are classified as mild outliers and are plotted separately with shaded circles.

112

CHA PTER 3

Numerical Summary Measures

Any observations outside the outer fences (less than OFL, or greater than OFH) are classified as extreme outliers and are plotted separately with open circles. Note: Some statistical packages will use other symbols for outliers and may not distinguish mild and extreme outliers. Figure 3.41 shows the relationship between construction points for a modified box plot and the location of any outliers. extreme outliers

mild outliers

mild outliers 3(IQR)

extreme outliers

3(IQR) 1.5(IQR)

1.5(IQR) IQR

OFL

IFL

Q1

x

IFH

Q3

OFH

Figure 3.41 Construction points for a modified box plot.

Example 3.20 Sled Dog Trips DATA SET SLEDDOGS

The Ignace, Ontario, fishing and hunting resort Agimac River Outfitters offers guided sled dog trips on wooded trails and along beautiful lakes.31 Most trips last approximately 212 hours, but there are all-day trips and the weather conditions may affect the length of a scheduled trip. A random sample of sled dog trips was obtained and the length (in hours) of each was recorded. The data are given in the following table. Construct a modified box plot for these data. 0.7 9.5 2.2

1.7 2.6 3.7

0.8 2.7 0.5

2.7 11.9 1.1

1.4 3.1 1.4

0.8 7.9 3.1

2.6 0.6 0.1

3.5 6.1 2.6

1.3 5.2 4.3

0.6 1.9 4.5

SOLUTION Winterdance Dogsled Tours

STEP 1 Find the quartiles, the median, and the interquartile range.

Q1 5 1.1

~x 5 2.6

Q3 5 3.7

IQR 5 3.7 2 1.1 5 2.6

STEP 2 Find the inner and outer fences.

IFL 5 1.1 2 ( 1.5 )( 2.6 ) 5 1.1 2 3.9 5 22.8 IFH 5 3.7 1 ( 1.5 )( 2.6 ) 5 3.7 1 3.9 5 7.6 OFL 5 1.1 2 ( 3 )( 2.6 ) 5 1.3 2 7.8 5 26.5 OFH 5 3.7 1 ( 3 )( 2.6 ) 5 3.7 1 7.8 5 11.5 STEP 3 Draw a (horizontal) measurement axis. Carefully sketch a box with edges at the quarA technology solution:

Figure 3.42 TI-84 Plus C modified box plot.

tiles: left edge at Q1, right edge at Q3. Draw a vertical line in the box at the median. STEP 4 Draw a horizontal line (whisker) from the left edge of the box to the most extreme observation within the inner fence IFL (0.1). Draw a horizontal line (whisker) from the right edge of the box to the most extreme observation within the inner fence IFH (6.1). STEP 5 Plot any mild outliers, observations between !6.5 and !2.8, or between 7.6 and 11.5. There are two mild outliers, 7.9 and 9.5. Plot any extreme outliers, observations less than !6.5 or greater than 11.5. There is one extreme outlier, 11.9. STEP 6 The resulting modified box plot is shown in Figure 3.43. The box plot suggests the data are positively skewed, or skewed to the right. The upper half of the data is much more spread out than the lower half. There are two mild outliers and one extreme outlier.

3.4

STEPPED STEPPED TUTORIAL TUTORIALS BOX PLOTS BOX PLOTS

Figure 3.43 Modified box plot for the sled dog trip data.

113

Five-Number Summary and Box Plots

2

4

6

8

10

12

Note: IFL and OFL are negative even though an observed trip time cannot be less than 0 hours. That’s OK. This is a correct statistical calculation, not a contradiction, even though it seems odd. Figure 3.42 shows a technology solution. TRY IT NOW

GO TO EXERCISE 3.91

A CLOSER L OK

Figure 3.44 TI-84 Plus C box plots for gasoline data.

When we compare two (or more) data sets graphically, the corresponding box plots may be placed on the same measurement axis (one above the other using a horizontal axis, or side-by-side with a vertical axis). Figure 3.44 shows three box plots on the same measurement axis, representing the number of gallons of gasoline pumped in randomly selected vehicles at three different stations. TRY IT NOW

GO TO EXERCISE 3.94

Technology Corner Procedure: Construct a box plot. Reconsider: Example 3.20, solution, and interpretations.

VIDEO TECH MANUALS EXELPLOTS DISCRIPTIVE BOX

CrunchIt! CrunchIt! has a built-in function to construct a box plot. 1. Enter the data into a column. 2. Select Graphics; Box Plot. Choose the appropriate column, optionally enter a Title, X Label, and Y Label, and click the

calculate button. The resulting box plot is shown in Figure 3.45.

Figure 3.45 CrunchIt! box plot of the sled dog trip data.

TI-84 Plus C The TI-84 Plus has two built-in statistical plots, a standard and a modified box plot. The modified box plot does not distinguish between mild and extreme outliers. 1. Enter the data into list L1 . 2. Press STATPLOT and select Plot1 from the STATPLOTS menu. 3. Turn the plot On and select Type box plot (modified or standard). Set Xlist to the name of the list containing the

data and Freq to 1. Choose a Mark for outliers (if constructing a modified box plot) and select a Color. 4. Enter appropriate WINDOW settings. Press GRAPH to display the box plot. A modified box plot is shown in

Figure 3.42.

114

CHAPTER 3

Numerical Summary Measures

Minitab The Minitab modified box plot does not distinguish between mild and extreme outliers. 1. Enter the data into column C1. 2. Select Graph; Boxplot and choose a One Y; Simple box plot. 3. Enter C1 in the Graph variables window. Select the Scale options button and check Transpose value and category scales

to construct a box plot with a horizontal measurement axis. 4. Select the Data view options button and check Interquartile range box, and Outlier symbols for a modified box plot. Note there are only two outliers in this box plot due to the numerical method Minitab uses to compute quartiles. See Figure 3.46.

Figure 3.46 Minitab modified box plot.

Excel The following steps may be used to construct a standard box plot. Additional calculations and options are necessary to construct a modified box plot. Enter the data into column A. Find the five-number summary in the order shown. Highlight these five cells. Under the Insert tab, select Insert Line Chart; 2-D Line; Line with Markers. Under Chart tools; Design, select Switch Row/Column. Right-click on the point representing the minimum value. Select Format Data Series; Line and choose No line. Repeat this process for each point. 6. Select any point on the graph. Select Add Chart Element; High-Low Lines and Add Chart Element; Up/Down Bars; Up/Down Bars. 7. Format other graph items as appropriate. See Figure 3.47. 1. 2. 3. 4. 5.

Figure 3.47 Excel standard box plot.

SECTION 3.4 EXERCISES Concept Check 3.96 True/False The five-number summary for a set of observations is determined by finding the most extreme observations.

3.97 True/False In a standard box plot, a line is drawn in the box at the sample mean. 3.98 True/False A box plot may reveal whether a distribution

is symmetric.

3.4

3.99 True/False A modified box plot always has markers for mild and extreme outliers.

b.

Practice 3.100 Find the five-number summary for each data

set.

115

Five-Number Summary and Box Plots

!2

!1

1

2

4

5

6

7

8

4

3

5

c.

EX3.100

a. b. c. d.

{34, 40, 34, 32, 32, 40, 35, 35, 28, 35} {57, 65, 70, 71, 67, 56, 52, 66, 74, 57, 67, 78} {94, 80, 91, 94, 83, 92, 83, 93, 96, 80, 87, 98, 81, 93} {2.3, 1.8, 2.1, 1.0, 2.4, 2.3, 0.4, 9.8, 0.6, 1.4, 3.1, 10.9, 3.8, 0.5, 0.9, 2.2, 1.3, 1.3} e. {166.8, 103.1, 119.9, 141.9, 110.6, 189.8, 121.6, 141.6, 133.6, 178.2, 158.9, 145.9, 139.1, 148.6, 135.0, 174.0, 152.4, 119.7, 196.9, 118.7, 159.7, 150.3, 113.8, 108.9, 163.2} f. {!33.8, !9.8, !18.5, !11.5, !36.3, !33.1, !21.1, !26.2, !25.4, !32.1, !35.9, !28.0, !38.2, !12.0, !29.2, !40.1, !13.1}

3.102 For each data set with Q1 and Q3 given, find the interquartile range and the inner and outer fences. a. Q1 " 22, Q3 " 46 b. Q1 " 1255, Q3 " 1306 c. Q1 " 65.75, Q3 " 75.21 d. Q1 " 914.9, Q3 " 1140.5 e. Q1 " 1.275, Q3 " 4.07 f. Q1 " 0.265, Q3 " 2.51 g. Q1 " !33.67, Q3 " !23.90 h. Q1 " 98.43, Q3 " 98.81 3.103 For each data set with Q1 and Q3 given, determine

whether the observation x is a mild outlier, an extreme outlier, or neither. a. Q1 " 20, Q3 " 29, x " 35 b. Q1 " 486.1, Q3 " 510.9, x " 440 c. Q1 " 5.18, Q3 " 6.32, x " 4.2 d. Q1 " 96.3, Q3 " 101.1, x " 116.5 e. Q1 " 68.92, Q3 " 69.07, x " 68.4 f. Q1 " 101.26, Q3 " 144.59, x " 132.6 3.104 For each box plot, find the five-number summary.

(Estimate these numbers as best you can by using the tick marks on each graph.) a.

10

20

30

40

50

60

70

80

90

10

d.

70

90

80

100

110

120

e.

3.101 Construct a standard box plot for each five-number

summary. a. xmin " 15.3, Q1 " 21.8, ~ x 5 25.3, Q3 " 28.2, xmax " 34.2 b. xmin " 70.9, Q1 " 167.8, ~ x 5 187.1, Q3 " 225.3, xmax " 329.3 c. xmin " 0.06, Q1 " 5.3, ~ x 5 13.7, Q3 " 30.8, xmax " 122.3 d. xmin " 10.1, Q1 " 10.7, ~ x 5 11.3, Q3 " 12.5, xmax " 26.7

9

1

2

4

3

5

7

6

9

8

10

f.

5

10

15

20

25

30

35

!30

!20

g.

!80

!70

!60

!50

!40

Applications 3.105 Business and Management A Roth’s supermarket in

Salem, Oregon, is using a new statistical tool to help in ordering bottles of raspberry iced tea. A random sample of the number of bottles sold per day is given in the following table. Construct a modified box plot for these data. Describe the BOTTLES distribution of the number of bottles sold. 48 45

52 47

46 50

58 48

50 49

46 49

59 49

51 48

46 48

48 45

3.106 Travel and Transportation Major airlines compete for customers by advertising on-time arrival. A random sample of flights arriving at the Philadelphia International Airport was obtained and the actual arrival time was compared to the scheduled arrival time. The differences (in minutes) are given in the following table (negative numbers indicate the flight arrived before the scheduled arrival time).32 Construct a modified box plot for these data. Describe the distribution in terms of symmetry, skewness, and variability. Are there any outliers? If ARRIVAL so, are they mild or extreme?

14 135 210

64 121 98

!5 6 90

!5 !18 175

74 !17 6

215 !5 42

19 82 35

202 116 54

113 18 10

90 82 12

116

CHA PTER 3

Numerical Summary Measures

3.107 Public Policy and Political Science A recent study

reported the school tax bills (in dollars) for randomly selected families in Hillsboro, New Hampshire, a small rural community. Use the following modified box plot to describe the distribution of the data.

1300

1400

1500

1600

1700

1800

1900

3.108 Education and Child Development Dr. Jan Remer,

a psychologist in Hermitage, Tennessee, randomly selected six-year-olds and recorded the time (in minutes) each child needed to complete a 20-piece picture puzzle. The data were used to predict readiness for first grade. The following standard box plot for the data was drawn. Describe the distribution of completion times.

1.0

1.5

2

2.5

3

3.5

4

4.5

5.0

3.109 Public Health and Nutrition As part of a new physical fitness program, the Crooked Oak Middle School in Oklahoma City, Oklahoma, records the number of sit-ups each sixth-grade student can do in one minute. A random sample for males and females was obtained and the following modified box plots were drawn. Describe the male and female data separately. What similarities and/or differences do the box plots suggest?

Teen Center. The 2012 monthly commission checks (in dollars) associated with these machines are given in the VENDING following table: 51.24 119.52 25.65 36.90

27.60 123.48 31.95 18.45

37.80 38.48 40.50 31.95

44.76 35.28 22.50 18.45

24.36 16.92 33.75 19.50

14.04 47.25 59.85 22.65

a. Construct a modified box plot for these data. Describe the

distribution in terms of symmetry, skewness, variability, and outliers. b. Construct a standard box plot for these data. Which plot do you think is more descriptive? Why? 3.112 Biology and Environmental Science Several weather centers across the country carefully track and record data for tropical disturbances during the hurricane season. The following table from the 2012 Atlantic hurricane season contains the minimum central pressure (in millibars) for each named tropical storm and STORMS hurricane.35

998 1004 997

992 968 1005

987 1006 969

990 970 940

980 968 1000

1000 964

965 978

Construct a modified box plot for the hurricane data. Describe the distribution in terms of symmetry, skewness, variability, and outliers. How would the graph change if a standard box plot were used?

Extended Applications Males

Females

15

20

25

30

35

40

45

50

55

3.110 Economics and Finance Many government program budgets are determined by the annual inflation rate. The inflation rate (as a percent change in the consumer price index since 1913) for the United States for the years 1965–2011 are given on the text website.33 Construct a modified box plot for these data. Describe the distribution of inflation rate over the past 40 years in terms of symmetry, skewness, variability, and INFRATE outliers. 3.111 Public Policy and Political Science According

to US Vending Company, the national average commission per machine per month is approximately $35.34 Suppose the Culver City, California, City Council recently entered into an agreement with Coca-Cola Bottling Company. As part of this contract, two US Vending snack machines were installed, one at Veteran’s Park and the other at the

3.113 Public Policy and Political Science The amber-light

time at an intersection usually varies according to the speed limit.36 A random sample of amber-light times (in seconds) at intersections with different speed limits is given below. Construct a modified box plot for each set of data on the same measurement axis. Describe and compare the distributions. Do the box plots suggest that amber-light times are longer at AMBERLT intersections with a higher speed limit? 30 mph 3.7 3.1

3.7 3.9

3.4 4.3

2.2 1.8

3.4 3.0

3.6 1.6

3.5 4.5

3.8 2.7

2.6 2.2

2.1 2.4

4.3 6.6

4.1 5.0

4.8 5.5

5.4 3.6

2.9 3.7

3.9 5.9

2.1

6.9

50 mph 4.1 4.6

4.7 3.3

3.114 Business and Management In an attempt

to control costs, a manager at San José State University is carefully looking at the number of photocopies made by faculty members at the campus duplicating center. Random samples of liberal arts faculty and of natural

Chapter 3

science faculty were obtained. The number of copies made COPIES by each faculty member is given on the text website. a. Construct a modified box plot for each data set on the same measurement axis. b. Compare the distributions in terms of symmetry, skewness, variability, and outliers. c. Is there any graphical evidence to suggest that one group of faculty uses the copy center more than the other? 3.115 Public Health and Nutrition The U.S. Food and Drug Administration has become concerned about claims made by companies selling vitamin and mineral tablets. A random sample of 400-mg vitamin C tablets was obtained and analyzed by an independent laboratory for exact vitamin C content. The results (in milligrams) are given on the text VITC website. a. Construct a modified box plot for the vitamin C data. b. Describe the distribution in terms of symmetry, skewness, variability, and outliers. c. Does the box plot suggest any graphical evidence that the claim of a 400-mg content is wrong? 3.116 Public Policy and Political Science If you

receive a jury summons, you are obligated to appear in court. There are, however, several general (honest) instant, temporary, and hardship excuses to avoid jury duty. For example, firefighters and physicians may be automatically excused from serving. The compensation for serving on a jury varies by state; some states reimburse child-care expenses and/or transportation costs; some states do not compensate jurors for the first few days of service. A sample of states was obtained, and the juror pay per day as of January 1, 2012, for each was recorded.37 The data are given JURY in the following table.

State Alabama Arkansas Delaware Idaho Kentucky Maryland Missouri New Mexico Pennsylvania Virginia

Pay 10.00 50.00 20.00 10.00 12.50 15.00 6.00 40.00 9.00 30.00

Summary

State Alaska California Hawaii Illinois Maine Massachusetts Nebraska North Carolina Texas Wyoming

Page

Sample mean

75

Population mean Sample median Population median Trimmed mean Mode

75 76 77 78 79

Sample proportion of successes

80

Population proportion of successes Sample range Deviation about the mean

80 86 87

Sample variance

87

Pay 25.00 15.00 30.00 4.00 10.00 50.00 35.00 12.00 6.00 40.00

a. Construct a modified box plot for the jury pay per day data. b. Describe the distribution in terms of symmetry, skewness,

variability, and outliers. c. Suppose that, in January 2013, the $9.00 rate in

Pennsylvania was raised to $60.00 and all other rates given remained the same. Change the value for Pennsylvania to $60.00 and construct a new modified box plot. How does this new box plot compare with the one in part (a)? Describe any similarities and/or differences. 3.117 Economics and Finance In January 2006, Wisconsin joined at least 18 other states by posting on websites the names of people and businesses that owe back taxes. The law in Wisconsin requires the Department of Revenue to list those who owe at least $25,000. These “websites of shame” are designed to help states collect additional tax money during tight budget times. A random sample of names from the Wisconsin list as of October 4, 2012, was obtained, and the taxes owed by each individual or business are given on the text website.38 Construct a modified box plot for these data and describe the distribution in terms of symmetry, TAXLIST skewness, variability, and outliers.

CHAPTER 3 SUMMARY Concept

117

Notation / Formula / Description

1 x 5 g xi: the sum of the observations divided by n. n m: the mean of an entire population. ~x : the middle value of the ordered data. ~ : the middle value of an entire population. m xtr(p): the sample mean of a trimmed data set. The value that occurs most often. n(S) = p5 : the relative frequency of occurrence of successes. n p: the true proportion of successes in an entire population. R: the largest observation ( xmax ) minus the smallest observation ( xmin ) . xi 2 x. 1 1 1 g ( xi 2 x ) 2 5 s2 5 c g x2i 2 ( g xi ) 2 d n n21 n21

118

CHA PTER 3

Numerical Summary Measures

Sample standard deviation

87

Population variance

88

Population standard deviation Quartiles

88 90

Interquartile range Chebyshev’s rule

90 98

Empirical rule

100

z-score

103

Percentiles Five-number summary Box plot

104 109 110

Modified box plot

111

s 5 "s2: the positive square root of the sample variance. s2: the variance for an entire population.

s 5 "s2: the positive square root of the population variance. The quartiles divide the data into four parts. Q1 is the first quartile and Q3 is the third quartile. Q2 is the median. IQR 5 Q3 2 Q1 For any set of observations, the proportion of observations within k standard 1 deviations of the mean is at least 1 2 2. k If a distribution is approximately normal, the proportion of observations within one, two, and three standard deviations about the mean is approximately 0.68, 0.95, and 0.997, respectively. xi 2 x zi 5 , how far an observation is from the mean in standard deviations. s The percentiles divide a data set into 100 parts. xmin, Q1, ~x , Q3, xmax A graphical description of a data set, constructed using the five-number summary. The graph conveys information about central tendency, symmetry, skewness, and variability. A graphical description of a data set, constructed using ~x , Q1, Q3, IQR, and the inner and outer fences. This box plot also indicates any outliers.

CHAPTER 3 EXERCISES

3

APPLICATIONS

Low-altitude times (Miami)

3.118 Public Health and Nutrition Most multivitamins contain calcium for strong bones and to lower the risk of heart disease. A random sample of multivitamins was obtained, and the calcium content for each (in milligrams) is given in the CALCIUM following table:

25.1 25.6 24.9 23.7 25.5 22.4 24.7 24.2 25.6 24.8 23.9 24.4 24.7 24.4 26.4 24.7 24.7 26.8 24.9 24.3

156 151 173 201 182 166 173 180 174 185 160 178 173 169 203 190 187 202 173 171

High-altitude times (Denver) 22.8 30.0 27.3 30.3 28.3 31.1 27.0 26.8 26.3 29.1 23.5 26.2 29.2 23.0 a. Construct a modified box plot for each data set on the

same measurement axis. a. Find the mean, the variance, and the standard

b. Describe each box plot in terms of center, shape, spread,

deviation. b. Find the proportion of observations within one, two, and

and outliers. c. Describe the similarities and differences between the two

three standard deviations about the mean. c. Using the proportions obtained in part (b), do you

think the distribution of observations is normal? Why or why not? 3.119 Manufacturing and Product Development Many boxed cake mixes include special high-altitude baking instructions. To determine any difference between baking times at low and high altitudes, the consumer group Public Citizen made several similar cakes in nine-inch round pans in Miami and Denver, and carefully recorded the time to bake (in minutes). The data are given in the following CAKEMIX table.

distributions. 3.120 Sports and Leisure The longest running U.S.

produced, fictional-content television show by number of episodes is WWE Raw.39 Other long-running shows of this type include Gunsmoke, Lassie, Ozzie and Harriett, and Bonanza. Law and Order is currently number 6, with approximately 500 shows. The number of episodes for some of the top 115 shows TVSHOW are given in the following table: 633 357 260

588 344 254

505 336 243

500 331 227

456 291 223

452 296 216

435 286 213

430 284 212

369 278 180

361 271 160

Chapter 3

a. Find the median, the first and third quartiles, and the

interquartile range. b. Find the 30th and the 95th percentiles. c. Suppose Dallas currently has 357 episodes (as of May 2014). Using the data in the table, in what percentile does this episode count lie? 3.121 Manufacturing and Product Development The most popular wind turbine sold by General Electric has a rated mean electrical generating capacity of 1.5 megawatts (MW)40 with a standard deviation of 0.07 MW. A quality control engineer is trying to develop a plan for routine maintenance based on z-scores. a. Suppose a randomly selected wind turbine is inspected and found to have a generating capacity of 1.54 MW. Is there any reason to believe this generating capacity is unusual? Why or why not? b. Suppose another randomly inspected wind turbine has a generating capacity of 1.3 MW. Is there any reason to believe this generating capacity is unusual? Why or why not? 3.122 Sports and Leisure String tension in tennis rackets is

usually measured in pounds. Recommended string tensions are usually in the mid-60s (pounds) for oversize rackets, and high 50s to low 60s for mid-overs. Higher tensions tend to decrease the size of the “sweet spot” and reduce power, but increase control. This book’s website presents the results from a random sample of string tension from tennis rackets of players on the RACKETS professional tour. a. Find the range, sample variance, interquartile range, coefficient of variation, and coefficient of quartile variation for each type of racket. (CV and CQV were defined in Exercise 3.51.) b. Using the results from part (a), compare the variability in string tension for the two types of rackets. c. Construct a modified box plot for each type of racket on the same measurement axis. Does this graphical comparison support your numerical comparison in part (b)? 3.123 Manufacturing and Product Development

Many homes that use forced hot air for heat have air ducts installed in every room. A system using galvanized pipe is constructed to distribute heat throughout the house. A random sample of six-inch diameter, five-foot long, 28-gauge galvanized pipe was obtained from various manufacturers and the weights (in pounds) are given on the text website. AIRDUCT

a. Find the sample mean and the sample median. b. Use your results from part (a) to describe the symmetry

of the distribution. c. Find a 10% trimmed mean. Is the use of a trimmed mean

to measure central tendency justified (or necessary) in this case? Why or why not? 3.124 Medicine and Clinical Studies Although caffeine is believed to be safe in moderate amounts, some health experts suggest that 300 mg of caffeine (the amount in about three

119

Exercises

cups of coffee) is a moderate intake.41 The amount of caffeine in a cup of coffee varies according to coffee bean, brewing technique, filter, etc. A random sample of eightounce cups of coffee was obtained and the caffeine content (in milligrams) was measured. The data are given in the CAFFEIN following table: 89 97 a. b. c. d.

75 95

90 101

115 115

88 112

96 100

107 71

106 109

93 89

Find the mean, median, variance, and standard deviation. Construct a modified box plot for these data. Use your results from parts (a) and (b) to describe the data. Based on your results in parts (a) and (b), do you believe a person who drinks three cups of coffee ingests a moderate amount of caffeine? Justify your answer.

3.125 Sports and Leisure World of Warcraft is one

of the most popular multiplayer video games. The number of subscribers to World of Warcraft from 1st quarter 2005 to 3rd quarter 2012 (in millions) is given on the text website.42 GAMING

a. Find the mean, variance, and standard deviation. b. Find the proportion of observations within one, two,

and three standard deviations about the mean. Use these proportions to determine whether the distribution of subscribers is approximately normal. c. Blizzard Entertainment has decided to advertise more if the number of subscribers per quarter drops below a certain threshold. Using the data above, find the number of subscribers, c, so that 90% of all values are at or below c. 3.126 Education and Child Development The time (in minutes) it takes to read a certain passage is part of an elementary school assessment test. Two different groups were given the same passage to read. One group received a standard reading curriculum, and the other was given reading instruction based on the “whole language” paradigm. The results are given READING on the text website. a. Find the mean, variance, and standard deviation for each group. b. Construct a modified box plot for each group and display the graphs on the same measurement axis. c. Based on your results in parts (a) and (b), describe any differences in reading speed distributions. 3.127 Physical Sciences A standard often used for measur-

ing brightness is lux. For example, bright moonlight has 0.1 lux and bright sunshine has 100,000 lux. The light required for general office work is approximately 400 lux. A random sample of the brightness in office cubicles was obtained, and the data CUBELUX are given on the text website. a. Find the mean, variance, and standard deviation. b. Construct a modified box plot for these data. Classify any outliers as mild or extreme. c. Using the data in the table, in what percentile does 400 lux lie? d. Use Chebyshev’s rule to describe this data set (k ! 2, 3).

120

CHA PTER 3

Numerical Summary Measures

3.128 Manufacturing and Product Development The density of tires is an important selling point for serious mountain bike riders. The tire industry uses a type A durometer to measure the indentation hardness for mountain bike tires. Suppose the distribution of tire hardness is approximately normal, with mean 45 and standard deviation 7. a. Carefully sketch the normal curve for tire hardness. b. Is a tire hardness of 30 unusually soft? Justify your answer. c. A certain bicycle shop claims the hardness of all its tires is in the 84th percentile. If this is true, what is the minimum hardness of any tire in the store? 3.129 Physical Sciences There were approximately

17,000 earthquakes around the world in 2012.43 A random sample of the magnitudes (on the Richter scale) of these earthquakes during November is given on the text website. QUAKES

a. Find the mean, median, variance, and standard deviation

of the magnitudes. b. Find the 40th and the 80th percentiles. c. How likely is a magnitude of 4.8? Justify your answer.

EXTENDED APPLICATIONS 3.130 Physical Sciences Hydraulic fracturing, or fracking,

is a method used to extract natural gas from deep shale deposits. This process involves over 500 chemicals and millions of gallons of water. In a random sample of fracking wells, the mean depth was 8000 feet.44 Assume the standard deviation is 450 feet and the distribution of depths is approximately normal. a. What proportion of wells have depths between 7100 and 8900 feet? b. What proportion of wells have depths less than 6650 feet? c. What proportion of wells have depths between 7550 and 9350 feet? d. Suppose a new fracking well was drilled in 2012 to a depth of 8255. Is there any evidence to suggest that the mean depth of wells has changed? 3.131 Physical Sciences A building code officer inspected

random home fire extinguishers for pressure (in psi), and the FIREX data are given on the text website. a. Construct a modified box plot for these data. b. Use the empirical rule to decide whether this distribution of pressures is approximately normal. c. Create a new set of observations, yi 5 ln ( xi ) , where ln is the natural logarithm function. Construct a modified box plot for this new set of data. Use this graph and the empirical rule to decide whether the distribution of the transformed data is approximately normal. 3.132 Manufacturing and Product Development

America’s favorite candy is M&Ms, with over $670 million in annual sales. M&Ms were originally sold in tubes and are now available in several versions and can even be personalized.45 Suppose the manufacturer (Mars) claims the mean weight of a single M&M is 0.91 gram with standard deviation 0.04 gram.

a. Without assuming anything about the shape of the distri-

bution of M&M weights, what proportion of M&Ms have weights between 0.83 and 0.99 gram? b. Suppose a random M&M has weight 0.74 gram. Do you believe the manufacturer’s claim about the mean weight? Justify your answer. 3.133 Manufacturing and Product Development The actual width of a 2 ! 4 piece of lumber is approximately 134 inches but can vary considerably. The Lumber Yard in Martinsburg, West Virginia, advertises consistent dimensions for better building, and claims all 2 ! 4s sold have a mean width of 134 inches with a standard deviation of 0.02 inch. a. Assume the distribution of widths is approximately normal. Find a symmetric interval about the mean that contains almost all of the 2 ! 4 widths. b. Suppose a random 2 ! 4 has width 1.79 inches. Is there any evidence to suggest The Lumber Yard’s claim is wrong? Justify your answer. c. Suppose a random 2 ! 4 has width 1.68 inches. Is there any evidence to suggest The Lumber Yard’s claim is wrong? Justify your answer. 3.134 Biology and Environmental Science Some fish have been found to have mercury levels greater than 1 ppm (parts per million), a level considered safe by the U.S. Food and Drug Administration. Suppose the mean mercury level for smallmouth bass in the Susquehanna River is 0.7 ppm with standard deviation 0.1 ppm, and the distribution of mercury level is approximately normal. a. Is it likely a fisherman will catch a smallmouth bass with mercury level greater than 1 ppm? Justify your answer. b. Suppose the standard deviation is 0.05 ppm. Now, is it likely a fisherman will catch a smallmouth bass with mercury level greater than 1 ppm? Justify your answer. c. Carefully sketch the normal curves for parts (a) and (b) on the same measurement axis. 3.135 Sports and Leisure The longest-running

Broadway show, with over 10,000 performances, is Phantom of the Opera, surpassing Cats in 2006. A sample of Broadway shows was obtained, and the number of performances SHOWS of each was recorded.46 a. Find the sample mean and the sample median number of performances. What do these values suggest about the shape of the distribution? b. Find the sample variance and the sample standard deviation. Find the proportion of observations within one standard deviation of the mean, within two standard deviations of the mean, and within three standard deviations of the mean. What do these proportions suggest about the shape of the distribution? c. Find the first quartile, the third quartile, and the interquartile range. Construct a modified box plot for the performance data. Use this graph to describe the distribution in terms of symmetry, skewness, variability, and outliers. Does your description based on the box plot agree with your answers to parts (a) and (b)? Why or why not?

Chapter 3

d. Find out how many performances there have been for

Phantom of the Opera and add this value to the data set. How will this value affect the sample mean, sample median, sample variance, and quartiles? Find these values and verify your predictions.

LAST STEP 3.136 How efficient is the Canadian Pacific Railway? To increase efficiency, officials at the

Canadian Pacific Railway monitor several variables including train speed, cars on each train, and terminal dwell time. In addition, the type and amount of freight is carefully recorded for each train. The following table shows the number of

121

Exercises

carloads of grain mill products for 30 randomly selected weeks RAILWAY in 2011 and 2012. 572 610 718

711 611 707

582 557 673

663 685 697

612 683 808

577 629 755

650 626 438

550 637 569

590 634 684

659 723 637

a. Compute the summary statistics for these data, including the

mean, median, variance, standard deviation and quartiles. b. Construct a box plot for these data. c. Describe the distribution. Identify any outliers. d. Find the proportion of observations within one, two,

and three standard deviations of the mean. Compare the results to the empirical rule. e. Find the 90th percentile.

4

Probability Looking Back ■

Understand the relationships among a population, a sample, probability, and statistics.

Looking Forward ■

Learn the definition of probability and useful probability concepts.

■

Compute the probability of various events involving counting techniques, independence, or conditional probability.

What are the chances of winning a prize in Monopoly Sweepstakes? In 2012, McDonald’s once again offered customers an opportunity to play the Monopoly Game. The objective of this contest was to collect McDonald’s Monopoly game pieces corresponding to the properties from the original Monopoly board game. Customers could collect the game pieces with the purchase of certain McDonald’s products, and some game pieces were instant winners. However, special collections of properties were worth big prizes, including cash, cars, and vacations. Each set of properties consists of two to four game pieces, corresponding to squares on the Monopoly board. In the McDonald’s game, one game piece in each set of properties is rare. Here is a list of a few property collections, the rare piece, and the probability of finding the rare piece. St. James Place, Tennessee Avenue, and New York Avenue: $10,000 cash. Probability of finding Tennessee Avenue: 0.00000000193 Park Place, Boardwalk: $1,000,000 annuity. Probability of finding Boardwalk: 0.00000000326 Reading Railroad, Pennsylvania Railroad, B&O Railroad, Short Line: EA Sports fan trip. Probability of finding Short Line: 0.00000000185 The techniques presented in this chapter will be used to determine the probability of winning at least one of the property prizes. Hint: It’s going take a few Big Macs!1

CONTENTS 4.1 Experiments, Sample Spaces, and Events 4.2 An Introduction to Probability 4.3 Counting Techniques 4.4 Conditional Probability 4.5 Independence Jordan Siemens/Getty Images

123

124

CH APTE R 4

Probability

4.1 Experiments, Sample Spaces, and Events To understand probability concepts, we need to think carefully about experiments. Consider the activity, or act, of tossing a coin, selecting a card from a standard poker deck, counting the number of contaminants in 1 cm3 of drinking water, or even testing a cell phone for defects before shipment. In every one of these activities, the outcome is uncertain. For example, when we test a new cell phone, we do not know (for sure) whether it will be defect-free. This idea of uncertainty leads to the definition of an experiment.

Definition An experiment is an activity in which there are at least two possible outcomes and the result of the activity cannot be predicted with absolute certainty. Here are some examples of experiments. 1. Roll a six-sided die and record the number that lands face up.

We cannot say with certainty that the number face up will be a 1, or a 2, etc., so this activity is an experiment. 2. Using a radar gun, record the speed of a pitch at a Red Sox baseball game. We’re not sure whether the pitch will be a fastball, curveball, slider, etc. And even if we steal the signal from the catcher, we cannot predict the speed of the pitch with certainty. 3. Count the number of patients who arrive at the emergency room of a city hospital during a 24-hour period. Although past records might help us form an estimate, there is no way of predicting the exact number of emergency room patients during a 24-hour period. 4. Select two Keurig Home Brewers and inspect each for flaws in materials and workmanship. Even though a strict quality control process might be in place, there is no way of knowing whether both Keurigs will be flawless, one will contain a flaw, or both will have flaws. Because we don’t know for sure what will happen when we conduct an experiment, we need to consider all possible outcomes. This sounds easy (just think about all the things that can happen), but it can be tricky. Sometimes it involves a lot of counting, but often outcomes can be visualized using a tree diagram. Consider the following examples. This is an experiment, because we cannot predict the last digit with certainty.

Example 4.1 Social Security Numbers Suppose a U.S. citizen is selected and the last digit of her Social Security number is recorded. How many possible outcomes are there, and what are they?

SOLUTION STEP 1 The last digit of a person’s Social Security number can be any integer from 0 to 9. STEP 2 There are 10 possible outcomes.

The outcomes are 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9.

Example 4.2 Buckle Up Two drivers on the Pennsylvania Turnpike are selected at random and checked for compliance with the seatbelt law. How many possible outcomes are there, and what are they?

SOLUTION STEP 1 If a driver is wearing a seatbelt, denote this observation by R (for restrained), and

if he is not wearing a seatbelt, use U (for unrestrained).

4.1

Experiments, Sample Spaces, and Events

125

There are lots of other ways to denote these four outcomes. There is no single correct notation. Write the outcomes so that others can understand and interpret your list.

STEP 2 Each outcome is a pair of observations, one on each driver. There are four pos-

Tree diagrams will also be extremely useful for determining probabilities in problems involving Bayes’s rule. Problems of this type are presented in Section 4.5.

All of the outcomes from the experiment in Example 4.2 can be determined by constructing a tree diagram, a visual road map of possible outcomes. Figure 4.1 is a tree diagram associated with this experiment.

sible outcomes, and here is a way to write them: RR, RU, UR, UU. The first letter indicates the observation on the first driver, and the second letter indicates the observation on the second driver. RU is a different outcome from UR. RU means the first driver was wearing a seatbelt and the second driver was not. UR means the first driver was not wearing a seatbelt and the second driver was.

First-generation branches

Second-generation branches

Outcomes

R

RR

U

R

RU

U

UR

R U

Figure 4.1 Tree diagram for Example 4.2.

UU

The first-generation branches indicate the possible choices associated with the first driver, and the second-generation branches represent the choices for the second driver. A path from left to right represents a possible experimental outcome.

Example 4.3 Buckle Up (Continued) Extend the previous example. How many outcomes are there if we stop three drivers and record their seatbelt status?

SOLUTION Now there are eight possible outcomes: RRR, RRU, RUR, RUU, URR, URU, UUR, UUU. Figure 4.2 is a tree diagram for this extended experiment. Again, every path from left to right represents a possible outcome. First-generation branches

Second-generation branches

R U R

Third-generation branches

Outcomes

R

RRR

U RRU R

RUR

U RUU R

U R U

Figure 4.2 Tree diagram for Example 4.3.

URR

U URU R

UUR

U UUU

126

CHAPTE R 4

Probability

A CLOSER L OK Tree diagrams are also used to prove the multiplication rule (Section 4.3), an arithmetic technique used to count the number of possible outcomes in certain experiments.

1. Tree diagrams are a fine technique for finding all the possible outcomes for an experi-

ment. However, they can get very big, very fast. 2. A tree diagram does not have to be symmetric, as they are in Figures 4.1 and 4.2. The

branches and paths depend on the experiment. Consider the next example.

Example 4.4 Breakfast of Champions A consumer in Clarkdale, Arizona, is searching for a box of his favorite breakfast cereal. He will check all three grocery stores in town if necessary, but will stop if the cereal is found. The experiment consists of searching for the cereal. How many possible outcomes are there, and what are they?

SOLUTION STEP 1 If the cereal is in stock, use the letter I; if it is out of stock, use O. Figure 4.3

shows a tree diagram for this experiment. Why isn’t IO a possible outcome?

STEP 2 On the tree diagram, there are four possible paths from left to right. The out-

comes are Outcome I OI OOI OOO

Experiment result The cereal is in stock in store 1. The cereal is not in stock in store 1, but it is in stock in store 2. The cereal is not is stock in stores 1 and 2, but it is in stock in store 3. The cereal is not in stock in any store. First-generation branches

Second-generation branches

Third-generation branches

O O

I OOI

I

O

OOO

OI I I

Figure 4.3 Tree diagram for Example 4.4.

The symbol S is used to denote several different objects, or quantities, in probability and statistics; a small s is also common. The text of the problem reveals the relevant meaning of the symbol.

This tree diagram is not symmetric, but all possible outcomes are represented by left-to-right paths. The paths representing the outcomes have different lengths. Some of the outcomes are shorter because the experiment ends early if the consumer finds the cereal in the first or second store.

Definition The sample space associated with an experiment is a listing of all the possible outcomes using set notation. It is the collection of all outcomes written mathematically, with curly braces, and denoted by S.

4.1

Experiments, Sample Spaces, and Events

127

Example 4.5 Sample Spaces Find the sample space for each of the four experiments above.

SOLUTION We determined the outcomes for each experiment. Write the sample space using set notation. STEP 1 Last digit of Social Security number: S ! {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. STEP 2 Seatbelt experiment: S ! {RR, RU, UR, UU}. STEP 3 Extended seatbelt experiment:

S ! {RRR, RRU, RUR, RUU, URR, URU, UUR, UUU}. STEP 4 Cereal experiment: S ! {I, OI, OOI, OOO}.

TRY IT NOW

GO TO EXERCISE 4.19

Given an experiment and the sample space, we usually study and find the probability of specific collections of outcomes, called events.

Definition 1. An event is any collection (or set) of outcomes from an experiment (any subset of the

sample space). 2. A simple event is an event consisting of exactly one outcome. 3. An event has occurred if the resulting outcome is contained in the event.

A CLOSER L OK 1. An event may be given in standard set notation, or it may be defined in words. If a writ-

ten definition is given, we need to translate the words into mathematics in order to identify the event outcomes. 2. Notation: (a) Events are denoted with capital letters, for example, A, B, C, . . . . (b) Simple events are often denoted by E1, E2, E3, . . . . 3. It is possible for an event to be empty. An event containing no outcomes is denoted by {} or 0 (the empty set).

Example 4.6 College Dining Translate at most and at least carefully. These expressions appear frequently in probability and statistics questions. M M C

MM

There are four possible outcomes. A tree diagram is shown in the margin. The sample space is S ! {MM, MC, CM, CC}. There are four relevant simple events:

C M

Two resident students at Bucknell University are selected and asked if they purchased a meal plan (M) or cook for themselves (C). The experiment consists of recording the response from both students.

MC CM

C CC

E1 5 5 MM 6 , E2 5 5 MC 6 , E3 5 5 CM 6 , E4 5 5 CC 6 .

Here are some other events, in words and in set notation. Let A be the event that both students made the same choice. A ! {MM, CC}. Let B be the event that at most one student purchased a meal plan. B ! {CC, MC, CM} contains observations with at most one M. Let D be the event that at least one student cooks for himself. D ! {CM, MC, CC} contains observations with one or more Cs.

128

CH APTE R 4

Probability

Example 4.7 On-Time Delivery UPS delivery routes include as many right turns as possible.

A UPS driver may deliver to floors 2 through 6 in an office building and use one of three elevators (labeled A, B, and C). The experiment consists of recording the floor and elevator used. There are 15 possible outcomes because there are three elevators for each of the five floors. A tree diagram works again. The sample space is S 5 5 2A, 3A, 4A, 5A, 6A, 2B, 3B, 4B, 5B, 6B, 2C, 3C, 4C, 5C, 6C 6 .

The number in each outcome represents the floor, and the letter represents the elevator.

© B Christopher/Alamy

Let E be the event that the delivery is made on an odd floor using elevator B. E 5 5 3B, 5B 6 .

Let F be the event that the delivery is made on an even floor. This definition says nothing about the elevator used. There are no restrictions on the elevator in this event. F 5 5 2A, 4A, 6A, 2B, 4B, 6B, 2C, 4C, 6C 6 .

Let G be the event that the delivery is made using elevator C. G 5 5 2C, 3C, 4C, 5C, 6C 6 . TRY IT NOW

GO TO EXERCISE 4.22

When an experiment is conducted, only one outcome can occur. For example, if the UPS driver used elevator B to deliver to the third floor, the experimental outcome is 3B. The observed outcome may be included in several relevant events. In the delivery example above, if the outcome 4C is observed, then the events F and G have occurred. The event E did not occur. Given an experiment, the sample space, and some relevant events, we often combine events in various ways to create and study new events. Events are really sets, so the methods of combining events are set operations.

Definition A" is read as “A prime” or “A complement.”

Let A and B denote two events associated with a sample space S. 1. The event A complement, denoted A", consists of all outcomes in the sample space S that are not in A. 2. The event A union B, denoted A c B, consists of all outcomes that are in A or B or both. 3. The event A intersection B, denoted A d B, consists of all outcomes that are in both A and B. 4. If A and B have no elements in common, they are disjoint or mutually exclusive, written A d B ! {}.

A CLOSER L OK 1. The event A" is also called not A. The word not in the text of a probability question

usually means you need to find the complement of an event. 2. Or usually means union; A or B means A c B. 3. And usually means intersection; A and B means A d B. 4. Any outcome in both A and B is included only once in the event A c B. 5. The three events defined above could be denoted using any new symbols. A", A c B, and

A d B are traditional mathematical symbols to denote complement, union, and intersection.

6. It is possible for one of these new events to contain all the outcomes in the sample

space.

4.1

Experiments, Sample Spaces, and Events

129

Example 4.8 One-Coat Coverage Home Depot sells Behr Premium Plus Ultra interior paint in one of three finishes: flat (F), satin (T), or gloss (G). The manager is interested in customer preferences and conducts an experiment by recording the interior paint finish for the next two customers who buy paint. The sample space for this experiment has nine outcomes. See Figure 4.4. S 5 5 FF, FT, FG, TF, TT, TG, GF, GT, GG 6 .

Consider the following events. A ! {FF, TT, GG} B ! {FF, FT, TF, TT} C ! {FF, FT, FG, TF, GF} D ! {FT, TF, TG, GT}

Both buy the same finish. Neither buys gloss. At least one buys flat. Exactly one buys satin.

First-generation branches

Second-generation branches

F T G

FT FG

F F T

FF

T G

TF TT TG

G F T G

GF GT GG

Figure 4.4 Tree diagram for Example 4.8.

Here are some new events created from the four given events: Dr 5 5 FF, FG, TT, GF, GG 6 5 D complement, neither or both buy satin. 5 All outcomes in S not in D. A c C 5 5 FF, TT, GG, FT, FG, TF, GF 6 5 Both buy the same finish or at least one buys flat. 5 All outcomes in A or C ( or both ) . AdD5 5 6 5 Both buy the same finish and exactly one buys satin. 5 All outcomes in A and D. A and D are disjoint. ( A c C ) r 5 5 TG, GT 6 5 A union C, complement. 5 All outcomes in S not in A c C. (A d D) r 5 S 5 A intersection D, complement. 5 All outcomes in S not in A d D. TRY IT NOW

GO TO EXERCISE 4.12

130

CH APTE R 4

Probability

A Venn diagram may be used to visualize a sample space and events, to determine outcomes in combinations of events, and to answer probability questions in later sections. To construct a Venn diagram draw a rectangle to represent the sample space. Various figures (often circles) are drawn inside the rectangle to represent events. The Venn diagrams in Figure 4.5 illustrate various combinations of events. S

S

A!

A

A

B

A union B: A ∪ B

A complement: A! S

A

B

A intersection B: A ∩ B

S

A

B

A and B are disjoint: A ∩ B " { }

Figure 4.5 Venn diagrams.

In a Venn diagram, plane regions represent events. We often add labeled points to denote outcomes. Later, probabilities assigned to events will be added to the diagrams. The definitions of union, intersection, and disjoint events can be extended to a collection consisting of more than two events.

Definition Let A1, A2, A3, . . . , Ak be a collection of k events. 1. The event A1 c A2 c # # # c Ak is a generalized union and consists of all outcomes in at least one of the events A1, A2, A3, . . . , Ak. 2. The event A1 d A2 d # # # d Ak is a generalized intersection and consists of all outcomes in every one of the events A1, A2, A3, . . . , Ak. 3. The k events A1, A2, A3, . . . , Ak are disjoint if no two have any element in common.

Example 4.9 Priority Request A university computer technician attaches a priority code to each help request. The range is 0 to 9, with 0 as the lowest priority and 9 as the highest priority. Consider an experiment in which a random request is selected and the priority is recorded. The sample space is S 5 5 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 6 and consider the events A ! {0, 1, 2, 3, 4} B ! {3, 4, 5, 6} C ! {7, 8} D ! {2, 4, 6, 9}

4.1

131

Experiments, Sample Spaces, and Events

a. List the outcomes in the event A c B c C and illustrate these three events using a Venn

diagram. b. List the outcomes in the event A c B c D and illustrate these three events using a Venn

diagram. c. List the outcomes in each of the following events: i. A d B d C ii. A d B d D iii. ( A c B ) r iv. ( A c B c D ) r

SOLUTION STEP 1 The event A c B c C includes all the outcomes in at least one of the events A, B,

or C. A c B c C ! {0, 1, 2, 3, 4, 5, 6, 7, 8}. Figure 4.6 shows the relationships among the events A, B, and C, and the sample space S. STEP 2 The event A c B c D includes all the outcomes in at least one of the events A, B, or D. A c B c D ! {0, 1, 2, 3, 4, 5, 6, 9}. Figure 4.7 shows the relationships among the events A, B, and D, and the sample space S. STEP 3 A d B d C 5 5 6 There are no outcomes in all three events. A d B d D 5 546 4 is the only outcome in all three events. ( A c B ) r 5 5 7, 8, 9 6 ( A c B c D ) r 5 5 7, 8 6 5 C A

All outcomes in S not in A c B.

All outcomes in S not in A c B c D.

S

9

1

A C

2

0 3

4

5

6

7

1 0

8

3

D 2 4 5

9 7 6 8

B

B

Figure 4.6 The events A, B, and C in Example 4.9.

TRY IT NOW

S

Figure 4.7 The events A, B, and D in Example 4.9.

GO TO EXERCISE 4.13

SECTION 4.1 EXERCISES Concept Check

Practice

4.1 True/False In an experiment, the result of the activity is

4.6 An experiment consists of rolling a six-sided die, recording the number face up, and then tossing a coin and recording head or tail. Carefully sketch a tree diagram and find the sample space for this experiment.

always pretty certain. 4.2 True/False A tree diagram is always symmetric. 4.3 True/False A sample space consists of all possible

outcomes. 4.4 True/False A simple event occurs very rarely. 4.5 Short Answer a. The word

is usually associated with the complement of an event. b. The word is usually associated with union. c. The word is usually associated with intersection.

4.7 A basketball player is going to select a sneaker with red, blue, green, or black stripes, and in either low- or high-top style. An experiment consists of recording the color and style. Carefully sketch a tree diagram and find the sample space for this experiment. 4.8 An experiment consists of selecting one letter from B, I, N, G, O, and one of five rows. How many possible outcomes are there in this experiment? Carefully sketch the corresponding tree diagram.

132

CH APTE R 4

Probability

4.9 One playing card is selected from a regular 52-card deck. An experiment consists of recording the denomination (ace, 2, 3, 4, 5, 6, 7, 8, 9, 10, jack, queen, king) and suit (club, diamond, heart, or spade). How many possible outcomes are there in this experiment?

4.16 The Venn diagram below shows the relationship among

three events. S A

B

4.10 Consider an experiment with sample space

S 5 5 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 6

and the events A ! {0, 2, 4, 6, 8} B ! {1, 3, 5, 7, 9} C ! {0, 1, 2, 3, 4} D ! {5, 6, 7, 8, 9} Find the outcomes in each of the following events. a. A" b. C" c. D" d. A c B e. A c C f. A c D

C

to find the outcomes in each of the following events. a. B d C b. B d D c. A d B d. A d C e. ( B d C ) r f. Br c Cr

Redraw the Venn diagram for each part of this problem and carefully shade in the region corresponding to each new event. a. A c B c C b. A d B d C c. A c C d. B d C e. B d Cr f. ( A c B ) r d C g. ( A c B c C ) r h. Ar d Br d Cr i. B d C d Ar

4.12 Consider an experiment with sample space

4.17 Consider an experiment with sample space

4.11 Use the sample space and the events in Exercise 4.10

S 5 5 a, b, c, d, e, f, g, h, i, j, k 6

and the events A ! {a, c, e, g} B ! {b, c, f, j, k} C ! {c, f, g, h, i} D ! {a, b, d, e, g, h, j, k} Find the outcomes in each of the following events. a. A" b. C" c. D" d. A d B e. A d C f. C d D 4.13 Use the sample space and the events in Exercise 4.12

to find the outcomes in each of the following events. a. A c B c D b. B c C c D c. B d C d D d. A d B d C 4.14 Use the sample space and the events in Exercise 4.12

to find the outcomes in each of the following events. a. ( A d B d C ) r b. A c B c C c D c. ( B c C c D ) r d. Br d Cr d Dr 4.15 The Venn diagram below shows the relationship between

two events. S A

B

S ! {YYY, YYN, YNY, YNN, NYY, NYN, NNY, NNN} a. Find the outcomes in each of the following events. A ! Exactly one Y. B ! Exactly two Ns. C ! At least one Y. D ! At most one N. Find the outcomes in each event described in words and write each as a combination of the events A, B, C, and D. b. Exactly one Y or at most one N c. Two or more Ns d. Exactly two Ns and at least one Y e. Two or more Ys 4.18 Consider an experiment with sample space

S ! {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} and the events A ! {0, 1, 2, 7, 8, 9} B ! {0, 1, 2, 4, 8} C ! {0, 1, 3, 9} D ! {1, 4, 9} Draw a separate Venn diagram to illustrate the relationships among each collection of events and the sample space S. a. B and C b. A and D c. A, B, and C d. A, C, and D

Applications 4.19 Physical Sciences An experiment consists of recording

Redraw the Venn diagram for each part of this problem and carefully shade in the region corresponding to each new event. b. ( A d B ) r c. Ar d B a. ( A c B ) r d. A d Br e. Ar d Br f. Ar c Br

the time zone (E, C, M, P) and strength (L, M, H) for the next earthquake in the 48 contiguous states United States. a. Carefully sketch a tree diagram to illustrate the possible outcomes for this experiment. b. Find the sample space S for this experiment.

4.1

Experiments, Sample Spaces, and Events

133

4.20 Economics and Finance Three taxpayers are selected

4.27 Medicine and Clinical Studies The Emergency Room

at random and asked whether they itemized their tax deductions last year or used the standard deduction. An experiment consists of recording each response. Construct a tree diagram to represent this experiment and find the outcomes in the sample space.

in a rural hospital is staffed in four six-hour shifts (1, 2, 3, 4). During any shift, an Emergency Room patient is attended to by either a general physician (G), a surgeon (R), or an intern (I). An experiment consists of coding the next Emergency Room patient by shift and attending doctor. Consider the following events. A ! The attending doctor is the general physician. B ! The patient is admitted during the second shift. C ! The patient is admitted during shift 3 or is seen by the intern. D ! The patient is admitted during shift 4 and is seen by the general physician. a. Find the sample space S for this experiment. b. List the outcomes in each of the events A, B, C, and D. c. List the outcomes in the events A c B and A d B.

4.21 Travel and Transportation Two people who work in

New York City are selected at random and asked how they get to work: drive, take a train, or take a bus. An experiment consists of recording each response. Construct a tree diagram to represent this experiment and find the outcomes in the sample space. 4.22 Fuel Consumption and Cars

In early 2013 it was revealed that the Winnipeg Police Service had worked 627 overtime shifts over the last six months specifically for traffic enforcement.2 The cost to Canadian taxpayers was approximately $900,000, and some people believe the time and money should be spent on crime prevention instead. An experiment consists of selecting a ticketed car at random and recording: a. Whether the driver has a valid registration. b. Whether the automobile is properly insured. c. The time of day during the traffic stop: morning, afternoon or evening. Construct a tree diagram to represent this experiment and find the outcomes in the sample space.

4.23 Physical Sciences A construction crew excavating

a site for a building foundation must remove the rock and prepare a trench for concrete footers. An experiment consists of recording the type of rock present (I, igneous; S, sedimentary; M, metamorphic) and the number of days needed to prepare the site (1 to 5). a. Carefully sketch a tree diagram to illustrate the possible outcomes for this experiment. b. Find the sample space S for this experiment. 4.24 Manufacturing and Product Development One of

four calculator batteries is bad. An experiment consists of testing each battery until the dead one is found. a. How many possible outcomes are there for this experiment? b. Is the outcome GBGG (Good, Bad, Good, Good) possible? Why or why not? 4.25 Sports and Leisure An experiment consists of

recording the number of pins knocked down on each roll during a frame of a bowling game. A bowler may take a maximum of two rolls per frame. How many outcomes are in the sample space for this experiment? Hint: If the first roll is a 10 (a strike), the experiment is over. 4.26 Sports and Leisure A sports statistician must carefully chart opposition football plays in preparation for the next game. An experiment consists of recording the type of play (pass or rush) and the yards gained (#99, #98, #97, . . . , #2, #1, 0, 1, 2, . . . , 97, 98, 99) on a randomly selected first down. How many outcomes are in the sample space for this experiment?

4.28 Psychology and Human Behavior Drivers entering

the Quaker Bride Mall parking lot at the main entrance may turn left, right, or go straight. An experiment consists of recording the direction of the next car entering the mall and the vehicle style (sedan, SUV, van, or pickup). Consider the following events. A ! The next vehicle is a van. B ! The next vehicle is a sedan or pickup. C ! The next vehicle turns left. D ! The next vehicle goes straight or turns right. a. Find the sample space S for this experiment. b. List the outcomes in each of the events A, B, C, and D. c. List the outcomes in the events C c D and C d D. 4.29 Public Health and Nutrition Each patient with a regular appointment at Dr. Kenneth Heise’s dental office is classified by the number of cavities found (assume 4 is the maximum) and as late (L) or on time (T) for the appointment. a. Find the sample space S for this experiment. b. Describe the following events in words. A ! {0L, 1L, 2L, 3L, 4L} B ! {3L, 4L, 3T, 4T} C ! {1L, 3L, 1T, 3T} D ! {0L, 0T} E ! {0L, 0T, 1L, 2L, 3L, 4L} F ! {4T} 4.30 Travel and Transportation Every passenger arriving

at the Las Vegas McCarran International Airport is classified as American (A) or Foreign (F), and by the number of checked bags (assume 5 is the maximum). a. Find the sample space S for this experiment. b. Describe the following events in words. A ! {A0, F0} B ! {F0, F1, F2, F3, F4, F5} C ! {A1, F1, A2, F2} D ! {F0, F5} E ! {A1, F1, A3, F3, A5, F5} 4.31 Public Health and Nutrition A researcher working for a Five Guys fast-food restaurant in Savannah, Georgia, selects random customers and classifies each according to sex

134

CHAPTE R 4

Probability

[male (M) or female (F)], fresh-cut French fries or not [(C) or (N)], and age group [young (Y), middle aged (D), or senior (R)]. a. Find the sample space S for this experiment. b. Describe the following events in words. A ! {MCY, MCD, MCR, MNY, MND, MNR} B ! {MCR, FCR} C ! {MCY, MNY, FCY, FNY} D ! {MNY, MND, MNR, FNY, FND, FNR}

Extended Applications 4.32 Sports and Leisure A single six-sided die is rolled.

If the number face up is even, then the experiment is over. If the number face up is odd, then the die is rolled again. The experiment continues until the number face up is even. a. Carefully sketch (part of) a tree diagram to illustrate the possible outcomes for this experiment. b. Find the sample space for this experiment. 4.33 Economics and Finance A taxpayer in need of advice

will call the IRS repeatedly until she can get through (no busy signal). If she receives a busy signal, she will hang up and try again later, and will stop calling as soon as she reaches an agent. An experiment consists of recording the calling pattern. A possible outcome is BBH: a busy signal (B) on the first two calls, and (finally) help (H) on the third call. a. How many possible outcomes are there in this experiment? b. List some of the outcomes for this experiment. 4.34 Marketing and Consumer Behavior Musicnotes.com sells sheet music in the following genres: rock, jazz, new age, and country. An experiment consists of recording the preferred genre for the next customer, and the number of songs purchased (assume 5 is the maximum). Consider the following events. A ! The next customer prefers rock. B ! The next customer prefers jazz and buys at least three songs. C ! The next customer buys at most two songs. D ! The next customer prefers country and buys one song. a. Find the sample space S for this experiment. b. Find the outcomes in each of the following events. iii. A d D i. A" ii. A c C iv. C d D v. A d C d D vi. (A d B) 4.35 Marketing and Consumer Behavior Verizon Wireless offers smartphone and tablet customers a variety of data plans.

Each user must select the number of gigabytes depending on the planned volume of emails, music, and videos.3 An experiment consists of selecting a Verizon Wireless smartphone customer and recording the data plan (1, 2, 3, 4, 5, or 6 GB) and whether the customer used more data than planned last month. Consider the following events. A ! The customer used more data than planned. B ! The customer has a 1-, 2-, or 3-GB plan. C ! The customer has a 5- or 6-GB plan, and used more data than planned. D ! The customer has a 2-, 4-, or 6-GB plan. a. Find the sample space S for this experiment. b. Find the outcomes in each of the following events. iii. A d B i. B" ii. A c B iv. C d D v. A d B d D vi. (A d D)" 4.36 Travel and Transportation An experiment consists

of selecting a random passenger on a train from Washington, DC, Union Station to Trenton, NJ, and recording the purpose of travel (business or pleasure) and the number of pieces of luggage (0 to 4). Consider the following events: A ! The passenger is traveling on business. B ! The passenger has no luggage. C ! The passenger has at most one piece of luggage. D ! The passenger has three pieces of luggage or is traveling for pleasure. a. Find the sample space S for this experiment. b. Find the outcomes in each of the following events. ii. A d B iii. B c C i. A c B iv. B d C v. A d D vi. A d B d C d D 4.37 Marketing and Consumer Behavior Raman’s Coffee and Chai offers Chai tea in the following variations: hot or cold; with whipped cream or without; and in small, medium, or large size. An experiment consists of recording these three options for the next customer. a. Carefully sketch a tree diagram to illustrate the possible outcomes for this experiment. b. Find the sample space S for this experiment. c. Consider the following events. A ! The next customer order is small. B ! The next customer order is cold. C ! The next customer order is small or hot. Find the outcomes in each of the following events. ii. B c C iii. B d C iv. C" i. A c B

4.2 An Introduction to Probability P(A) works like a function. The inputs are events; the outputs are probabilities.

Given an experiment, some events are more likely to occur than others. For any event A, we need to assign a number to A that corresponds to this intuitive likelihood of occurrence. The likelihood that A will occur is simply the probability of the event A. For example, the probability that an asteroid 100 meters in diameter will strike the Earth in any given year is 0.001 (a pretty unlikely event). The probability of wind gusts over 40 miles per hour at the Mount Washington Observatory on any given winter day is 0.07 (a more likely event). The notation P(A) is used to denote this likelihood, the probability of an event A. To begin our discussion of probability, consider the following working definition.

4.2

135

An Introduction to Probability

Definition The probability of an event A is a number between 0 and 1 (including those endpoints) that measures the likelihood A will occur. 1. If the probability of an event is close to 1, then the event is likely to occur. 2. If the probability of an event is close to 0, then the event is not likely to occur.

Would you enroll in a class where the probability of receiving an A is 1?

If the probability of an event A is 1, then the event is a certainty: It will occur. If the probability of an event B is 0, then B is definitely not going to occur. What about events with probabilities in between? How do we decide to assign a probability of 0.3, for example, to an event C? We need a reasonable, all-purpose rule for linking an event to its likelihood of occurrence. The natural (theoretical) definition for assigning a probability to an event is very intuitive.

Definition The relative frequency of occurrence of an event is the number of times the event occurs divided by the total number of times the experiment is conducted.

Example 4.10 Pick a Card, Any Card It seems like the answer should be 1/ 4. Why?

In a regular 52-card deck there are 13 clubs, 13 diamonds, 13 hearts, and 13 spades. Suppose an experiment consists of selecting one card from the deck and recording the suit. What is the probability of selecting a club?

SOLUTION STEP 1 Let C be the event that a club is selected. We want the probability of the event C,

Relative frequency was defined in Chapter 2 in the context of frequency distributions.

which is denoted by P(C ). STEP 2 To estimate the probability of C, it seems reasonable to conduct the experiment several times and see how often a club is selected. If C occurs often (we get a club a lot of the time), then the likelihood (probability) should be high. If C occurs rarely, then the probability should be close to 0. STEP 3 To estimate the likelihood of selecting a club, we use the relative frequency of occurrence of a club, which is the frequency divided by total trials, or Relative frequency 5

After every selection, the observed card is placed back in the deck. The deck is shuffled, and another selection is made.

number of times a club is selected total number of selections

STEP 4 Suppose that after 10 tries, a club was selected only twice. The relative frequency

is 2/10 5 0.2. This is an estimate of P(C ). It’s quick and easy, but it doesn’t seem too accurate. STEP 5 Suppose we try the experiment a few more times. With more observations we should be able to make a better guess at P(C). The table below shows values for = N, the number of trials, and p, the relative frequency of occurrence of a club. N = p N = p

10

50

100

200

300

400

500

600

700

0.2

0.3

0.29

0.23

0.223

0.205

0.228

0.252

0.267

800

900

1000

1100

1200

1300

1400

1500

0.245

0.243

0.227

0.254

0.260

0.256

0.261

0.249

For example, after 300 draws, the relative frequency of occurrence of the event C was 0.223.

136

CHAPTE R 4

Probability

STEP 6 Figure 4.8 shows a plot of relative frequency versus number of trials. The graph

Relative frequency, p

shows a remarkable pattern. As N increases, the points are noticeably closer to the dashed line. The relative frequencies seem to be homing in on one number (around 0.25); this relative frequency, whatever it is, should be the probability of the event C.

0.3

0.2

0.1

0.0

500

1000

1500

Number of trials, N

Figure 4.8 Scatter plot of relative frequency versus number of trials. STEP 7 In the long run, the relative frequencies tend to stabilize, or even out, and become

almost constant. They close in on one number, the limiting relative frequency. The probability of the event C is the limiting relative frequency.

STATISTICAL APPLET PROBABILITY

If an experiment is conducted N times and an event occurs n times, then the probability of the event is approximately n/N (the relative frequency of occurrence). The probability of an event A, P(A), is the limiting relative frequency, the proportion of time the event A will occur in the long run. This is a basic and sensible definition, a rule for assigning probability to an event. Given an event, all we need to do is find the limiting relative frequency. Although this definition makes sense, and Example 4.10 and Figure 4.8 support and illustrate our intuition, there is a real practical problem. We cannot conduct experiments over and over, compute relative frequencies, and only then estimate the true probability. How will we ever know the true limiting relative frequency? How large should N be? When are we close enough? Will we ever hit the limiting relative frequency exactly? The definition is nice, but there seems little hope of ever finding the true probability of an event. Fortunately, there is another way to determine the exact probability in some cases. Consider the next two examples.

Example 4.11 Call It in the Air Suppose an experiment consists of tossing a fair coin and recording the side that lands face up. The event H is the coin landing with heads face up. Find P(H).

SOLUTION If we were to conduct this experiment over and over, the relative frequency of occurrence of H would close in on 1/ 2.

There are only two possible outcomes on each flip of the coin, and they are both equally likely to occur. In the long run, we expect heads to occur half of the time. Therefore, P ( H ) 5 1/2. Without flipping the coin thousands of times, making estimates, or guessing at the limiting relative frequency, we are certain the probability is 1/2.

Example 4.12 Roll the Die An experiment consists of tossing a fair six-sided die and recording the number that lands face up. Consider the event E ! {1}, rolling a one. Find P(E).

4.2

An Introduction to Probability

137

SOLUTION The relative frequency of occurrence of a 1 would get closer and closer to 1/ 6 as the number of rolls gets larger and larger.

There are six possible outcomes on each roll of the die, and they are all equally likely to occur. In the long run, we expect 1 to occur one-sixth of the time. Therefore, P ( E ) 5 1/6. We can identify the exact limiting relative frequency. These two examples suggest it is indeed possible to find the limiting relative frequency! They are special cases, however, because in each experiment, all of the outcomes are equally likely.

Properties of Probability The word chance is also used to express likelihood. A 10% chance means the probability is 0.10.

1. For any event A, 0 $ P(A) $ 1.

The probability of any event is a limiting relative frequency, and a relative frequency is a number between 0 and 1. An event with probability close to 0 is very unlikely to occur, and an event with probability close to 1 is very likely to occur. 2. For any event A, P(A) is the sum of the probabilities of all of the outcomes in A. To compute P(A), just add up the probability of each outcome or simple event in A. 3. The sum of the probabilities of all possible outcomes in a sample space is 1: P(S) ! 1. The sample space S is an event. If an experiment is conducted, S is guaranteed to occur. 4. The probability of the empty set is 0: P({ }) ! P(0) ! 0. This event contains no outcomes.

In the next example, the probability (limiting relative frequency) of each simple event is assumed to be known. We will use the properties above and some earlier definitions to develop some common tools and strategies for solving similar probability questions.

Example 4.13 Try the Easy Button There are five sales associates (indicated by their employee number) on duty in a Staples office supply store: three women (3, 4, and 5) and two men (1 and 2). An experiment consists of classifying the next customer’s action. He or she will make a purchase from one of the sales associates (indicated by number) or buy nothing (99). The probability of each simple event is given in the table below. Simple event Probability

Bloomberg/Getty Images

1

2

3

4

5

99

0.08

0.12

0.10

0.25

0.15

0.30

Consider the following events. A ! {1, 2} ! The next customer buys something from a male sales associate. B ! {3, 4, 5} ! The next customer buys something from a female sales associate. C ! {99} ! The next customer buys nothing. D ! {1, 4} ! The next customer buys from one of these two sales associates. Find P(A), P(C), P(B c D), P(A d D), and P(A d B).

SOLUTION STEP 1 P ( A ) 5 P ( 1 ) 1 P ( 2 )

5 0.08 1 0.12 5 0.20

Add the probabilities of each simple event in A.

138

CH APTE R 4

Probability

STEP 2 P ( C ) 5 P ( 99 ) 5 0.30

There is only one outcome in C.

STEP 3 P ( B c D ) 5 P ( 1, 3, 4, 5 )

Find the outcomes in the event B c D.

5 P ( 1 ) 1 P ( 3 ) 1 P ( 4 ) 1 P ( 5 ) Add up the probabilities of each simple event. 5 0.08 1 0.10 1 0.25 1 0.15 5 0.58 STEP 4 P ( A d D ) 5 P ( 1 ) 5 0.08 The intersection is one simple event. Check the probability in the table above.

STEP 5 P ( A d B ) 5 P (5 6) 5 0

TRY IT NOW

Think about tossing a fair coin, or rolling a fair die, or randomly selecting a student in a class to answer a question.

The intersection is empty, so the probability is 0.

GO TO EXERCISE 4.44

To find probabilities in the previous example, we looked at each event piece by piece. We broke down each event into simple events. Let’s apply the same properties in an equally likely outcome experiment. Suppose an experiment has n equally likely outcomes, S ! {e1, e2, e3, . . . , en}. Each simple event has the same chance of occurring, so the probability of each is 1/n; P(ei ) ! 1/n. The limiting relative frequency of ei is 1/n. This is exactly what we found in Examples 4.11 and 4.12. Consider an event A ! {e1, e2, e3, e4, e5}. To find P(A), add up the probabilities of each simple event in A. P ( A ) 5 P ( e1 ) 1 P ( e2 ) 1 P ( e3 ) 1 P ( e4 ) 1 P ( e5 ) 1 1 1 1 1 5 5 1 1 1 1 5 n n n n n n number of outcomes in A N(A) 5 5 number of outcomes in the sample space S N(S)

Finding Probabilities in an Equally Likely Outcome Experiment You will not always see the phrase equally likely outcomes in these probability questions. We will identify some keywords and work with familiar experiments that imply equally likely outcomes.

In an equally likely outcome experiment, the probability of any event A is the number of outcomes in A divided by the total number of outcomes in the sample space S. Finding the probability of any event, in this case, means counting the number of outcomes in A, counting the number of outcomes in the sample space S, and dividing. P(A) 5

N(A) N(S)

Section 4.3 presents some special counting rules to help compute probabilities associated with common experiments and events. However, we can solve some of these problems already, and even use our results to make a statistical inference.

Example 4.14 Bank Teller Jobs The Beneficial Savings Bank in Tabernacle, New Jersey, has five tellers: 1 and 2 are trainees; 3, 4, and 5 are veterans. Tellers 2, 3, and 4 are female, and tellers 1 and 5 are male. At the end of the day, two tellers will be randomly selected and all of their transactions for the day will be audited. a. What is the probability that both trainees will be selected for the audit? b. What is the probability that one male and one female will be selected for the audit? c. What is the probability that two females will be selected for the audit?

4.2

Solution Trail 4.14 KEYW OR DS ■

Randomly selected

TR ANSL ATI O N ■

Equally likely outcomes

CONC EPTS ■

P(A) ! N(A) / N(S)

VI SI ON

To find the probability of each event, count the number of outcomes in that event and divide by the total number of outcomes in the sample space.

An Introduction to Probability

139

SOLUTION The experiment consists of selecting two tellers at random. The outcomes consist of two tellers who can be represented by their numbers. Therefore, 12 represents the outcome that tellers 1 and 2 were selected. The order of selection does not matter. For example, 12 and 21 both represent the event tellers 1 and 2 were selected. We can (a) list all possible outcomes systematically, (b) sketch a tree diagram, or (c) use combinations (to be presented in Section 4.3). There are 10 outcomes in the sample space. S 5 5 12, 13, 14, 15, 23, 24, 25, 34, 35, 45 6 .

a. Let A ! both trainees are selected for the audit. Because the trainees are tellers 1 and

2, there is only one outcome in the event A: A ! {12}. P(A) 5

number of outcomes in A N(A) 1 5 5 0.10 5 ( ) number of outcomes in S N S 10

b. Let B ! one male and one female teller are selected. Tellers 2, 3, and 4 are female, and

tellers 1 and 5 are male. Check the sample space carefully to list the outcomes in B. B 5 5 12, 13, 14, 25, 35, 45 6 1 P ( B ) 5

N(B) 6 5 5 0.60 N(S) 10

c. Let C ! two females are selected. Tellers 2, 3, and 4 are female. Check the sample

space again, and pick out the matching outcomes. C 5 5 23, 24, 34 6 1 P ( C ) 5

N(C) 3 5 0.30 5 N(S) 10

The next example involves an equally likely outcome experiment and an inference question. We’ll need to compute the likelihood of the observed event to help us draw a conclusion.

Example 4.15 Buttered Bagels Suppose Bloomin Bagels sells only two different varieties of bagels: plain (P) and cinnamon raisin (C). The owner believes the demand for each kind is the same and the shop should continue to bake these varieties in equal numbers. Five customers are selected at random. Each customer buys only one bagel and the bagel purchased is noted.

Solution Trail 4.15a KEYW OR DS ■

■

Demand for each kind is the same Selected at random

TR ANSL ATI O N ■

Equally likely outcomes

CONC EPTS ■

P(A) ! N(A) / N(S)

VI SI ON

To find the probability of each event, count the number of outcomes in that event and divide by the total number of outcomes in the sample space.

a. Find the probability that exactly one person buys a plain bagel. b. Suppose all five customers purchase a plain bagel. Is there any evidence to suggest that

demand is weighted more toward one variety?

SOLUTION The experiment consists of selecting five customers at random and recording their bagel purchase. Each outcome is a sequence of five letters: Cs and/or Ps. For example, the outcome CCPCP stands for: the first customer buys a cinnamon raisin bagel, the second customer buys a cinnamon raisin bagel, the third customer buys a plain bagel, the fourth buys a cinnamon raisin bagel, and the fifth buys a plain bagel. There are 32 possible outcomes; a systematic listing helps, and a tree diagram works (but is big). (The multiplication rule also works here. This very useful counting technique is presented in Section 4.3.) Here is the sample space: S = {PPPPP, PPPPC, PPPCP, PPPCC, PPCPP, PPCPC, PPCCP, PPCCC, PCPPP, PCPPC, PCPCP, PCPCC, PCCPP, PCCPC, PCCCP, PCCCC, CPPPP, CPPPC, CPPCP, CPPCC, CPCPP, CPCPC, CPCCP, CPCCC, CCPPP, CCPPC, CCPCP, CCPCC, CCCPP, CCCPC, CCCCP, CCCCC}.

140

CH APTER 4

Probability

a. Let A ! exactly one person buys a plain bagel. Check the sample space and carefully

list all the outcomes in A. A 5 5 PCCCC, CPCCC, CCPCC, CCCPC, CCCCP 6 N(A) 5 P(A) 5 5 0.15625 5 N(S) 32

b. The claim is that the demand for each type of bagel is equal. If this is true, then all of

Solution Trail 4.15b KE YWOR DS ■ ■

All five purchase a plain bagel Is there any evidence?

T RANSL ATI ON ■ ■

Experimental outcome Draw a conclusion

the outcomes in the sample space S are equally likely. The experiment consists of observing the bagel purchase for the next five customers. Let B ! the observed outcome, everyone buys a plain bagel. Find the likelihood of the event B occurring. There is only one outcome in B, so the probability of the event B is P(B) 5

CONCEPTS ■

Equally likely outcomes.

Inference procedure

VI S ION

Find the probability of the experimental outcome that all five customers buy a plain bagel. Compute how likely this outcome is, and draw a conclusion about the claim.

N(B) 1 5 0.03125 5 ( ) N S 32

Count and divide.

The conclusion: Because this probability is so small, all five people buying a plain bagel is a rare event. But it happened! This suggests the assumption is wrong—there is evidence to suggest that the demand for each type of bagel is not equal. Note: There is really evidence to suggest that some assumption is wrong. It could be, for example, that the five customers were not selected at random. To draw a conclusion about the demand for these two types of bagels, we must accept all other assumptions are true. TRY IT NOW

GO TO EXERCISE 4.52

Consider an experiment, two events A and B, and known probabilities P(A) and P(B). Suppose we use A and B to create a new event using complement, union, or intersection. Sometimes we can use the known probabilities P(A) and P(B) to calculate the probability of the new event quickly. We may not have to break down the new event into simple events, or even count all the outcomes in the new event (if it is an equally likely outcome experiment). The complement rule and the addition rule for two events are two rules that help with probability calculations.

The Complement Rule For any event A, P(A) ! 1 # P(A").

A CLOSER L OK S

A

A!

Figure 4.9 Venn diagram for visualizing the complement rule. (A")" is read as “A complement, complement.” What is (A")"? All outcomes not in A", which is A!

1. The complement rule is easy to visualize and justify by looking at a Venn diagram.

Figure 4.9 shows an event A and its complement A". Remember, the area of a region represents probability. P(A) % P(A") ! P(S) ! 1, which can be written as P(A) ! 1 # P(A") or P(A") ! 1 # P(A). 2. The complement rule is incredibly handy; it is used in various contexts throughout probability and statistics. The problem is, how do you know when to use it? Look for keywords such as not, at least, and at most. A rule of thumb: If you are faced with a very long probability calculation involving many simple events, or one that may require lots of counting, try looking at the complement.

Example 4.16 Law and Order Three public defenders are assigned to cases randomly. An experiment consists of recording the lawyer (by number) assigned to the next three cases. The outcome 132 means lawyer 1 was assigned case 1, lawyer 3 was assigned case 2, and lawyer 2 was assigned case 3. a. Find the probability that all three cases are assigned to different lawyers. b. Find the probability that lawyer 2 is not assigned to any of the three cases. c. Find the probability that lawyer 2 is assigned to at least one case.

4.2

141

An Introduction to Probability

SOLUTION There are 27 possible outcomes; a tree diagram works. Note: Each case can be assigned to one of three lawyers: Number of possible assignments for each case. T T T 3 3 3 3 3 5 27 Case 1 Case 2 Case 3 Here’s the sample space: S = {111, 112, 113, 121, 122, 123, 131, 132, 133, 211, 212, 213, 221, 222, 223, 231, 232, 233, 311, 312, 313, 321, 322, 323, 331, 332, 333}. a. Let A ! all three cases are assigned to different lawyers. Find all the outcomes in S

with a 1, a 2, and a 3. A 5 5 123, 132, 213, 231, 312, 321 6 N(A) 6 P(A) 5 5 5 0.2222 N(S) 27 8/ 27 is really approximately equal to 0.2963. Many answers in this text are rounded (here, to four decimal places) and an equal sign is used for simplicity and convenience.

Solution Trail 4.16c KEYW OR DS ■

At least one case

TR ANSL ATI O N ■

Let C be the event that lawyer 2 has at least one case

CONC EPTS ■

Complement rule

VI SI ON

Consider the complement, C", the event that lawyer 2 has no cases. Count the outcomes that have no 2s and use the complement rule: P(C) ! 1 # P(C"). STEPPED STEPPED TUTORIAL TUTORIALS GENERAL ADDITION BOX PLOTS RULE

Equally likely outcomes.

b. Let B ! lawyer 2 is not assigned to any of the three cases. Find all the outcomes

without a 2. B 5 5 111, 113, 131, 133, 311, 313, 331, 333 6 N(B) 8 P(B) 5 5 5 0.2963 N(S) 27 c. Let C be the event lawyer 2 has at least one case. The outcomes in C include those with

one 2, two 2s, and three 2s. That seems like a lot of counting. This is a good opportunity to use the complement rule. P ( C ) 5 1 2 P ( Cr ) 5 1 2 P ( lawyer 2 is assigned 0 cases ) 5 1 2 P(B) 512

Complement rule. Interpretation of Cr. Cr 5 B in this example.

8 5 1 2 0.2963 5 0.7037 27

TRY IT NOW

GO TO EXERCISE 4.54

The Addition Rule for Two Events 1. For any two events A and B, P(A c B) ! P(A) % P(B) # P(A d B). 2. For any two disjoint events A and B, P(A c B) ! P(A) % P(B).

A CLOSER L OK 1. Figure 4.10 helps illustrate and justify this rule. S

A

Figure 4.10 Venn diagram for illustrating the addition rule.

B

142

CH APTE R 4

Probability

To find the probability of the union, start by adding P(A ) % P(B). This sum includes the region of intersection, P(A d B), twice. Adjust this total by subtracting the intersection area, once. 2. A c B ! B c A; order doesn’t matter here. So, P ( A c B ) 5 P ( B c A ) . 3. The f irst, more general, formula always works. If A and B are disjoint, then P(A d B) ! 0. 4. The addition rule can be extended. For any three events A, B, and C: P(A c B c C) 5 P(A) 1 P(B) 1 P(C) 2 P(A d B) 2 P(A d C) 2 P(B d C) 1 P(A d B d C) You can also visualize and derive this by using a Venn diagram. In this case, the sum P(A) % P(B) % P(C) includes the double intersections twice and the triple intersection three times. We therefore need to adjust the total accordingly. 5. Let A1, A2, A3, . . . , Ak be a collection of k disjoint events. P ( A1 c A2 c cc Ak ) 5 P ( A1 ) 1 P ( A2 ) 1 c1 P ( Ak ) .

Beware of these common errors: P(A % B)—you can’t add two events; P(A) c P(B)—you can’t union two numbers. Probabilities are given as percentages in Example 4.17. Divide each by 100 to convert to a probability.

Solution Trail 4.17 KE YWOR DS ■ ■ ■

70% use sugar 35% use milk 25% use both

T RANSL ATI ON ■ ■ ■

P(G) ! 0.70 P(M ) ! 0.35 P(G d M ) ! 0.25

CONCEPTS ■ ■

Probability of events Intersection

VI S ION

Create a Venn diagram and find probabilities of events using the Venn diagram and probability rules.

If the events are disjoint, to find the probability of a union, just add up the corresponding probabilities. This is especially useful in questions that ask about the number of individuals or objects with a specific attribute. For example, suppose 10 people are asked whether they received a flu shot this winter. The probability of at least 3 is the probability of 0, plus the probability of 1, plus the probability of 2, plus the probability of 3. 6. Complement, union, and intersection are operations applied to events. It doesn’t make sense to take the union of probabilities (which are numbers). Similarly, addition and subtraction are operations on real numbers. You shouldn’t try to add or subtract events.

Example 4.17 Milk or Sugar? Marketing research by The Coffee Beanery in Detroit, Michigan, indicates that 70% of all customers put sugar in their coffee, 35% add milk, and 25% use both. Suppose a Coffee Beanery customer is selected at random. a. Draw a Venn diagram to illustrate the events in this problem. b. What is the probability that the customer uses at least one of these two items? c. What is the probability that the customer uses neither? d. What is the probability that the customer uses just sugar? e. What is the probability that the customer uses just one of these two items?

SOLUTION a. Define the events given in the problem.

Let G ! the customer adds sugar; P(G) ! 0.70. Let M ! the customer adds milk; P(M) ! 0.35. Use both means uses sugar and milk, which means intersection. Therefore, P(G d M) ! 0.25. These three probabilities add up to more than 1. That’s OK because the events M and G intersect.

4.2

143

An Introduction to Probability

Remember, area of a region corresponds to probability. To complete the picture, start at the inside and work your way out.

i. The shaded area represents the probability that the customer uses both sugar and milk, that is, P(G d M). We know that P(G d M) ! 0.25. Because P(G) ! 0.70, the remaining area representing G corresponds to 0.70 # 0.25 ! 0.45. ii. Similarly, because P(G d M) ! 0.25 and P(M) ! 0.35, the remaining area representing M corresponds to 0.35 # 0.25 ! 0.10. iii. The total probability in the entire sample space must sum to 1; the remaining probability is 1 # (0.45 % 0.25 % 0.10) ! 0.20. Figure 4.11 is the Venn diagram that corresponds to this problem. S G

M

0.45

0.25

0.20

0.10

Figure 4.11 Venn diagram for Example 4.17. b. The probability of using at least one item means using sugar, or milk, or both. That’s a

union of two events. P(G c M) 5 P(G) 1 P(M) 2 P(G d M) 5 0.35 1 0.70 2 0.25 5 0.80

Addition rule for two events. Use the known probabilities.

The Venn diagram supports this answer. Look at the region that represents P(G c M), and add up the corresponding probabilities. c. Uses neither means does not use sugar or milk. Because G c M means sugar or milk,

neither suggests the complement of G c M.

P 3 (G c M) r 4 5 1 2 P(G c M) 5 1 2 0.80 5 0.20

Complement rule applied to the event G c M. Use the previous answer.

In the Venn diagram, this is the region outside G c M. d. Uses just sugar means uses sugar but not both sugar and milk. This is not simply P(G),

because this probability includes more than just sugar. Start with the probability of using sugar, and subtract the probability of using both. P ( just sugar ) 5 P ( G ) 2 P ( G d M ) 5 0.70 2 0.25 5 0.45

Use the Venn diagram. Use the known probabilities.

e. Uses just one of these items means uses sugar or milk, but not both. Start with the

union, subtract off the intersection. P ( exactly one ) 5 P ( G c M ) 2 P ( G d M ) 5 0.80 2 0.25 5 0.55 TRY IT NOW

Use the Venn diagram. Use the known probabilities.

GO TO EXERCISE 4.61

Example 4.18 Movie Receipts The Cheswick Theatre in Pittsburgh, Pennsylvania, has six screens, each showing a different movie. Receipts from a recent weekend were used to compile the following table, showing the probability of watching each movie. Movie

M1

M2

M3

M4

M5

M6

Probability

0.10

0.25

0.20

0.30

0.10

0.05

144

CH APTE R 4

Probability

Consider the following events. A 5 5 M1, M2 6 (movies rated PG) B 5 5 M2, M3, M6 6 (action adventures) C 5 5 M4, M5 6 (dramas) D 5 5 M6 6 (foreign film) Suppose a patron is randomly selected. Find the probability he a. watched a movie rated PG or an action adventure. b. watched a movie rated PG or a drama. c. watched a movie rated PG, or a drama, or a foreign film.

SOLUTION Find the probability of A, B, C, and D. Break down each event and look at the individual outcomes. P(A) P(B) P(C) P(D)

5 P ( M1 ) 5 P ( M2 ) 5 P ( M4 ) 5 P ( M6 )

1 P ( M2 ) 5 0.10 1 0.25 5 0.35 1 P ( M3 ) 1 P ( M6 ) 5 0.25 1 0.20 1 0.05 5 0.50 1 P ( M5 ) 5 0.30 1 0.10 5 0.40 5 0.05 a. Or means union. Find the corresponding events, and translate everything into mathematics. P(A c B) 5 P(A) 1 P(B) 2 P(A d B) 5 P ( A ) 1 P ( B ) 2 P ( M2 ) 5 0.35 1 0.50 2 0.25 5 0.60

(General) addition rule. Find the events in A d B. Use known probabilities.

b. Part (b) is the same kind of question; or means union.

P(A c C) 5 P(A) 1 P(C) 5 0.35 1 0.40 5 0.75

A and C are disjoint. Use known probabilities.

c. Or means union again in part (c), but with three events.

P(A c C c D) 5 P(A) 1 P(C) 1 P(D) 5 0.35 1 0.40 1 0.05 5 0.80

Three disjoint events. Use known probabilities.

SECTION 4.2 EXERCISES Concept Check 4.38 True/False The probability of any event is always a

number between 0 and 1. 4.39 Fill in the Blank a. If the probability of an event is close to 1, then the event

. is b. If the probability of an event is close to 0, then the event . is c. The probability of an event is . 4.40 True/False There is no way of knowing the sum of all

possible outcomes in a sample space.

4.43 Fill in the Blank For any two events A and B,

P(A c B) ! P(A) % P(B) #

.

Practice 4.44 Consider an experiment with the probability of each

simple event given in the table below.

Simple event Probability Simple event Probability

e1

e2

e3

e4

0.07

0.09

0.13

0.18

e5

e6

e7

0.22

0.15

0.16

4.41 True/False

The events A, B, C, and D are defined by

4.42 True/False For any event A, P(A) % P(A") ! 1.

A ! {e1, e2, e3} C ! {e1, e5, e7}

In an equally likely outcome experiment, the probability of any event A is the number of outcomes in A.

B ! {e2, e4, e6, e7} D ! {e3, e4, e5, e6, e7}

4.2

Find the following probabilities. a. P(A) b. P(C) c. P(D) d. P ( A c B ) e. P ( A d C ) f. P ( B d D ) g. P(A") h. P ( A d Cr ) i. P ( Ar d D ) j. P(C") k. P ( B d C d D ) l. P 3 ( B c C ) r 4 How do you know there is no other possible simple event in this experiment? 4.45 An experiment consists of rolling a special 18-sided die.

All of the numbers, 1 through 18, are equally likely. Find the probability of each event. a. A ! rolling an even number. b. B ! rolling a number divisible by 3. c. C ! rolling a number less than 7. d. D ! rolling at least a 10. 4.46 An experiment consists of rolling a special 22-sided die.

All of the numbers, 1 through 22, are equally likely. Find the probability of each event. a. A ! rolling a number greater than 10 and even. b. B ! rolling a prime number or a number divisible by 5. c. C ! rolling at most an 11. d. D ! rolling a number divisible by 2 and 3. 4.47 Consider an experiment, the events A and B, and prob-

abilities P(A) ! 0.55, P(B) ! 0.45, and P ( A d B ) 5 0.15. Find the probability of: a. A or B occurring. b. A and B occurring. c. Just A occurring. d. Just A or just B occurring. 4.48 Consider an experiment, the events A and B, and prob-

abilities P(A) ! 0.26, P(B) ! 0.68, and P ( A c B ) 5 0.80. Find each probability. a. P ( A d B ) b. P(A") c. P 3 ( A d B ) r 4 d. P 3 ( A c B ) r 4

4.49 Consider an experiment, the events A and B, and prob-

abilities P(A) ! 0.355, P(B) ! 0.406, and P ( A d B ) 5 0.229. Find each probability. a. P ( A c B ) b. P 3 ( A c B ) r 4 c. P(B") d. P 3 ( A d B ) r 4 4.50 Carefully sketch a Venn diagram showing the

relationship between two events. Add probabilities to the appropriate regions so that the following statements are true: P ( A d B ) 5 0.31, P(A) ! 0.57, and P(B) ! 0.48. 4.51 Carefully sketch a Venn diagram showing the relation-

ships among three events. Add probabilities to the appropriate regions so that the following statements are true:

P ( A ) 5 0.46 P ( A d B ) 5 0.05 P ( B d C ) 5 0.14

P ( B ) 5 0.35 P ( C ) 5 0.44 P ( A d C ) 5 0.18 P ( A d B d C ) 5 0.03

Applications 4.52 Manufacturing and Product Development Valassis,

a marketing services company, offers a cafeteria-style benefit program; an employee may select three benefits from five. The

An Introduction to Probability

145

five possible benefits are health insurance, life insurance, a prescription plan, dental insurance, and vision insurance. a. How many different benefit packages can an employee select? List them. b. If all benefit packages are equally likely, what is the probability that an employee selects a package that includes health insurance? c. If all benefit packages are equally likely, what is the probability that an employee selects a package that includes life insurance and a prescription? 4.53 Travel and Transportation Suppose a bridge has 10 toll

booths in the east-bound lane: four are only for E-Z Pass holders, two are only for exact change, one takes only tokens, and the remainder are manned by toll collectors who accept only cash. During heavy-traffic hours it is difficult to see the signs indicating the type of toll booth. Suppose a driver selects a toll booth randomly. a. What is the probability that an exact-change toll booth is selected? b. What is the probability that a manual-collection toll booth or the token toll booth is selected? c. What is the probability that an E-Z Pass toll booth is not selected? d. Suppose the driver has only tokens. What is the probability of selecting the appropriate toll booth? 4.54 Public Health and Nutrition As of February 16, 2013, the risk of contracting the flu in the United States was still elevated. However, the number of new cases was decreasing. The following table lists the proportion of reported cases of influenza A by region since September 30, 2012.4

Region

Proportion

1 2 3 4 5 6 7 8 9 10

0.076 0.074 0.215 0.080 0.154 0.065 0.062 0.089 0.104 0.081

Suppose a reported case is selected at random. a. What is the probability that the case is from Region 1, 2, 3, or 4? b. What is the probability that the case is from Region 9 or 10? c. What is the probability that the case is not from Region 5? 4.55 Marketing and Consumer Behavior Delorenzo’s Pizza offers five different toppings on its pizzas: pepperoni, sausage, olives, mushrooms, and anchovies. A large pizza comes with any two different toppings. a. How many different two-topping pizzas are possible? b. Suppose that all of the pizzas are equally likely. What is the probability that the next pizza ordered has at least one meat topping?

146

CH APTE R 4

Probability

c. What is the probability that the next pizza ordered does

not have anchovies? d. Suppose one more large pizza choice is added: plain cheese with no toppings. Answer parts (b) and (c) with this added assumption. 4.56 Demographics and Population Statistics The

following table lists the proportion of employed people by each major industry in Japan as of December 2012.5 Major industry

Proportion

Agriculture and forestry Construction Manufacturing Information and communications Transport and postal activities Wholesale and retail trade Scientific, professional, technical Accommodations Personal services Education Medical Services Government

0.0328 0.0845 0.1722 0.0330 0.0585 0.1786 0.0367 0.0666 0.0412 0.0511 0.1247 0.0797 0.0404

Suppose an employed person in Japan is selected at random. a. What is the probability that the person works in construction or manufacturing? b. What is the probability that the person does not work in wholesale or retail trade? c. What is the probability that the person does not work in education nor medical? 4.57 Public Health and Nutrition The number of reported cases of vaccine-preventable diseases in Canada in 2011 is given in the following table.6

Disease

4.58 Marketing and Consumer Behavior A marketing

firm can place an advertisement using several media. The table below shows the probability that a randomly selected person in a targeted region will see the advertisement in the given medium.

Medium Probability

4.59 Sports and Leisure The Florida Cash 3 daily midday

lottery number consists of three digits, each 0–9. a. How many possible midday numbers are there? b. If all of the midday numbers are equally likely, find the probability that all three digits are the same. c. If all of the midday numbers are equally likely, find the probability that all three digits are 8s or 9s. d. There is an evening number, also consisting of three digits, 0–9. If all of the midday and evening numbers are equally likely, what is the probability that the two numbers are the same? 4.60 Marketing and Consumer Behavior The following table lists the most popular U.S. convention centers.7 Suppose the probability given represents the likelihood that a randomly selected U.S. convention will be held at that site.

Site Orlando Chicago Las Vegas Washington, DC Dallas Atlanta Phoenix Other

Probability 0.310 0.225 0.098 0.075 0.064 0.055 0.033 0.140

Suppose a convention is randomly selected. Consider the events: A ! {Convention in Orlando or Chicago} B ! {Convention not in Washington, DC} C ! {Convention in Las Vegas}

38 759 282 676 8

Suppose a reported case is selected at random. a. What is the probability that the case is measles? b. What is the probability that the case is mumps or pertussis? c. What is the probability that the case is not Hib meningitis?

Probability

Find the following probabilities. a. P(A), P(B), P(C) b. P(A c B), P(A d B), P(B d C) c. P(A"), P(Ar d C), P(A d B d C) d. P(Br d Cr), P[(B c C)"]

Frequency

Hib meningitis Measles Mumps Pertussis Other

Medium

Consider the following events. A ! {Magazine, Newspaper} B ! {TV, Radio, Internet} C ! {Magazine, Newspaper, Internet, Billboard}

Newspaper

Radio

Magazine

TV

0.15

0.10

0.08

0.30

Internet

Billboard

Not seen

0.12

0.05

0.20

Find the following probabilities. a. P(A), P(B), P(C) b. P(A d B), P(A c C), P(A d C) c. P(Ar c C), P(A c B c Cr) 4.61 Technology and the Internet Tablet computers have

become very popular and fill a gap between smartphones and PCs. A recent survey indicated that of those people who own tablets, 70% use the device to play games and 44% use the device to access bank accounts.8 Suppose 30% do both—play games and access bank accounts—and suppose a tablet user is selected at random. a. What is the probability that the tablet user plays games or accesses bank accounts? b. What is the probability that the tablet user does not play games nor access bank accounts? c. What is the probability that the tablet user only plays games? Only accesses bank accounts?

4.3

Extended Applications 4.62 Psychology and Human Behavior According to the

2011–2012 APPA National Pet Owners Survey, approximately 33% of all U.S. households own a cat and 39% of all U.S. households own a dog.9 Suppose 10% of all U.S. households own both a cat and a dog. a. Carefully sketch a Venn diagram with probabilities to illustrate the relationship between the two events C ! household owns a cat, and D ! household owns a dog. b. What is the probability that a randomly selected U.S. household owns a cat or a dog? c. What is the probability that a randomly selected household owns neither a cat nor a dog? d. What is the probability that the U.S. household owns only a cat? 4.63 Marketing and Consumer Behavior Of all those people who enter Uncle’s Stereo, a discount electronics store in New York City, 28% purchase a digital camera, 5% buy a home theater receiver, and 4% buy both. Suppose a customer is selected at random. a. What is the probability that the customer buys a digital camera or a home theater receiver? b. What is the probability that the customer buys either a digital camera or a home theater receiver, but not both? c. What is the probability that the customer buys only a digital camera? d. What is the probability that the customer does not buy a home theater receiver? 4.64 Demographics and Population Statistics The following table shows the ABO and Rh blood-type probabilities for people in the United States.10 (This table is called a joint probability table. Each number in the table can be thought of as the probability of an intersection; for example, the probability of blood type A and negative Rh is 0.06.)

O Rh type

Positive Negative

ABO type A B

AB

0.38 0.34 0.09 0.03 0.07 0.06 0.02 0.01

Suppose a U.S. resident is selected at random. Find the following probabilities. a. The person has Rh-positive blood. b. The person has type B blood.

Counting Techniques

147

c. The person does not have type O blood. d. The person has type AB or Rh-negative blood. 4.65 Manufacturing and Product Development A tire

manufacturer has started a program to monitor production. In every batch of eight tires, two will be randomly selected and tested for defects electronically. An experiment consists of recording the condition of these two tires: defect-free (G) or reject (B). Suppose two of the eight tires in a batch actually have serious defects. a. List the outcomes in this experiment. b. What is the probability that both tires selected will be defect-free? c. What is the probability that at least one of the tires selected will have a defect? d. What is the probability that both tires selected will have a defect? 4.66 Medicine and Clinical Studies The number of

Emergency Room visits has increased over the past several years in the United States. One reason for this increase may be that in difficult economic times people tend to postpone routine health care. This results in more visits to the ER. Of all patients who visit an Emergency Room, suppose 22% are seen in less than 15 minutes, 13% are admitted to the hospital, and 5% are seen in less than 15 minutes and admitted to the hospital.11 Suppose a patient who made an Emergency Room visit is selected at random. a. Carefully sketch a Venn diagram showing the relationship between the events seen in less than 15 minutes and admitted to the hospital, and add probabilities to the appropriate regions. b. What is the probability that the patient was seen in less than 15 minutes or admitted to the hospital? c. What is the probability that the patient was seen in less than 15 minutes but not admitted to the hospital? d. What is the probability that the patient was neither seen in less than 15 minutes nor admitted to the hospital?

Challenge 4.67 Reconsider Example 4.15. Suppose the owner records the

type of bagel purchased for the next ten customers. Find the probability that everyone buys a plain bagel. What do you think about the assumption of equal demand now?

4.3 Counting Techniques In an equally likely outcome experiment, computing probabilities means counting. To find the probability of an event A, count the number of outcomes in the event A and divide by the number of outcomes in the entire sample space S: P(A) ! N(A) / N(S). If N(S) is large, drawing a tree diagram or listing all of the possible outcomes is impractical. For certain experiments, the following rules may be used instead to count outcomes in an event and/or a sample space.

148

CHAPTE R 4

Probability

The Multiplication Rule Suppose an outcome in an experiment consists of an ordered list of k items selected using the following procedure: 1. There are n1 choices for the first item. 2. There are n2 choices for the second item, no matter which first item was selected. 3. The process continues until there are nk choices for the kth item, regardless of the previous items selected. There are N ( S ) 5 n1 # n2 # n3 cnk outcomes in the sample space S.

A CLOSER L OK 1. You can picture (and even prove) this rule by drawing a tree diagram and counting the

number of paths from left to right. 2. To use this rule, think of each choice as a slot, or a position, to fill. T n1 Item 1

Number of choices for each slot. T T # # # nk n2 3 3 3 Item 2 Item k

5 n1 # n2 # # # nk

3. This counting technique can also be used for events, not just for sample spaces.

Example 4.19 Surround Sound A home theater system consists of a receiver, surround-sound speakers, and a Blu-ray player. Vann’s Inc. store sells 7 different receivers, 12 types of speakers, and 9 different Blu-ray players. How many possible systems can be constructed?

SOLUTION STEP 1 This is a counting problem, and there are three slots to fill: receiver, speakers,

and Blu-ray player. We’ll assume that all components are compatible, and that the choice of any one item does not depend on any other item. STEP 2 Here’s how to apply the multiplication rule: 7 12 9 3 3 Receiver Speakers Blu-ray player

5 756

There are 756 possible systems.

Example 4.20 License Plates A Connecticut license plate consists of three letters followed by three numbers. a. How many different license plates are possible? b. How many license plates end in 555?

SOLUTION a. This is a counting problem. There are six slots to fill: three letters followed by three The actual number of possible license plates is smaller, because some three-letter words aren’t allowed.

numbers. There are 26 possible letters for each of the first three positions, and 10 possible numbers for each of the last three positions. Use the multiplication rule. 26 3 26 3 26 3 10 10 10 3 3 5 17,576,000 Letter Letter Letter Number Number Number There are 17,576,000 possible different license plates.

4.3

Counting Techniques

149

b. If the license plate ends in 555, then each of the number positions is fixed; there is only

one choice. We are still free to choose any letter in each of the first three positions. The multiplication rule still works. 26 3 26 3 26 3 1 1 1 3 3 5 17,576 Letter Letter Letter Number Number Number There are 17,576 license plates that end in 555.

Example 4.21 Five-of-a-Kind In the game of Yahtzee, five fair dice are rolled and the numbers that land face up are recorded. a. How many different rolls are possible? b. What is the probability of rolling a Yahtzee (all five dice show the same number)?

SOLUTION a. There are five slots to fill, one for each die. Use the multiplication rule. Keeweeboy/Dreamstime.com

6 3 6 3 6 3 6 3 6 5 7776 Die 1 Die 2 Die 3 Die 4 Die 5 There are 7776 possible rolls, or outcomes, in the sample space. b. There are only six possible Yahtzees: 11111, 22222, 33333, 44444, 55555, and 66666.

Because all the outcomes are equally likely (fair dice), the probability of rolling a Yahtzee is P ( Yahtzee ) 5

number of Yahtzees 6 5 5 0.0007716 number of different rolls 7776

Example 4.22 Win, Place, or Show Suppose there are 12 entries in the Preakness Stakes horse race. An experiment consists of recording the finish: the first-, second-, and third-place horse. For example, the outcome (7, 9, 2) means horse 7 came in first, horse 9 came in second, and horse 2 came in third. a. How many different finishes are possible? b. What is the probability of a finish with horse 4 or 5 in first place? c. What is the probability that horse 7 will not finish first, second, or third?

SOLUTION a. There are three positions to fill, but the number of choices in the second slot depends

on the first choice and the number of choices in the third slot depends on the first two choices. Even though we are drawing from the same, reduced, collection, we can still use the multiplication rule. There are 12 horses that could finish first. Once a first-place horse is selected, there are only 11 left that could come in second. After a first and second-place horse are selected, there are only 10 possible for third place. The multiplication rule is used here to count the number of permutations. 12 First

3

11 Second

3

10 Third

5 1320

There are 1320 possible different finishes. N(S) ! 1320.

150

CHAPTE R 4

Probability

b. Let A be the event horse 4 or 5 wins the race. We’ll assume that all of the outcomes are

equally likely so that P(A) ! N(A) / N(S) ! N(A) / 1320. There are two choices for first place (horse 4 or 5). There are now 11 choices for second place (the horse not selected for first, plus the remaining 10), and 10 choices for third place. The multiplication rule is used to find the number of outcomes in A: 2 3 11 3 10 5 220 First Second Third Finally, P(A) ! 220 / 1320 ! 0.1667. c. The word not suggests the use of a complement, but a direct approach may be easier.

Let the event B ! horse 7 does not finish first, second, or third. P(B) ! N(B) / N(S) ! N(B) / 1320. Use the multiplication rule again to count the number of outcomes in the event B. We do not want horse 7 in the top three. That leaves 11 possible horses for first place, 10 for second, and 9 for third. 11 3 10 3 9 5 990 First Second Third P(B) ! 990 / 1320 ! 0.75. TRY IT NOW

GO TO EXERCISE 4.76

The following notation is often used to write large numbers associated with counting problems more concisely.

Definition For any positive whole number n, the symbol n! (read “n factorial”) is defined by n! 5 n ( n 2 1 )( n 2 2 ) c( 3 )( 2 )( 1 ) In addition, 0! ! 1 (0 factorial is 1).

A CLOSER L OK 1. To find n!, just start with n, multiply by (n # 1), then (n # 2), . . . , down to 1. For

example, 7! 5 ( 7 )( 6 )( 5 )( 4 )( 3 )( 2 )( 1 ) 5 5040 10! 5 ( 10 )( 9 )( 8 )( 7 )( 6 )( 5 )( 4 )( 3 )( 2 )( 1 ) 5 3,628,800 2. Factorials get really big, really fast. Try finding 50!. If you absolutely have to find a large factorial, then you should probably use a good calculator or computer. Consider a generalization of the horse-racing problem. Suppose there are n items to choose from, r positions to fill, and the order of selection matters. There are n choices for the first position, n # 1 choices for the second position, and n # 2 choices for the third position. This process continues until there are n # (r # 1) choices for the rth position. The product of these numbers is the total number of permutations.

Definition Given a collection of n different items, an ordered arrangement, or subset, of these items is called a permutation. The number of permutations of n items, taken r at a time, is given by nPr

5 n ( n 2 1 )( n 2 2 ) c 3 n 2 ( r 2 1 ) 4

4.3

nPr is also referred to as n items permuted r at a time.

Using the definition of factorial, n! nPr 5 (n 2 r)!

Counting Techniques

151

In the denominator, do the subtraction first, then the factorial.

A CLOSER L OK 1. All n items must be different in order for this formula to be used. 2. A distinguishing characteristic of a permutation is that order matters. For example, if

the outcome AB is different from the outcome BA, that suggests a permutation. Suppose an experiment consists of selecting two students from a class of 35. The first one selected will be the president and the second will be the vice president. Order certainly matters here; we will be counting permutations. If the two students selected will form a committee, however, then the order of selection does not matter. Counting in this case involves a combination, which will be introduced a little later. 3. Here is an example to justify this formula. 12P3

5 ( 12 )( 11 )( 10 ) ( 9 )( 8 )( 7 )( 6 )( 5 )( 4 )( 3 )( 2 )( 1 ) 5 ( 12 )( 11 )( 10 ) 3 ( 9 )( 8 )( 7 )( 6 )( 5 )( 4 )( 3 )( 2 )( 1 ) ( 12 )( 11 )( 10 )( 9 )( 8 )( 7 )( 6 )( 5 )( 4 )( 3 )( 2 )( 1 ) 5 ( 9 )( 8 )( 7 )( 6 )( 5 )( 4 )( 3 )( 2 )( 1 ) 5

Definition of 12P3, n ! 12, r ! 3. Multiply by 1 in a nice form.

Rewrite as one fraction.

12! 12! n! 5 5 ( ) ( 9! 12 2 3 ! n 2 r)!

Definition of factorial.

Example 4.23 Vending Machine Selection A vending machine has room for six types of soda. The soda can be arranged in any order to correspond with the selection buttons on the front of the machine. If the operator has 10 different types of soda to choose from, how many machine selection arrangements are possible?

SOLUTION If you compute nPr by hand, there is always a lot of canceling. nPr is a count, so the answer has to be an integer.

STEP 1 There are n ! 10 items, we need to choose r ! 6, and the order in which the soda

is arranged matters. For example, if capital letters represent soda types, then the arrangement ABCDEF is different from ABCEDF. We must count the number of permutations of 10 items, taken 6 at a time. 10! 10! 5 STEP 2 10P6 5 Definition of nPr, using factorials. ( 10 2 6 ) ! 4! ( 10 )( 9 )( 8 )( 7 )( 6 )( 5 )( 4 )( 3 )( 2 )( 1 ) 5 Definition of factorial. ( 4 )( 3 )( 2 )( 1 ) 5 ( 10 )( 9 )( 8 )( 7 )( 6 )( 5 ) 5 151,200 Cancel; multiply. There are 151,200 ordered arrangements of soda types in the vending machine. Figure 4.12 shows a technology solution.

Example 4.24 Sheldon and Leonard Figure 4.12 TI-84 Plus C permutation function.

A fan of The Big Bang Theory has taped nine episodes from the most recent season of this show. However, he only has time to watch four episodes. Suppose he selects four shows at random. a. How many different ordered arrangements of episodes are possible? b. If the season finale is recorded, what is the probability that he will select and watch this

episode last?

152

CHAPTE R 4

Probability

Solution Trail 4.24b KE YWOR DS ■

Selects at random

T RANSL ATI ON ■

Equally likely outcomes

CONCEPTS ■

P(A) ! N(A) / N(S)

VI S ION

Count the number of arrangements in which the final recording is selected last, and divide this count by the total number of ordered arrangements

SOLUTION a. There are n ! 9 episodes to choose from. We need to count the number of ordered

arrangements of r ! 4 recordings. 9P4

9! 9! 5 (9 2 4)! 5! ( 9 )( 8 )( 7 )( 6 )( 5 )( 4 )( 3 )( 2 )( 1 ) 5 ( 5 )( 4 )( 3 )( 2 )( 1 ) 5 ( 9 )( 8 )( 7 )( 6 ) 5 3024 5

Definition of nPr, using factorials.

Definition of factorial. Cancel; multiply.

There are 3024 different ordered arrangements of four episodes. b. Let A ! the last recording selected is the season finale. There are four positions to fill, but the last slot is fixed (with the season finale). The first three positions can be filled by any of the remaining eight recordings, in any order.

6 8P3

8 3 7 3 6 3 1 5 336 Rec 1 Rec 2 Rec 3 Rec 4 336 P(A) 5 5 0.1111 3024

Figure 4.13 shows a technology solution. TRY IT NOW

GO TO EXERCISE 4.91

In many experiments, the order in which the items are selected does not matter, for example, selecting five manufactured items from a batch of 50 for inspection, choosing nine people from 35 for a search committee, or picking three tax returns from 100 for a federal audit. In each case, the order of selection is not important; the collection, or group selected, is a single outcome. These unordered arrangements are called combinations. Figure 4.13 Find the number of permutations and divide by the total number of outcomes.

Definition Given a collection of n different items, an unordered arrangement, or subset, of these items is called a combination. The number of combinations of n items, taken r at a time, is given by nCr

n n! nPr 5a b5 5 r r! ( n 2 r ) ! r!

A CLOSER L OK n r 2. To find nCr from nPr we need to collapse all ordered arrangements of the same r items into one possible outcome. Dividing by r! does this because every unordered set of r distinct items can be arranged in r! ways. 3. If you have to calculate nCr by hand, there is always a lot of cancellation. The final answer must be an integer because it is a count. 1. a b is read as “n choose r.”

Example 4.25 Jury Duty How many different ways are there to select a jury of 12 people from a pool of 20?

SOLUTION STEP 1 There are n ! 20 prospective jurors, and we need to choose r ! 12, without

regard to order. A jury is an unordered arrangement of 12 people. We need to count the number of combinations of 20 items, taken 12 at a time.

4.3

STEP 2

20C12

5a

20 b 12

20! 20! 5 12! ( 20 2 12 ) ! 12!18! ( 20 )( 19 )( 18 )( 17 )( 16 )( 15 )( 14 )( 13 ) 5 5 125,970 8!

5

Figure 4.14 The TI-84 Plus C combination function.

Solution Trail 4.26 KEYW OR DS ■

Randomly selects

TR ANSL ATI O N ■

Equally likely outcomes

153

Counting Techniques

n Definition of a b. r Cancellation; computation.

There are 125,970 ways to select a jury of 12 from a pool of 20 candidates. Figure 4.14 shows a technology solution.

Example 4.26 Hardwood Floors Lumber Liquidators ships 214 -inch solid oak wood flooring in cartons containing 20 square feet. Suppose that two cartons in a shipment of 11 contain defective pieces. An installer randomly selects five cartons. a. What is the probability that there are no defective pieces in any of the five cartons

selected? b. What is the probability that the installer picks exactly one carton that contains defective pieces?

CONC EPTS ■

■

Probability of an event means counting Combinations

VI SI ON

The order in which the cartons are selected does not matter. To find the number of outcomes in the sample space, count combinations. To count the number of outcomes in each event, use the multiplication rule and the formula for nCr.

SOLUTION There are n ! 11 cartons, and we need to choose r ! 5, without regard to order. 11C5

5a

( 11 )( 10 )( 9 )( 8 )( 7 ) 11 11! 11! 5 5 5 462 b5 5 5! ( 11 2 5 ) ! 5!6! 5!

There are 462 outcomes in the sample space, all equally likely. a. Let A ! select no cartons that contain defective pieces. Count the number of ways to select no cartons that contain defective pieces. This is the same as choosing five good cartons. There are nine good cartons, so we count the number of ways to select five cartons from the nine good cartons, without regard to order. ( 9 )( 8 )( 7 )( 6 ) 9 9! 9! 5 5 5 126 N(A) 5 a b 5 5 5! ( 9 2 5 ) ! 5!4! 4!

Because this is an equally likely outcome experiment, P(A) 5

N(A) 126 5 0.2727 5 N(S) 462

b. Let B ! select one carton that contains defective pieces (and, therefore, four good

cartons). To find the number of outcomes in B, there are two cases to consider: the number of ways to select one bad carton, and the number of ways to select four good cartons. 2 9 a b 3 a b 5 2 3 126 5 252 1 4 c c

The number of ways to select one bad carton from two, without regard to order

P(B) 5 Figure 4.15 Probability calculations.

The number of ways to select four good cartons from nine, without regard to order

N(B) 252 5 5 0.5455 N(S) 462

Figure 4.15 shows a technology solution. TRY IT NOW

GO TO EXERCISE 4.85

154

CHAPTE R 4

Probability

Technology Corner Procedure: Compute permutations and combinations. Reconsider: Examples 4.23 and 4.25, solutions, and interpretations.

TI-84 Plus C There are built-in functions to compute permutations, combinations, and even factorials. 1. On the Home screen, enter the value of n. 2. Select MATH ; PROB ; nPr or MATH ; PROB ; nCr. 3. Enter the value of r. See Figures 4.13 and 4.14.

Minitab Use the functions Permutations and Combinations in either a Session window or in Calc; Calculator. See Figure 4.16.

Figure 4.16 Minitab functions for permutations and combinations.

Excel Use the function PERMUT to compute permutations and the function COMBIN to compute combinations. Use intermediate results to compute probabilities. See Figure 4.17.

Figure 4.17 Excel functions for permutations and combinations.

SECTION 4.3 EXERCISES Concept Check 4.68 Fill in the Blank A

4.71 Short Answer Counting rules are most helpful in what

is a visualization of

the multiplication rule. 4.69 Fill in the Blank An ordered arrangement is called

a

.

4.70 Fill in the Blank An unordered arrangement is called

a

.

kind of an experiment? 4.72 True/False For fixed values of n and r, nPr is always

greater than or equal to nCr.

4.3

Practice 4.73 Find the number of permutations indicated. a. 8P4 b. 11P7 c. 12P4 d. 10P10 e. 10P1 f. 10P0 g. 9P2 h. 20P2 i. 100P2 4.74 Find the number of combinations indicated.

9 5 10 d. a b 10 12 g. a b 3 a. a b

9 b. a b 4 10 e. a b 1 16 h. a b 7

14 b 7 10 f. a b 0 20 i. a b 18

c. a

4.75 How many permutations of the letters in the word

HISTOGRAM are possible? 4.76 A businessman’s outfit consists of a pair of pants, a shirt,

and a tie. Suppose he can choose from among 5 pairs of pants, 8 shirts, and 15 ties. a. How many different outfits are possible? b. Suppose a winter outfit includes a sweater and he can select one of 7 sweaters. Now how many different winter outfits are possible?

Counting Techniques

155

work. Three of the 14 do not have their union card, and 6 carpenters will be selected at random for construction jobs. a. What is the probability that all six carpenters selected will have their union card? Write a Solution Trail for this problem. b. What is the probability that exactly one carpenter selected will not have a union card? c. What is the probability that at least one carpenter selected will not have a union card? 4.83 Fuel Consumption and Cars Suppose Geico offers

automobile insurance with specific levels of coverage according to the table below. Coverage levels Medical: $10,000; $20,000; $50,000; $100,000 Bodily injury liability: $50,000; $100,000 Property damage liability: $25,000; $50,000; $100,000 Uninsured motorists: $50,000; $100,000; $200,000 Comprehensive: $250,000; $500,000; $1,000,000

7 in the next half-hour. How many different playlists are possible?

Suppose an automobile policy must have all five coverages. a. How many different automobile policies are possible? b. How many policies have comprehensive coverage of at least $500,000? c. How many policies have bodily injury liability and property damage liability of $100,000?

4.78 A grocery store has 6 cashiers on duty, 10 baggers, and 4

4.84 Manufacturing and Product Development eBags

people who will help customers load their groceries into a car. How many different checkout crews are possible?

offers backpacks in 5 styles, 3 sizes, and 10 colors. a. How many different backpacks does the company offer? b. Midnight blue and dark green are the two most popular colors. How many different backpacks in these colors does the company offer? c. The urban-style backpack is the least popular. If the company eliminates this style, how many different backpacks will it offer?

4.77 A disc jockey has 20 songs to choose from but can play only

4.79 A television station is developing a new identifying

three-note theme. How many different three-note themes are possible if there are 20 notes to choose from and no note can be repeated? 4.80 A small basket contains 17 good apples and 3 rotten apples. a. How many different handfuls of six apples are possible? b. How many different handfuls of five good apples and one

rotten apple are possible? c. How many different handfuls of three good apples and three rotten apples are possible?

Applications 4.81 Manufacturing and Product Development Suppose

Target sells a combination lock, which is really a permutation lock, with 40 numbers, 0 to 39. The combination for each lock is set at the factory and consists of three numbers. a. How many lock combinations are possible if numbers can be repeated? b. If all lock combinations are equally likely, what is the probability of selecting a lock with only single-digit numbers in the combination? c. Answer parts (a) and (b) if the lock combination must be three different numbers. 4.82 Public Policy and Political Science Suppose 14

carpenters report to the union hall hoping for a chance to

4.85 Fuel Consumption and Cars A small tool-and-die

shop manufactures kneuter valves. A shipment of 15 valves to a Swedish automobile assembly plant contains three defective values. Suppose the assembly plant randomly selects four valves from the shipment. a. What is the probability that all four valves will be defect-free? b. What is the probability that the plant will select all three defectives? c. What is the probability that the plant will select at least one defective? 4.86 Medicine and Clinical Studies A physician routinely visits a local nursing home on Thursday mornings to examine patients. Suppose the facility has 20 residents, but the physician only has time to check 8. The supervisor places 8 random patients on an ordered list and presents the schedule to the physician. a. How many different schedules are possible? b. If there are 15 women and 5 men in the facility, what is the probability that all appointments will be with women?

156

CH APTE R 4

Probability

4.87 Psychology and Human Behavior A telemarketer has

12 people on his contact list. Suppose he will randomly select 8 people to call during the next shift. a. How many different calling schedules are possible? b. Suppose only 2 of the 12 will definitely purchase the product when contacted. What is the probability that these 2 people will be the first 2 called? c. Suppose another 2 of the 12 will ask to be placed on the do-not-call list when contacted. What is the probability that these 2 people will not be called? 4.88 Sports and Leisure In preparation for the coming season, a bass fisherman decides to buy 5 random lures out of the 10 new ones in the local tackle shop. a. How many different collections of 5 new lures are possible? b. Suppose 1 of the 10 lures is a Crazy Crawler. What is the probability that the fisherman will not select this lure? Write a Solution Trail for this problem. c. Suppose 3 of the 10 are Excalibur lures. What is the probability that at least 1 of the 5 selected will be an Excalibur lure? 4.89 Classic Vinyl A music collector has 15 unopened

classic rock albums in her collection. Suppose she decides to select three to sell at an upcoming auction. a. How many different ways are there to select three albums from her collection? b. Suppose the albums are selected at random and five are by the Beatles. What is the probability that all three selected will be by the Beatles? c. What is the probability that none of the three will be by the Beatles? d. Suppose that one album is by the Doors. What is the probability that this album is not selected? 4.90 Business and Management The Gagosian art gallery in New York City has 20 stored paintings but has just made room to display several of them. Seven paintings will be randomly selected and offered to the public for sale. a. How many different collections of 7 paintings are possible? b. Suppose 10 of the 20 stored works are by the same local artist. What is the probability that all 7 of the selected paintings will be by this artist? c. The featured room in the gallery receives the most attention, and the order in which the paintings are displayed in this room is related to buyer interest. Suppose the 7 selected paintings will be placed in this featured room. How many different arrangements are possible? 4.91 Economics and Finance The purchasing agent for a

state office building placed a call for bids on replacing the entry doors. Suppose that eight sealed bids are received by the deadline. The bids will be opened in random order. a. In how many different ways can the bids be opened? b. What is the probability that the lowest bid will be opened last? 4.92 Marketing and Consumer Behavior In remodeling a kitchen, a builder decides to place a splashguard behind the

sink consisting of 8 six-inch-square ceramic tiles decorated with different botanical herbs. The tiles will be installed in a custom-made wooden panel. The tile supplier has 12 different herb designs to choose from, and the builder selects 8 of these 12 at random. Suppose the order in which the tiles are arranged on the splashguard does not matter. a. Two of the 12 herb tiles contain a blue tint that matches the kitchen color scheme. What is the probability that these 2 tiles will be included in the splashguard? b. The family actually grows 5 of the 12 herbs in a backyard garden. What is the probability that all 5 of these will be included on the splashguard? 4.93 Travel and Transportation A PennDOT road

line-painting crew consists of a foreman, a driver, and a painter. Suppose a supervisor is preparing the schedule to paint lines on roads in Johnstown and 10 foremen, 15 drivers, and 17 painters are available. a. How many different crews are possible? b. Suppose the crews are selected at random, and there is one foreman who has a severe personality conflict with one driver. What is the probability that neither of these individuals will be on the road painting crew? c. Eight of the painters have been cited by a supervisor for improper painting. What is the probability that the crew will include one of these painters? 4.94 Education and Child Development

A university library is preparing a display case of books written by faculty members. There are 25 new faculty books, but there is room for only 10 in the display case. Suppose 10 books are selected at random. a. How many different faculty book collections can be displayed? b. If 15 of the new books are written by faculty members from the College of Science and Technology, what is the probability that all 10 displayed books are written by faculty members from this college? c. If none of the 10 displayed books is written by faculty members from the College of Science and Technology, is there any evidence to suggest the selection process was not random? Justify your answer.

4.95 Psychology and Human Behavior In a family with

five children, two of the five are selected at random each evening to do the dishes. The first one selected washes, and the second one dries. a. How many different wash–dry crews are possible? b. Suppose there are two girls and three boys in the family. If the two girls are selected to wash and dry, is there any evidence to suggest the selection process was not random? Justify your answer.

Extended Applications 4.96 Manufacturing and Product Development

A remote-control garage door opener has a series of 10 twoposition (0 or 1) switches used to set the access code. The code is initially set at the factory, and the switch sequence on

4.3

the remote control and the opener must match in order to use the system. a. How many different access codes are possible? b. If all access codes are equally likely, what is the probability that a randomly selected system will have a code with exactly one 0? c. To increase security and ensure that customers will have different access codes, new systems have 10 threeposition switches (0, 1, or 2). Answer parts (a) and (b) using the new system. 4.97 Psychology and Human Behavior An annual family

picture following Thanksgiving dinner is arranged with all 10 family members in a row in front of a fireplace. a. How many different arrangements of family members are possible? b. Suppose the family includes one set of twins, and all arrangements are equally likely. What is the probability that the twins will be in the middle two places (positions 5 and 6)? c. What is the probability that the twins will be side by side in the picture? d. Suppose the family includes five males and five females. What is the probability that the picture arrangement will alternate male, female, male, female, etc., or female, male, female, male, etc.? 4.98 Public Policy and Political Science A special commit-

tee on community development has four members from the town council. The full town council has 14 members, six Democrats and eight Republicans. a. How many different committees on community development are possible? b. Suppose the committee members are selected at random. What is the probability of a committee consisting of all Republicans? c. Suppose every member of the committee selected is a Democrat. Do you believe the selection process was random? Justify your answer. 4.99 Sports and Leisure Texas hold ’em poker has become

very popular in gambling casinos and is seen on ESPN and the Travel Channel. In November 2012, Greg Merson won the World Series of Poker in Las Vegas and a cool $8.53 million. The game is played with a standard 52-card deck, and starts with each player being dealt two (random) cards face down (hole cards). There is a round of betting, the dealer then flips three cards face up (the flop), betting, one card is flipped (the turn), betting, a fifth card is flipped (the river), and more betting. Let’s focus on the two hole cards, called a (pre-flop) hand, in this problem. a. How many (two-card, pre-flop) hands are possible in Texas hold ’em? b. What is the probability that a pre-flop hand consists of two aces? c. What is the probability that a pre-flop hand consists of a pair, that is, two cards of the same rank? d. What is the probability that a pre-flop hand consists of two cards of the same suit?

Counting Techniques

157

Challenge 4.100 The Complement Rule Reconsider Example 4.22. Verify the probability in part (c) using the complement rule. 4.101 Combination Patterns Find the following sums.

2 0

2 1

2 2

3 0

3 1

3 2

3 3

4 0

4 1

4 2

4 3

a. a b 1 a b 1 a b

b. a b 1 a b 1 a b 1 a b

4 4

c. a b 1 a b 1 a b 1 a b 1 a b

n n n n d. a b 1 a b 1 a b 1 c1 a b 0

1

2

n

4.102 Sports and Leisure

Consider a regular deck of 52 playing cards. For a five-card poker hand, find the probability of: a. One pair. b. Two pairs. c. Three of a kind: three cards of the same rank and two others of different ranks, for example, JJJ74. d. A straight: five cards in sequence; the ace can be either high or low. e. A flush: five cards of the same suit.

4.103 Psychology and Human Behavior How many

different ways are there to arrange n people at a round table? (Hint: A simple rotation of a seating plan, shifting each person around the table but keeping the order the same, is not a different arrangement.) 4.104 Travel and Transportation Suppose there are n items of which n1 are of one type, n2 are of a second type, . . . , and nk are of the kth type, and n1 1 n2 1 c1 nk 5 n. The number of unordered arrangements of the n items is a generalized combination given by

a

n

n!

b5 n1 n2 # # # nk n1! n2! cnk

(Think about arranging a string of colored Christmas tree lights.) Suppose the Amtrak Auto Train from Washington, DC, to Florida has 10 sleeper cars, 2 diner cars, and 14 car carriers. Discounting the engine and caboose, how many different arrangements of cars in the train are there? 4.105 Public Policy and Political Science The U.S.

Senate Committee on Homeland Security and Governmental Affairs has 16 members. The full Senate has 53 Democrats, 45 Republicans, and 2 Independents. a. How many different 16-member Senate committees are possible? b. If the committee members are selected at random, what is the probability of a committee consisting of all Democrats? c. What is the probability that the committee consists of 14 Democrats and 2 Independents?

158

CH APTE R 4

Probability

4.4 Conditional Probability The probability questions we have considered so far have all been examples of unconditional probability. No special conditions were imposed, nor was any extra information given. However, sometimes two events are related so that the probability of one depends on whether the other has occurred. In this case, knowing something extra may affect the probability assignment. This type of situation usually involves two events. The extra information may be expressed as an event separate from the event whose probability is desired.

Example 4.27 Morning Commute Consider a banker who commutes 30 miles to work every day. Because of several factors (weather, road construction, family obligations, etc.), the probability that she makes it to work on time on any random day is 0.5. If the event T is T ! the banker makes it to work on time, then P(T) ! 0.5. This is an unconditional probability statement: No extra information related to the event T is known or given. Suppose a random day is selected, and the road conditions are terrible because of a snowstorm. The probability that the banker arrives at work on time is surely lower, perhaps around 0.1. Knowing the extra information (a snowstorm) changes the probability assignment for T. The statement, “What is the probability that the banker arrives at work on time if it is snowing?” is a conditional probability question. The extra information is that it’s snowing outside. If the event F is defined as F ! a snowstorm, The vertical bar, 0 , in the probability statement is read as “given.”

then this conditional probability is written as P(T 0 F) ! 0.1; the probability that the banker arrives at work on time, given that it is snowing, is 0.1.

Suppose another random day is selected, but this time the banker wakes up before the alarm goes off and leaves the house early. The probability that she makes it to work on time is certainly higher, say, close to 0.95. Once again, knowing some extra information changes the probability assignment for T. If the event E is E ! the banker leaves her house early, then P(T 0 E) ! 0.95.

Knowing something extra may change the probability assignment. How do we use any added information to compute the (possibly) new probability? Consider the next example.

Example 4.28 Roll the Die Consider an experiment in which a fair, six-sided die is rolled and the number landing face up is recorded. The sample space is S ! {1, 2, 3, 4, 5, 6}. Consider the following events: A 5 5 1 6 5 roll a 1

and

B 5 5 1, 3, 5 6 5 roll an odd number.

Finding P(A) is an unconditional probability question because no extra information is known. Because all of the outcomes in the experiment are equally likely, and there is one outcome in A and there are six outcomes in the sample space, P ( A ) 5 1/6.

4.4

159

Conditional Probability

Suppose someone rolls the die, covers it with her hands, peeks at the number, and reports, “I rolled an odd number.” With this added information, the probability of a 1 is now P ( A 0 B ) 5 13 . This conditional probability is reasonable because now we only have to consider three possibilities—that is, we have reduced the sample space from six outcomes to three, and the number of outcomes in A is 1. The idea of reducing, or shrinking, the sample space is key to calculating conditional probabilities. The definition of conditional probability, and some justification for it, are given below.

Definition

What goes wrong with this definition if P(B) ! 0?

Suppose A and B are events with P(B) > 0. The conditional probability of the event A given that the event B has occurred, P(A 0 B), is P(A 0 B) 5

P(A d B) P(B)

A CLOSER L OK 1. The unconditional probability of an event A can be written as

P(A) 5

probability of the event A P(A) P(A) 5 5 1 P(S) probability of the relevant sample space

We use this same reasoning to find P(A 0 B). 2. Given that B has occurred, the relevant sample space has changed. It is reduced from S to B. (See Figure 4.18.) 3. Given that B has occurred, the only way A can occur is if A d B has occurred, because the sample space has been reduced to B. 4. P(A 0 B) is the probability that A has occurred, P ( A d B ) , divided by the probability of the relevant sample space P(B). S

A

B

Figure 4.18 An illustration for calculating conditional probability.

In the following example, the formula for conditional probability is used to confirm our intuitive answer to the die problem above.

Example 4.29 Roll the Die (Continued) The experiment consists of rolling a fair, six-sided die and recording the number that lands face up. S ! {1, 2, 3, 4, 5, 6}. Consider the following events: A 5 5 1 6 5 roll a 1

and

B 5 5 1, 3, 5 6 5 roll an odd number.

Find P(A 0 B), the probability of rolling a one given that an odd number was rolled.

160

CH APTE R 4

Probability

SOLUTION STEP 1 We will need the following probabilities:

P(B) ! 3/6 and P ( A d B ) 5 P ( 1 ) 5 1/6. STEP 2 P ( A 0 B ) 5

P(A d B) 1/6 1 6 1 5 # 5 5 P(B) 3/6 6 3 3

This answer agrees with the intuitive answer above (thank goodness).

A CLOSER L OK

When are these two conditional probabilities equal?

Here are some facts about union, intersection, and conditional probability to help translate and solve many of the problems that follow. 1. P ( A c B ) 5 P ( B c A ) . This is always true, because A c B 5 B c A (all the outcomes in A or B or both). 2. P ( A d B ) 5 P ( B d A ) . This is also always true, because A d B 5 B d A. 3. P ( A 0 B ) 2 P ( B 0 A ) . These two probabilities could be equal, but in general they are different. It’s all right to switch A and B with union and intersection, but not with conditional probability. 4. The keywords given and suppose often signal partial information and, therefore, indicate a conditional probability question.

Example 4.30 Do You Have a Reservation? The Zagat Survey, started in 1979 by two Yale-educated lawyers, invites diners to rate and review restaurants. The first survey included only New York City restaurants, but the company now offers dining guides to thousands of restaurants worldwide.12 Suppose Zagat asked 510 people selected at random to rate Charlie Trotter’s Restaurant in Chicago according to price (low, medium, or high) and food (1, 2, 3, or 4 stars). The results of this survey are presented in the two-way, or contingency, table below. The numbers in this table represent frequencies. For example, in the third row and fourth column, 30 people rated the prices high and the food 4 stars. The last column contains the sum for each row, and similarly, the bottom row contains the sum for each column. These sums are often called marginal totals. 1 star (A) Price

You can also think of this table as representing all of the simple events in an equally likely outcome experiment. For example, let the outcome HA mean a person rated the prices high and the food 1 star. The probability of HA is the number of outcomes in HA divided by the number of outcomes in the sample space: N(HA) / N(S) ! 25 / 510.

Low (L) Medium (M) High (H)

Food rating 2 stars (B) 3 stars (C)

4 stars (D)

20 50 25

35 80 5

90 95 40

15 25 30

160 250 100

95

120

225

70

510

Assume that these results are representative of the entire population of Chicago, so the relative frequency of occurrence is the true probability of the event. A person from Chicago is randomly selected. a. Find the probability that the person rates the prices medium. b. Find the probability that the person rates the food 2 stars. c. Suppose the person selected rates the prices high. What is the probability that he rates

the restaurants 1 star? d. Suppose the person selected does not rate the food 4 stars. What is the probability that

she rates the prices high?

4.4

161

Conditional Probability

SOLUTION STEP 1 This is an unconditional probability question, asking only about the event M.

Compute the relative frequency of occurrence of M, that is, the proportion of responses that rated the restaurant medium priced. P(M) 5

50 1 80 1 95 1 25 250 5 5 0.4902 510 510

STEP 2 This is just another unconditional probability question. Find the relative fre-

quency of occurrence of 2 stars. P(B) 5

35 1 80 1 5 120 5 5 0.2353 510 510

STEP 3 This is a conditional probability question; the key word is suppose. The

given information is rates prices high. We need the probability of the event A given that the event H has occurred. Using the formula for conditional probability, P(A 0 H) 5

P(A d H) 25/510 25 # 510 25 5 5 5 0.2500 5 P(H) 100/510 510 100 100

This probability can also be obtained directly by reducing the sample space in the two-way table. The shaded row is the reduced sample space. 1 star (A)

2 stars (B)

3 stars (C)

4 stars (D)

20 50 25

35 80 5

90 95 40

15 25 30

160 250 100

95

120

225

70

510

Low (L) Medium (M) High (H)

In the reduced sample space, 25 outcomes are in the event A. Therefore, number of outcomes in A and in the reduced sample space number of outcomes in the reduced sample space 25 5 5 0.2500 100

P(A 0 H) 5

STEP 4 Solve this conditional probability question by reducing the sample space via the

two-way table.

Low (L) Medium (M) High (H)

1 star (A)

2 stars (B)

3 stars (C)

4 stars (D)

20 50 25

35 80 5

90 95 40

15 25 30

160 250 100

95

120

225

70

510

There are 440 (! 95 % 120 % 225) outcomes in the reduced sample space, and 70 (! 25 % 5 % 40) people rated the prices high. P ( H 0 Dr ) 5

TRY IT NOW

70 5 0.1591 440

GO TO EXERCISE 4.124

162

CHAPTE R 4

Probability

Example 4.31 The Changing Labor Force

Solution Trail 4.31 KE YWOR DS ■ ■

If the father is employed Probability that the mother is also employed

T RANSL ATI ON ■

Over the past several decades, the nature of the U.S. labor force has changed dramatically. More women are searching for jobs, more men are staying home with children, and senior citizens are remaining in their jobs longer. According to the U.S. Census Bureau, for married-couple family groups, 84.9% of all fathers are employed; and in 57.5% of these households, both parents are employed. 13 Suppose a married couple family group is selected at random. If the father is employed, what is the probability that the mother is also employed?

SOLUTION STEP 1 Consider the events:

F ! the father is employed. M ! the mother is employed.

Given the event the father is employed, find the probability of the event the mother is employed

The statement of the problem includes two probabilities involving these two events.

CONCEPTS ■

Conditional probability

P ( F ) 5 0.849

Percentage converted to unconditional probability.

P ( F d M ) 5 0.575

The word both means intersection.

VI S ION

Use the formula for conditional probability to find P(M 0 F).

The solution here requires careful translation of the words into mathematics.

Steps for Calculating a Conditional Probability To find the conditional probability of the event A given that the event B has occurred:

STEP 2

P(M 0 F) 5 5

P(M d F) P(F)

Translated conditional probability; definition.

0.575 5 0.6773 0.849

Use known probabilities.

If the father is employed in a married-couple family, the probability that the mother is also employed is 0.6773. TRY IT NOW

GO TO EXERCISE 4.122

Example 4.32 Sex, Marital Status, and the Census The U.S. Constitution directs the government to conduct a census of the population every 10 years. Population totals are used to allocate congressional seats, electoral votes, and funding for many government programs. The U.S. Census Bureau also compiles information related to income and poverty, living arrangements for children, and marital status. The following joint probability table lists the probabilities corresponding to marital status and sex of persons 18 years and over.14

a. Calculate P(B) and P(A d B).

Marital status Never Divorced or Married (R) married (N) Widowed (W) separated (D) Sex

P(A d B) b. Find P ( A 0 B ) 5 P(B)

Male (M) Female (F)

0.282 0.284

0.147 0.121

0.013 0.050

0.043 0.060

0.485 0.515

0.566

0.268

0.063

0.103

1.000

Suppose a U.S. resident 18 years or older is selected at random. a. Find the probability that the person is female and widowed. b. Suppose the person is male. What is the probability that he was never married? c. Suppose the person is married. What is the probability that the person is female?

4.4

The body of the table contains intersection probabilities: the probability of a row event and a column event. For example, the probability that a person is male and divorced is 0.043, the intersection of the first row and the fourth column. The probabilities obtained by summing across rows or down columns are called marginal probabilities. The total probability in the table is 1.000.

163

Conditional Probability

SOLUTION STEP 1 The keyword is and, which means intersection. The probability of female (F )

and widowed (W) is found by reading the appropriate cell. Married (R)

Never married (N)

Widowed (W)

Divorced (D)

0.282 0.284

0.147 0.121

0.013 0.050

0.043 0.060

0.485 0.515

0.566

0.268

0.063

0.103

1.000

Male (M) Female (F)

P ( F d W ) 5 0.050 STEP 2 The keyword is suppose. That suggests conditional probability. The extra infor-

mation is male. P(N 0 M) 5 5

P(N d M) P(M)

Translated conditional probability; definition.

0.147 5 0.303 0.485

Use known probabilities.

STEP 3 This is another conditional probability. This time, the event R is given.

P(F 0 R) 5

TRY IT NOW

P(F d R) 0.284 5 0.502 5 ( ) P R 0.566

GO TO EXERCISE 4.125

A CLOSER L OK 1.

In Example 4.32, P(R) 5 P(R d M) 1 P(R d F) 5 P ( R d M ) 1 P ( R d Mr ) In general, for any two events A and B, P ( A ) 5 P ( A d B ) 1 P ( A d Br ) This decomposition technique is often needed in order to find P(A). The Venn diagram in Figure 4.19 illustrates this equation. The events B and B" make up the entire sample space: S 5 B c Br. S B A A !B

A ! B!

Figure 4.19 Venn diagram showing decomposition of the event A. Try to draw the Venn diagram to illustrate this equality.

2. Suppose B1, B2, and B3 are mutually exclusive and exhaustive:

B1 c B2 c B3 ! S. For any other event A, P(A) ! P(A d B1) % P(A d B2) % P(A d B3)

B!

164

CH APTE R 4

Probability

SECTION 4.5 EXERCISES Concept Check

a. Verify that this is a valid joint probability table; that is,

4.106 True/False In an unconditional probability statement, no

extra relevant information related to the event in question is given.

b. c. d. e.

4.107 True/False Extra information always changes a probability assignment. 4.108 Fill in the Blank In the conditional probability

statement P(A 0 B), the relevant sample space is

.

each probability must be greater than or equal to 0, and the sum of all probabilities must equal 1. Compute the marginal probabilities. Find P(A d F), P(B d G), and P(D d G). Find P(A 0 G), P(F 0 D), and P(E 0 C). Verify that P(C) ! P(C d F) % P(C d G).

4.114 Consider the following joint probability table.

4.109 True/False a. P(A c B) ! P(B c A) b. P(A d B) ! P(B d A) c. P(A 0 B) ! P(B 0 A)

A1 A2 A3

4.110 Short Answer Suppose B1, B2, B3, and B4 are mutually

B1

B2

B3

0.095 0.205 0.155

0.016 0.188 0.238

0.007 0.003 0.093

exclusive and exhaustive events. For any other event A, write P(A) as a sum of probabilities involving the events B1, B2, B3, and B4.

Practice 4.111 Identify each of the following statements as a conditional

or unconditional probability question. a. The probability that a randomly selected car will start in the morning. b. The probability that a person will remember to bring home a loaf of bread after work if he leaves a Post-It note reminder on the steering wheel. c. The probability that the next batter will get a hit in a baseball game. d. The probability that a randomly selected heart transplant operation will be successful. e. Of all one-way streets in a large city, the probability that the street has more than two lanes. 4.112 Identify each of the following statements as a condi-

tional or unconditional probability question. a. The probability that a randomly selected circuit board will be defective, given that it was manufactured during the third shift. b. The probability that a waitress receives a tip of more than 18% of the cost of the meal. c. The probability that the next customer in a bookstore will buy a magazine. d. The probability that a company’s sales will increase, given that more money is spent on advertising. e. The probability that a bowler will make three strikes in a row. 4.113 Consider the following joint probability table describing

the events A, B, C, D, E, F, and G. A B C D E

F

G

0.12 0.15 0.17 0.19 0.11

0.05 0.07 0.04 0.02 0.08

a. b. c. d. e.

Find P(A1), P(A2), and P(A3). Find P(B1), P(B2), and P(B3). Find P(A1 d B1), P(A2 d B2), and P(A3 d B3). Find P(A1 0 B1), P(B1 0 A1), and P(A"1 0 B"1). Find P(B2 0 A2), and P(B3 0 A3).

4.115 Consider the following joint probability table.

A B

a. b. c. d.

C1

C2

C3

0.135 0.145

0.125 0.174

0.206 0.215

0.466 0.534

0.280

0.299

0.421

1.000

Find P(A) and P(C2). Find P(A d C 1) and P(B d C 3). Find P(C2 0 B), P(A 0 C3), and P(A 0 C"3). Verify that P(B) ! P(B d C 1) % P(B d C 2) % P(B d C 3). Carefully sketch a Venn diagram to illustrate this equality.

4.116 A recent survey classified each person according to the

following two-way table. Bl A1 A2 A3

B2

B3 244

165

150 202

466

583

985

178

815

a. Complete the two-way table. b. How many people participated in this survey?

Assume that the results from this survey are representative of the entire population, and one person from this population is randomly selected. c. Find P(A1), P(A2), and P(A3). d. Find P(B1 d A1), P(B2 d A2), and P(B3 d A3). e. Find P(A3 0 B1), P(B2 0 A2), and P(A3 0 B"1).

4.4

4.117 Consider an experiment and three events A, B, and C

S C 1

2

7

6

3

Applications

8

5

4.119 Sports and Leisure Consider a regular 52-card deck 4

9

B

The following table gives the probability of each outcome. Outcome Probability

1

2

3

4

5

0.01

0.12

0.11

0.10

0.15

6

7

8

9

0.25

0.14

0.08

0.04

Outcome Probability

there is some evidence to suggest a link between people who participate in an office football pool and those who cheat on their income taxes. Suppose 25% of all people participate in an office football pool. The IRS estimates that 15% of all people participate in an office football pool and cheat on their income tax return. Suppose a person is randomly selected. If the person is known to participate in an office football pool, what is the probability that she cheats on her income tax return? Write a Solution Trail for this problem.

4.118 Consider an experiment and three events A, B, and C

defined in the Venn diagram below.

4.121 Trail Users According to Trail Count, the annual survey of San Jose’s off-street bicycle and pedestrian trail users, approximately 24% use trails daily.15 Suppose 12% use trails daily and exercise, and 8% use trails daily for commuting. Suppose a trail user is randomly selected. a. Given that the person uses trails daily, what is the probability that he uses the trails for exercise? b. Given that the person uses trails daily, what is the probability that she uses the trails for commuting? c. If the person uses trails daily, what is the probability that he does not use the trial for commuting?

S A

B 2

4

8

3

5 6

7

9

4.122 Travel and Transportation In a particularly

C

The following table gives the probability of each outcome. Outcome Probability Outcome Probability

of playing cards. Suppose two cards are drawn at random from the deck without replacement. a. What is the probability that the second card is an ace, given that the first card is a king? b. What is the probability that the second card is an ace, given that the first card is an ace? c. What is the probability that the second card is a heart, given that the first card is a heart? d. Suppose two cards are drawn at random from the deck with replacement. What is the probability that the second card is a heart, given that the first card is a heart? 4.120 Economics and Finance In the United States,

Find the following probabilities. a. P(A), P(B), and P(C). Why don’t these three probabilities sum to 1? b. P(A d B) and P(B d C). c. P(B 0 C) and P(C 0 B). d. P(A 0 B"), P(C 0 A"), and P[1 0 (A c B)"]. e. P(3 0 B), P(4 0 B), and P(5 0 B). Why do these three probabilities sum to 1?

1

165

Find the following probabilities. a. P(A), P(B), and P(C). b. P(A d B) and P(B d C). c. P(A 0 B), P(B 0 C), and P[(A d B) 0 C]. d. P(0 0 C"), P(7 0 C), and P[(A c B) 0 C"). e. P(2 0 B), P(3 0 B), and P(7 0 B).

defined in the Venn diagram below.

A

Conditional Probability

1

2

3

4

0.135

0.130

0.142

0.128

0.147

5

6

7

8

9

0.083

0.072

0.063

0.055

0.045

rural area in upstate New York, 80% of all people use chains on their car tires (for winter driving), 60% carry a snow shovel in their car and use chains, and 15% carry a shovel but do not use chains. Suppose a person from this area is selected at random. a. If the person uses chains, what is the probability that he carries a shovel? Write a Solution Trail for this problem. b. Given that the person does not use chains, what is the probability that she carries a shovel? 4.123 Biology and Environmental Science The Florida Sea Grants Agents, Florida Fish and Wildlife Commission, and

166

CHAPTE R 4

Probability

volunteer divers work together in the Great Goliath Grouper Count. The table below lists the probability of each size grouper in two regions.16 ' 3 ft Region

Sarasota Monroe

0.059 0.030

Length 3–5 ft ( 5 ft 0.347 0.277

Age

0.079 0.208

Suppose a grouper from one of these regions is selected at random. a. Suppose the grouper is 3–5 feet long. What is the probability that it is from Monroe? b. Suppose the grouper is from Sarasota. What is the probability that the grouper is longer than 5 feet? c. Suppose the grouper is longer than 3 feet. What is the probability that it is not from Sarasota? 4.124 Sports and Leisure An assistant football coach at a

Division II school helps his team prepare for the next opponent by charting plays. He looks at game films and records the down distance (first, second, or third down, and a categorical measure of the number of yards needed for a first down) and type of play (rush or pass), to look for tendencies. The two-way table below shows the number of plays that fall into each category. (Fourthdown plays are not charted, because they usually involve a punt or a field-goal attempt.)

1st Play

Rush Pass

126 87

Down/distance 2nd 2nd 3rd short long short 35 16

46 67

65 23

Revenue source Gambling Liquor stores

3rd long 12 59

Suppose this table represents the true tendencies of the next opponent. a. What is the probability that the opponent rushes the ball? b. Suppose the opponent has a first down. What is the probability of a pass? c. Suppose it is a first or second down. What is the probability that the opponent rushes the ball? d. Suppose the opponent passes the ball. What is the probability that it is a third down? 4.125 Public Policy and Political Science As a result of

decreasing revenue, economic conditions, and deep budget deficits, many states have tried to legalize gambling or, in states where gambling is already legal, expand casino operations. For example, Washington state and Pennsylvania have joined multistate lotteries, several states have raised taxes on receipts from riverboat gambling, and at least four states have had ballot questions about starting new types of games.17 To measure public opinion in Kansas, a random sample of residents was selected and each response was categorized according to revenue preference and age. The results are given in the following two-way table.

18–21 21–30 30–45 & 45

33 55 117 158

68 121 109 110

Other 12 50 132 90

Assume this table is representative of the entire state’s population and suppose a resident is randomly selected. a. What is the probability that the resident is in favor of legalized gambling? b. Suppose the person is in favor of state-owned liquor stores. What is the probability that the person is 30–45 years old? c. Suppose the person selected is under 21. What is the probability that the person is in favor of some other option? d. Suppose the resident is not in favor of legalized gambling. What is the probability that the respondent is 21–30 years old? e. If the person selected is under 21 or at least 45, what is the probability that this resident is in favor of state-owned liquor stores? 4.126 Marketing and Consumer Behavior Because

wireless technology has become reliable and prevalent, many people are terminating their landline service. According to a survey by the Centers for Disease Control and Prevention, 35.8% of households have wireless phones only. In addition, 52.5% have a landline with wireless, 9.4% have a landline without wireless, 0.2% have a landline with unknown wireless, and 2.1% are phoneless.18 Suppose the survey indicated that 22.2% were wireless-only users and in excellent health, 9.3% were wireless-only users and in adequate health, and 4.3% were wireless-only users and in poor health. Suppose a person who completed the survey is selected at random. a. Suppose the person is a wireless-only user. What is the probability that she is in excellent health? b. Suppose the person is a wireless-only user. What is the probability that she is in poor health? c. Suppose the person is a wireless-only user. What is the probability that he is not in poor health? 4.127 Marketing and Consumer Behavior The manager at

Of The Land Gallery in Red Lake Falls, Minnesota, is looking at many ways to increase pottery sales for next year. The probability that she will advertise more is 0.65, and the probability of advertising more and increasing revenue is 0.35. a. Suppose the store manager decides to advertise more. What is the probability that revenue will increase? b. If the store manager does advertise more, what is the probability that revenue will not increase? c. What is the probability of not advertising more and revenue increasing? 4.128 Psychology and Human Behavior A random sample

of adult drivers was obtained, 1000 men and 900 women. A survey showed that 640 men rely on GPS systems and 450 women rely on them.19 Suppose a person included in this survey is randomly selected.

4.4

woman and relies on GPS systems? b. Suppose the person selected is a man. What is the probability that he relies on a GPS system? c. Suppose the person selected relies on a GPS system. What is the probability that the person is a woman?

Extended Applications 4.129 Demographics and Population Statistics According to the U.S. Census Bureau, 87.1% of people living in the United States are native-born and 12.9% are foreign-born. In addition, 10.16% are native-born and had no health insurance during the last year, and 3.78% are foreign-born and had no health insurance during the last year.20 Suppose a person living in the United States is selected at random. a. Suppose the person is native-born. What is the probability that he had no health insurance during the last year? b. Suppose the person is foreign-born. What is the probability that she had no health insurance during the last year? c. Suppose the person is foreign-born. What is the probability that he had health insurance during the last year? 4.130 Biology and Environmental Science Homeowners who cultivate small backyard gardens are often worried about pests (for example, rabbits and groundhogs) ruining plants. Some gardeners protect their gardens with a fence, others spread chemicals around the perimeter of the garden to keep animals away, and some do nothing. The joint probability table below shows the relationships among these garden protection methods and success.

Garden defense Fence Chemicals Nothing Result

Pests No pests

0.05 0.30

0.08 0.20

0.34 0.03

Suppose this table is representative of all Comcast user customer satisfaction and a customer is selected at random. a. What is the probability that the customer is from the Midwest and customer service is fair? b. If the customer is from the North, what is the probability that customer service is excellent? c. Suppose customer service is poor. What is the probability that the customer is from the South? d. If the customer service is good or fair, which region is the customer most likely from? Justify your answer. 4.132 Psychology and Human Behavior The following

partial two-way table lists the number of adult criminal cases in Canada by case type and sentence.22 Type of sentence Conditional Custody sentence Probation

Criminal code

a. What is the probability that the person selected is a

167

Conditional Probability

Crimes against the person

16,067

Property crimes

22,178

Administration of justice

28,186

Other criminal code offenses

2,465

55,006 34,353

60,456

1,460

20,378

50,024

456

5,708

10,288

Criminal code offenses (traffic)

7,387

751

7,141

15,279

Other federal statute

7,980

2,732

10,534

21,246

85,922

114,588

a. Complete the table. b. Suppose the case was a crime against the person. What is

the probability that it resulted in custody? Suppose a backyard gardener is selected at random. a. Suppose the garden had pests. What is the probability the gardener used nothing? b. Suppose the gardener used chemicals. What is the probability there were pests? c. Given that the garden had no pests, which method of defense did the gardener most likely use? Justify your answer. 4.131 Business and Management Bank of America, Time

Region

Warner Cable, and Delta Airlines are some of the companies ranked worst in customer service according to the American Consumer Satisfaction Index.21 Suppose the table below represents the results of a survey of customer satisfaction for Comcast, another company on the Worst Customer Service list.

North Midwest South

Customer service Excellent Good Fair 0.102 0.059 0.062 0.105 0.105 0.144 0.075 0.084 0.213

c. Suppose the case results in probation. What is the

probability that it was a conviction for a criminal code offense (traffic)? d. Suppose the case was a crime against the person. What is the probability that it resulted in custody? 4.133 Public Health and Nutrition The McPherson Middle School in Clyde, Ohio, is set to examine its school lunch program. A survey of 2200 students asked students about their lunch type and how they got to school in the morning. The following (partial) two-way table is assumed to represent the entire student body.

Arrival mode Bus Car Walk Lunch

Poor 0.004 0.007 0.040

Carries Buys

466 345

142 500

967

970 a. Complete the two-way table. b. Suppose a student at the school is randomly selected. What

is the probability that the student carries a lunch and gets to school by car?

168

CH APTE R 4

Probability

c. Suppose the student takes the bus to school. What is the

probability that the student buys lunch? d. Suppose the student does not walk to school. What is the probability that the student carries a lunch? e. If the student buys lunch, how did he or she most likely get to school? 4.134 Medicine and Clinical Studies In the movie A Christmas Story, Ralph “Ralphie” Parker wanted an official Red Ryder carbine-action 200-shot range model BB gun (with a compass in the stock). Everyone in the movie (including Santa) tells Ralphie he will shoot his eye out with this present. According to the Centers for Disease Control and Prevention, approximately 30,000 people visited an Emergency Room last year for BB gun accidents. The most common injuries were to the face, head, neck, and eye.23 In a survey of BB– and pellet gun–related injuries treated at hospitals, each injury was classified by primary body part injured and victim–shooter relationship. The results are given in the table below. Victim-shooter relationship

Body part injured

Other/ Friend/ shooter Not Self acquaintance Relative Stranger not seen stated Extremity Trunk

200

128

89

17

25

189

57

40

21

5

10

58

Face

51

35

20

4

14

49

Head/neck

40

34

19

6

12

44

Eye

23

20

9

3

7

26

2

3

3

2

1

5

Other

Suppose this table is representative of the entire population of BB– and pellet gun–related injuries, and a person seen in an Emergency Room suffering from this type of injury is selected at random.

a. What is the probability that the injury is to the eye and the

shooter is a relative? b. Suppose the injury was caused by a stranger. What is the

probability that the body part injured is an extremity? c. Suppose the injury is to the head/neck. What is the

probability that the shooter is a friend/acquaintance? d. Suppose the injury is not to the eye. What is the

probability that the shooter is a relative?

Challenge 4.135 Public Policy and Political Science

A survey of voters in a certain district asked if they favored a return to stronger isolationism. The following three-way table classifies each response by sex, political party, and response.

Yes No

Dem

Male Rep

Ind

Dem

202 124

126 288

105 85

234 312

Female Rep Ind 101 66

95 150

Suppose a random voter is selected from this district. a. What is the probability that the voter is in favor of isolationism, a female, and a Republican? b. What is the probability that the voter is not in favor of isolationism? c. Suppose the voter is female. What is the probability that she is a Democrat? d. Suppose the voter is not in favor of isolationism. What is the probability that the voter is a Republican and male? e. Suppose the voter is not an Independent. What is the probability that he or she is in favor of isolationism?

4.5 Independence If extra information is given, sometimes we simply say, “So what?”

In the last section we learned about conditional probability, that is, how knowing extra information may change a probability assignment. Often, however, additional information has no effect on the probability assignment. Consider the following examples.

EXAMPLE 4.33 The Common Cold Hundreds of different viruses can cause the common cold. Many people are able to develop a resistance to some of these viruses, but they may still contract a cold from a different virus. Catching a cold is not related to cold temperatures or bad weather, exercise, diet, or enlarged tonsils or adenoids.24 Let the event C ! catching a cold. Suppose the (unconditional) probability that a certain person contracts a cold this winter is 0.45: P(C) ! 0.45. If this person decides to exercise more this winter, the cold facts above mean this extra exercise has no effect on contracting a cold. What is the probability that this person contracts a cold this winter given the event E ! they exercise more? lightwavemedia/Shutterslock

P ( C 0 E ) 5 P ( C ) 5 0.45.

4.5

Independence

169

Knowing extra information here does not change the conditional probability assignment. Intuitively, the events C (contracting a cold) and E (exercising more) are unrelated, or independent.

Example 4.34 No Purchase Necessary There are lots of sweepstakes in which a consumer is automatically entered by making a purchase. However, almost all sweepstakes entry rules explain that there is “No purchase necessary to enter.” A person may make a purchase to enter the sweepstakes, or instead enter by mailing a postcard or completing an online form. If this statement in the rules is true, then making a purchase cannot change the probability of winning the sweepstakes. Suppose the event A ! winning the sweepstakes, and the event B ! making a purchase. P ( A 0 B ) 5 P ( A ) and P ( A 0 Br ) 5 P ( A ) . Whether or not you make a purchase has no effect on the chance of your entry being the winner. The events winning the sweepstakes and making a purchase are independent. In these two examples, the occurrence or nonoccurrence of one event has no effect on the occurrence of the other. In this case, the two events are independent.

Definition Two events A and B are independent if and only if P(A 0 B) 5 P(A)

If A and B are not independent, they are said to be dependent events.

A CLOSER L OK One way to verify independent events: Is P(A | B) ! P(A)? If so, then A and B are independent; if not, they are dependent.

1. If we know the events A and B are independent, then

P(A 0 B) 5 P(A)

and

P(B 0 A) 5 P(B).

Similarly, if either one of these equations is true, then the other is also true, and the events are independent. 2. If A and B are independent events, then so are all combinations of these two events and their complements. Mathematical translation: If P(A 0 B) ! P(A), then P(A 0 B") ! P(A), P(Ar 0 B) ! P(A"), and P(Ar 0 Br) ! P(A"). 3. Unfortunately, independent events cannot be shown on a Venn diagram. In problems that involve independent events, we’ll have to translate the words into a probability question and then use an appropriate formula. 4. It is reasonable to think of independent events as unrelated. One might conclude that they are therefore disjoint. This is not true! Suppose A and B are mutually exclusive and P ( A ) 2 0 (there is some positive probability associated with the event A). Then P(A 0 B) 5 0 2 P(A)

The probability of A given B has to be 0, because A and B are disjoint. Once B occurs, A cannot occur. Hence, disjoint events are dependent. In Section 4.4, we learned the formula for finding conditional probability: P(A 0 B) 5

P(A d B) P(B)

or

P(B 0 A) 5

P(A d B) P(A)

170

CHAPTE R 4

Probability

STEPPED STEPPED TUTORIAL TUTORIALS INDEPENDENCE BOX PLOTS AND THE MULTIPLICATION RULE

We can solve both of these equations for P ( A d B ) to obtain the following probability multiplication rule.

The Probability Multiplication Rule For any two events A and B,

# P(A 0 B) # P(B 0 A) # P(B)

$'%'&

P(A d B) 5 P(B) 5 P(A) 5 P(A)

Always true. Only true if A and B are independent.

A CLOSER L OK 1. The real skill in applying this rule is knowing which equality to use. The first two

equalities are always true. Use one of these only if A and B are dependent and you need to find P(A d B). Read the problem carefully to determine which conditional and unconditional probabilities are given. If A and B are independent, use the third equality to compute the probability of intersection. The word independent will not always appear in the problem. It may be implied or can be inferred from the type of experiment described. 2. If events are dependent, a modified tree diagram can be used to apply the probability multiplication rule. In Figure 4.20 the probability of traveling along any branch is written along the appropriate leg. Second-generation branch probabilities are conditional. For example, P(C 0 A) (the probability of C given A) is the probability of taking path C, given path A.

P(C |

On this road map, to determine a final probability, we multiply probabilities along the way.

P(A)

A

P(B)

P(D |

P(E | B

A)

C

P(A ! C) ! P(A) " P(C | A)

D

P(A ! D) ! P(A) " P(D | A)

E

P(B ! E ) ! P(B) " P(E |B)

F

P(B ! F ) ! P(B) " P(F |B)

A)

B)

P(F | B)

Figure 4.20 The probability multiplication rule on a tree diagram.

A modified tree diagram is useful here also. Try drawing one to illustrate this extended rule.

All probabilities coming from a single node must sum to 1. To find the probability of traveling along a complete path from left to right (equivalent to the probability of an intersection), we multiply probabilities along the path. 3. The probability multiplication rule can be extended. For any three events A, B, and C: P(A d B d C) ! P(A) ) P(B 0 A) ) P(C 0 A d B). 4. If the events A1, A2, . . . , Ak are mutually independent, then P(A1 d A2 d # # # d Ak) 5 P ( A1 ) # P ( A2 ) # # # P ( Ak ) . In words, if the events are mutually independent, the probability of an intersection is the product of the corresponding probabilities.

4.5

Independence

171

There are lots of probability rules, formulas, and diagrams in this section. Here are some examples (along with Solution Trails to help you translate the words into mathematics) to illustrate these concepts.

Example 4.35 Mobile Shoppers Solution Trail 4.35 KEYW OR DS ■

Used their mobile device and made a purchase

TR ANSL ATI O N ■

What is the probability of the intersection of the event used a mobile device and the event made a purchase?

CONC EPTS ■

Probability multiplication rule

VI SI ON

Determine whether the events are independent, determine which probabilities are given, and use the appropriate form of the probability multiplication rule.

Cyber Monday is a huge online shopping day. More people are now using their mobile devices, smart phones or tablets, to shop and make purchases on this Monday after Thanksgiving. Approximately 18% of Cyber Monday shoppers used their mobile device to look for deals.25 If a person used a mobile device to shop, the probability of making a purchase was 0.45. Suppose a Cyber Monday shopper is selected at random. What is the probability that the shopper used a mobile device and made a purchase?

SOLUTION STEP 1 Define the following events:

V ! used a mobile device; M ! made a purchase. We are given that the probability the person used a mobile device is 0.18, that is, P(V) ! 0.18. In addition, we are told that the probability the shopper made a purchase if he used a mobile device is 0.45. This is a conditional probability statement that can be written P(M 0 V ) ! 0.45. STEP 2 P ( V d M ) 5 P ( V ) # P ( M 0 V ) Probability Multiplication Rule. ( )( ) 5 0.18 0.45 5 0.081 Use the given probabilities. Note: Using the probability multiplication rule, we can also write: P(V d M) 5 P(M) # P(V 0 M)

This is a correct application of the rule, but it doesn’t help in this problem, because the probabilities on the right-hand side are not given. An accurate but inappropriate use of the probability multiplication rule is evident because the probabilities given in the problem and in the equality are mismatched. Simply try the other equality.

Solution Trail 4.36 KEYW OR DS ■

Both mattresses will be delivered on time

TR ANSL ATI O N ■

What is the probability of the intersection of the event mattress 1 is delivered on time and the event mattress 2 is delivered on time?

CONC EPTS ■

Probability multiplication rule

VI SI ON

Determine whether the events are independent, determine which probabilities are given, and use the appropriate form of the probability multiplication rule.

TRY IT NOW

GO TO EXERCISE 4.154

Example 4.36 It’s Made for Sleep Better Bedding in East Hartford, Connecticut, claims that 99.4% of all its mattress deliveries are on time. Suppose two mattress deliveries are selected at random. a. What is the probability that both mattresses will be delivered on time? b. What is the probability that both mattresses will be delivered late? c. What is the probability that exactly one mattress will be delivered on time?

SOLUTION a. Let Mi ! mattress i is delivered on time; P(Mi) ! 0.994 (given).

Both mattresses delivered on time means mattress 1 is on time and mattress 2 is on time. P ( M1 d M2 ) 5 P ( M1 ) # P ( M2 ) 5 ( 0.994 )( 0.994 ) 5 0.988036

Both mattresses on time; independent events. Probability of each mattress delivered on time.

The probability that both mattresses will be delivered on time is 0.9880.

172

CH APTE R 4

Probability

b. Both mattresses delivered late means mattress 1 is delivered late and mattress 2 is

delivered late. Delivered late is the complement of delivered on time. P ( Mr1 d Mr2 ) 5 P ( Mr1 ) # P ( Mr2 ) 5 3 1 2 P ( M1 ) 4 # 3 1 2 P ( M2 ) 4 5 ( 0.006 )( 0.006 )

Independent events. Complement rule.

5 0.000036

The probability that both mattresses will be delivered late is 0.000036. c. Exactly one mattress on time means

Mattress 1 is on time and mattress 2 is late, or Mattress 1 is late and mattress 2 is on time. P ( Exactly one mattress is on time ) 5 P 3 ( M1 d Mr2 ) c ( Mr1 c M2 ) 4

We don’t usually see this step, but there really is a union of two events in the background. [Notice the or separating (a) and (b) above.] These two events are disjoint, and the probability of the union of disjoint events is the sum of the corresponding probabilities. 5 P ( M1 d Mr2 ) 1 P ( Mr1 d M2 ) 5 P ( M1 ) # P ( Mr2 ) 1 P ( Mr1 ) # P ( M2 ) 5 ( 0.994 )( 0.006 ) 1 ( 0.006 )( 0.994 ) 5 0.011928 In Chapter 5, we will convert all (symbolic) outcomes into real numbers, and use the probabilities of experimental outcomes to find the probabilities associated with real numbers.

Independent events. Use known probabilities.

The probability that exactly one mattress will be on time is 0.0119. Note: There is another way to solve part (c). With two mattresses to deliver, one of three things must happen: 0 are on time, 1 is on time, or 2 are on time; and the probabilities of these three events must sum to 1. (Why?) From part (a), P(2 on time) ! 0.9880; from part (b), P(0 on time) ! 0.000036. P ( 1 on time ) 5 1 2 3 P ( 0 on time ) 1 P ( 2 on time ) 4 5 1 2 ( 0.000036 1 0.988036 ) 5 1 2 0.988072

Complement rule. Use known probabilities.

5 0.011928 TRY IT NOW

GO TO EXERCISE 4.151

Example 4.37 Immunizations Federal health officials have reported that the proportion of children (ages 19 to 35 months) who received a full series of inoculations against vaccine-preventable diseases, including diphtheria, tetanus, measles, and mumps, increased up until 2006, but has stalled since. The CDC reports that 14 states have achieved a vaccination coverage rate of at least 80% for the 4:3:1:3:3:1 series.26 The probability that a randomly selected toddler in Alabama has received a full set of inoculations is 0.792, for a toddler in Georgia, 0.839, and for a toddler in Utah, 0.711.27 Suppose a toddler from each state is randomly selected. a. Find the probability that all three toddlers have received these inoculations. b. Find the probability that none of the three has received these inoculations. c. Find the probability that exactly one of the three has received these inoculations.

4.5

173

Independence

SOLUTION a. Define the following three events:

A ! toddler A from Alabama has received these inoculations; G ! toddler G from Georgia has received these inoculations; and U ! toddler U from Utah has received these inoculations. Assume these three events are independent. P(A d G d U) 5 P(A) # P(G) # P(U) 5 ( 0.792 )( 0.839 )( 0.711 ) 5 0.4725

All three means intersection. Independent events. Use given probabilities.

The probability that all three toddlers have received these inoculations is 0.4725. b. None of the three has received inoculations means toddler A has not received the inoculations and toddler G has not received the inoculations and toddler U has not received the inoculations. Translate this sentence into mathematics using intersection and complement. P ( Ar d Gr d Ur ) 5 P ( Ar ) # P ( Gr ) # P ( Ur ) 5 3 1 2 P(A) 4 # 3 1 2 P(G) 4 # 3 1 2 P(U) 4 5 ( 1 2 0.792 )( 1 2 0.839 )( 1 2 0.711 ) 5 ( 0.208 )( 0.161 )( 0.289 )

Math translation: intersection. Independent events. Complement rule. Use given probabilities. Simplify.

5 0.0097 The probability that none of the three has received the inoculations is 0.0097. c. To write a probability statement for exactly one has received the inoculations, ask

“How can that happen?” Toddler A has received the inoculations and toddlers G and U have not, or toddler G has received the inoculations and toddlers A and U have not, or toddler U has received the inoculations and toddlers A and G have not. Translate this sentence into probability using intersection and complement. P ( A d Gr d Ur ) 1 P ( Ar d G d Ur ) 1 P ( Ar d Gr d U )

Three ways exactly one toddler has received the inoculations.

5 P ( A ) # P ( Gr ) # P ( Ur ) 1 P ( Ar ) # P ( G ) # P ( Ur ) 1 P ( Ar ) # P ( Gr ) # P ( U )

Independent events.

5 ( 0.792 )( 0.161 )( 0.289 ) 1 ( 0.208 )( 0.839 )( 0.289 ) 1 ( 0.208 ) ( 0.161 )( 0.711 ) Known probabilities; complement rule.

5 0.0369 1 0.0504 1 0.0238 5 0.1111 The probability that exactly one toddler of the three has received the inoculations is 0.1111. TRY IT NOW

GO TO EXERCISE 4.152

Example 4.38 Winter Tires Winter tires are designed to reduce automobile crashes and improve driver safety in a wide range of winter weather conditions. Despite the advantages of using winter tires, there is increased cost, decreased fuel economy, and the aggravation of mounting and installation. According to a recent survey, 40% of drivers in Manitoba use winter tires (event M), 43% in Ontario (event O) do, and 98% in Quebec (event Q) do.28 Suppose one driver from each jurisdiction is randomly selected. Find the probability that at least one driver uses winter tires.

174

CH APTE R 4

Probability

Solution Trail 4.38 KE YWOR DS ■

At least one driver uses winter tires

T RANSL ATI ON ■

The words at least one means the event one, two, or three drivers use winter tires. This is the same as the complement of the event none of the drivers uses winter tires

CONCEPTS ■

Complement rule

VI S ION

Define the event none of the drivers uses winter tires and use the complement rule to find the probability that at least one does.

Solution Trail 4.39a KE YWOR DS ■

Stays at hotel A, problem with the reservation

T RANSL ATI ON ■

What is the probability of the intersection of the event A and the event R?

CONCEPTS ■

Probability multiplication rule

VI S ION

Determine whether the events are independent, determine which probabilities are given, and use the appropriate form of the probability multiplication rule.

SOLUTION P ( at least one driver uses winter tires ) 5 1 2 P ( 0 drivers use winter tires ) 5 1 2 P ( Mr d Or d Qr ) 5 1 2 P ( Mr ) # P ( Or ) # P ( Qr )

Complement rule. All three do not use winter tires. Independent events.

5 1 2 3 1 2 P(M) 4 # 3 1 2 P(O) 4 # 3 1 2 P(Q) 4 5 1 2 ( 1 2 0.40 )( 1 2 0.43 )( 1 2 0.98 ) 5 1 2 ( 0.60 )( 0.57 )( 0.02 )

Complement rule. Use given probabilities.

5 1 2 0.0068 5 0.9932 The probability that at least one driver uses winter tires is 0.9932. Challenge: Find this probability using a direct approach, without using the complement rule. TRY IT NOW

GO TO EXERCISE 4.153

Example 4.39 A Traveling Salesperson During frequent trips to a certain city, a traveling salesperson stays at hotel A 50% of the time, at hotel B 30% of the time, and at hotel C 20% of the time. When checking in, there is some problem with the reservation 3% of the time at hotel A, 6% of the time at hotel B, and 10% of the time at hotel C. Suppose the salesperson travels to this city. a. Find the probability that the salesperson stays at hotel A and has a problem with the

reservation. b. Find the probability that the salesperson has a problem with the reservation. c. Suppose the salesperson has a problem with the reservation; what is the probability that the salesperson is staying at hotel A?

SOLUTION Define the following events: A ! stays at hotel A; B ! stays at hotel B; C ! stays at hotel C; and R ! problem with the reservation. Convert all the given percentages into probabilities. The phrase of the time indicates conditional probability. P ( A ) 5 0.50

P ( B ) 5 0.30

P ( C ) 5 0.20

P ( R 0 A ) 5 0.03

P ( R 0 B ) 5 0.06

P ( R 0 C ) 5 0.10

To find P(R" | A), apply the complement rule to a conditional probability statement:

This experiment can be represented with a modified tree diagram (Figure 4.21). Remember, the probabilities along all paths coming from a node must sum to 1, and secondgeneration branch probabilities are conditional.

P(R" | A) ! 1 # P(R | A).

a. The events A and R are dependent. The likelihood of a problem with a reservation

depends on the hotel. P(A d R) 5 P(A) # P(R 0 A) 5 ( 0.50 )( 0.03 )

Probability multiplication rule. Use known probabilities.

5 0.0150

The probability of staying at hotel A and having a problem with the reservation is 0.0150.

4.5

0.03 A

R

P(A ! R)

R!

P(A ! R!)

R

P(B ! R)

R!

P(B ! R!)

R

P(C ! R)

R!

P(C ! R!)

175

0.97

0.5

0.06 0.30

Independence

B

0.94

0.

20 0.10 C

0.90

Figure 4.21 Tree diagram for Example 4.39.

Solution Trail 4.39b KEYW OR DS ■

Problem with the reservation

TR ANSL ATI O N ■

The event R

CONC EPTS ■

b. P ( R ) 5 P ( A d R ) 1 P ( B d R ) 1 P ( C d R )

Decomposition of R.

5 P(A) # P(R 0 A) 1 P(B) # P(R 0 B) 1 P(C) # P(R 0 C) 5 ( 0.50 )( 0.03 ) 1 ( 0.30 )( 0.06 ) 1 ( 0.20 )( 0.10 ) 5 0.0150 1 0.0180 1 0.0200 5 0.0530

Probability multiplication rule. Use known probabilities.

The probability of a problem with the reservation (regardless of the hotel) is 0.0530. Figure 4.22 shows this decomposition of R using a Venn diagram.

Unconditional probability

VI SI ON

A

To find P(R), ask, “How can that happen?” Which paths from left to right involve the event R? The tree diagram suggests that three compound events (paths) involve a problem with the reservation.

B

C

R

P(A ! R)

P(B ! R)

P(C ! R)

Figure 4.22 Venn diagram showing decomposition of R.

Solution Trail 4.39c KEYW OR DS ■ ■

Problem with the reservation Staying at hotel A

TR ANSL ATI O N ■

Given the event R, find the probability of the event A

CONC EPTS ■

Conditional probability with the event R given

VI SI ON

Find P(A | R).

c. P ( A 0 R ) 5

5

P(A d R) P(R) 0.0150 0.0530

Formula for conditional probability.

Use answers to (a) and (b).

5 0.2830 The probability that the salesperson stayed at hotel A, given a problem with the reservation, is 0.2830. TRY IT NOW

GO TO EXERCISE 4.163

176

CH APTER 4

Probability

A CLOSER L OK 1. Part (c) of the hotel example illustrates Bayes’ rule. This theorem loosely states:

STEPPED STEPPED TUTORIAL TUTORIALS TREE AND BOX DIAGRAMS PLOTS BAYES’ RULE

Given certain conditional probabilities (and other unconditional probabilities), we are able to solve for a new conditional probability where the events are inverted, or swapped. In the hotel example, we were given the conditional probabilities P(R 0 A), P(R 0 B), and P(R 0 C). Using these probabilities and the unconditional probabilities P(A), P(B), and P(C), we were able to find P(A 0 R), a conditional probability with the events A and R switched. 2. Suppose P(A), P(B), and P ( A d B ) are known. To decide whether A and B are inde?

pendent, check the equation P ( A d B ) 5 P ( A ) # P ( B ) . If the probability of the intersection is equal to the product of the probabilities, then the events are independent. If not, they are dependent. 3. There are many applications in probability and statistics that involve repeated sampling from a population with replacement. In this case, each draw is independent of any other draw. Other applications involve sampling without replacement, for example, exit polls and telephone surveys. Consider each individual response as an event. These events are definitely dependent. However, if the population is large enough and the sample is small relative to the size of the population, then the events are almost independent. Calculating probabilities assuming independence results in little loss of accuracy. Exercise 4.134 illustrates this idea.

SECTION 4.5 EXERCISES Concept Check

4.142 Decide whether each pair of events is independent or

4.136 Fill in the Blank Two events A and B are independent

if and only if P(A 0 B) !

.

4.137 Fill in the Blank If A and B are independent events,

P(A d B) !

.

4.138 Fill in the Blank If the events A, B, and C are mutually

independent, P(A d B d C) !

.

4.139 True/False For any two events A and B, P(A d B) !

P(A) ) P(B 0 A).

4.140 True/False When sampling from a population with

replacement, each draw is independent of any other draw.

Practice 4.141 Decide whether each pair of events is independent or

dependent. a. A ! make an error on your income tax return, and B ! file Form 1040 long. b. C ! put together a swing set correctly, and D ! read the directions. c. E ! run out of milk, and F ! the refrigerator breaks down. d. G ! break your pencil lead while writing, and H ! feel overly stressed.

dependent. a. A ! a randomly selected CD has a scratch, and B ! a random email message is spam. b. C ! one paper towel is enough to completely clean a spill, and D ! you use a generic paper towel. c. E ! no accidents are reported in 24 hours in a county, and F ! there are no storms in the area. d. G ! your automobile insurance bill increases, and H ! you had one speeding ticket within the last year. 4.143 Suppose the following probabilities are known:

P(A) ! 0.25, P(B 0 A) ! 0.34, and P(C 0 A d B) ! 0.62. a. Find P(A d B), P(Br 0 A), and P(A d B"). b. Find P(A d B d C), P(Cr 0 A d B), and P(A d B d Cr). c. Are the events A and B independent? Justify your answer. 4.144 Suppose the events A, B, and C are independent and

P(A) ! 0.55, P(B) ! 0.45, and P(C) ! 0.35. Find the following probabilities. a. P(A d B), P(A d C), and P(B d C). b. P(A d B d C) and P(Ar d Br d Cr). c. P(A d Br d Cr) and P(Ar d B d C). 4.145 Suppose the following probabilities are known:

P(A) ! 0.40, P(B 0 A) ! 0.25, P(C 0 A) ! 0.45, and P(D 0 A) ! 0.30.

4.5

a. Find P(A d B), P(A d C), and P(A d D). b. Are the events A and B independent? Justify your answer. c. Find P(Br 0 A). If the event A occurs, are there any other

events in addition to B, C, and D that can occur? Justify your answer.

4.146 Suppose the probability that an individual has blue eyes

is 0.41. Four people are randomly selected. a. Find the probability that all four have blue eyes. b. Find the probability that none has blue eyes. c. Find the probability that exactly two have blue eyes. 4.147 Consider the modified tree diagram below. B

0.2 A

5 0.3

B 0.28 0.36

C

D

a. Identify and determine each missing path probability. b. Find P(A d C) and P(Ar d B). c. Find P(D). 4.148 Consider the modified tree diagram below. 7

0.3 8

0.2

C

B C!

A

C

5 0.3

B!

0.55

C! C B A!

0.0

8

C!

0.7

6

4

0.6

C

B! C!

a. Identify and determine each missing path probability. b. Find P(A d B d C) and P(Ar d B d Cr). c. Find P(C). Are the events B and C independent? Justify

your answer.

Applications 4.149 Demographics and Population Statistics Fishing is often considered a quiet, serene pastime. However, the job of fisherman is actually very dangerous—The Deadliest Catch on The Discovery Channel chronicles the risky lives of fishermen on the Bering Sea. According to recent data, the fatality rate for fishermen is 0.0012.29 Suppose two fishermen are selected at random. a. What is the probability that both fishermen will be fatally injured during the year? b. What is the probability that neither will be fatally injured during the year? c. What is the probability that exactly one will be fatally injured during the year?

of South Louisiana is the largest-tonnage port in the United States. Inspectors randomly select ships at one of the facilities and check for safety violations. Past records indicate that 90% of all ships inspected have no safety violations. Suppose two ships are selected at random. a. What is the probability that both ships have safety violations? Write a Solution Trail for this problem. b. What is the probability that neither ship has a safety violation? c. What is the probability that exactly one ship has a safety violation?

D

A!

177

4.150 Public Policy and Political Science The port

C

0.62

Independence

4.151 Snakes on a Plane

India has reported the greatest number of venomous snake bites and fatalities in the world. The four snakes that reportedly do the most biting are the King Cobra, Indian Krait, Russell’s Viper, and the Saw Scaled Viper. The snake bite fatality rate in India is 0.20.30 Suppose two people in India bitten by a Krait are selected at random. a. What is the probability that both people will die? b. What is the probability that exactly one person will die? c. What is the probability that at most one person will die? 4.152 Economics and Finance In a 2013 survey conducted by the Bank of Montreal, Canadians indicated that they were feeling more optimistic about the economy. It was found that 38% of the respondents believed that their employer would hire additional people during the year.31 Suppose three Canadian workers are selected at random. a. What is the probability that all three believe their employer will hire additional people in the coming year? b. What is the probability that at least one believes that her employer will hire additional people in the coming year? c. What is the probability that none of the three believes that their employer will hire additional people in the coming year? 4.153 Physical Sciences The San Francisco Bay Area

is near several geological fault lines and is therefore vulnerable to the constant threat of earthquakes. According to a study by the U.S. Geological Survey, the probability of a

178

CHAPTE R 4

Probability

magnitude 6.7 or greater earthquake in the Greater Bay Area in the next 30 years is 0.63. The probability of a large earthquake within the next 30 years along four major fault lines is given in the following table.32 Fault line North San Andreas Hayward San Gregorio Concord–Green Valley

Probability 0.21 0.31 0.06 0.03

Suppose earthquakes occur in this area independently of one another. a. What is the probability that there will be a major earthquake within the next 30 years in all four fault regions? Write a Solution Trail for this problem. b. What is the probability that there will be no major earthquake within the next 30 years in any of the four regions? c. What is the probability that there will be a major earthquake within the next 30 years in at least one of the four regions? d. Suppose each probability is doubled if we consider the next 40 years. What is the probability that there will be a major earthquake within the next 40 years in at least one of the four regions? 4.154 Public Policy and Political Science Recent national

elections suggest that the political ideology of adults in the United States is very evenly divided. In a USA Today/Gallup survey, 29% were Republicans, 30% were Democrats, and 41% were Independents.33 In addition, 51% of all Republicans, 16% of all Democrats, and 28% of all Independents describe their political views as conservative. Suppose an adult in the United States is selected at random. a. What is the probability that the adult is a Republican and describes her political views as conservative? b. What is the probability that the adult is a Democrat and describes his political views as conservative? c. Suppose the adult described his views as conservative. What is the probability that he is an Independent? 4.155 Medicine and Clinical Studies According to the Alzheimer’s Association, approximately 13% of older Americans have Alzheimer’s disease.34 This is the sixth leading cause of death in the United States, and there is no treatment to prevent, cure, or even slow the disease. Suppose four older Americans are selected at random. a. What is the probability that all four have Alzheimer’s disease? b. What is the probability that exactly one has Alzheimer’s disease? c. What is the probability that at least two have Alzheimer’s disease? 4.156 Psychology and Human Behavior

There are almost 36 million homes in the United States that have four TVs. In homes where there are TVs, 88% have a TV in the

living room, 68% have a TV in the bedroom, and 17% have a TV in the kitchen.35 Suppose a home with TVs is selected at random. a. What is the probability that the home has a TV in all three rooms? b. What is the probability that the home has a TV in only the living room? c. What is the probability that the home has a TV in exactly two of the three rooms? 4.157 Economics and Finance Detailed analysis of two technology stocks indicates over the next six months the probability that the price of stock 1 will rise is 0.42 and for stock 2 the probability is 0.63. Suppose the stock prices react independently. a. What is the probability that both stock prices will rise over the next six months? b. What is the probability that stock 1 will rise and stock 2 will sink? c. Suppose both stocks are in the technology sector, and stock 2 tends to follow stock 1. If stock 1 rises over the next six months, the chance of stock 2 rising is 81%. Now what is the probability of both stock prices rising over the next six months? 4.158 Sports and Leisure The PGA Tour maintains

statistical reports on variables such as money leaders, driving distance, and driving accuracy. The table below lists the probability that selected players were able to hit the green “in regulation.”36 According to the PGA Tour, a green is considered hit in regulation if any portion of the ball is touching the putting surface after the green in regulation stroke has been taken. Golfer Nick Watney Bubba Watson Camilo Villegas

Probability 0.7738 0.7654 0.7525

Suppose these three golfers are playing a round at Sawgrass and they tee up on number 11, one of the most difficult holes to play on the professional tour. a. What is the probability that all three players will hit the green in regulation? b. What is the probability that none of the three players will hit the green in regulation? c. What is the probability that exactly one of the players will hit the green in regulation? d. What is the probability that all three players will hit the green in regulation on all four rounds? 4.159 Sports and Leisure As of March 2013, Larry Bird had the tenth highest career free-throw percentage in NBA history—88.6%. Mark Price was number one.37 Suppose Larry were still playing and he steps up to the free-throw line for two shots. It is unlikely that the two shots are independent. If he misses the first shot, the probability that he makes the second is 0.95, and if he makes the first shot, the probability that he makes the second is 0.85.

4.5

a. What is the probability that he makes both shots? b. What is the probability that he misses both shots? c. What is the probability that he makes only one shot? 4.160 Economics and Finance As Baby Boomers reach retirement age, many are beginning to carefully examine their savings plans and government benefits. This generation is worried about remaining financially independent as many companies cut or even eliminate pension plans. According to a report commissioned by the Canadian bank CIBC, approximately 25% of retiring Baby Boomers expect to carry some debt into their retirement.38 Suppose three Baby Boomers are selected at random. a. What is the probability that exactly two of the three expect to carry some debt into retirement? b. What is the probability that all three expect to carry some debt into retirement? c. Suppose another random sample of five Baby Boomers is obtained. What is the probability that exactly one adult from each sample expects to carry some debt into retirement? 4.161 Psychology and Human Behavior A recent survey

revealed that more people in the United Kingdom believe in space aliens than in God.39 One in 10 people has reported seeing a UFO, and 20% of the respondents believe that UFOs have landed on Earth. Suppose three people from the United Kingdom are selected at random. a. What is the probability that all three believe UFOs have landed on Earth? b. What is the probability that none of the three believes UFOs have landed on Earth? c. What is the probability that exactly one of the three believes UFOs have landed on Earth?

Extended Applications 4.162 Medicine and Clinical Studies When a person has a

certain type of leukemia, a physician may perform a bone marrow transplant in order to restore a healthy blood supply. Among the general population, the chances of an acceptable bone marrow match are 1 in 20,000.40 Suppose a person needs a bone marrow transplant and four people from the general population are selected at random. a. What is the probability that none of the four will match? b. What is the probability that at least one will match? c. How many people would have to be tested in order for the probability of at least one match to be 0.50? 4.163 Travel and Transportation A family trying to arrange a vacation is using the Internet to name their own price for a rental car. The software reports that 50% of all people name a price of $30 per day, 40% bid $25 per day, and 10% bid $20 per day. The Internet company also reports that 90% of all $30 bids are accepted, 60% of all $25 bids are accepted, and only 5% of all $20 bids are accepted. a. What is the probability that the family will submit a bid of $25 and have it accepted?

Independence

179

b. What is the probability that their bid will be accepted? c. Suppose their bid is accepted. What is the probability that

it is for $20? 4.164 Biology and Environmental Science Opponents of the U.S. Navy SURTASS LFA Sonar System argue that it constitutes a substantial risk to marine life, causing extraordinary numbers of stranded, or beached, whales. Consider the following statements concerning the use of this system near a remote island in the South Pacific. ■ On any given day, the probability of a mass stranding of whales in this area is 0.01. ■ The probability of a military exercise on any given day is 0.001. ■ If there is a military exercise, the probability of a mass stranding is 0.17. a. Define events and write a probability statement for each fact above. b. On a randomly selected day, what is the probability of a mass stranding of whales and a military exercise? c. Are the events mass stranding and military exercise independent? Justify your answer. 4.165 Medicine and Clinical Studies A tine test is a common method used to determine whether a person has been exposed to tuberculosis. Approximately 5% of people in the United States have been exposed to tuberculosis.41 Using the tine test, 95% of all people who have been exposed test positive, and 98% of those not exposed test negative. Suppose a person is randomly selected and given the tine test. a. What is the probability that the person tests positive and has been exposed to tuberculosis? b. What is the probability that the person tests positive? c. Suppose the test is positive; what is the probability that the person actually has been exposed? 4.166 Travel and Transportation There are four major air carriers with flights from Boston to Los Angeles; 32% of all passengers take American Airlines, 25% take Jet Blue, 17% take United, and 26% take Virgin America. Data from 2012 indicate that 20% of all American Airlines flights from Boston to Los Angeles are late, 23% of Jet Blue flights from Boston to Los Angeles are late, 19% of United flights from Boston to Los Angeles are late, and 11% of Virgin America flights from Boston to Los Angeles are late.42 Suppose a passenger taking a flight from Boston to Los Angeles is randomly selected. a. What is the probability that the passenger takes American Airlines and is late? b. What is the probability that the passenger is late? On time? c. Suppose the passenger arrives late. Which airline did the passenger most likely fly? 4.167 Manufacturing and Product Development

The Italian Aspide missile, a licensed version of the U.S. Sparrow, has a sophisticated homing guidance system and single-shot hit probability of 0.80.43 Suppose an enemy plane is within range of three missile firing stations, all three

180

CHAPTE R 4

Probability

fire an Aspide surface-to-air missile, and the missiles operate independently. a. What is the probability that the plane is hit? b. What is the probability that all three missiles miss? c. How many missiles would have to be fired at the plane in order to be 99.99% sure it would be hit? 4.168 Public Health and Nutrition Many more adults in the United States have celiac disease than a decade ago. A new study suggests that approximately 1% of all U.S. adults have celiac disease and should avoid eating foods with gluten.44 Suppose five adults in the United States are selected at random. a. What is the probability that exactly one of the five has celiac disease? b. What is the probability that only the first adult and the fifth adult selected have celiac disease? c. Suppose all five adults have celiac disease. Do you believe the claim concerning the percentage of adults with celiac disease? Justify your answer.

a. Complete the tree diagram by filling in the missing path

probabilities. b. What is the probability that the car is repaired under

budget, on time, and with company D? c. What is the probability that the cost of the repair is over

the estimate? d. What is the probability that the car is repaired under

budget, given that it is ready on time? 4.170 Sports and Leisure Suppose two cards are drawn

without replacement from a regular deck of 52 playing cards. Consider the events

D ! driver takes his car to shop D. L ! driver takes his car to shop L. T ! the work is completed on time. B ! the cost is less than or equal to the estimate (under budget).

A1 ! an ace is selected on the first draw; A2 ! an ace is selected on the second draw. a. Find P(A2 0 A1) and P(A2). Are the events A1 and A2 independent? Justify your answer. b. Suppose the two cards are drawn without replacement from six regular 52-card decks shuffled together. Find P(A2 0 A1) and P(A2) for this experiment. Are the events A1 and A2 independent? Justify your answer. c. In part (b), the events are almost independent. For six decks, find P(A1 d A2) exactly, and then find the same probability assuming the two events are independent (with the probability of an ace on any draw being 24 / 312).

The following modified tree diagram describes the relationships among these events.

Challenge

4.169 Fuel Consumption and Cars After a minor collision, a driver must take his car to one of two body shops in the area. Consider the following events.

55

0.4 26

0.2

B

T B!

D

78

0.3

34 0.6

B

T! B! 95 0.8

B

T L

B! 0.8

95 0.9

44

B

T! B!

4.171 The Traveling Salesperson Reconsider Example 4.39. Suppose the salesperson has a problem with the reservation. In which hotel did the salesperson most likely stay? 4.172 The Grapes of China The Chinese wine industry has become very large. However, there are many quality control problems.45 Wine batches are systematically examined for alcohol content, total sugar, volatile acid, and other food additives. Suppose the Yantai Best Cellar Consulting Company claims that only 3% of all bottles are unqualified, or defective. Suppose six bottles of Yantai wine are selected at random. a. What is the probability that none of the six bottles will be unqualified? b. What is the probability that at least one of the bottles will be unqualified? c. Suppose all six bottles are unqualified. Do you believe the company’s claim? Justify your answer.

CHAPTER 4 SUMMARY Concept

Page

Experiment

124

Sample space

126

Notation / Formula / Description

An activity in which there are at least two possible outcomes and the result cannot be predicted with certainty. S: a listing of all possible outcomes, using set notation.

Chapter 4

Tree diagram Event Simple event Complement Union Intersection Disjoint events Venn diagram Probability of an event Equally likely outcome experiment Probabilities in an equally likely outcome experiment Complement rule Addition rule Multiplication rule n factorial

148 150

Permutation

150

Combination

152

Conditional probability

159

Two-way (contingency) table Joint probability table Independent events

160

Dependent events Probability multiplication rule

169 170

Exercises

125 127 127 128 128 128 128 130 136 138

A visual road map of possible outcomes in an experiment. Any collection of outcomes from an experiment. An event consisting of exactly one outcome. A": all outcomes in the sample space S not in A. A c B: all outcomes in A or B or both. A d B: all outcomes in both A and B Two events are disjoint if their intersection is empty: A d B ! { }. Geometric representation of a sample space and events. The limiting relative frequency of occurrence. All outcomes in the experiment have the same chance of occurring.

138

P(A) ! N(A) / N(S)

140 141

P(A) ! 1 # P(A"). P(A c B) ! P(A) % P(B) # P(A d B) P(A c B) ! P(A) % P(B) if A and B are disjoint. N ( S ) 5 n1 # n2 # n3 cnk . n! 5 n ( n 2 1 )( n 2 2 ) c( 3 )( 2 )( 1 ) .

162 169

An ordered arrangement. nPr 5 n ( n 2 1 )( n 2 2 ) c 3 n 2 ( r 2 1 ) 4 5

181

n! . (n 2 r)!

n n! . An unordered arrangement. nCr 5 a b 5 r r! ( n 2 r ) ! The conditional probability of the event A given that the event B has occurred is P(A 0 B) ! P(A d B)/P(B). A two-way table with observed frequencies corresponding to classifications of two variables. A two-way table with probabilities corresponding to the intersection of two events. Two events A and B are independent if P(A | B) ! P(A); if the occurrence of event B does not affect the occurrence or nonoccurrence of the event A. If two events A and B are not independent, then they are dependent. P(A d B) ! P(B) ) P(A | B) ! P(A) ) P(B 0 A). P(A d B) ! P(A) ) P(B) if A and B are independent.

CHAPTER 4 EXERCISES APPLICATIONS decorating store received a shipment of 20 different Tiffanystyle lamps and the store manager selects three lamps at random for display. a. How many different displays are possible? b. Suppose three of the lamps were damaged during shipping. What is the probability that two of the lamps selected will be broken? c. What is the probability that at least one of the lamps selected will be broken?

plastic two-liter beverage containers. An experiment consists of recording the general appearance of a rocket [bad (B), good (G), or excellent (E)] and the maximum altitude [low (L), medium (M), or high (H)]. Consider the following events. A ! The rocket is rated as excellent. B ! The rocket flies to a high altitude. C ! The rocket is rated as bad or flies low. D ! The rocket is good and flies to a medium altitude. a. Find the sample space S for this experiment. b. List the outcomes in each of the events A, B, C, and D. c. List the outcomes in A c B, B c C, and D". d. List the outcomes in A d B, C d D, and (B c D)".

4.174 Physical Sciences At a state middle-school science fair, students launch bottle rockets designed and built from

4.175 Sports and Leisure The Boston Bruins play in the Northeast Division of the Eastern Conference in the National

4.173 Marketing and Consumer Behavior A home

4

182

CH APTE R 4

Probability

Hockey League. For each game played against another team in the Eastern Conference, the division [Atlantic (A), Northeast (N), or Southeast (S)] and the outcome [win (W), loss (L), tie (T), or overtime loss (O)] are recorded. Consider the following events. E ! The opponent is in the Southeast Division. F ! The Bruins win the game. G ! The opponent is in the Northeast or the game is an overtime loss. H ! The Bruins lose and the opponent is from the Atlantic Division. a. Carefully sketch a tree diagram to illustrate the possible b. c. d. e.

outcomes for this experiment. Find the sample space for this experiment. Find the outcomes in each of the events E, F, G, and H. List the outcomes in E c F, F d G, and H". List the outcomes in E c H", E c F c G", and F c G".

4.176 Demographics and Population Statistics

In a recent population survey, the U.S. Census Bureau reported the following classifications and corresponding probabilities.46 Educational attainment

Probability

No degree High school graduate Some college, no degree Associate degree Bachelor’s degree Master’s degree Professional degree Doctorate degree

0.1317 0.3001 0.1946 0.0916 0.1844 0.0708 0.0132 0.0136

Consider the events: A ! has a bachelor’s, master’s or doctorate degree B ! does not have an associate degree C ! does not have a degree Find the following probabilities. a. P(A), P(B), and P(C). b. P(A d B), P(A c C), and P(B c C). c. P(C"), P(A" c B), and P(Br d Cr). 4.177 Biology and Environmental Science

The germination rate for pumpkin seeds is directly related to the prevailing weather conditions. The Autumn Gold is a popular mediumsized pumpkin and ripens to a deep orange. If conditions are seasonable, the probability of germination is 0.85.47 If it is dry, suppose the probability that a random seed will germinate is 0.75. Recent weather history suggests there is a 40% chance of a dry start to the growing season. Suppose an Autumn Gold pumpkin seed is randomly selected. a. What is the probability that the growing season will be dry and the seed will germinate? b. What is the probability that the seed will germinate? c. Suppose the seed does not germinate. What is the probability that the growing season had a dry start?

4.178 Economics and Finance Many Americans use savings bonds to supplement retirement funds or to pay for qualified higher-education expenses. The U.S. Treasury even sells savings bonds online. Approximately one in every six Americans owns savings bonds.48 Suppose four Americans are randomly selected. a. What is the probability that all own savings bonds? b. What is the probability that none of the four owns savings bonds? c. What is the probability that exactly two of the four own savings bonds? 4.179 Marketing and Consumer Behavior At Elmo’s, an

old-fashioned barber shop in Melbourne, Florida, 70% of all customers get a haircut, 40% get a shave, and 15% get both. a. What is the probability that a randomly selected customer gets a shave or a haircut? b. What is the probability that a randomly selected customer gets neither? c. What is the probability that a randomly selected customer gets only a shave? d. What is the probability that a randomly selected customer gets a shave, given that he gets a haircut? e. Suppose two customers are selected at random. What is the probability that both get only a haircut? 4.180 Medicine and Clinical Studies

More and more people are trying herbal remedies, including gooseberry juice, eucalyptus oil, and crushed ajwain, for relief from the common cold. The following joint probability table shows the relationship between having tried an herbal remedy and highest degree earned. Highest degree earned High College Graduate Vocational school degree degree Tried Not tried

0.23 0.04

0.17 0.12

0.06 0.15

0.05 0.18

Suppose one person is randomly selected. a. What is the probability that the person has tried an herbal remedy, given that the highest degree earned is from college? b. If the person has not tried an herbal remedy, what is the probability that the highest degree earned is from high school? c. Suppose the person has not earned a graduate degree. What is the probability that the person has tried an herbal remedy? d. Suppose two people are selected at random. What is the probability that exactly one has tried an herbal remedy? 4.181 Psychology and Human Behavior Do you believe in ghosts? According to a recent survey, 21% of people in Sweden believe in ghosts.49 Suppose that of those who believe in ghosts, 20% said they have had a spiritual encounter with a ghost. Suppose a Swede is selected at random.

Chapter 4

a. If the person believes in ghosts, what is the probability

that she has never had an encounter with a ghost? b. What is the probability that the person believes in ghosts and has had an encounter with a ghost? c. What is the probability that the person believes in ghosts and has not had an encounter with one? 4.182 Travel and Transportation A super-commuter is a

person who commutes to work from one large metro area to another by car, rail, bus, or even air. Super-commuters are not necessarily elite business travelers, but rather middle-income individuals who are willing to commute long distances in order to secure affordable housing or better schools. According to a recent study, 13% of workers in Houston are super-commuters, 8.6% in Phoenix, and 7.5% in Atlanta.50 Suppose three workers are selected at random, one from each city. a. What is the probability that all three are super-commuters? b. What is the probability that none of the three are supercommuters? c. What is the probability that only the worker from Houston is a super-commuter? d. What is the probability that exactly two of the workers are super-commuters? 4.183 Manufacturing and Product Development At a

glass manufacturing facility, crystal stemware is carefully inspected for correct dimensions, quality, and production trends. After lengthy studies, the factory is known to produce 15% defectives. Most of these pieces are discovered through inspection and are reworked or discarded. Suppose two pieces are randomly selected for inspection. a. What is the probability that both pieces are defect-free? b. What is the probability that neither piece is defect-free? c. Suppose at least one of the pieces has a flaw; what is the probability that both are defective? 4.184 Medicine and Clinical Studies There is a constant shortage of organ donors in the United States. Fewer people are donating organs and ever more people are on waiting lists. One solution to this problem involves compensating organ donors. A recent poll suggests that 60% of Americans support some form of compensation in terms of future health care for people who make organ donations while alive, for example, kidneys, bone marrow, or liver.51 Suppose four Americans are selected at random. a. What is the probability that all four support compensation for organ donors? b. What is the probability that none of the four supports compensation for organ donors? c. What is the probability that exactly two of the four support compensation for organ donors? 4.185 Marketing and Consumer Behavior Americans drink a lot of coffee, and they put all sorts of extras into their coffee to enhance the drink, including flavor shots and flavored creams. Research data indicates that 62% of all coffee drinkers put creamer in their coffee.52 Of those people who use creamer, 40% say they would drink more coffee if

Exercises

183

their preferred flavors were offered. Suppose a coffee drinker is selected at random. a. Suppose the coffee drinker uses creamer. What is the probability that he would not drink more even if his preferred flavor were offered? b. What is the probability that the coffee drinker uses creamer and would drink more if his preferred flavor were offered? c. Suppose three coffee drinkers are selected at random. What is the probability that exactly one uses creamer?

EXTENDED APPLICATIONS 4.186 Public Health and Nutrition

Tobacco smoke contains more than 7000 chemicals, many that are toxic and several that are known to cause cancer. As a result, many smokers try various methods to quit. Consider a group of smokers who want to quit. In this group, 2.7% have tried an electronic cigarette, or e-cigarette. Of those who have tried e-cigarettes, 31% quit smoking after six months.53 Of those people who tried some other method, suppose 16% quit smoking after six months. Suppose a smoker who would like to quit is selected at random. a. What is the probability that the smoker tried e-cigarettes and quit smoking after six months? b. What is the probability that the smoker quit after six months? c. Suppose the smoker quit smoking after six months. What is the probability that she tried e-cigarettes? 4.187 Travel and Transportation In a study of the worldwide commercial jet fleet through 2011, 37% of all fatal accidents occurred when the plane was on final approach or landing.54 Suppose four fatal jet accidents are selected at random. a. What is the probability that all four occurred during final approach or landing? b. What is the probability that none of the four occurred during final approach or landing? c. What is the probability that exactly one of the four occurred during final approach or landing? 4.188 Economics and Finance Customers at a Publix grocery store in Charleston, South Carolina, can pay for purchases with cash, a debit card, or a credit card. Fifty-five percent of all customers use cash and 38% use a debit card. Careful research has shown of those paying with cash, 75% use coupons; of those using a debit card, 35% use coupons; and of those using a credit card, only 10% use coupons. Suppose a customer is randomly selected. a. What is the probability that the customer pays with a credit card and does not use coupons? b. What is the probability that the customer does not use coupons? c. If the customer does not use coupons, what is the probability that he paid with a debit card? 4.189 Psychology and Human Behavior According to a recent survey, 11% of men ages 50 to 64 now color their hair.55

CHAPTE R 4

Probability

Many men feel this is necessary in order to remain competitive in the workplace. Suppose four men ages 50 to 64 are selected at random. a. What is the probability that all four color their hair? b. What is the probability that exactly one of the four colors his hair? c. Suppose none of the four colors his hair. Is there any evidence to suggest that the study’s claim is not correct? Justify your answer. 4.190 Fuel Consumption and Cars Auto Parts Warehouse

offers a wide variety of parts and accessories for cars. Consider the following events: A ! a randomly selected customer purchases a manual B ! a randomly selected customer purchases trim accessories C ! a randomly selected customer purchases a car-care product Suppose the following probabilities are known. P(A) ! 0.44, P(B) ! 0.52, P(C) ! 0.39, P(A d B) ! 0.19, P(A d C) ! 0.10, P(B d C) ! 0.23, P(A d B d C) ! 0.08. a. Carefully sketch a Venn diagram illustrating the relationship among these three events and label each region with the corresponding probability. b. Find the probability of just event A occurring. c. Find the probability of none of the events (A, B, or C) occurring. d. Find P(A 0 C), P(B 0 A d C), and P(A d B d C 0 A).

4.191 Travel and Transportation

Pasco County in Florida has special evacuation plans in the event of a hurricane. Suppose residents can take one of five different major highways out of the county. Department of Transportation officials have produced the following table indicating the probability that a resident will use a selected road. Road Probability

A

B

C

D

E

0.20

0.18

0.26

0.32

0.04

Suppose three Pasco County residents are selected at random and a hurricane strikes. a. What is the probability that all three will take the same escape route? b. What is the probability that exactly one will take escape route E? c. What is the probability that two will take escape route C? d. Suppose all three Pasco County residents hear a traffic report indicating that route A is flooded and impassable. What is the probability that all three will take route B? 4.192 Travel and Transportation

Some researchers believe that the severity of an automobile accident is related to the type of vehicle. For example, pickup trucks tend to be involved in more fatal crashes than other types of vehicles. The following table presents the number of vehicles involved in each type of crash.56 Suppose this table is representative of all crashes in the United States and one crash is selected at random.

Vehicle type

184

Passenger car Pickup Utility Van Other light truck Large truck Bus Other

Fatal

Crash Injury

Property

18,350 8,452 6,924 2,494 32

1,506,595 376,156 447,946 174,299 67,828

3,686,062 1,001,893 1,187,911 453,197 222,940

3,215 221 1,152

53,411 9,968 6,282

239,298 47,387 12,429

a. What is the probability that the crash results in injury

only? b. Suppose the crash involves a utility vehicle. What is the

probability that it is fatal? c. Suppose the crash involves property damage only. What is

the probability that the vehicle is a large truck? d. Suppose the crash is not fatal. What is the probability that

it involves a bus? e. Suppose the crash does not involve a passenger car. What

is the probability that it results in injury only? f. Are the events fatal crash and van independent? Justify

your answer. g. Suppose two crashes are selected at random. What is the

probability that both were fatal and involved an unknown type of vehicle?

CHALLENGE 4.193 Free Nights

During the month of August, one guest at the Golden Nugget in Las Vegas will be selected at random to participate in a contest to win free lodging. A fair quarter will be tossed until the first head is recorded. If the first head occurs on toss x, the contestant will win x free nights’ stay at the Golden Nugget. So, if a head is obtained on the first coin toss, the contest is over, and the guest wins one free night. If the first head appears on the fourteenth toss (13 tails and then a head), the guest wins 14 free nights. Theoretically, a guest could win any number of free nights, 1, 2, 3, 4, . . . , although it seems unlikely someone could win, for example, 100 free nights. a. Use technology to model this contest. Try your simulation 10 times and record the number of free nights awarded each time. Did anyone win five or more free nights’ stay? b. Consider the event A ! the guest wins five or more free nights at the Golden Nugget. Simulate the contest n ! 50 times and compute the relative frequency of occurrence of the event A. Repeat this process for n ! 100, 150, 200, . . . , 2000. c. Construct a plot of the relative frequency versus the number of simulations. Describe any patterns. d. Use your results in (b) and (c) to estimate the probability of winning five or more free nights at the Golden Nugget.

Chapter 4

e. Find the exact probability of winning five or more free

nights at the Golden Nugget. Hint: Consider the complement of the event A. 4.194 What are the chances of winning a prize in Monopoly Sweepstakes? The object in Monopoly

Sweepstakes was to collect game pieces with the purchase of certain products. Some game pieces were instant winners. However, special collections of properties were worth big prizes. There were nine rare property prizes involving the following game pieces: Mediterranean Avenue, Vermont Avenue, Virginia Avenue, Tennessee Avenue, Kentucky Avenue, Ventnor Avenue, Pennsylvania Avenue, Boardwalk, and Short Line. The probability of finding the winning combination involving each of these rare properties is given in the following table.

Rare property Mediterranean Avenue Vermont Avenue Virginia Avenue Tennessee Avenue Kentucky Avenue Ventnor Avenue Pennsylvania Avenue Boardwalk Short Line

Exercises

Probability of winning 1 / 402,602 1 / 578,695,060 1 / 12,953,122 1 / 518,330,833 1 / 161,914,024 1 / 499,516,192 1 / 158,948,243 1 / 306,939,484 1 / 539,566,072

Find the probability of winning at least one of these rare property prizes.

185

5

Random Variables and Discrete Probability Distributions Looking Back ■

Recall the definition of an experiment, how to find a sample space, and operations on events.

■

Remember the properties and rules used to compute the probability of various events.

Looking Forward ■

Learn the concept of a random variable, a bridge between the experimenter’s world and the statistician’s world, and how information is transferred between worlds.

■

Understand the connection between an experimental outcome and the number associated with that outcome.

■

Understand probability distributions for discrete random variables and work with several special discrete distributions.

Is a flu shot really effective? Each year, the Centers for Disease Control and Prevention (CDC) recommend a flu shot for certain groups of people who are classified as at risk for serious complications from the most common strains of influenza virus. Adults aged 50 or older, residents of nursing homes, people with chronic heart or lung conditions, and even people who simply hate the flu, are all advised to get a shot when the shots become available, usually during the fall. Approximately 135 million doses of flu vaccine were distributed during the 2012–2013 flu season. The vaccine was designed to protect individuals against the three most common types of flu predicted to occur. The CDC reported that the flu vaccine was 56% effective. That is, 56% of all people receiving the flu vaccine who were exposed to a flu virus did not contract the flu. To check this claim (56% effective), a random sample of 50 at-risk people who received a flu shot was selected. During the flu season, all 50 were exposed to the flu and 29 actually contracted the disease. The techniques presented in this chapter will allow us to compute the likelihood of at least 29 people (out of 50) contracting the flu. This result will be used to determine whether there is any evidence that the claim is false.

CONTENTS 5.1 Random Variables 5.2 Probability Distributions for Discrete Random Variables 5.3 Mean, Variance, and Standard Deviation for a Discrete Random Variable 5.4 The Binomial Distribution 5.5 Other Discrete Distributions © Jose Luis Pelaez/Corbis

187

188

CHA PTE R 5

Random Variables and Discrete Probability Distributions

5.1 Random Variables The idea of assignment suggests the need for a function.

A function f is a rule that takes an input value and returns an output value (according to the rule). Suppose the function f is defined by f ( x ) 5 x2 1 4. This rule indicates that f takes an input x and assigns, or maps, x to the value x2 1 4. For example, the function f assigns the input 1 to the output 5 because f ( 1 ) 5 12 1 4 5 5. A random variable is just a special kind of function.

Definition A random variable is a function that assigns a unique numerical value to each outcome in a sample space.

A CLOSER L OK 1. Such functions are called random variables because their values cannot be predicted

with certainty before the experiment is performed. 2. Capital letters, such as X and Y, are used to represent random variables. 3. A random variable is a rule for assigning each outcome in a sample space to a unique X: S S R A random variable maps elements of a sample space to the real numbers.

real number. If e is an experimental outcome and x is a real number, here is a formal way to picture this assignment: X ( e ) 5 x. The random variable X takes an outcome e and maps, or assigns, it to the number x. The number x is associated with the outcome e, and is a value the random variable can take on, or assume. 4. Figures 5.1 and 5.2 help us understand how a random variable works. These figures illustrate the random variable X as the link between experimental outcomes and numerical values. The rule for a random variable may be given by a formula, as a table, or even in words. Note that several outcomes may be assigned to the same number, but each outcome is assigned to only one number. S e4

e3 e1

X e6

e2 e5

e7

!4 !3 !2 !1

1

2

3

4x

Figure 5.1 A random variable assigns a numerical value to each outcome. Experimenter’s world

e4 e5 e6 e7 e8 e9

X

Figure 5.2 Another visualization of the definition of a random variable.

Statistician’s world

!2 !1 0 1 2 3

5.1

Random Variables

189

The next example shows how a specific random variable maps outcomes to numbers. The notation will get shorter and more concise as the concept of assignment becomes clearer.

Example 5.1 That Sinking Feeling

St. Petersburg Times/ZUMAPRESS/Newscom

In February 2013, a sinkhole suddenly opened up in the bedroom of a home in a Tampa, Florida, suburb. The homeowner died in this incident and the house was razed because officials feared it could collapse at any time. The Florida Department of Environmental Protection maintains a database of subsidence incident reports and has recorded over 3000 sinkholes since 1970.1 The most common type of sinkhole in Florida is a collapse sinkhole. Suppose three sinkholes in Florida are selected at random and each is classified as a collapse sinkhole (C) or some other type (O). Let the random variable X be defined to be the number of collapse sinkholes out of the three selected.

SOLUTION STEP 1 The experiment consists of recording the type of each sinkhole. Each outcome

consists of a sequence of three letters, with each letter a C or an O. There are eight possible outcomes (from the multiplication rule). Here is the sample space: S 5 5 OOO, OOC, OCO, OCC, COO, COC, CCO, CCC 6

STEP 2 The random variable X takes each outcome and returns the number of collapse

sinkholes (Cs). Here is a table that illustrates this mapping and the values the random variable X can assume: Outcome

Value of X

OOO OOC OCO OCC COO COC CCO CCC

0 1 1 2 1 2 2 3

More formally, one can write: X ( OOO ) 5 0, X ( OOC ) 5 1, X ( OCO ) 5 1, c

The statement X 5 1 is an event defined in terms of a random variable.

A number is assigned to each outcome. Note that the outcomes OOC, OCO, and COO are mapped to the same number (1), and the outcomes OCC, COC, and CCO are mapped to 2. STEP 3 Here’s the key: We are no longer interested in the sequence of letters, or outcomes, but rather focus on the numbers associated with the outcomes. We need to consider the number of possible values X can assume and the probability that X assumes each value. STEP 4 To find, say, the probability that X takes on the value 1, think about which outcomes are assigned to 1 and sum the probabilities of those outcomes. The probability that the random variable X equals 1 is P ( X 5 1 ) 5 P ( OOC ) 1 P ( OCO ) 1 P ( COO ) because these three outcomes are mapped to 1. As shown in Figure 5.3, the random variable X links these three outcomes and their associated probabilities to the number 1.

190

CHA PTE R 5

Random Variables and Discrete Probability Distributions

Outcome OOO OOC OCO OCC COO COC CCO CCC

x is a possible value of the random variable X.

X

x

0 1 2 3

Figure 5.3 The random variable X maps three outcomes to the number 1.

There are two types of random variables. The type depends on the number of possible values the random variable can assume.

Definition A random variable is discrete if the set of all possible values is finite, or countably infinite. A random variable is continuous if the set of all possible values is an interval of numbers. These definitions are analogous to those for discrete and continuous data sets. The following remarks are also similar.

A CLOSER L OK 1. Discrete random variables are usually associated with counting, and continuous ran2.

3.

4.

5.

dom variables are usually associated with measuring. To decide whether a random variable is discrete or continuous, consider all the possible values the random variable could assume. Finite or countably infinite means discrete. An interval of possible values means continuous. Recall: Countably infinite means there are infinitely many possible values, but they are countable. You may not ever be able to finish counting all of the possible values, but there exists a method for actually counting them. The interval of possible values for a continuous random variable can be any interval, of any length, open or closed. The exact interval may not be known, only that there is some interval of possible values. In practice, no measurement device is precise enough to return any number in some interval. In theory, a continuous random variable may assume any value in some interval (but not in reality).

Remember, an experiment may result in a numerical value right away, not a symbol or a token. In this case we do not need any extra link or connection to the real numbers. The description of the experiment is the same as the definition of the random variable. The values the random variable can assume are the possible distinct experimental outcomes. In the following example, several experiments are described and each associated random variable is identified.

Example 5.2 Discrete or Continuous Consider each experiment below and determine whether the associated random variable is discrete or continuous.

5.1

Random Variables

191

a. A Kohl’s department store has 65 cash registers. At the end of the day, the re-

ceipts are carefully audited to determine whether each cash register balances. Let the random variable X be the number of cash registers that balance on a randomly selected day. b. Patients undergoing a tonsillectomy are administered a general anesthetic. Let the random variable Y be the length of time from injection of the anesthetic until a patient is rendered unconscious. c. Schlage manufactures and sells a maximum-security double-cylinder deadbolt lock for homes. At the facility where the locks are made and assembled, finished locks are randomly selected and carefully checked for defects. If a defective lock is found, the assembly line is shut down. An experiment consists of recording whether the selected lock is good (G) or defective (B). The sample space is S 5 5 B, GB, GGB, GGGB, GGGGB, c6 . Let the random variable X be the number of locks inspected until a defect is found. d. Let the random variable Y be the length of the largest fish caught on the next party boat arriving back to the dock in Belmar, New Jersey.

SOLUTION a. There is no need to use a collection of symbols to represent experimental outcomes for

the cash registers. The possible values for X (and the distinct experimental outcomes) are finite: 0, 1, 2, 3, . . . , 65. These values are distinct, disconnected points on a number line. The random variable X is discrete. b. Y is a measurement, the time elapsed until a patient is unconscious. The possible values for Y are any number in some interval, say, 0 to 60 minutes. The random variable Y is continuous. c. The values X can assume are 1, 2, 3, 4, . . . . The number of possible values is countably infinite; the values are disconnected on a number line. The random variable X is discrete. d. Y is a measurement, and can (theoretically) take on any value in some interval. The possible values for Y are any number is some interval, say 5 to 25 inches. The random variable Y is continuous.

SECTION 5.1 EXERCISES Concept Check 5.1 True/False The set of all possible values for a random

variable can be infinite. 5.2 True/False A random variable may assign more than one

5.7 Fill in the Blank Continuous random variables are usually associated with _____________. 5.8 Short Answer If X is a discrete random variable, explain how to find P ( X 5 2 ) .

numerical value to an outcome. 5.3 True/False A random variable can be both discrete and

continuous. 5.4 Fill in the Blank A random variable is a special kind of

_____________. 5.5 Fill in the Blank A random variable maps elements of the _____________ to the _____________. 5.6 Fill in the Blank Discrete random variables are usually

associated with _____________.

Practice 5.9 Classify each random variable as discrete or continuous. a. The number of boll weevils in one acre of a Louisiana

cotton farm. b. The volume of ice cream in one scoop. c. The area of a randomly selected baseball field including

foul territory. d. The number of late deliveries in one month by a package

delivery service.

192

CHAPT E R 5

Random Variables and Discrete Probability Distributions

e. The number of girls born in a rural hospital during the

5.14 Education and Child Development An experiment

next year. f. The interest rate on a savings account at a randomly selected bank in Philadelphia. g. The number of tickets sold in the next Powerball lottery. h. The number of oil tankers registered to a certain country at a given time.

consists of showing a four-year-old child an interactive instructional video and then asking the child to tie his shoelaces. The random variable Y is the length of time the child takes to tie the first shoelace. Is Y discrete or continuous? Justify your answer.

5.10 Classify each random variable as discrete or continuous. a. The number of visitors to the Museum of Science in

Boston on a randomly selected day. b. The camber-angle adjustment necessary for a front-end

alignment. c. The total number of pixels in a photograph produced by

a digital camera. d. The number of days until a rose begins to wilt after

purchase from a flower shop. e. The running time for the latest James Bond movie. f. The blood alcohol level of the next person arrested for

DUI in a particular county. 5.11 Classify each random variable as discrete or continuous. a. The number of people requesting vegetarian meals on a

flight from New York to London. b. The exact thickness (in millimeters) of a paper towel. c. The time it takes a driver to react after the car in front

stops suddenly. d. The number of escapees in the next prison breakout. e. The length of time a deep-space probe remains in contact

with Earth. f. The number of points on a randomly selected buck.

The definition of a point is an antler projection at least one inch in length from the base to tip. The brow tine and main beam tip shall be counted as points regardless of length. 5.12 Classify each random variable as discrete or continuous. a. The number of votes necessary to elect a new Pope. b. The amount of sugar in a 16-ounce sweetened bottled

drink purchased in a New York City cafe. c. The total number of riders on all forms of public

transportation in the United States during the year. d. The number of residents in an assisted-living center who

suffer from hardening of the arteries. e. The amount of lead measured in the soil of a children’s playground. f. The time it takes an automobile to pass through the George Massey Tunnel in Vancouver, British Columbia.

Applications 5.13 Marketing and Consumer Behavior T. J. Maxx sells

home fashions and men’s, women’s, boys’, and girls’ apparel. An experiment consists of classifying the next two items purchased, each as men’s, women’s, boys’, or girls’ apparel. Let the random variable X be the number of sales of women’s or girls’ apparel. a. List the outcomes in the sample space. b. What are the possible values for X? Is X discrete or continuous? Justify your answer.

5.15 Biology and Environmental Science The Waynesburg Lions Club receives a shipment of 300 Christmas trees from Wending Creek Farms in Coudersport, Pennsylvania, to sell as a fundraiser. Classify each of the following random variables as discrete or continuous. a. The number of trees over six feet tall. b. The moisture content (expressed as a percentage) of a randomly selected tree. c. The number of Douglas fir trees in the shipment. d. The diameter of the trunk at the bottom of a randomly selected tree. 5.16 Biology and Environmental Science To map the current of bottom water in a certain part of the Atlantic Ocean, a dye is released and used to trace the water flow. Let the random variable X be the maximum distance (in meters) from release at which the dye is detected after one day. Is X discrete or continuous? Justify your answer. 5.17 Psychology and Human Behavior An experiment

consists of recording the behavior of a randomly selected Duluth cab driver as a traffic signal changes from red to green. Let the random variable X be the acceleration (in ft/s2) of the cab one second after the light changes. Is X discrete or continuous? Justify your answer. 5.18 Medicine and Clinical Studies A report in the journal Cancer suggests that women who take aspirin on a regular basis lower their risk of a certain kind of melanoma.2 Every woman who visits the Turtle Creek Medical Center in Dallas, Texas, during the next business day will be asked whether she regularly takes an aspirin. Let X be the number of women who take an aspirin daily. Is X discrete or continuous? Justify your answer. 5.19 Sports and Leisure The Boston Red Sox play their

spring training games in JetBlue Park, Fort Myers, Florida. Suppose a game against the Tampa Bay Rays is selected. Classify each of the following random variables as discrete or continuous. a. The price of a randomly selected ticket. b. The time it takes to complete the game. c. Whether the game is postponed due to rain or is completed. d. The speed of the first pitch in the bottom of the third inning. e. The number of fans in attendance. f. The number of hot dogs sold during the entire game. g. The total number of errors in the game. h. The weight in ounces of the bat used by the third hitter in the fifth inning.

5.2

Probability Distributions for Discrete Random Variables

193

5.2 Probability Distributions for Discrete Random Variables A random variable is a rule that assigns each experimental outcome to a real number. To complete the description of a discrete random variable so that we can understand and answer questions involving the random variable, we need to know all the possible values the random variable can assume and all the associated probabilities. This collection of values and probabilities is called a probability distribution. Because random variables are used to model populations, a probability distribution is a theoretical description of a population. A random variable provides the link between experimental outcomes and real numbers. An experimental outcome and the probability assigned to that outcome are both associated with exactly the same value of the random variable. This connection determines probability assignments for a random variable.

Definition The probability distribution for a discrete random variable X is a method for specifying all of the possible values of X and the probability associated with each value.

A CLOSER L OK 1. A probability distribution for a discrete random variable may be presented in the form

of an itemized listing, a table, a graph, or a function. 2. A probability mass function (pmf) is denoted with a small p, and is the probability that a discrete random variable is equal to some specific value. In symbols, it is defined by p ( x ) 55 P(X 5 x). Rule In words, the rule for the function p evaluated at an input x is the probability of an event, the probability that the random variable X takes on the specific value x. The function p and its probability rule are used interchangeably. Suppose X is a discrete random variable. Then p(7) means find the probability that the random variable X equals 7, or P ( X 5 7 ) . A probability distribution is constructed using the definition of a random variable and the links between experimental outcomes and real numbers. The next example illustrates this concept.

Example 5.3 Construct a Probability Distribution DATA SET EG5.3

Suppose an experiment has eight possible outcomes, each denoted by a sequence of three letters, each an N or a D. The probability of each outcome is given in the following table. Outcome Probability

NNN 0.336

NND 0.224

NDN 0.144

DNN 0.084

NDD 0.096

DND 0.056

DDN 0.036

DDD 0.024

The random variable X is defined to be the number of Ds in an outcome. Find the probability distribution for X.

SOLUTION STEP 1 The probability distribution for X consists of all the possible values X can assume

along with the associated probabilities. The table below shows the random variable assignment and a technique for calculating the probability of each value.

194

CHAPTER 5

Random Variables and Discrete Probability Distributions

Experiment

Probability distribution

Probability

Outcome

0.336 0.224 0.144 0.084 0.096 0.056 0.036 0.024

NNN NND NDN DNN NDD DND DDN DDD

Looking at just this (probability distribution) table, how do you know X cannot assume any other value?

Probability

Value, x

X

P ( X 5 0 ) 5 0.336

1

P ( X 5 1 ) 5 0.224 1 0.144 1 0.084 5 0.452

2

P ( X 5 2 ) 5 0.096 1 0.056 1 0.036 5 0.188

3

P ( X 5 3 ) 5 0.024

To find the probability that X takes on a specific value x, find all the outcomes that are mapped to x, and add the probabilities of these outcomes. STEP 2 The random variable X takes on the values 0, 1, 2, and 3. The probability distribution can be presented in a table as shown below:

TRY IT NOW

x

1

2

3

p(x)

0.336

0.452

0.188

0.024

GO TO EXERCISE 5.29

A CLOSER L OK 1. Think about this process of constructing a probability distribution. To find the proba-

bility that X takes on the value x, look back at the experiment and find all the outcomes that are mapped to x. Drag along these probabilities and sum them. 2. The probability distribution for a random variable X is a reference for use in answering probability questions about the random variable. For example, we’ll need to answer probability questions such as “Find P ( X 5 3 ) .” Think of X 5 3 as an event stated in terms of a random variable. The details needed for answering this question are in the probability distribution. The next example illustrates various methods for presenting a probability distribution. Don’t worry about where these actual probabilities came from here. In this example, focus only on the methods for conveying all the values and probabilities.

DATA SET DEFENDER

Example 5.4 Public Defender Suppose the random variable Y represents the number of arraignments in a day before a certain judge in which the accused uses a public defender. The probabilities of Y taking on various values are as follows: 5 / 15 for no public defender; 4 / 15 for one; 3 / 15 for two; 2 / 15 for three; and 1 / 15 for four. Here are several ways to represent the probability distribution for Y.

SOLUTION STEP 1 A complete listing of all possible values and associated probabilities (use either

the probability mass function, p, or the assignment rule): P(Y 5 0) P(Y 5 1) P(Y 5 2) P(Y 5 3) P(Y 5 4)

5 5/15 5 4/15 5 3/15 5 2/15 5 1/15

The random variable Y can take on the values 0, 1, 2, 3, or 4, and the probability of each value is given. There can be no other value of Y, because the probabilities sum to 1.

5.2

195

Probability Distributions for Discrete Random Variables

STEP 2 A table of values and probabilities:

y

1

2

3

4

p(y)

5 / 15

4 / 15

3 / 15

2 / 15

1 / 15

This kind of table is the most common way to present a probability distribution for a discrete random variable. It concisely lists all the values Y can assume and the associated probabilities. STEP 3 A probability histogram: p(y) 5/15 4/15 3/15 2/15 1/15 0

1

2

3

4

y

The distribution of Y is represented graphically. A rectangle is drawn for each value y, centered at y, with height equal to p( y). STEP 4 A point representation: p(y) 5/15 4/15 3/15 2/15 1/15 0

1

2

3

4

y

Plot the points ( y, p( y)) and draw a line from ( y, 0) to ( y, p( y)). STEP 5 A formula: p( y) 5

52y 15

y 5 0, 1, 2, 3, 4

This shows the rule for the probability mass function. For example, to find p(2), the probability Y 5 2, let y 5 2 in the formula to find: p(2) 5 TRY IT NOW

522 3 5 15 15 GO TO EXERCISE 5.30

196

CHAPTER 5

Random Variables and Discrete Probability Distributions

All of the techniques presented in the previous example are valid methods for presenting a probability distribution. Use the style that is most convenient or appropriate, or what is called for in the question. Often, a graphical representation of the distribution will be helpful. Sometimes, having a formula for the probability distribution is more useful. In the next example we’ll construct another probability distribution and consider some probability questions involving a random variable.

Example 5.5 Who Wants Coffee? The Hard Rock Cafe in Dallas carefully monitors customer orders and has found that 70% of all customers ask for some kind of coffee (C), while the remainder order a specialized tea (T). Suppose four customers are selected at random. Let the random variable X be the number of customers who order coffee. a. Find the probability distribution for X. b. Find the probability that more than two customers order coffee. c. Suppose at least two customers order coffee. What is the probability that all four cus-

tomers order coffee?

SOLUTION The experiment consists of observing four customer choices. Each outcome consists of a sequence of four letters, each a C or a T. From the multiplication rule, there are 16 possible outcomes: CCCC, CCCT, CCTC, etc. Because the customers are selected at random, each choice is independent, and the probability of each outcome is obtained by multiplying the corresponding probabilities. For example, P ( CTCT ) 5 P ( C d T d C d T ) First customer buys coffee and second customer buys tea and . . . # # # 5 P ( C ) P ( T ) P ( C ) P ( T ) Events are independent. Multiply corresponding probabilities. 5 ( 0.70 )( 0.30 )( 0.70 )( 0.30 ) P(T) 5 1 2 P(C) 5 0.0441

Solution Trail 5.5a KE YWORDS ■

Probability distribution

T R ANSLATION ■

Find all the values X can assume and all the associated probabilities.

C ONCEP TS ■

The following table lists all the possible experimental outcomes, the probability of each outcome (computed as above), and the value of the random variable assigned to each outcome.

Connection between experimental outcomes and real numbers

VIS ION

First, think about the experiment. Use the definition of the random variable to link experimental outcomes with values of the random variable, and drag along all of the probabilities. Construct a table listing all the values of X and the associated probabilities.

Outcome

Probability

x

Outcome

Probability

x

TTTT TTTC TTCT TCTT CTTT TTCC TCTC TCCT

0.0081 0.0189 0.0189 0.0189 0.0189 0.0441 0.0441 0.0441

0 1 1 1 1 2 2 2

CTTC CTCT CCTT TCCC CTCC CCTC CCCT CCCC

0.0441 0.0441 0.0441 0.1029 0.1029 0.1029 0.1029 0.2401

2 2 2 3 3 3 3 4

This table shows that the values of X are 0, 1, 2, 3, and 4. a. Use the links in the table to construct the probability distribution for X.

p ( 0 ) 5 P ( X 5 0 ) 5 P ( TTTT ) 5 0.0081 There is only one outcome assigned to a 0, and the probability of that outcome is 0.0081.

5.2

Solution Trail 5.5b KEYW ORDS ■

More than two

TR AN SLAT IO N ■

.2

Probability Distributions for Discrete Random Variables

197

p(1) 5 P(X 5 1) Definition of a probability mass function. 5 P ( TTTC or TTCT or TCTT or CTTT ) These outcomes are mapped to 1. 5 P ( TTTC ) 1 P ( TTCT ) 1 P ( TCTT ) 1 P ( CTTT ) Or means union; the outcomes are disjoint.

5 0.0189 1 0.0189 1 0.0189 1 0.0189 5 0.0756 Continue in this manner to obtain the probability distribution for X.

C ON CEPTS ■

Find the probability that the random variable X takes on a value greater than 2.

VIS ION

Use the probability distribution to determine which values are greater than 2, and add the associated probabilities.

Solution Trail 5.5c KEYW ORDS ■ ■

Suppose at least two All four

x

1

2

3

4

p(x)

0.0081

0.0756

0.2646

0.4116

0.2401

b. P ( X . 2 ) 5 P ( X 5 3 ) 1 P ( X 5 4 )

5 0.4116 1 0.2401 5 0.6517

■

P(X 5 4 d X $ 2) P(X $ 2) ( P X 5 4) 5 P(X $ 2)

P(X 5 4 0 X $ 2) 5

■

Conditional probability

VIS ION

Given that at least two customers order coffee, the number of values X can assume is reduced. Use the definition of conditional probability with events involving the random variable.

Definition of conditional probability.

Intersection of X 5 4 and X $ 2.

0.2401 0.2646 1 0.4116 1 0.2401 0.2401 5 5 0.2620 0.9163 5

Given at least two Find the probability that X 5 4.

C ON CEPTS

Use the probability distribution table.

The probability of more than two customers ordering coffee is 0.6517. Note: How would this probability change if the question asked for the probability that two or more customers order coffee? c. Given that X is at least 2, find the probability that X is exactly 4.

TR AN SLAT IO N ■

Only values greater than 2.

Use the probability distribution.

Given that at least two people order coffee, the probability that exactly four order coffee is 0.2620. Here is a way to picture this conditional probability using the probability distribution table: x

1

2

3

4

p(x)

0.0081

0.0756

0.2646

0.4116

0.2401

Given that X is either 2, 3, or 4, the reduced, or relevant, probability is 0.9163. The proportion of time that X is equal to 4, given X is 2, 3, or 4, is 0.2401 / 0.9163. TRY IT NOW

GO TO EXERCISE 5.36

The probability distribution of a random variable reveals which values of the random variable are most likely to occur. This information is extremely helpful in making a statistical inference. Consider the following example.

Example 5.6 In Case There Is a Power Outage The Carson City Hospital has three emergency generators for use in case of a power failure. Each generator operates independently, and the manufacturer claims the probability that each generator will function properly during a power failure is 0.95. Suppose a power failure occurs and all three generators fail. Do you have reason to doubt the manufacturer’s claim? Justify your answer.

198

CHA PTE R 5

Random Variables and Discrete Probability Distributions

Solution Trail 5.6 KE YWORDS ■

Reason to doubt the manufacturer’s claim?

T R ANSLAT IO N ■

SOLUTION STEP 1 In the event of a power failure, let F stand for a generator that fails, and let S

represent a generator that functions properly, that is, starts. There are eight possible experimental outcomes. Let X be the number of failures. STEP 2 The table below lists each outcome, the probability of each outcome, and the value of the random variable associated with each outcome.

Use the available evidence to draw a reasonable conclusion.

Outcome

Probability

x

SSS SSF SFS SFF FSS FSF FFS FFF

0.8574 0.0451 0.0451 0.0024 0.0451 0.0024 0.0024 0.0001

0 1 1 2 1 2 2 3

CO NCEP TS ■

Inference procedure

VIS ION

Consider the claim, the experiment, and the likelihood of the experimental outcome. Then draw a conclusion. Define a random variable, and use the probability distribution to determine the probability that all three generators fail.

Note: The probabilities are rounded to four places to the right of the decimal. Because each generator operates independently, the probability of each outcome is the product of the corresponding probabilities. For example, P ( SFS ) 5 P ( S d F d S ) 5 P(S) # P(F) # P(S) 5 ( 0.95 )( 0.05 )( 0.95 ) 5 0.0451 STEP 3 Use the links in the table to construct the probability distribution for X.

x

1

2

3

p(x)

0.8574

0.1353

0.0072

0.0001

STEP 4 Use the four-step inference procedure.

Claim: The probability that each generator will function properly during a power failure is 0.95. Experiment: The value of the random variable observed is x 5 3. Likelihood: The likelihood of the observed outcome is P ( X 5 3 ) 5 0.0001. Conclusion: Because this probability is so small, the outcome of observing three failures is very rare. But it happened! This small probability suggests the assumption is wrong. There is evidence to suggest that the claim of 0.95 (start probability) is wrong. TRY IT NOW

GO TO EXERCISE 5.42

As the previous examples suggest, the following properties must be true for every probability distribution for a discrete random variable X.

Properties of a Valid Probability Distribution for a Discrete Random Variable 1. 0 # p ( x ) # 1

The probability that X takes on any value, p ( x ) 5 P ( X 5 x ) , must be between 0 and 1.

2. a p ( x ) 5 1 all x

The sum of all the probabilities in a probability distribution for a discrete random variable must equal 1.

5.2

199

Probability Distributions for Discrete Random Variables

The following example involves a probability distribution for a discrete random variable and illustrates these two properties.

Example 5.7 Rewards for Donating The Central Blood Bank in Pittsburgh, Pennsylvania, offers bonus points to donors, which can be redeemed at an online store.3 The number of points a donor has earned is a random variable Y. Suppose Y has the following probability distribution: y

100

150

200

250

300

p(y)

0.01

0.05

?

0.25

0.35

0.30

a. Find p(150). Keith Brofsky/Photodisc/Getty Images

b. Find P ( 100 # Y # 250 ) and P ( 100 , Y , 250 ) . c. Construct the corresponding probability histogram.

DATA SET

SOLUTION

DONORS

a. The sum of all the probabilities must equal 1.

p ( 150 ) 5 1 2 3 p ( 0 ) 1 p ( 100 ) 1 p ( 200 ) 1 p ( 250 ) 1 p ( 300 ) 4 5 1 2 ( 0.01 1 0.05 1 0.25 1 0.35 1 0.30 ) 5 1 2 0.96 5 0.04

b. The values Y takes on between 100 and 250 inclusive are 100, 150, 200, and 250.

P ( 100 # Y # 250 ) 5 p ( 100 ) 1 p ( 150 ) 1 p ( 200 ) 1 p ( 250 ) 5 0.05 1 0.04 1 0.25 1 0.35 5 0.69 The values Y takes on strictly between 100 and 250 are 150 and 200. P ( 100 , Y , 250 ) 5 p ( 150 ) 1 p ( 200 ) 5 0.04 1 0.25 5 0.29 In this example, including (or excluding) an endpoint (a single value) changes the probability assignment. It is important to remember that a single value may make a difference in a probability assignment for a discrete random variable. c. To construct the probability histogram, draw a rectangle for each value y, centered at y, with height equal to p(y). p(y) 0.4

0.3

0.2

0.1

0.0

TRY IT NOW

100

150

GO TO EXERCISE 5.26

200

250

300

y

200

CHA PTE R 5

Random Variables and Discrete Probability Distributions

SECTION 5.2 EXERCISES Concept Check

c. x

p(x)

5.20 True/False A probability distribution is a theoretical

2

4

6

8

10

12

0.05

0.20

0.25

0.25

0.20

0.05

model of a population.

5.29 The table below lists all of the possible outcomes for an

5.21 True/False The sum of all the probabilities in a probabil-

experiment, the probability of each outcome, and the value of a random variable assigned to each outcome. Use this table to construct the probability distribution for X. Construct the corresponding probability histogram.

ity distribution for a discrete random variable must equal 1. 5.22 True/False For a discrete random variable, under

certain circumstances p(x) could be less than 0. 5.23 Fill in the Blank The probability distribution for a

discrete random variable is a method for specifying _____________ and _____________. 5.24 Short Answer Briefly describe several methods to

represent a probability distribution.

Practice 5.25 The probability distribution for the random variable X is

given in the following table:

a. b. c. d.

x

1

2

3

4

5

6

7

p(x)

0.35

0.20

0.15

0.12

?

0.08

0.03

Outcome

Probability

x

Outcome

Probability

x

AA AB AC AD BA BB BC BD

0.01 0.02 0.03 0.04 0.02 0.04 0.06 0.08

1 2 3 4 2 2 3 4

CA CB CC CD DA DB DC DD

0.03 0.06 0.09 0.12 0.04 0.08 0.12 0.16

3 3 3 4 4 4 4 4

5.30 The table on the text website lists all of the possible

Find p(5). Find P ( 2 # X # 6 ) and P ( 2 , X # 6 ) . Find P ( X , 4 ) . Find the probability that X takes on the value 1 or 7.

outcomes for an experiment, the probability of each outcome, and the value of a random variable assigned to each outcome. Use this table to construct the probability distribution for Y. EX5.30 Construct the corresponding probability histogram. 5.31 The probability distribution for a discrete random variable

5.26 The probability distribution for the random variable Y is

X is given by the formula.

given in the following table:

a. b. c. d.

p(x) 5

y

10

20

25

30

45

50

p(y)

0.155

0.237

0.184

0.122

?

0.258

Find p(45). Find P ( Y $ 25 ) and P ( Y . 25 ) . Find the probability Y is divisible by 10. Construct the corresponding probability histogram.

5.27 The probability distribution for the random variable X is

given in the following table: x

23

22

21

1

2

3

p(x)

0.20

0.10

0.05

0.30

0.05

0.10

0.20

a. b. c. d.

Find P ( X $ 0 ) and P ( X . 0 ) . Find P ( X 2 . 1 ) . Find P ( X $ 2 0 X $ 0 ) . Construct the corresponding probability histogram.

5.28 Determine whether each probability distribution below is

valid. Justify your answers. a. b.

x

2

4

6

8

10

12

p(x)

0.15

0.16

0.17

0.18

0.19

0.20

x

2

4

6

8

10

12

p(x)

0.25

0.25

0.25 20.25 0.25

0.25

a. b. c. d. e.

x(x 1 1) 112

x 5 1, 2, . . . , 6

Verify that this is a valid probability distribution. Find P ( X 5 4 ) . Find P ( X . 2 ) . Find the probability that X takes on the value 3 or 4. Construct the corresponding probability histogram.

Applications 5.32 Manufacturing and Product Development

A wooden kitchen cabinet is carefully inspected at the manufacturing facility before it is sent to a retailer. The random variable X is the number of defects found in a randomly selected cabinet. The probability distribution for X is given in the table below. x

1

2

3

4

5

p(x)

0.900

0.050

0.025

0.020

0.004

0.001

Suppose a cabinet is selected at random. a. What is the probability that the cabinet is defect free? b. What is the probability that the cabinet has at most two defects? c. What is the probability that two randomly selected cabinets both have at least three defects? Write a Solution Trail for this problem. d. Find P ( 2 # X # 4 ) and P ( 2 , X , 4 ) .

5.2

5.33 Fuel Consumption and Cars An automobile insurance

policy depends on many factors, including the vehicle type, year, make, and model, type of coverage, your driving history, your insurance score based on claims history, payment history, and credit score, and your state of residence.4 Suppose that for some driver category, the probability distribution for the random variable Y, the amount (in dollars) paid to policyholders for claims in one year, is given in the table below. y

500

1000

5000

10000

p(y)

0.65

0.20

0.10

0.04

0.01

a. Find P ( Y . 0 ) . b. Find P ( Y # 1000 ) . c. What is the probability that a randomly selected driver is

paid $5000? d. Suppose two drivers are selected at random. What is the

probability that both are paid $1000? e. Suppose two drivers are selected at random. What is the

probability that at least one is paid $500 or more? 5.34 Public Policy and Political Science Camden, New

Jersey, has one of the highest crime rates in the United States. As a result, a new Camden County Police Department will include 400 officers and be responsible for an area covering nine square miles.5 Suppose the number of times a police cruiser from this new department drives through the Whitman Park neighborhood during a one-hour period is a random variable X, with probability distribution as given in the table below. x

1

2

3

p(x)

0.3679

0.3679

0.1839

0.0613

x

4

5

6

7

p(x)

0.0153

0.0031

0.0005

0.0001

Suppose a one-hour period is randomly selected. a. What is the probability that no police cruiser will drive through the neighborhood? b. What is the probability that at least one police cruiser will drive through the neighborhood? c. What is the probability that at most two police cruisers will drive through the neighborhood? d. What is the probability that more than seven police cruisers will drive through the neighborhood? e. Suppose at least two police cruisers were sighted in the neighborhood during the one-hour period. What is the probability that there were at least four during this time? 5.35 Public Policy and Political Science According to a

January 2013 survey by the Pew Research Center, the percentage of Americans who trust the government in Washington has decreased steadily since Bill Clinton left office.6 Today, approximately 30% of Americans say they trust the government (always or most of the time). Suppose three Americans are selected at random. Let X be the number who trust the government.

201

Probability Distributions for Discrete Random Variables

a. Construct the probability distribution for X. Construct the

corresponding probability histogram. b. What is the probability that all three Americans trust the

government? c. What is the probability that at least one of the three trusts

the government? 5.36 Marketing and Consumer Behavior Twinkies are an American tradition, sold for over 83 years, originating in a Chicago bakery, and forever linked to legal lingo. (You may have read about the “Twinkie defense,” associated with a 1979 murder trial in San Francisco.) Approximately 12% of households in the United States buy Twinkies.7 Suppose four households are selected at random and let Y be the total number of households that buy Twinkies. a. Construct the probability distribution for Y. Write a Solution Trail for this problem. b. What is the probability that at least one household buys Twinkies? c. Suppose at least two households buy Twinkies. What is the probability that all four households buy Twinkies? 5.37 Manufacturing and Product Development Staples has

six special drafting pencils for sale, two of which are defective. A student buys two of these six drafting pencils, selected at random. Let the random variable X be the number of defective pencils purchased. Construct the probability distribution for X. 5.38 Business and Management Two packages are independently shipped from Fort Collins, Colorado, to the Convention Center in Kansas City, Missouri, and each is guaranteed to arrive within four days. The probability that a package arrives within one day is 0.10, within two days is 0.15, within three days is 0.25, and on the fourth day is 0.50. Let the random variable X be the total number of days for both packages to arrive. Construct the probability distribution for X.

Extended Applications 5.39 Biology and Environmental Science Agway, a farm and garden supply store, sells winter fertilizer in 50-pound bags. For customers who purchase this product, the probability distribution for the random variable X, the number of bags sold, is given in the table below.

x

1

2

3

4

5

p(x)

0.55

0.35

0.07

0.02

0.01

Suppose a person buying winter fertilizer is randomly selected. a. What is the probability that the customer buys more than two bags? b. What is the probability that the customer does not buy two bags 3 P ( X 2 2 ) 4 ? c. Find the probability that two randomly selected customers each buy one bag. d. Suppose two customers are randomly selected. What is the probability that the total number of bags purchased will be at least eight?

202

CHA PT ER 5

Random Variables and Discrete Probability Distributions

e. Let the random variable Y be the number of pounds sold

to a randomly selected customer buying winter fertilizer. Find the probability distribution for Y. 5.40 Sports and Leisure Suppose the probability that a

person says he or she was at the Woodstock Festival and Concert is 0.20. An experiment consists of randomly selecting people in Green Bay and asking them whether they were at Woodstock. The experiment stops as soon as one person says he or she was there. The random variable X is the number of people stopped and questioned (until one person says he or she was there). Let Y and N stand for a Yes and No response, respectively. a. List the first several outcomes in the sample space. b. Find the probability of each outcome in part (a). c. Find the value of the random variable associated with each outcome in (a). d. Find a formula for the probability distribution of X. 5.41 Sports and Leisure A game show contestant on Let’s

Make a Deal selects two envelopes with prize money enclosed. Two of the envelopes contain $100, one envelope contains $250, two envelopes contain $500, and the last envelope contains $1000. Let the random variable M be the maximum of the two prizes. a. Find the probability distribution for M. b. Suppose two contestants independently select prize envelopes on two different days. What is the probability that both win the top prize? 5.42 Economics and Finance A Subway restaurant in

that allows customers to pay for meals with a bank debit card. The manager of the restaurant claims this new system will decrease the waiting time, and that the probability of getting a meal in under two minutes (with this system in place) is 0.75. Suppose four customers are selected at random. Let the random variable X be the number of customers who receive their meal in under two minutes. a. Construct the probability distribution for X. b. Suppose none of the four customers receives their meal in under two minutes. Is there any evidence to suggest the manager’s claim is false? Justify your answer. 5.43 Demographics and Population Statistics Most physicians in Canada graduated from a medical school in Canada. However, some attended medical school in the United States or another foreign country. The probability that a physician in Alberta attended a Canadian medical school is 0.684; in British Columbia, 0.705; in Quebec, 0.891; and in Ontario, 0.736.8 Suppose one physician is independently selected from each province, and let the random variable Y be the number of physicians who attended a Canadian medical school. a. Construct a probability distribution for Y. b. What is the probability that at least one physician attended a Canadian medical school? c. Suppose another group of physicians is selected, one from each province. What is the probability that at least three physicians from each group attended a Canadian medical school?

Scotsbluff, Nebraska, recently installed a new computer system

5.3 Mean, Variance, and Standard Deviation for a Discrete Random Variable

Just as there are descriptive measures of a sample (for example, x, s2, and s), there are corresponding descriptive measures of a population (m, s2 , and s ). As we said in Chapter 3, these population parameters describe the center and variability of the entire population. They are usually unknown values we would like to estimate. However, because a random variable may be used to model a population, these (population) descriptive measures are inherent in and determined by the probability distribution. This section presents the methods used to compute the mean, variance, and standard deviation of a random variable (or population). The next example suggests a definition of expected value.

Example 5.8 The Caddy Pool

David Cannon/Getty Images

A teenager is a member of the caddy program at the Montebello Country Club. Each morning he arrives at the golf course and enters his name into the caddy pool. The probability of being selected on any day is 4 / 5, and if selected he will earn $50. On days he is not selected, he earns nothing. How much money does this caddy earn per day on average? Or, in the long run, how much does the caddy earn each day?

5.3

Mean, Variance, and Standard Deviation for a Discrete Random Variable

203

SOLUTION STEP 1 This question is concerned with the amount earned each day on average, not on any

one particular day. Consider the probabilities given and consider five typical days. On four of five days, the caddy earns $50. On the fifth day, he earns $0. The total earned for the five typical days is $200. To find the average amount earned each day, divide by five: $200/5 5 $40. STEP 2 Consider a random variable X that takes on only two values, 0 and 50, with probabilities 0.20 and 0.80, respectively. Another way to compute the average earned each day is to use this probability distribution. 40 5

0 3 0.20 1 50 3 0.80 c c c c Value Probability Value Probability

The long-run average earnings per day can be found by using a probability distribution. Multiply each value by its corresponding probability, and sum these products.

Definition Let X be a discrete random variable with probability mass function p(x). The mean, or expected value, of X is E ( X ) 5 m 5 mX 5 a 3 x # p ( x ) 4 .

8 Notation

(5.1)

8 all x

Calculation

A CLOSER L OK 1. The capital E stands for expected value and is a function. The function E takes a ran-

dom variable as an input and returns the expected value. More generally, E accepts as an input any function of a random variable. For example, suppose f ( X ) is a function of a discrete random variable X. The expected value of f ( X ) is E 3 f (X ) 4 5 a 3 f (x) # p(x) 4

(5.2)

all x

2. m is the mean, or expected value, of a random variable (which may model a popula-

tion). If necessary, the associated random variable is used as a subscript for identification, for example, mX or mY . 3. The mean is easy to compute. Multiply each value of the random variable by its corresponding probability, and add the products. 4. The mean of a random variable is a weighted average and is only what happens on average. The mean may not be any of the possible values of the random variable.

Example 5.9 Road Construction Next 30 Miles DATA SET PAVING

Thirty miles of Interstate Highway I-16 near Macon, Georgia, were recently repaved. At the end of each work day, the project supervisor estimated the number of hours behind or ahead of schedule. Suppose this estimate is a discrete random variable, X, with probability distribution as given in the following table:

Find the mean of X.

x

220

210

30

60

p(x)

0.2

0.3

0.4

0.1

204

CHA PT ER 5

Random Variables and Discrete Probability Distributions

SOLUTION STEP 1 X is a discrete random variable. To find the mean, use Equation 5.1. STEP 2 m 5 a 3 x # p ( x ) 4

Equation 5.1.

all x

5 ( 220 )( 0.2 ) 1 ( 210 )( 0.3 ) 1 ( 30 )( 0.4 ) 1 ( 60 )( 0.1 )

As the sample size increases, the sample mean, x, will tend to, or approach, the population mean m 5 11.

Multiply each value by its probability, and sum.

5 ( 24 ) 1 ( 23 ) 1 ( 12 ) 1 ( 6 ) 5 11 STEP 3 The mean, or long-run value, of X is 11. On average, the project supervisor esti-

mated the project was 11 hours ahead of schedule. In this example the mean is not a possible value of X.

Example 5.10 The Health Risks of Soda DATA SET SODA

According to a Gallup poll, almost half of all Americans drink at least one glass of soda per day.9 There is some evidence that this habit leads to increased health risks, especially diabetes, heart disease, obesity, and high cholesterol. Suppose X is a random variable that represents the number of glasses of soda that a randomly selected American drinks each day. The probability distribution for X is given in the following table. x

1

2

3

4

5

p(x)

0.55

0.28

0.09

0.04

0.03

0.01

Find the expected number of glasses of soda that an American drinks each day.

SOLUTION STEP 1 X is a discrete random variable (it takes on only a finite number of values). Find

the mean using Equation 5.1. m 5 a 3 x # p(x) 4

Equation 5.1.

all x

5 ( 0 )( 0.55 ) 1 ( 1 )( 0.28 ) 1 ( 2 )( 0.09 ) 1 ( 3 )( 0.04 ) 1 ( 4 )( 0.03 ) 1 ( 5 ) ( 0.01 ) Sum of each value times probability.

5 0 1 0.28 1 0.18 1 0.12 1 0.12 1 0.05 5 0.75 STEP 2 The mean number of glasses of soda per day for Americans is 0.75.

The variance and standard deviation of a random variable measure the spread of the distribution. The variance is computed using the expected value function, and the standard deviation together with the mean can be used to determine the most likely values of the random variable.

Definition Let X be a discrete random variable with probability mass function p(x). The variance of X is Var ( X ) 5 s2 5 s2X 5 a 3 ( x 2 m ) 2 # p ( x ) 4 5 E 3 ( X 2 m ) 2 4 all x

Notation

Calculation

Definition in terms of expected value

(5.3)

5.3

Mean, Variance, and Standard Deviation for a Discrete Random Variable

205

The standard deviation of X is the positive square root of the variance: s 5 sX 5 "s2

Notation

(5.4)

Calculation

A CLOSER L OK 1. In words, the variance is the expected value of the squared deviations about the

mean. 2. The symbol Var stands for variance and is a function. The function Var takes a random variable as an input and returns the variance. 3. To compute the variance using Equation 5.3: a. Find the mean, m, of X using Equation 5.1. b. Find each difference: ( x 2 m ) . c. Square each difference: ( x 2 m ) 2. d. Multiply each squared difference by the associated probability. e. Sum the products. 4. There is a computational formula for the variance of a random variable.

Computational Formula for s2 s2 5 E ( X 2 ) 2 E ( X ) 2 5 E ( X 2 ) 2 m2

(5.5)

In words, the variance is the expected value of X squared minus the expected value of X, squared.

Example 5.11 Children in Day Care DATA SET DAYCARE

Suppose the discrete random variable X, the age of a randomly selected child at the Looney Toons Child Care Center in Cleveland, Ohio, has the probability distribution given in the following table. x

1

2

3

4

5

6

7

p(x)

0.05

0.10

0.15

0.25

0.20

0.15

0.10

a. Find the expected value, variance, and standard deviation of X. b. Find the probability that the random variable X takes on a value within one standard

deviation of the mean.

SOLUTION a. Expected value: Find the expected value of X using Equation 5.1.

E(X) 5 a 3 x # p(x) 4 all x

5 ( 1 )( 0.05 ) 1 ( 2 )( 0.10 ) 1 ( 3 )( 0.15 ) 1 ( 4 )( 0.25 ) 1 ( 5 ) ( 0.20 ) 1 ( 6 )( 0.15 ) 1 ( 7 )( 0.10 ) 5 0.05 1 0.20 1 0.45 1 1.00 1 1.00 1 0.90 1 0.70 5 4.30 5 m

206

CHA PT ER 5

Random Variables and Discrete Probability Distributions

Variance: Find the variance of X using Equation 5.3. Var ( X ) 5 a ( x 2 m ) 2 # p ( x )

Equation 5.3.

all x

5 ( 1 2 4.30 ) 2 ( 0.05 ) 1 ( 2 2 4.30 ) 2 ( 0.10 ) 1 ( 3 2 4.30 ) 2 ( 0.15 ) 1 ( 4 2 4.30 ) 2 ( 0.25 ) 1 ( 5 2 4.30 ) 2 ( 0.20 ) 1 ( 6 2 4.30 ) 2 ( 0.15 ) Sum over all values of x. 1 ( 7 2 4.30 ) 2 ( 0.10 ) 5 ( 10.89 )( 0.05 ) 1 ( 5.29 )( 0.10 ) 1 ( 1.69 )( 0.15 ) 1 ( 0.09 )( 0.25 ) 1 ( 0.49 )( 0.20 ) 1 ( 2.89 )( 0.15 ) 1 ( 7.29 )( 0.10 ) Square each difference. 5 0.5445 1 0.5290 1 0.2535 1 0.0225 1 0.0980 1 0.4335 1 0.7290 5 2.6100 5 s

Compute each product.

2

Sum the products.

Here is a tabular method for computing the variance using the definition. Sum the last column to obtain s2 . x

x2m

(x 2 m)2

p(x)

(x 2 m)2 # p(x)

1 2 3 4 5 6 7

23.30 22.30 21.30 20.30 0.70 1.70 2.70

10.89 5.29 1.69 0.09 0.49 2.89 7.29

0.05 0.10 0.15 0.25 0.20 0.15 0.10

0.5445 0.5290 0.2535 0.0225 0.0980 0.4335 0.7290 2.6100

Sum this column.

d s2

Variance: Using the computational formula. Find E(X 2), the expected value of X 2. E ( X 2 ) 5 a 3 x2 # p ( x ) 4

Equation 5.2.

all x 2

5 1 ( 0.05 ) 1 22 ( 0.10 ) 1 32 ( 0.15 ) 1 42 ( 0.25 ) 1 52 ( 0.20 ) 1 62 ( 0.15 ) 1 72 ( 0.10 ) Sum over all values of x. 5 1 ( 0.05 ) 1 4 ( 0.10 ) 1 9 ( 0.15 ) 1 16 ( 0.25 ) 1 25 ( 0.20 ) 1 36 ( 0.15 ) 1 49 ( 0.10 ) Square each x. 5 0.05 1 0.40 1 1.35 1 4.00 1 5.00 1 5.40 1 4.90 Compute each product. 5 21.10

Sum the products.

Use this result to find the variance. s2 5 E ( X 2 ) 2 m2 5 21.10 2 ( 4.30 ) 2 5 21.10 2 18.49 5 2.61

Equation 5.5. Use previous results. Find m2. Find the difference.

Variance: Using a tabular method. Start at the middle of each row and work toward the ends. Sum the outer columns to obtain m and E(X 2).

5.3

Mean, Variance, and Standard Deviation for a Discrete Random Variable

Solution Trail 5.11b

x2 # p ( x )

x2

p(x)

x

x # p(x)

0.05 0.40 1.35 4.00 5.00 5.40 4.90

1 4 9 16 25 36 49

0.05 0.10 0.15 0.25 0.20 0.15 0.10

1 2 3 4 5 6 7

0.05 0.20 0.45 1.00 1.00 0.90 0.70

Sum this column for m.

4.30

dm

KEYW ORDS ■

Within one standard deviation of the mean

TR AN SLAT IO N ■

In the interval ( m 2 s, m 1 s ) , within one step in each direction from the mean

Sum this column for E(X 2).

E ( X 2 ) S 21.10

C ON CEPTS ■ ■

Probability distribution Probability statement

VIS ION

The probability distribution is given. Use the Translation to write a mathematical probability statement. Find the value(s) of X that lie in the interval, and add the corresponding probabilities.

207

s2 5 E ( X2 ) 2 m2 5 21.10 2 ( 4.30 ) 2 5 2.61 Standard deviation: The positive square root of the variance. s 5 "s2 5 !2.61 < 1.6155

b. P ( m 2 s # X # m 1 s )

Translation to a probability statement.

5 P ( 4.30 2 1.6155 # X # 4.30 1 1.6155 ) 5 P ( 2.6845 # X # 5.9155 ) 5 P(X 5 3) 1 P(X 5 4) 1 P(X 5 5)

Use values for m and s. Compute the difference and sum. Find values of X in the interval.

5 0.15 1 0.25 1 0.20 5 0.60

Use corresponding probabilities. Compute the sum.

The probability that X takes on a value within one standard deviation of the mean is 0.60.

A CLOSER L OK 1. The computational formula for the variance is quicker and produces less round-off

error. Use Equation 5.5 to find the variance of a discrete random variable. 2. In Example 5.11, the random variable X is not (approximately) normal; the empirical rule does not apply. In addition, even though Chebyshev’s rule applies to any distribution, it should not be used if the probability distribution is known. Chebyshev’s rule provides only an estimate, a lower bound for the probability that X is within k standard deviations of the mean. The exact probability can be determined using the known probability distribution. (Actually, Chebyshev’s rule can’t help at all here, because k must be greater than 1.) 3. Neither the TI-84 Plus C nor Minitab has a built-in menu function to find the mean, variance, and standard deviation of a discrete random variable. However, it’s easy to perform list operations on the calculator and column operations in Minitab or Excel in order to produce these summary statistics. See the technology manuals for details.

Example 5.12 Rockin’ 5’s DATA SET ROCKIN5

One of the newest instant scratch-off games in the North Carolina Education Lottery is Rockin’ 5’s. The payout (in dollars) is a discrete random variable with probability distribution as given in the following table.10 x p(x) x p(x)

25000

5000

500

100

50

25

0.000001

0.000004

0.000022

0.000800

0.002667

0.006667

20

10

5

4

2

0.006667

0.013333

0.040000

0.046667

0.100000

0.783172

208

CHAPTE R 5

Random Variables and Discrete Probability Distributions

a. Find the expected payout, and the variance and standard deviation of the payout. b. Interpret the values found in part (a). In particular, if it costs $2.00 to purchase a ticket,

what happens in the long run?

SOLUTION a. Use the tabular method and the computation formula for the variance.

E(X 2) S

x2 # p ( x )

x2

p(x)

x

x # p(x)

625.0000 100.0000 5.5000 8.0000 6.6675 4.1669 2.6668 1.3333 1.0000 0.7467 0.4000 0.0000

625000000 25000000 250000 10000 2500 625 400 100 25 16 4 0

0.000001 0.000004 0.000022 0.000800 0.002667 0.006667 0.006667 0.013333 0.040000 0.046667 0.100000 0.783172

25000 5000 500 100 50 25 20 10 5 4 2 0

0.0250 0.0200 0.0110 0.0800 0.1334 0.1667 0.1333 0.1333 0.2000 0.1867 0.2000 0.0000 1.2894 d m

755.4811

s2 5 E ( X 2 ) 2 m2 5 755.4811 2 ( 1.2894 ) 2 5 753.8187 s 5 "s2 5 !753.8187 5 27.4558

b. The mean payout from this instant scratch-off game is approximately $1.29, with

variance $753.82 and standard deviation $27.46. If it costs $2.00 to play (purchase a ticket), in the long run the player loses 2.00 2 1.29 5 0.71, or 71 cents, on average. This doesn’t seem like much. However, from the State’s point of view, every time someone buys a ticket, North Carolina makes 71 cents (on average).

SECTION 5.3 EXERCISES Concept Check 5.44 True/False The mean of a discrete random variable

must be a possible value of the random variable. 5.45 True/False The expected value of a discrete random

variable can be negative. 5.46 True/False The standard deviation of a discrete random

variable is always nonnegative. 5.47 True/False The standard deviation of a discrete random

variable could be 0. 5.48 True/False The computational formula for s2 should

only be used when the number of possible values for the random variable is small. 5.49 Fill in the Blank Suppose f ( X ) is a function of a

discrete random variable. E 3 f ( X ) 4 5 _____________.

5.50 Fill in the Blank The variance of a discrete random

variable is the expected value of _____________.

Practice 5.51 Suppose X is a discrete random variable. Complete the

table below and find the mean, variance, and standard deviation of X. x2 # p ( x )

x2

p(x)

x

0.10 0.16 0.20 0.24 0.18 0.12

2 4 6 8 10 12

x # p(x)

5.3

5.52 The probability distribution for a random variable X is

given in the table below. x

5

10

15

20

p(x)

0.10

0.15

0.70

0.05

209

Mean, Variance, and Standard Deviation for a Discrete Random Variable

a. Find the mean, variance, and standard deviation of X. b. Find the probability X takes on a value smaller than the

mean. c. Using the probability distribution, explain why the value

of the mean of X makes sense. 5.53 Suppose Y is a discrete random variable with probability

distribution given in the table below. y

220

210

10

20

p(y)

0.30

0.15

0.10

0.15

0.30

a. b. c. d.

Is this a valid probability distribution? Justify your answer. Find the mean, variance, and standard deviation of X. Find the probability X takes on a value less than the mean. Suppose two bags of pistachios are selected at random. What is the probability that both bags have four or more pistachios too difficult to open by hand?

5.57 Sports and Leisure Suppose the number of rides that a visitor enjoyed at Disney World during a day is a random variable with probability distribution given in the table below.11

x

5

6

7

8

9

10

11

12

p(x)

0.04

0.07

0.09

0.12

0.20

0.30

0.13

0.05

a. Find the mean number of rides for a Disney World visitor. b. Find the variance and standard deviation of the number of

rides for a Disney World visitor. c. Find the probability that the number of rides for a

2

a. Find m, s , and s . b. Find P ( m 2 2s # Y # m 1 2s ) . c. Find P ( Y $ m ) and P ( Y . m ) .

randomly selected Disney World visitor is within one standard deviation of the mean. Write a Solution Trail for this problem. d. Find the probability that the number of rides for a randomly selected Disney World visitor is less than one standard deviation above the mean.

5.54 Suppose the random variable X has the probability

distribution given in the table below.

x

2

3

5

7

11

13

p(x)

0.15

0.25

0.15

0.10

0.30

0.05

a. Find the mean, variance, and standard deviation of X. b. Suppose the random variable Y is defined by Y 5 2X 1 1.

Find the mean, variance, and standard deviation of Y. c. Suppose the random variable W is defined by

W 5 X 2 1 1. Find the mean, variance, and standard deviation of W. 5.55 Suppose X is a discrete random variable with probability

distribution as given in the table below.

5.58 Public Policy and Political Science Approximately

100 children’s products are recalled every year.12 In particular, children’s clothing is recalled for a variety of reasons, for example, drawstrings that are too long and pose a hazard, small buttons that may break off and cause choking, and material that fails to meet federal flammability standards. Suppose the number of recalls of children’s clothing during a given month is a random variable with probability distribution given in the table below. x p(x)

x

1

2

3

5

8

13

21

p(x)

0.05

0.10

0.15

0.20

0.25

0.20

0.05

a. Find the mean, variance, and standard deviation of X. b. Find the probability X is more than one standard deviation

from the mean. c. Find P ( X # m 1 2s ) .

Applications 5.56 Manufacturing and Product Development In a

quarter-pound bag of red pistachio nuts, some shells are too difficult to pry open by hand. Suppose the random variable X, the number of pistachios in a randomly selected bag that cannot be opened by hand, has the probability distribution given in the table below. x

1

2

3

4

5

p(x)

0.500

0.250

0.100

0.050

0.075

0.025

1

2

3

4

5

6

0.005 0.185 0.275 0.305 0.200 0.020 0.010

a. Find the mean, variance, and standard deviation of the

number of recalls of children’s clothing during a given month. b. Suppose the number of recalls in a given month is at least three. What is the probability that the number of recalls that month will be at least five? c. If the number of recalls in a given month is above more than one standard deviation from the mean, the federal government issues a special warning directed toward parents. What is the probability that a special warning will be issued during a given month? 5.59 Public Health and Nutrition A bill introduced into the Virginia State Senate stipulated that the owner of a tanning facility must identify and document the skin type of every customer, and must advise every customer of the maximum time of recommended exposure in the tanning device.13 Past records indicate that most sessions range from 10 to 30 minutes. Suppose the duration of a tanning session (in minutes) at

210

CHA PTE R 5

Random Variables and Discrete Probability Distributions

the Solar Planet tanning facility in Herendon, Virginia, is a discrete random variable with probability distribution given in the table below. x

10

12

15

20

25

30

p(x)

0.30

0.25

0.15

0.12

0.10

0.08

Extended Applications 5.62 Manufacturing and Product Development A

cordless drill has several torque settings for driving different screws into different materials. A manufacturer models the torque setting required for a randomly selected task with a probability distribution given by

a. Find the mean, variance, and standard deviation of the

duration of a tanning session time. b. Find the probability that a randomly selected session has a duration within one standard deviation of the mean. c. Find the probability that a randomly selected session has a duration within two standard deviations of the mean. d. Suppose a sunlamp lasts for 100 hours. After approximately how many tanning sessions will the sunlamp have to be replaced? 5.60 Public Health and Nutrition For nurses, additional patients contribute heavily to increased stress and job burnout. Suppose the number of patients assigned to each nurse at the Banner Thunderbird Medical Center in Glendale, Arizona, is a random variable with probability distribution given in the table below.14

p(x) 5

3

4

5

6

7

8

p(x)

0.07

0.12

0.18

0.37

0.17

0.09

a. Find the mean, variance, and standard deviation of the

number of patients assigned to each nurse. b. Find the probability that the number of assigned patients is greater than one standard deviation to the right of the mean. c. Suppose three nurses are selected at random. What is the probability that exactly two of the three have five assigned patients?

one standard deviation from the mean. Find the probability of a task requiring a rare torque setting. 5.63 Six Degrees of Kevin Bacon

The actor Kevin Bacon has been in so many movies that almost everyone in Hollywood can be connected to him within six degrees. Let n 5 1,326,359, the number of actors linked to Kevin Bacon. The probability distribution for the Bacon Number of a randomly selected Hollywood personality is given in the following table.15

x

1

2

3

4

5

6

p(x)

0.002

0.010

0.050

0.060

0.080

0.090

x

7

8

9

10

11

12

p(x)

0.100

0.120

0.140

0.150

0.150

0.048

p(x)

x

p(x)

x

p(x)

1/n

1

2511 / n

2

262,544 / n

3

839,562 / n

4

204,764 / n

5

15,344 / n

6

1397 / n

7

204 / n

8

32 / n

a. Find the mean, variance, and standard deviation for the

Bacon Number. b. Find P ( X $ m 2 s ) . c. The number of movies, Y, in which a Hollywood

personality has appeared is related to the Bacon Number by the formula Y 5 2X 1 5. Find the mean, variance, and standard deviation of the number of movies in which a Hollywood personality has appeared.

5.61 Psychology and Human Behavior A certain elevator

in Tampa’s tallest office building, 100 North Tampa, is used heavily between 8:00 A.M. and 9:00 A.M. as employees arrive for work. Suppose the number of people who board the elevator on the ground floor going up is a random variable with probability distribution given in the table below.

x 5 1, 5, 10, 15, 20

a. Verify that this is a valid probability distribution. b. Find the mean, variance, and standard deviation of X. c. A torque setting is classified as rare if it is more than

x

x

( x 2 12 ) 2 247

5.64 Marketing and Consumer Behavior While the temperature range of most household ovens is approximately 200–600°F, most consumers only use four or five common settings. Suppose the probability distribution for the oventemperature setting for a randomly selected use is given in the table below.

x

300

325

350

375

400

500

p(x)

0.040

0.205

0.400

0.075

0.200

0.080

2

a. Find m, s , and s . b. For a randomly selected elevator ride from the ground

floor going up, what is the probability that the number of riders is within one standard deviation of the mean? c. For two randomly selected elevator rides from the ground floor going up, what is the probability that both trips have a number of riders more than two standard deviations from the mean?

a. Find the mean, variance, and standard deviation of the

oven temperature settings. b. Suppose three different uses are randomly selected. Find

the probability that the temperature settings for all three are at least 400°F. c. Suppose three different uses are randomly selected. Find the probability that exactly one use is for 350°F.

5.4

5.65 Education and Child Development An elementary

class rarely remains the same size from the beginning of the school year until the end. Families move in and out of the district, some students are reassigned, and scheduling conflicts necessitate changes. Suppose the change in the number of students in a class is a random variable with probability distribution given by 0x0 1 1 p(x) 5 19

x 5 23, 22, 21, 0, 1, 2, 3

a. Verify that this is a valid probability distribution. b. Find the mean, variance, and standard deviation of the

change in class size. c. Suppose two classes are selected at random. Find the

probability that both classes remain the same size for the entire year. 5.66 Psychology and Human Behavior Many organizations publish wedding guidelines with specific suggestions regarding the reception, wedding cake, flowers, and even themes and styles. However, the number of bridesmaids and groomsmen is usually a very personal decision made by the bride and groom. For semiformal weddings, one to six bridesmaids are typical, with a possible flower girl and/or ring bearer.16 Suppose the probability distribution for the number of bridesmaids at a semiformal wedding is given in the table below.

x

1

2

3

4

5

6

p(x)

0.05

0.21

0.34

0.27

0.08

0.05

a. Find the mean, variance, and standard deviation for the

number of bridesmaids at a semiformal wedding. b. Suppose a randomly selected semiformal wedding has at

least four bridesmaids. What is the probability that it has exactly six bridesmaids? c. Suppose four semiformal weddings are randomly selected. What is the probability that all four have at least three bridesmaids?

The Binomial Distribution

211

Challenge 5.67 Dichotomous Random Variable Suppose the random

variable X takes on only two values, according to the probability distribution given below. x

1

p(x)

0.4

0.6

a. Find the mean, variance, and standard deviation of X. b. Suppose P ( X 5 1 ) 5 0.7, and therefore P ( X 5 0 ) 5

1 2 0.7 5 0.3. Find the mean, variance, and standard deviation of X. c. Suppose P ( X 5 1 ) 5 0.8. Find the mean, variance, and standard deviation of X. d. Suppose P ( X 5 1 ) 5 p and P ( X 5 0 ) 5 1 2 p 5 q. Find the mean, variance, and standard deviation of X in terms of p and q. e. For what values of p (and q) is the variance of X greatest? 5.68 Linear Function Suppose X is a discrete random variable

with mean mX and variance s2X . Let Y be a linear function of X, such that Y 5 aX 1 b, where a and b are constants. Find the mean and variance of Y in terms of mX and s2X . 5.69 Variance Computation Formula Suppose X is a

discrete random variable that takes on a finite number of values. Prove the variance computation formula. That is, show that E 3 ( X 2 m ) 2 4 5 E ( X 2 ) 2 m2

Hint: Write E 3 ( X 2 m ) 2 4 as a sum using the probability mass function p(x). Expand and simplify.

5.70 Standardization Suppose X is a discrete random variable

with mean mX and variance s2X . Let Y be defined in terms of X by Y5

X 2 mX sX

Find the mean, variance, and standard deviation of Y.

5.4 The Binomial Distribution In the previous sections, the general definition and probability distribution for a discrete random variable were introduced. This section presents a specific discrete random variable that is common and very important. The binomial random variable can be used to model many real-world populations and to do more formal inference. As with any random variable, there is a related experiment in the background. Consider the following experiments (and look for similarities). 1. Simply toss a coin 50 times and record the sequence of heads and tails. 2. For a random sample of 100 voters, ask each one whether he or she is going to vote for

a particular candidate. Record the sequence of yes and no responses. 3. Select a random sample of 25 customers at a fast-food restaurant and record whether or not each pays with exact change. 4. Drill a series of randomly selected test oil wells. Each well will either yield oil worth drilling or be classified as dry. Record the result for each well.

212

CHAPTER 5

Random Variables and Discrete Probability Distributions

There are four common properties in all of these experiments. These properties are used to describe a binomial experiment, and they are necessary to define a binomial random variable.

Properties of a Binomial Experiment 1. The experiment consists of n identical trials. 2. Each trial can result in only one of two possible (mutually exclusive) outcomes. One TUTORIALS STEPPED TUTORIAL BOX PLOTSSETTING BINOMIAL

outcome is usually designated a success (S) and the other a failure (F). 3. The outcomes of the trials are independent. 4. The probability of a success, p, is constant from trial to trial.

A CLOSER L OK 1. A trial is a small part of the larger experiment. A trial results in a single occurrence of

either a success or a failure. For example, flipping a coin once, or drilling one test oil well, is a single trial. A typical binomial experiment might consist of n 5 50 trials. 2. A success does not have to be a good thing. For example, the experiment may consist of injecting animals with a potential carcinogen and checking for the development of tumors. A success might be an animal that develops at least one tumor. Success and failure could stand for heads and tails, acceptable and not acceptable, or even dead and alive. 3. Trials are independent if whatever happens on one trial has no effect on any other trial. For example, any one voter response has no effect on any other voter response. 4. The probability of a success on every trial is exactly the same. For example, the probability of the tossed (fair) coin landing with head face up is always 1/2. In a binomial experiment, outcomes consist of sequences of Ss and Fs. For example, SSFSFSFS is a possible outcome in a binomial experiment with n 5 8 trials.

The Binomial Random Variable The binomial random variable maps each outcome in a binomial experiment to a real number, and is defined to be the number of successes in n trials.

Notation Why is P(F) 5 1 2 p?

1. The probability of a success is denoted by p. Therefore, P ( S ) 5 p and P ( F ) 5

1 2 p 5 q. 2. A binomial random variable X is completely determined by the number of trials n and

the probability of a success p. If we know those two values, then we will be able to answer any probability question involving X. The shorthand notation X , B ( n, p ) means X is (distributed as) a binomial random variable with n trials and probability of a success p. For example, X , B(25, 0.4) means X is a binomial random variable with 25 trials and probability of success 0.4. Our goal now is to find the probability distribution for a binomial random variable. Given n and p, we want to find the probability of obtaining x successes in n trials, P ( X 5 x ) 5 p ( x ) . We will solve this problem by first considering a simple case with n 5 5.

5.4

For example, suppose five people are selected at random. Let p be the probability that a randomly selected person snores.

The Binomial Distribution

213

Example 5.13 Binomial Experiment with n 5 5 Consider a binomial experiment with n 5 5 trials and probability of success p. a. A typical outcome with two successes is SFFSF. Find the probability of this outcome. b. Another possible outcome with two successes is FSFSF. Find the probability of this

outcome. c. Compare your results from (a) and (b). d. Find the probability that X 5 2 successes.

SOLUTION a. The probability of this outcome is

P ( SFFSF ) 5 P ( S d F d F d S d F ) Probability of a success on the first trial and a failure on the second trial and . . . .

5 P(S) # P(F) # P(F) # P(S) # P(F) Trials are independent (property of a binomial experiment).

5 p # (1 2 p) # (1 2 p) # p # (1 2 p) 2

5 p (1 2 p)

3

P(S) 5 p, P(F) 5 1 2 p. Multiplication is commutative.

b. The probability of this outcome is

P ( FSFSF ) 5 P ( F ) # P ( S ) # P ( F ) # P ( S ) # P ( F ) 5 (1 2 p) # p # (1 2 p) # p # (1 2 p) 2

5 p (1 2 p)

3

Trials are independent. P(S) 5 p, P(F) 5 1 2 p. Multiplication is commutative.

c. The results are identical. Every other outcome with two successes, and therefore three

failures, has exactly the sample probability, p2 ( 1 2 p ) 3. Therefore, the probability of an outcome depends on the number of successes (and failures), not on the order in which they appear. d. To compute the probability that X 5 2 successes, find all the outcomes mapped to a 2, and add the corresponding probabilities. However, every outcome that is mapped to a 2 has the same probability, so all we need to know is how many outcomes are mapped to a 2. P ( X 5 2 ) 5 ( number of outcomes with 2 successes ) p2 ( 1 2 p ) 3 Generalizing, suppose X , B ( n, p ) . We want to find the probability of obtaining x successes in n trials. The probability of any single outcome with x successes, and therefore n 2 x failures, is px ( 1 2 p ) n2x. The probability of obtaining x successes is

▲

The number of successes and the number of failures must sum to n, the total number of trials.

P ( X 5 x ) 5 ( number of outcomes with x successes ) px ( 1 2 p ) n2x We need a method for quickly counting the number of outcomes with x successes. Recall: For any positive whole number n, the symbol n! (read “n factorial”) is defined by n! 5 n ( n 2 1 )( n 2 2 ) c( 3 )( 2 )( 1 ) In addition, 0! 5 1 (0 factorial is 1). Given a collection of n items, the number of combinations of size x is given by nCx

The number of outcomes with x successes is determined using combinations. Suppose n X , B ( n, p ) . The number of outcomes with x successes is a b. We can now write an x expression for the probability of obtaining x successes in n trials. ▲

Can you figure out why this is true?

n n! 5a b5 x x! ( n 2 x ) !

214

CHAPTER 5

Random Variables and Discrete Probability Distributions

The Binomial Probability Distribution Solution Trail 5.14a KE YWORDS ■ ■ ■ ■

Suppose X is a binomial random variable with n trials and probability of a success p: X , B ( n, p ) . Then n p ( x ) 5 P ( X 5 x ) 5 a b px ( 1 2 p ) n2x x

75% 10 bottles Selected at random Exactly six

x 5 0, 1, 2, 3, . . . , n

(5.6)

Number of outcomes Probability of x successes and n 2 x with x successes failures in any single outcome

T R ANSLATION ■

Binomial experiment

C ONCEP TS ■

Binomial probability distribution

VIS ION

Assume this problem describes a binomial experiment: n 5 10 trials (bottles), P(ice wine) 5 p 5 0.75, p is the same on each trial, and the trials are independent. Define the binomial random variable and write a probability statement.

Example 5.14 Dessert Wine Ice wine is made from grapes that are allowed to freeze naturally while still on the vine. Seventy-five percent of all ice wine in Canada is made in Ontario.17 The remaining 25% is made in other provinces. Suppose 10 bottles of Canadian ice wine are selected at random. a. Find the probability that exactly six bottles are from Ontario. b. Find the probability that at least seven bottles are from Ontario.

SOLUTION Let X be the number of bottles of ice wine (out of the 10 selected) that come from Ontario. The experiment exhibits the properties of a binomial experiment, so X , B ( 10, 0.75 ) ( n 5 10, p 5 0.75 ) . a. Translate the words into a probability statement involving the random variable X.

Exactly six means X 5 6. Use Equation 5.6 to find the relevant probability.

Solution Trail 5.14b KE YWORDS ■

At least seven

T R ANSLATION ■

Seven or more, X is 7 or greater.

C ONCEP TS ■

Binomial probability distribution

P(X 5 6) 5 a

10 b ( 0.75 ) 6 ( 1 2 0.75 ) 1026 6

Equation 5.6.

5 ( 210 )( 0.75 ) 6 ( 0.25 ) 4 Compute 10C6. 5 0.1460 The probability that exactly 6 of the 10 randomly selected bottles of ice wine come from Ontario is 0.1460. b. The probability that X is greater than or equal to 7 means the probability that X is 7 or 8 or 9 or 10. P ( X $ 7 ) 5 P ( X 5 7 or X 5 8 or X 5 9 or X 5 10 ) 5 P ( X 5 7 ) 1 P ( X 5 8 ) 1 P ( X 5 9 ) 1 P ( X 5 10 ) Or means union; the outcomes are disjoint.

VIS ION

10 10 5 a b ( 0.75 ) 7 ( 0.25 ) 3 1 a b ( 0.75 ) 8 ( 0.25 ) 2 7 8

We already have the distribution for X. We need a probability statement. Use Equation 5.6 to calculate the necessary probabilities.

10 10 b ( 0.75 ) 9 ( 0.25 ) 1 1 a b ( 0.75 ) 10 ( 0.25 ) 0 9 10 5 0.2503 1 0.2816 1 0.1877 1 0.0563 1a

Use Equation 5.6 four times.

Compute combinations and powers, and multiply.

5 0.7759 The probability that at least seven bottles of ice wine are from Ontario is 0.7759. STEPPED STEPPED TUTORIAL TUTORIALS BINOMIAL BOX PLOTS PROBABILITIES

Note: 1. The two most important elements for solving this problem are (1) the probability dis-

tribution and (2) the probability statement.

5.4

The Binomial Distribution

215

2. Often, the properties of a binomial experiment will not be stated explicitly in the prob-

lem. Usually we must read into the problem to see the n trials, to identify a success, to recognize independence, and to presume that the probability of a success remains constant from trial to trial.

A CLOSER L OK 1. Even for small values of n, many of the probabilities associated with a binomial ran-

dom variable are a little tedious to calculate, and are subject to lots of round-off error. Technology helps, and Table 1 in Appendix A presents cumulative probabilities for a binomial random variable, for various values of n and p. 2. Cumulative probability is an important concept. If X , B ( n, p ) , the probability that X takes on a value less than or equal to x is cumulative probability. Accumulate all the probability associated with values up to and including x. Symbolically, cumulative probability is P(X # x) 5 a P(X 5 k) x

(5.7)

k50

5 P ( X 5 0 ) 1 P ( X 5 1 ) 1 P ( X 5 2 ) 1 c1 P ( X 5 x )

3. Graphically, cumulative probability is like standing on a special staircase, looking

down (or back), and measuring the height. The steps are labeled 0, 1, 2, . . . , n, the height of step x is P ( X 5 x ) , and the total height of the staircase is 1. Figure 5.4 illustrates P ( X # 3 ) . The number of steps is n 1 1, and the height of each step depends on n and p. In this example, n 5 10 and p 5 0.25. The largest steps (the highest probabilities) are associated with X 5 1, 2, and 3. Steps 7, 8, 9, and 10 are hard to see because P ( X 5 7 ) , P ( X 5 8 ) , P ( X 5 9 ) , and P ( X 5 10 ) are so small.

Solution Trail 5.15 KEYW ORDS ■ ■ ■

40% 20 pizza orders Randomly selected

TR AN SLAT IO N ■ ■

p 5 0.40 n 5 20

C ON CEPTS ■

Probabilities associated with a binomial distribution

VIS ION

Consider a binomial experiment: n and p are given, and the four characteristics of a binomial experiment are assumed to apply. Identify the probability distribution and write each probability question in terms of the random variable. Use appropriate probability rules.

P(X = 3) 1 P(X ! 3)

P(X = 2) P(X = 1) P(X = 0)

0 1 2 3 4 5 6 7 8 9 10

Figure 5.4 Staircase analogy to cumulative probability.

Every probability question about a binomial random variable can be answered using cumulative probability. There may also be other, faster methods, but cumulative probability always works. The following example illustrates some of the techniques for converting to and using cumulative probability.

Example 5.15 A Slice Above Approximately 40% of all pizza orders are carry-out.18 Suppose 20 pizza orders are randomly selected. a. Find the probability that at most 8 are carry-out orders. b. Find the probability that exactly 10 are carry-out orders. c. Find the probability that at least 7 are carry-out orders. d. Find the probability that between 5 and 11 (inclusive) are carry-out orders.

216

CHAPTER 5

Random Variables and Discrete Probability Distributions

SOLUTION Let X be the number of pizza orders (out of the 20 selected) that are carry-out. X is a binomial random variable with n 5 20 and p 5 0.40: X , B ( 20, 0.40 ) . a. The probability that at most 8 are carry-out orders

5 P(X # 8) 5 0.5956

Translate the words into mathematics. Cumulative probability; use Table 1 in Appendix A.

b. The probability that exactly 10 are carry-out orders

5 P ( X 5 10 ) 5 P ( X # 10 ) 2 P ( X # 9 )

Translate the words into mathematics. Convert to cumulative probability.

5 0.8725 2 0.7553 5 0.1172

Use Table 1 in Appendix A.

This solution may also be found by using the probability mass function for a binomial random variable (Equation 5.6). In addition, most statistical software can compute binomial probabilities for single values. c. The probability that at least 7 are carry-out orders 5 P(X $ 7) 5 1 2 P(X , 7) 5 1 2 P(X # 6)

Translate the words into mathematics. The complement rule. The first value X takes on that is less than 7 is 6.

5 1 2 0.2500 5 0.7500

Use Table 1 in Appendix A.

d. The probability that between 5 and 11 (inclusive) are carry-out orders

5 P ( 5 # X # 11 ) 5 P ( X # 11 ) 2 P ( X # 4 ) 5 0.9435 2 0.0510 5 0.8925

Translate the words into mathematics. Convert to cumulative probability. Use Table 1 in Appendix A.

Figures 5.5 through 5.8 show technology solutions.

Figure 5.5 P(X # 8); cumulative probability.

Figure 5.6 P(X 5 10); using the probability mass function.

Figure 5.7 P(X $ 7); using cumulative probability.

Figure 5.8 P(5 # X # 11); using cumulative probability.

The next example shows how the binomial distribution can be used to make an inference.

Example 5.16 Lower Your Cholesterol The drug Lipitor, made by Pfizer, is used to lower cholesterol levels. It was first sold in 1997 and is the best-selling drug of all time. Based on clinical trials, Pfizer claims that approximately 10% of patients using Lipitor in a 40-mg dose will experience arthralgia,

5.4

Solution Trail 5.16a KEYW ORDS ■ ■ ■

10% 25 people At most one

TRANSL AT IO N ■ ■ ■

p 5 0.10 n 5 25 X#1

CON CEP TS ■

Binomial random variable

VISION

Consider a binomial experiment: n 5 25 trials with two outcomes (joint pain or no joint pain), the trials are independent (random sample), and the probability of a success (experience joint pain) is constant from trial to trial.

Solution Trail 5.16b KEYW ORDS ■

Is there any evidence?

TRANSLATI O N ■

Use the experimental outcome to draw a conclusion concerning Pfizer’s claim.

CONCEPTS ■

217

The Binomial Distribution

or joint pain, which is considered an adverse reaction.19 Suppose 25 people who need Lipitor are selected at random. Each is given a 40-mg dose (per day), and the number of people who experience joint pain is recorded. a. Find the probability that at most one person will experience joint pain. b. Suppose seven people experience joint pain. Is there any evidence to suggest that

Pfizer’s claim is wrong? Justify your answer.

SOLUTION Let X be the number of people (out of the 25 selected) who experience joint pain after taking Lipitor. X is a binomial random variable with n 5 25 and p 5 0.10: X , B ( 25, 0.10 ) . Translate the words into a mathematical probability statement, convert to cumulative probability if necessary, and use Table I in the Appendix. a. The probability that at most one will experience joint pain

5 P(X # 1) 5 0.2712

Translate the words into mathematics. Already cumulative probability; use Table 1 in Appendix A.

The probability of at most one person experiencing joint pain is 0.2712. b. Pfizer claims p 5 0.10. This implies that the random variable X has a binomial

distribution with n 5 25 and p 5 0.10. Claim: p 5 0.10 S X , B ( 25, 0.10 ) . The experimental outcome is that seven people experience joint pain. Experiment: x 5 7. It seems reasonable to consider P ( X 5 7 ) and draw a conclusion based on this probability. However, to be conservative (to give the person making the claim the benefit of the doubt), we always consider a tail probability. We accumulate the probability in a tail of the distribution, and if it is small, then there is evidence to suggest the claim is false. So, which tail? It depends on the mean of the distribution (and later on, the alternative hypothesis). Formulas for the mean, variance, and standard deviation of a binomial random variable are given below. Intuitively, however, the mean of a binomial random variable is m 5 np. If n 5 25 and p 5 0.10, we expect to see m 5 ( 25 )( 0.10 ) 5 2.5 people experience joint pain. Because x 5 7 is to the right of the mean, we’ll consider a right-tail probability. See Figure 5.9. p(x) 0.25

Inference procedure

VI SI ON

To decide whether seven people experiencing joint pain is reasonable, we need to follow the four-step inference procedure. This process now involves a random variable. Consider the assumption (claim), the experiment, and the likelihood of the experimental outcome.

0.20 0.15 0.10 0.05 0.00 0

1

2

3

4

5

6

7

8

9

10

x

Figure 5.9 A portion of the probability histogram for the random variable X in Example 5.16. The right-tail probability P(X $ 7) is the sum of the heights of the rectangles above 7, 8, 9, . . . , 25.

218

CHAPTER 5

Random Variables and Discrete Probability Distributions

Likelihood: P(X $ 7) 5 1 2 P(X , 7) 5 1 2 P(X # 6) 5 1 2 0.9905 5 0.0095

The complement rule. The first value X takes on that is less than 7 is 6. Cumulative probability; use Table 1 in Appendix A.

Conclusion: Because this tail probability is so small (less than 0.05), it is very unusual to observe seven or more people with joint pain. But it happened! This is either an incredibly lucky occurrence, or someone is lying. We usually discount the lucky possibility, and conclude that there is evidence to suggest Pfizer’s claim is false. Figure 5.10 shows a technology solution.

Figure 5.10 Probability calculations using JMP.

Solution Trail 5.17a KE YWORDS ■

60%

■

100 children

T R ANSLATION ■

p 5 0.60

■

n 5 100

C ONCEP TS ■

Binomial random variable, mean, variance, standard deviation

VIS ION

Consider a binomial experiment: n 5 100 trials, two outcomes (living in poverty or not living in poverty), trials are independent (random sample), and probability of a success (living in poverty) is constant from trial to trial. Define a random variable and identify its probability distribution.

A random variable is often described, or characterized, by its mean and variance (or standard deviation): m and s2 (or s ). If we know m and s , we can use Chebyshev’s rule to determine the most likely values of the random variable. For any population (random variable), most (at least 89%) of the values are within three standard deviations of the mean. This fact provides another approach to statistical inference, for determining the likelihood of an experimental outcome. To find the mean and variance of a binomial random variable, we could use the mathematical definitions (Equations 5.1 and 5.3). These formulas are used to produce the general results below. However, the mean is intuitive. Consider a binomial random variable X , B ( 10, 0.5 ) . We expect to see ( 10 )( 0.5 ) 5 5 5 np successes in 10 trials. (Think about tossing a fair coin 10 times.) Similarly, if X , B ( 100, 0.75 ) , we expect to see 100 ( 0.75 ) 5 75 5 np successes. The mean of a binomial random variable with n trials and probability of a success p is m 5 np.

Mean, Variance, and Standard Deviation of a Binomial Random Variable If X is a binomial random variable with n trials and probability of a success p, X , B ( n, p ) , then m 5 np

s2 5 np ( 1 2 p )

s 5 "np ( 1 2 p )

(5.8)

Given a binomial random variable, n, and p, we know the mean, variance, and standard deviation immediately. There is no need to create a table of values and probabilities, and use the formulas to find m and s2 . Here is an example to illustrate the use of this concept.

Example 5.17 Children in Poverty Troubles in the automobile industry and the global economic downturn caused high unemployment and a population exodus in Detroit. According to the 2013 State of Detroit report, 60% of all children there live in poverty.20 Suppose 100 children in Detroit are selected at random. a. Find the mean, variance, and standard deviation of the number of children living in

poverty. b. Suppose 55 of the 100 children are living in poverty. Is there any evidence to suggest the report’s claim is false? Justify your answer.

5.4

Solution Trail 5.17b KEYW ORDS ■

Is there any evidence?

TR AN SLAT IO N ■

Use the experimental outcome to draw a conclusion concerning the report’s claim.

C ON CEPTS ■

Inference procedure

■

Most likely values of a binomial random variable

VIS ION

Follow the four-step inference procedure. Use the mean and standard deviation to determine the most likely values of the random variable.

The Binomial Distribution

219

SOLUTION a. Let X be the number of children (out of the 100 selected) who are living in poverty. X

is a binomial random variable with n 5 100 and p 5 0.60: X , B ( 100, 0.60 ) . Use Equation 5.8 to find the mean, variance, and standard deviation. m 5 np 5 ( 100 )( 0.60 ) 5 60 s2 5 np ( 1 2 p ) 5 ( 100 )( 0.60 )( 1 2 0.60 ) 5 ( 100 )( 0.60 ) ( 0.40 ) 5 24 s 5 "s2 5 !24 < 4.9

The expected number of children living in poverty is 60, with a variance of 24 and a standard deviation of approximately 4.9. b. The report claims p 5 0.60. This implies that the random variable X has a binomial distribution with n 5 100 and p 5 0.60. Claim: p 5 0.60 S X , B ( 100, 0.60 ) . The experimental outcome is that 55 children are living in poverty. Experiment: x 5 55. Likelihood: From part (a), m 5 60 and s 5 4.9. Most observations are within three standard deviations of the mean. Therefore, most values of X are in the interval ( m 2 3s, m 1 3s ) 5 ( 60 2 3 ( 4.9 ) , 60 1 3 ( 4.9 )) 5 ( 60 2 14.7, 60 1 14.7 ) 5 ( 45.3, 74.7 ) Conclusion: Because 55 lies in this interval, it is a reasonable observation. There is no evidence to lead us to doubt the claim of p 5 0.60.

A CLOSER L OK 1. Recall: In statistics, we usually measure distance in standard deviations, not miles,

feet, inches, meters, or other units. We often want to know how many standard deviations from the mean is a given observation. 2. The inference problem in part (b) of Example 5.17 can also be answered using the tail probability approach. (Try it!) This method leads to the same conclusion and is generally more precise than constructing an interval about the mean. A more formal process for checking claims (hypothesis tests) is introduced in Chapter 9. 3. Whenever we test a claim, there are only two possible conclusions: a. There is evidence to suggest the claim is false. b. There is no evidence to suggest the claim is false. Note that, in either case, we never state with absolute certainty that the claim is true or the claim is false. This is because we never look at the entire population, only at a sample. With a large, random (representative) sample, we can be pretty confident in our conclusion, but never absolutely sure.

Technology Corner Procedure: Compute probabilities associated with a binomial random variable. Reconsider: Example 5.15, solution, and interpretations.

VIDEO TECH MANUALS EXEL DISCRIPTIVE BINOMIAL PROBABILITY COMPUTATIONS

220

CHA PT ER 5

Random Variables and Discrete Probability Distributions

CrunchIt! CrunchIt! has a built-in function to compute probabilities associated with a binomial random variable. 1. Select Distribution Calculator; Binomial. Enter the values for n and p, and select an appropriate inequality symbol or

the equals sign. Enter the value for x and click Calculate. See Figure 5.11. 2. To find the probability that X takes on a single value, use the same menu choices and enter values for n and p. Select the

equals sign and enter the value for x. Click Calculate. See Figure 5.12.

Figure 5.11 Cumulative probability.

Figure 5.12 The probability mass function.

TI-84 Plus C Suppose X , B ( n, p ) . There are built-in functions to compute a value of the probability mass function (the probability X takes on a single value) and cumulative probability. 1. For cumulative probability, use DISTR ; DISTR; binomcdf. Enter values for the number of trials (n), p (p), and

x value (x). Position the cursor on Paste and tap ENTER . The appropriate calculator command is copied to the Home screen. Tap ENTER again to compute the resulting probability. Refer to Figure 5.5, page 216. 2. To find the probability that X takes on a single value, use DISTR ; DISTR; binompdf. Enter values for the number of trials (n), p (p), and x value (x). Position the cursor on Paste and tap ENTER . The appropriate calculator command is copied to the Home screen. Tap ENTER again to compute the resulting probability. Refer to Figure 5.6, page 216.

Minitab There are built-in functions accessed via input windows or the command language to compute a value of the probability mass function (the probability that X takes on a single value) and cumulative probability. The command language may be necessary to perform additional calculations involving probabilities. 1. Select Calc; Probability Distributions; Binomial. Choose Cumulative probability. Enter the Number of trials (n), the

Event probability (p), and the Input constant (x). See Figure 5.13. 2. To find the probability that X takes on a single value, select Calc; Probability Distributions; Binomial. Choose Probability. Enter the Number of trials (n), the Event probability (p), and the Input constant (x). See Figure 5.14.

Figure 5.13 Cumulative probability.

Figure 5.14 The probability mass function.

5.4

The Binomial Distribution

221

Excel There is a single built-in function to find the probability that X takes on a single value or cumulative probability. The last argument of the function is either True for cumulative probability or False for the probability mass function. Additional spreadsheet calculations may be necessary to find the final answer. 1. To find cumulative probability, use the function BINOMDIST. Enter x, n, p, and True. See Figure 5.15. 2. To find the probability that X takes on a single value, use BINOMDIST. Enter x, n, p, and False. See Figure 5.15. Figure 5.15 Excel function for finding cumulative probability and for evaluating the probability mass function.

SECTION 5.4 EXERCISES Concept Check 5.71 True/False A binomial random variable is completely

described by the number of trials, n. 5.72 True/False There can be three or more outcomes in

each trial of a binomial experiment. 5.73 True/False For a binomial random variable,

P(F) 5 1 2 P(S). 5.74 True/False Every probability question about a binomial

random variable can be answered using cumulative probability. 5.75 True/False Suppose X , B ( n, p ) . Then E ( X ) 5 np. 5.76 True/False The most common values for a binomial

random variable are less than the mean. 5.77 Fill in the Blank The binomial random variable is a

count of ______________. 5.78 Fill in the Blank Suppose X , B ( n, p ) . The number of

outcomes with x successes is ______________. 5.79 Short Answer Name the four properties of a binomial

experiment. 5.80 Short Answer Write a probability expression involving

a binomial random variable X that represents cumulative probability.

Practice 5.81 a. c. e.

Suppose X , B ( 15, 0.25 ) . Find the following probabilities. P(X # 2) b. P ( X , 2 ) P(X 5 7) d. P ( X . 6 ) P ( 3 # X # 10 )

5.82 Suppose X , B ( 20, 0.40 ) . Find the following probabilities. a. P ( X $ 12 ) b. P ( X 2 10 ) c. P ( X # 15 ) d. P ( 2 , X # 8 )

5.83 Suppose X , B ( 25, 0.70 ) . Find the following probabilities. a. P ( X $ 1 ) b. P ( X $ 10 ) c. P ( X $ 17.5 ) d. P ( 10.1 # X # 19 ) 5.84 Suppose X is a binomial random variable with n 5 25 and p 5 0.80. a. Find the mean, variance, and standard deviation of X. b. Find the probability X is within one standard deviation of the mean. c. Find the probability X is more than two standard deviations from the mean. 5.85 Suppose X is a binomial random variable with n 5 30 and p 5 0.40. a. Find the mean, variance, and standard deviation of X. b. Find the intervals m 6 s, m 6 2s, and m 6 3s. c. Find P ( X . m 1 3s ) . d. Find P ( X # m 2 2s ) . 5.86 Suppose X is a binomial random variable with n 5 10 and p 5 0.50. a. Create a table of values of X and associated probabilities. (Hint: This is quick and easy using technology.) b. Use the table in part (a) and the definitions of expected value and variance (Equations 5.1 and 5.3) to find m, s2 , and s . c. Use Equation 5.8 to find m, s2 , and s . Check these answers with those in part (b).

Applications 5.87 Economics and Finance Approximately 90% of

freshmen at Marquette University receive financial aid.21 Suppose 20 Marquette freshmen are randomly selected. a. Find the probability that at most 15 freshmen receive financial aid. b. Find the probability that at least 12 freshmen receive financial aid.

222

CHA PTE R 5

Random Variables and Discrete Probability Distributions

c. Find the expected number of freshmen who receive

financial aid. d. Suppose at least 15 freshmen receive financial aid. What is the probability that all 20 freshmen receive financial aid? 5.88 Fuel Consumption and Cars The battery

manufacturer Varta sells a car battery with 800 coldcranking amps and advertises great performance even in bitingly cold weather. Varta claims that after sitting on a frozen Minnesota lake for 10 days at temperatures below 32°F, this battery will still have enough power to start a car. Suppose the actual probability of starting a car following this experiment is 0.75, and 15 randomly selected cars (equipped with this battery) are subjected to these grueling conditions. a. Find the probability that fewer than 10 cars will start. Write a Solution Trail for this problem. b. Find the probability that more than 12 cars will start. c. Suppose 9 cars actually start. Is there any evidence to suggest that the probability of starting a car is different from 0.75? Justify your answer. 5.89 Marketing and Consumer Behavior Levain Bakery, on West 74th Street in New York City, is trying to determine the number of loaves of raisin bread to make each day. Over the past few months the store has baked 50 loaves each day and has sold out with probability 0.80. Suppose the owner continues this practice and 30 days are selected at random. a. What is the expected number of days on which all 50 loaves will be sold? b. Find the probability of selling all 50 loaves on at least 20 days. c. Find the probability of selling all 50 loaves on at most 18 days. 5.90 Sports and Leisure A Six Flags Great Adventure Theme Park now offers a “wild safari” drive-thru with more than 1000 animals roaming freely on over 400 acres. The park claims that the probability of some car damage by an animal during a safari drive-thru is 0.60. Suppose 20 cars are selected at random. a. Find the probability that exactly 10 cars will be damaged. b. Find the probability that at least 15 cars will be damaged. c. Find the probability that no more than 12 cars will be damaged. d. Suppose 19 cars are damaged. Is there any evidence to suggest the claim of 0.60 is false? Justify your answer. Write a Solution Trail for this problem. 5.91 Public Health and Nutrition Parents tend to be very good at diagnosing their children’s routine medical problems, such as an ear infection, sinus infection, or strep throat. If an ailment is identified correctly, a trip to the doctor’s office may be avoided. A physician may confer with a parent by telephone, and simply call a pharmacy with a prescription for an antibiotic. Suppose parents are correct 90% of the time, and 50 families with a child suffering from some minor illness are selected at random. a. Find the mean, variance, and standard deviation of the number of parents who identify their child’s illness correctly.

b. Find the probability that at least 42 parents are correct. c. Find the probability that between 42 and 47 (inclusive)

parents are correct. d. Suppose 41 parents are actually correct. Is there any

evidence to suggest that fewer than 90% of parents are correct? Justify your answer. 5.92 Public Policy and Political Science

A building inspector enforces building, electrical, mechanical, plumbing, and energy code requirements for the safety and health of people in a certain city, county, or state. In Santa Cruz County, the probability that a building inspector will find at least one code violation at a commercial building is 0.25. Suppose 30 commercial buildings are selected at random. a. Find the mean, variance, and standard deviation of the number of commercial buildings with at least one violation. b. Find the probability that the number of commercial buildings with at least one violation will be within one standard deviation of the mean. c. Find the probability that the number of commercial buildings with at least one violation will be more than two standard deviations from the mean. d. Suppose the actual number of commercial buildings with at least one violation is 10. Is there any evidence to suggest that code violations are found in more than 25% of commercial buildings? Justify your answer.

5.93 Business and Management The Sundance Film Festival is held every January in Park City, Utah. Individuals from the film community determine various awards for independent film makers, and audience awards are also presented. In 2012, approximately 17% of the feature films in the documentary category were directed by women.22 Suppose a sample of 35 documentary films are selected at random. a. What is the probability that exactly five films were directed by women? b. What is the probability that at least eight films were directed by women? c. Suppose three films were directed by women. Is there any evidence to suggest that the proportion of films directed by women is different from 0.17? Justify your answer. 5.94 Demographics and Population Statistics In Illinois, a typical DUI (Driving Under the Influence) offender is a 34-year-old male, arrested between 11 P.M. and 4 A.M. on a weekend, and has a BAC (blood alcohol content) of 0.16. Eighty-five percent of all drivers arrested in Illinois are first-time offenders.23 Suppose 40 people arrested for DUI in Illinois are selected at random. a. What is the probability that at least 12 are first-time offenders? b. What is the probability that between 7 and 10 (inclusive) are first-time offenders? c. Suppose 4 of those arrested are first-time offenders. Is there any evidence to suggest that the proportion of first-time offenders arrested for DUI in Illinois has changed? Justify your answer.

5.4

5.95 Demographics and Population Statistics There is some evidence to suggest that businesses are moving out of states where unions are prevalent. In California, 18.4% of all workers belong to a union, and in Arkansas, 3.7%.24 Suppose 20 workers from California and 20 workers from Arkansas are selected at random. a. Find the probability that at most two California workers belong to a union. b. Find the probability that none of the Arkansas workers belongs to a union. c. Find the probability that at least one worker from each state belongs to a union. 5.96 Public Health and Nutrition A recent report suggests that one-third of all U.S. children are overweight or obese. One possible cause is the availability of junk foods and sugary snacks in schools. Approximately 40% of all students buy and eat one or more snacks at school.25 Suppose 40 school children are selected at random. a. Find the mean, variance, and standard deviation of the number of school children (out of the 40 selected) who buy and eat a snack at school. b. Construct intervals one, two, and three standard deviations from the mean. c. Suppose 27 students (out of the 40 selected) buy and eat a snack at school. How many standard deviations from the mean is this observation? What does this distance measure indicate about the likelihood of observing 27 students who buy and eat a snack at school? 5.97 Business and Management According to the CICA Business Monitor, 11% of Canada’s executive chartered accountants are pessimistic about the economy.26 To check this claim, 50 chartered accountants were randomly selected and asked whether they feel pessimistic about the economy in 2013. a. If the claim is true, find the probability that exactly seven chartered accountants are pessimistic about the economy. b. If the claim is true, find the probability that at most four chartered accountants are pessimistic about the economy. c. Suppose 12 chartered accountants are pessimistic about the economy. Is there any evidence to suggest that the claim is false? Justify your answer.

Extended Applications 5.98 Business and Management In the movie Lethal Weapon, the character played by Joe Pesci is concerned about drive-thru windows at fast-food restaurants. Suppose the probability that an order at a drive-thru window at a fast-food restaurant will be filled correctly is 0.75. Twenty orders are selected at random. a. What is the probability that exactly 15 orders will be filled correctly? b. What is the probability that at most 12 orders will be filled correctly? c. What is the probability that between 10 and 14 (inclusive) orders will be filled correctly?

The Binomial Distribution

223

d. Suppose two groups of 20 random orders are independently

selected. What is the probability that at least 16 orders will be filled correctly in both groups? 5.99 Sports and Leisure The reality television series

Splash! features celebrities attempting to learn how to dive. The first episode aired in January 2013 and earned a 23.6% audience share. That is, 23.6% of all TVs in use during the show time period were tuned to a station airing Splash!.27 Twenty people who watched TV during that time period were selected at random. a. Find the probability that at least six watched Splash!. b. Find the expected number of people who watched Splash!. Find the probability that the number of people who watched Splash! is less than the mean. c. Suppose that at most four people watched Splash!. What is the probability that no one watched Splash!? 5.100 Marketing and Consumer Behavior More children

are being rushed to the hospital because they were able to push down and twist the cap on a medication bottle and were poisoned by a common drug. A recent research study suggested that 25% of all preschool children can open a medication bottle and 10 preschool children are selected at random. Let the random variable X be the number of children who can open the bottle. a. Construct a probability histogram for the random variable X. b. Find the mean, variance, and standard deviation of X. Indicate the mean on the graph from part (a). c. Find P ( m 2 s # X # m 1 s ) and indicate this probability on the graph from part (a). d. Suppose these 10 children try to open a medicine bottle with a new design for the cap and one child is able to open the bottle. Is there any evidence to suggest that the new cap is more effective in stopping children from opening the bottle? Justify your answer. 5.101 Manufacturing and Product Development A company has developed a very inexpensive explosive-detection machine for use at airports. However, if an explosive is actually in a suitcase, the probability of it being detected by this machine is only 0.60. Therefore, several of these machines will be used simultaneously to screen each piece of luggage independently. Suppose a piece of luggage actually contains an explosive. a. If three machines screen this luggage, what is the probability that exactly one will detect the explosive? What is the probability that none of the three will detect the explosive? b. If four machines screen this luggage, what is the probability that at least one device will detect the explosive? c. If five machines screen this luggage, what is the probability that at least one device will detect the explosive? d. How many machines are necessary for screening in order to be certain that at least one device will detect the explosive with probability 0.999 or greater?

224

CHA PT ER 5

Random Variables and Discrete Probability Distributions

5.102 Marketing and Consumer Behavior Forever 21

5.104 Psychology and Human Behavior As a result of

sells women’s flip-flops in (oddly enough) 21 different colors. Despite this vast available color selection, 50% of all flip-flop purchases are in white. Suppose 30 buyers are selected at random. a. Find the mean, variance, and standard deviation of the number of buyers who purchase white flip-flops. b. Find the probability that the number of white flip-flops purchased will be within two standard deviations of the mean. Compare this with the predicted result from Chebyshev’s rule. c. Suppose two groups of 30 customers are independently selected. What is the probability of at least one group having exactly 15 people who buy white flip-flops?

stricter training requirements, fewer big fires, higher-paying jobs in cities, and changes in society, the number of volunteer firefighters is declining. Approximately three-fourths of all firefighters in the United States are volunteers, and the total number of volunteer firefighters has decreased steadily over the last two decades. Suppose 30 U.S. firefighters are selected at random. a. Find the probability that exactly 22 of the firefighters are volunteers. b. Find the probability that more than 25 of the firefighters are volunteers. c. Suppose 17 of the firefighters are volunteers. Is there any evidence to suggest that the proportion of volunteer firefighters has decreased? Justify your answer. d. Suppose 50 firefighters are selected at random from the West and 50 firefighters are selected from the Northeast. What is the probability that at least 40 of the firefighters will be volunteers in both groups?

5.103 Manufacturing and Product Development In early 2013 there was a worldwide glut of solar panels. This forced prices down and made this form of energy production affordable to many more people. This oversupply of solar panels may have caused some manufacturers to cut corners. Sainty Solar is a leading producer and claims the proportion of its solar panels that are defective is 0.02. Solar Solutions, a leading installer, receives a shipment of 50,000 solar panels. Before accepting the entire lot, Solar Solutions selects a random sample of 25 panels and thoroughly tests each one. If four or more panels are found to be defective, the entire shipment will be sent back. Otherwise the shipment will be accepted. a. Suppose the claim is true: The actual proportion of defectives is p 5 0.02. What is the probability that the shipment will be rejected? (This is one type of error probability. The company would be making a mistake if this event occurred. It would be rejecting the shipment when the proportion of defectives is as claimed.) b. Suppose the actual proportion of defectives is p 5 0.05. What is the probability that the shipment will be accepted? (This is another type of error probability. In this case, the company would also be making a mistake. It would be accepting the shipment when the proportion of defectives is too high.) c. Suppose the actual proportion of defectives is p 5 0.07. What is the probability that the shipment will be accepted?

5.105 Sports and Leisure Most ski resorts operate beginner,

intermediate, and advanced terrain in order to appeal to people with varying abilities. The table below lists several ski areas in Canada and the proportion of skiers who attempt the advanced terrain during their visit. Ski area Big White Kicking Horse Norquay

Probability 0.28 0.60 0.44

Suppose 20 skiers are randomly selected from each area. a. Find the probability that exactly five skiers at Big White will attempt the advanced terrain. b. Find the probability that more than eight skiers at Kicking Horse will attempt the advanced terrain. c. Find the probability that between 12 and 16 (inclusive) skiers at Norquay will attempt the advanced terrain. d. Find the probability that at most five skiers at all three locations will attempt the advanced terrain.

5.5 Other Discrete Distributions There are many other common discrete probability distributions. This section presents three of these distributions along with brief background, properties, and examples. Remember that many of the problems involving these distributions are solved using the same general technique: 1. Define a random variable and identify its probability distribution (distribution statement). 2. Translate the words into a probability question where the event is stated in terms of the

random variable (probability statement). 3. If necessary, try to convert the probability statement into an equivalent expression involving cumulative probability. Use tables and technology wherever possible.

5.5

Other Discrete Distributions

225

The geometric distribution is closely related to the binomial distribution. In a binomial experiment, n (the number of trials) is fixed and the number of successes varies. The binomial random variable is the number of successes in n trials. In a geometric experiment, the number of successes is fixed at 1, and the number of trials varies.

Properties of a Geometric Experiment 1. The experiment consists of identical trials. 2. Each trial can result in only one of two possible outcomes: a success (S) or a failure (F). 3. The trials are independent. 4. The probability of a success, p, is constant from trial to trial.

The experiment ends when the first success is obtained.

The Geometric Random Variable The geometric random variable is the number of trials necessary to realize the first success.

Think of an experiment in which you continue to phone a friend until you get through. The number of calls necessary until the first success (reaching your friend) is the value of a geometric random variable. The derivation of the probability distribution involves the properties given above. Let X be a geometric random variable, the number of trials until the first success (including the trial on which the success is obtained). Given p, the probability of a success, find the probability of needing x trials, P ( X 5 x ) 5 p ( x ) . P(X 5 1) 5 P(S) 5 p X 5 1 means the first trial results in a success, and the experiment is over. The probability of a success is simply p. P(X 5 2) 5 P(F d S) 5 P(F) # P(S) 5 (1 2 p)p X 5 2 means the first trial is a failure and the second trial is a success. Because trials are independent, we multiply the corresponding probabilities. P(X 5 3) 5 P(F d F d S) 5 P(F) # P(F) # P(S) 5 ( 1 2 p )( 1 2 p ) p 5 ( 1 2 p ) 2p Why isn’t FSF a possible outcome in a geometric experiment?

X 5 3 means the first two trials are failures and the third trial is a success. We use independence again, and multiply the corresponding probabilities. P(X 5 4) 5 P(F d F d F d S) 5 P(F) # P(F) # P(F) # P(S) 5 ( 1 2 p )( 1 2 p )( 1 2 p ) p 5 ( 1 2 p ) 3p X 5 4 means the first three trials are failures and the fourth trial is a success. We use independence again, and multiply the corresponding probabilities. In general, P ( X 5 x ) 5 P ( F ) # P ( F ) cP ( F ) # P ( S ) 5 ( 1 2 p )( 1 2 p ) c( 1 2 p ) # p x 2 1 failures 5 ( 1 2 p ) x21p

x 2 1 terms

226

CHA PT ER 5

Random Variables and Discrete Probability Distributions

X 5 x means the first x 2 1 trials are failures and the xth trial is the first success. This generalization is the formula for the probability distribution.

The Geometric Probability Distribution Suppose X is a geometric random variable with probability of a success p. Then p ( x ) 5 P ( X 5 x ) 5 ( 1 2 p ) x21p x 5 1, 2, 3, . . . 1 2 p 1 and s2 5 m5 2 p p

(5.9) (5.10)

A CLOSER L OK 1. The geometric random variable is discrete. The number of possible values is countably

Solution Trail 5.18 KE YWORDS ■

First person to have moved

T R ANSLAT IO N ■

First success

CO NCEP TS ■

Geometric probability distribution

VIS ION

Consider a geometric experiment. Each trial is a call to ask if the person has moved in the past year, P ( S ) 5 P ( moved ) 5 0.12, P(S) is the same on each trial, and the trials are independent. Define the geometric random variable and write a probability statement for each part.

(5.11)

4. Equation 5.9 is a valid probability distribution.

Each probability is between 0 and 1, and the sum of all the probabilities is an infinite series. The sum

▲

Several years ago the Bekins Moving Company used a clever jingle in many of their advertisements. The song started, “Bekins men are careful, quick, and kind, Bekins takes a load off of your mind. . . .”

P(X # x) 5 1 2 (1 2 p)x

x21 a P(X 5 x) 5 a (1 2 p) p `

`

x51

x51

is called a geometric series and it does sum to 1!

▲

There is a formula for the sum of a geometric series. Can you use it to show that this sum is 1?

infinite: 1, 2, 3, . . . . 2. The geometric distribution is completely characterized, or defined, by one parameter, p. 3. We do not need a table to find cumulative probabilities associated with a geometric random variable because there is an easy formula for computing these values. If X is a geometric random variable with probability of success p, then

Example 5.18 Bekins Men are Careful, Quick, and Kind The number of people who changed residences in the United States has declined steadily since 1985. However, perhaps due to the sluggish economy, the U.S. Census Bureau estimates that approximately 12% of all people changed residences in 2012.28 Suppose researchers at Bekins randomly call people in the United States and ask if they have moved in the last year. a. What is the probability that the fourth person called will be the first to have moved in

the past year? b. What is the probability that it will take at least six calls before speaking to someone who has moved in the past year?

SOLUTION a. Let X be the number of calls necessary until the first mover is found. X is a geometric

random variable with P ( S ) 5 0.12 5 p. The probability that the first mover (success) is found on the fourth call: P ( X 5 4 ) 5 ( 1 2 p ) 421p 3

Equation 5.9. 3

5 ( 1 2 0.12 ) ( 0.12 ) 5 ( 0.88 ) ( 0.11 ) 5 0.0818

Use p 5 0.12.

The probability that the first mover is found on the fourth call is 0.0818. b. At least six calls before speaking to someone who has moved in the past year means the first success will occur on the sixth call or later.

5.5

227

Other Discrete Distributions

The probability that at least six calls will be needed is P(X $ 6) 5 1 2 P(X , 6) 5 1 2 P(X # 5)

The complement rule. The first value X takes on that is less than 6 is 5.

5 1 2 3 1 2 (1 2 p)5 4

Use Equation 5.11.

5 1 2 3 1 2 ( 0.88 ) 5 4 5 1 2 0.4723 5 0.5277

Use p 5 0.12. Expand and simplify.

The probability that it will take six or more calls to find the first mover is 0.5277. Figures 5.16 and 5.17 show technology solutions.

Figure 5.16 P(X 5 4) using the probability mass function.

TRY IT NOW The distribution is named after the French mathematician Simeon Denis Poisson (1781–1840).

Figure 5.17 P(X $ 6) using cumulative probability.

GO TO EXERCISE 5.118

The Poisson probability distribution has many practical applications and is often associated with rare events. A Poisson random variable is a count of the number of occurrences of a certain event in a given unit of time, space, volume, distance, etc., for example, the number of arrivals to a hospital Emergency Room in a certain 30-minute period, the number of asteroids that pass through Earth’s orbit during a given year, or the number of bacteria in a milliliter of drinking water.

Properties of a Poisson Experiment 1. The probability that a single event occurs in a given interval (of time, length, volume,

etc.) is the same for all intervals. 2. The number of events that occur in any interval is independent of the number that

occur in any other interval.

These properties are often referred to as a Poisson process and can be difficult to verify.

The Poisson Random Variable The Poisson random variable is a count of the number of times the specific event occurs during a given interval.

The Poisson distribution is completely determined by the mean, denoted by the Greek letter lambda, l. Because the Poisson distribution is often used to count rare events, the mean number of events per interval is usually small. The probability distribution is given below.

228

CHA PT ER 5

Random Variables and Discrete Probability Distributions

The Poisson Probability Distribution Suppose X is a Poisson random variable with mean l. Then p(x) 5 P(X 5 x) 5 m5l

e2llx x!

x 5 0, 1, 2, 3, . . .

s2 5 l

(5.12) (5.13)

A CLOSER L OK 1. The Poisson random variable is discrete. The number of possible values is countably 2. 3. 4. 5. 6.

Solution Trail 5.19 KE YWORDS ■

■

Mean number of fatalities per month Two each month

T R ANSLAT IO N ■ ■

Fixed time l52

infinite: 0, 1, 2, 3, . . . . The Poisson distribution is completely characterized by only one parameter, l. The mean and the variance are both equal to the same value, l. Equation 5.12 is a valid probability distribution. All of the probabilities are between 0 and 1, and the sum of all the probabilities is 1 (another infinite series). e in Equation 5.12 is the base of the natural logarithm. e < 2.71828 is an irrational number, and most calculators have this special constant built in. The denominator of Equation 5.12 contains x! (x factorial). Recall: x! 5 x ( x 2 1 )( x 2 2 ) c( 3 )( 2 )( 1 ) and 0! 5 1. Table 2 in Appendix A contains values for P ( X # x ) (cumulative probability) for various values of l.

Example 5.19 Look Out Below! Skydiving is the ultimate thrill for some, and there were over 3.1 million jumps in 2012. Despite an improved safety record, there are approximately two skydiving fatalities each month.29 Suppose this is the mean number of fatalities per month and a random month is selected. a. Find the probability that exactly three fatalities will occur. b. Find the probability that at least five fatalities will occur. c. Find the probability that the number of fatalities will be within one standard deviation

of the mean.

CO NCEP TS ■

Poisson probability distribution

VIS ION

Consider a Poisson distribution. The probability of a single fatality is the same every month, and the number of fatalities in any month is independent of the number of fatalities in any other month. The mean of the Poisson distribution is given.

SOLUTION Let X be the number of skydiving fatalities per month. X has a Poisson distribution with l 5 2. a. The probability of exactly three means P ( X 5 3 ) .

e2223 5 0.1804 3! 5 P(X # 3) 2 P(X # 2) 5 0.8571 2 0.6767 5 0.1804

P(X 5 3) 5

Or, use Equation 5.12. Convert to cumulative probability. Use Table II in the Appendix.

b. At least five means five or more: X $ 5.

P(X $ 5) 5 1 2 P(X , 5) 5 1 2 P(X # 4) 5 1 2 0.9473 5 0.0527

The complement rule. The first value X takes on that is less than 5 is 4. Use Table II in the Appendix.

5.5

Other Discrete Distributions

229

c. Within one standard deviation of the mean is the interval ( m 2 s, m 1 s ) .

m 5 2 5 s2 S s 5 !2 5 1.4142.

P(m 2 s # X # m 1 s) 5 P ( 2 2 1.4142 # X # 2 1 1.4142 ) 5 P ( 0.5858 # X # 3.4142 ) 5 P(1 # X # 3) 5 P(X # 3) 2 P(X # 0)

Use values for m and s. Compute the difference and sum. Use properties of the Poisson distribution. Convert to cumulative probability.

5 0.8571 2 0.1353 5 0.7218

Use Table II in the Appendix. Compute the difference.

Figures 5.18 through 5.20 show technology solutions.

Figure 5.18 P(X 5 3).

TRY IT NOW

Figure 5.19 P(X $ 5).

Figure 5.20 P(1 # X # 3).

GO TO EXERCISE 5.119

The hypergeometric probability distribution arises from an experiment in which there is sampling without replacement from a finite population. Each element in the population is labeled a success or failure. The hypergeometric random variable is a count of the number of successes in the sample. For example, consider a shipment of 12 automobile tires, of which two are defective, and a random sample of four tires. A hypergeometric random variable may be defined as a count of the number of good tires selected.

Properties of a Hypergeometric Experiment 1. The population consists of N objects, of which M are successes and N 2 M are

failures. 2. A sample of n objects is selected without replacement. 3. Each sample of size n is equally likely.

The Hypergeometric Random Variable The hypergeometric random variable is a count of the number of successes in a random sample of size n.

The hypergeometric probability distribution is completely determined by n, N, and M. The probability of obtaining x successes is derived using many concepts introduced earlier: independence, the multiplication rule, equally likely outcomes, and combinations.

230

CHA PT ER 5

Random Variables and Discrete Probability Distributions

The Hypergeometric Probability Distribution Suppose X is a hypergeometric random variable characterized by sample size n, population size N, and number of successes M. Then M N2M ba b x n2x p(x) 5 P(X 5 x) 5 (5.14) N a b n max ( 0, n 2 N 1 M ) # x # min ( n, M ) a

m5n

M , N

s2 5 a

N2n M M b n a1 2 b N21 N N

(5.15)

A CLOSER L OK Here is an explanation for the strange restriction on the possible values for the random variable X.

▲

1.

max ( 0, n 2 N 1 M ) # x: x must be at least 0 or n 2 N 1 M, whichever is bigger. If n 2 N 1 M is positive, it is impossible to obtain fewer than n 2 N 1 M successes. x # min ( n, M ) : x can be at most n or M, whichever is smaller. The greatest number of successes possible is either n or the total number of successes in the population. Suppose n 5 5, N 5 10, and M 5 6. Then: max ( 0, n 2 N 1 M ) 5 max ( 0, 5 2 10 1 6 ) 5 max ( 0, 1 ) 5 1 min ( n, N ) 5 min ( 5, 10 ) 5 5 1 1 # x # 5

and

▲

It is impossible to obtain less than 1 success. Also, the greatest number of successes possible is 5. 2. The hypergeometric random variable is discrete. All of the probabilities are between 0 and 1, and the probabilities do sum to 1.

Solution Trail 5.20 KE YWORDS ■ ■ ■

10 twin-size comforters 2 stitched incorrectly 4 selected at random

T R ANSLAT IO N ■ ■

■

N 5 10 (finite population) 2 failures, therefore M 5 8 successes Sample size n 5 4

CO NCEP TS ■

Hypergeometric probability distribution

VIS ION

All four comforters selected at random without replacement and there are 10 comforters to choose from. Each comforter selected is either a success (stitched correctly) or a failure. Consider a hypergeometric distribution.

n r

3. Recall: a b is a combination. The number of combinations of size r is given by nCr

n n! 5a b5 ( r r! n 2 r ) !

Example 5.20 Hello Kitty A Target store has 10 Hello Kitty twin-size comforters for sale. Two of the 10 comforters have been stitched incorrectly at the factory and will split open when used. Suppose four of the comforters are randomly selected. a. What is the probability that exactly two comforters will be stitched correctly? b. What is the probability that at least three comforters will be stitched correctly?

SOLUTION Let X be the number of successes in the sample. X has a hypergeometric distribution with n 5 4, N 5 10, and M 5 8. Translate each question into a probability statement, convert to cumulative probability if necessary, and use Equation 5.14 and/or technology. a. Exactly two means X 5 2.

M N2M 8 10 2 8 a ba b a ba b x n2x 2 422 5 P(X 5 2) 5 N 10 a b a b n 4

Use Equation 5.14.

5.5

8 2 a ba b 2 2 5 10 a b 4 ( 28 )( 1 ) 5 5 0.1333 210

Other Discrete Distributions

231

In the numerator, from the 8 good comforters, choose 2; from the 2 bad comforters, choose 2. In the denominator, a

10 b 4

is the total number of ways to choose 4 comforters from 10.

Use the formula for a combination.

The probability of selecting exactly two correctly stitched comforters is 0.1333. b. At least three means three or more. The maximum number of successes is min ( n, M ) 5 min ( 4, 8 ) 5 4. In this case, three or more means 3 or 4. P(X $ 3) 5 P(X 5 3) 1 P(X 5 4) Consider the values X can assume that are greater than or equal to 3.

8 2 8 2 a ba b a ba b 3 1 4 0 5 1 10 10 a b a b 4 4 ( 56 )( 2 ) ( 70 )( 1 ) 5 1 210 210 5 0.5333 1 0.3333 5 0.8666

Use Equation 5.14.

Use the formula for a combination.

Note: This problem can also be solved using cumulative probability: P(X $ 3) 5 1 2 P(X # 2) Figures 5.21 and 5.22 show technology solutions.

Figure 5.21 Minitab session window output using the Hypergeometric Distribution input window.

TRY IT NOW

Figure 5.22 P(X $ 3) using the Minitab command language.

GO TO EXERCISE 5.123

Technology Corner Procedure: Compute probabilities associated with a geometric, Poisson, or hypergeometric distribution. Reconsider: Examples 5.18, 5.19, and 5.20, solutions, and interpretations.

CrunchIt! CrunchIt! has a built-in function to compute probabilities associated with a geometric and a Poisson variable. The CrunchIt! geometric random variable is defined to be the number of failures until the first success is achieved. 1. Suppose X is a geometric random variable with P ( X ) 5 p. Select Distribution Calculator; Geometric. Enter the value

for p and select an appropriate inequality symbol or the equals sign. Enter the value for x 2 1 and click Calculate. See Figures 5.23 and 5.24.

232

CHAPTE R 5

Random Variables and Discrete Probability Distributions

Figure 5.23 The probability mass function; use x 2 1.

Figure 5.24 Right-tail probability.

2. Suppose X is a Poisson random variable with mean l. Select Distribution Calculator; Poisson. Enter the value for

lambda and select an appropriate inequality symbol or the equals sign. Enter the value for x and click Calculate. See Figures 5.25 and 5.26.

Figure 5.25 The probability mass function.

Figure 5.26 Cumulative probability.

TI-84 Plus C There are built-in functions to compute cumulative probability and to evaluate the probability mass function associated with a geometric and Poisson random variable. Use the built-in function for combinations to compute probabilities associated with a hypergeometric random variable. 1. Suppose X is a geometric random variable with P ( S ) 5 p. Use the functions in the DISTR ; DISTR menu. Use the

function geometpdf to find the probability that X takes on a single value and geometcdf to find cumulative probability. In each case, enter a value of p (p) and x value (x). Position the cursor on Paste and tap ENTER . The appropriate calculator command is copied to the Home screen. Tap ENTER again to compute the desired probability. Refer to Figures 5.16 and 5.17, page 227. 2. Suppose X is a Poisson random variable with mean l. Use the functions in the DISTR ; DISTR menu. Use the function poissonpdf to find the probability that X takes on a single value and poissoncdf to find cumulative probability. In each case, enter a value of p (p) and x value (x). Position the cursor on Paste and tap ENTER . The appropriate calculator command is copied to the Home screen. Tap ENTER again to compute the desired probability. Refer to Figures 5.18 through 5.20, page 229.

Minitab There are built-in functions to compute cumulative probability and to evaluate the probability mass function associated with a geometric, Poisson, or hypergeometric random variable. These functions may be accessed through a graphical input window: Calc; Probability Distributions, or by using the command language. 1. Suppose X is a geometric random variable with P ( S ) 5 p. In a session window, use the commands PDF or CDF to

evaluate the probability mass function or to compute cumulative probability. See Figures 5.27 and 5.28.

5.5

Figure 5.27 P(X 5 4) in Example 5.18.

Other Discrete Distributions

233

Figure 5.28 P(X $ 6) in Example 5.18.

2. Suppose X is a Poisson random variable with mean l. In a session window, use the command PDF or CDF to evaluate

the probability mass function or to compute cumulative probability. See Figures 5.29 and 5.30.

Figure 5.29 P(X 5 3) in Example 5.19.

Figure 5.30 P(X $ 5) in Example 5.19.

3. Suppose X is a hypergeometric random variable with parameters n, N, and M. In a session or Calc; Probability

Distributions input window, evaluate the probability mass function or compute cumulative probability. Refer to Figures 5.21 and 5.22, page 231.

Excel There are built-in functions to compute cumulative probability and to evaluate the probability mass function associated with a Poisson and a hypergeometric random variable. Use the built-in function for a binomial distribution, BINOM.DIST, to find probabilities associated with the geometric distribution. 1. Suppose X is a geometric random variable with P ( S ) 5 p. Use the following formulas:

P ( X 5 x ) 5 BINOM.DIST ( 1, x, p, FALSE ) /x P ( X # x ) 5 1 2 BINOM.DIST ( 0, x, p, FALSE ) See Figure 5.31. 2. Suppose X is a Poisson random variable with l 5 L. To evaluate the probability mass function, use the function

POISSON.DIST with the last argument set to False. To compute cumulative probability, use the function POISSON.DIST with the last argument set to True. See Figure 5.32. 3. Suppose X is a hypergeometric random variable with parameters n, N, and M. To evaluate the probability mass function, use the function HYPGEOM.DIST with the last argument set to False. To compute cumulative probability, use the function HYPGEOM.DIST with the last argument set to True. See Figure 5.33.

Figure 5.31 Probabilities associated with a geometric random variable.

Figure 5.32 Probabilities associated with a Poisson random variable.

Figure 5.33 Probabilities associated with a hypergeometric random variable.

234

CHA PT E R 5

Random Variables and Discrete Probability Distributions

Note: Similar calculations are available using JMP. See Figures 5.34–5.36.

Figure 5.34 Probabilities associated with a geometric random variable.

Figure 5.35 Probabilities associated with a Poisson random variable.

Figure 5.36 Probabilities associated with a hypergeometric random variable.

SECTION 5.5 EXERCISES Concept Check 5.106 True/False In a geometric experiment, the probability of a success varies from trial to trial. 5.107 True/False A geometric experiment ends when the

first success is observed. 5.108 True/False The number of possible values for a geometric random variable is infinite. 5.109 True/False For a Poisson random variable, the mean is equal to the variance.

a. Find P ( X 5 2 ) . b. Find P ( X 5 5 ) . c. Find the mean, variance, and standard deviation of X. 5.117 Suppose X is a hypergeometric random variable with

sample size 8, population size 16, and number of successes in the population 12. a. List the possible values for X. b. Find the mean, variance, and standard deviation of X. c. Find P ( X 5 5 ) . d. Find P ( X 5 8 ) .

5.110 Fill in the Blank A Poisson random variable is often

Applications

used to count ______________.

5.118 According to the Anxiety and Depression Association of

5.111 Short Answer Explain the difference between a

America (ADAA), approximately 4% of all adults have attention-deficit/hyperactive disorder (ADHD).30 An experiment consists of selecting adults at random and asking them if they have ADHD. a. What is the probability that the fifth adult selected will be the first with ADHD? b. What is the probability that at least eight adults will be selected before identifying a person with ADHD? c. What is the mean number of adults that must be selected before identifying a person with ADHD? d. Suppose the 35th adult is the first with ADHD. Is there any evidence to suggest the ADAA claim is false? Justify your answer.

hypergeometric experiment and a binomial experiment.

Practice 5.112 Suppose X is a geometric random variable with prob-

ability of success 0.35. Find the following probabilities. a. P ( X 5 4 ) b. P ( X $ 3 ) c. P ( X # 2 ) d. P ( X $ m ) 5.113 Suppose X is a geometric random variable with mean

m 5 4. Find the following probabilities. a. P ( X 5 1 ) b. P ( 3 # X # 7 ) c. P ( X . m 1 2s ) 5.114 Suppose X is a Poisson random variable with l 5 2.

Find the following probabilities. a. P ( X 5 0 ) b. P ( 2 # X # 8 ) c. P ( X . 5 ) d. P ( X # 6 ) 5.115 Suppose X is a Poisson random variable with l 5 4.5. Find the following probabilities. a. P ( X . m ) b. P ( X 5 2 ) c. The probability X is either 4 or 5. d. P ( X # m 1 2s ) 5.116 Suppose X is a hypergeometric random variable with

n 5 5, N 5 12, and M 5 6.

5.119 Psychology and Human Behavior According to recent FBI statistics, the mean number of bank robberies per day in the Southern Region of the United States is 4.32.31 Suppose a day is selected at random. a. What is the probability of exactly two bank robberies in the Southern Region? Write a Solution Trail for this problem. b. What is the probability that there will be more than eight bank robberies on that day in the Southern Region? c. Suppose two days are selected at random. What is the probability that there will be no robberies in the Southern Region on both days? 5.120 Technology and the Internet According to a recent study, 30% of all computers in the United States

5.5

Other Discrete Distributions

235

are infected with some form of malware.32 Suppose a computer repair specialist carefully checks every machine left at his store. a. What is the probability that the second computer examined will be the first to have malware? Write a Solution Trail for this problem. b. What is the probability that the tenth machine examined will be the first to have malware? c. What is the mean number of computers examined before one will be infected with malware? d. What is the probability that at least five computers will be examined before one will have malware?

event. Suppose the probability of a false start in any swimming event is 0.07, and swimming events are selected at random. a. What is the probability that the first false start will occur in the fourth event selected? b. What is the probability that the first false start will occur after the 15th event selected? c. What is the mean number of events selected before a false start occurs? d. Suppose there are 26 events at a swimming and diving meet. What is the probability that the first false start will occur at this meet?

5.121 Physical Sciences Of all cities in the United States,

5.125 Manufacturing and Product Development Flatpanel TV displays in televisions and computer monitors often develop dead pixels, pixels that become locked in one state—for example, red at all times. Manufacturers maintain that dead pixels are a natural defect and there are various return policies after the discovery of a dead pixel. Suppose the mean number of dead pixels in a new Samsung 64-inch plasma TV is 2.5. One of these TVs is randomly selected and inspected for dead pixels. a. What is the probability that there will be no dead pixels? b. If the number of dead pixels is more than m 1 3s, the assembly line is automatically stopped and examined. What is the probability that the assembly line will be stopped? c. What is the probability that the number of dead pixels will be within two standard deviations of the mean?

Amherst, New York, has the fewest number of days per year clear of clouds, 4.4.33 Other cities with very few clear days include Buffalo, New York, Lakewood, Washington, and Seattle, Washington. Suppose a random year is selected. a. What is the probability that Amherst will have exactly three days clear of clouds? b. What is the probability of fewer than six days clear of clouds? c. What is the probability of at least nine days clear of clouds? d. Suppose that between 2 and 10 (inclusive) days are clear of clouds. What is the probability of more than five days clear of clouds? 5.122 Travel and Transportation Bad weather is to blame for some of the worst highway crashes in Canada. In December 2012 there was a 27-vehicle pile-up on Highway 40, and in February 2013 a 50-car pile-up shut down Highway 401 near Woodstock, Ontario. Highway 63 in Alberta has a notorious reputation; there are approximately four accidents every week on this road.34 Suppose a week is randomly selected. a. Find the probability that there are no more than four crashes. b. Find the probability that the number of crashes is more than m 1 2s. c. To obtain government funding for safety improvements, there must be five weeks in a row with six or more crashes. What is the probability of this happening? 5.123 Physical Sciences On a Friday night in late March

5.126 Economics and Finance Most banks charge a monthly fee for a checking account, in addition to ATM, overdraft, and other fees. However, 72% of credit unions offer checking accounts with no monthly fees.36 Suppose a bank auditor selects credit unions at random. a. What is the probability that the second credit union selected will be the first to offer free checking? b. What is the probability that the first credit union to offer free checking will be one of the first three? c. Suppose the first credit union to offer free checking is the 10th selected. Is there any evidence to suggest that the claim (72%) is false? Justify your answer.

2013, there were hundreds of reports of meteor sightings along the East Coast of the United States. People from North Carolina to Canada contacted the American Meteor Society to report what they saw and heard. Suppose that in a group of 25 people at the National Mall, 15 actually saw the meteor. A patrolman randomly selects five people from this group. a. What is the probability that none of the five people saw the meteor? b. What is the probability that at least four people saw the meteor? c. What is the probability that at most two people saw the meteor?

5.127 Business and Management Fifteen lobstermen have their boats anchored at a small pier along the New Hampshire coast. Five of these lobstermen have been fined within the past year for commercial lobster-size violations. Suppose four lobstermen are selected at random. a. What is the probability that exactly two have been fined for violations within the past year? Write a Solution Trail for this problem. b. What is the probability that all four have been fined for violations within the past year? c. What is the probability that at least one has been fined for violations within the past year?

5.124 Sports and Leisure The NCAA Men’s and Women’s

5.128 Demographics and Population Statistics Buchtal, a manufacturer of ceramic tiles, reports 3.9 job-related accidents per year. Accident categories include trip, fall, struck by equipment, transportation, and handling. Suppose a year is selected at random.

Swimming and Diving Committee recently recommended a no recall false-start rule.35 This proposal means that unless a false start is blatant, the race will continue. The student-athlete committing the false start will be disqualified following the

236

CHAPTE R 5

Random Variables and Discrete Probability Distributions

a. What is the probability that there will be no job-related

accidents? b. What is the probability that the number of accidents that year will be between two and five (inclusive)? c. If the number of accidents is more than three standard deviations above the mean, the company insurance carrier will raise the rates. What is the probability of an increase in the company’s insurance bill? 5.129 Sports and Leisure Amusement park rides are great

family fun, but over 4400 children are injured on rides every year. According to a recent study, on average, one child is treated in a hospital Emergency Room every two hours as a result of an injury from an amusement park ride.37 Suppose a two-hour period is selected at random. a. What is the probability that no children will be treated in an Emergency Room as a result of an injury from an amusement park ride? b. What is the probability that at most three children will be treated in an Emergency Room as a result of an injury from an amusement park ride? c. Suppose that six children are treated in an Emergency Room as a result of an injury on an amusement park ride. Is there any evidence to suggest that the claim (of one every two hours) is wrong? Justify your answer.

Extended Applications 5.130 Marketing and Consumer Behavior The Sweet

Leaf Iced Teas Company is sponsoring a conventional bottlecap sweepstakes game. Under each bottle cap there is a note saying either “You are not a winner,” or the prize awarded. Suppose there are 20 of the game bottles on a shelf in the supermarket, and two of them are winners. A customer randomly selects six bottles from the shelf. a. What is the probability of selecting no winning bottles? b. What is the probability of selecting both winning bottles? c. What is the mean number of winning bottles selected? d. How many bottles would the customer have to purchase in order to expect one winning bottle? 5.131 Physical Sciences There are over 1 million earth-

quakes worldwide of magnitude 2–2.9 each year. However, the mean number of earthquakes of magnitude 8 or higher is approximately one per year.38 Suppose a random year is selected. a. What is the probability of exactly two earthquakes of magnitude 8 or higher? b. What is the probability of at most four earthquakes of magnitude 8 or higher? c. Suppose there are three earthquakes of magnitude 8 or higher. Is there any evidence to suggest that the mean is different from one? Justify your answer. 5.132 Business and Management Managers at CafePress

acknowledge that a variety of errors may occur in customer orders received via telephone. A recent audit revealed that the probability of some type of error in a telephone order is 0.20. In an attempt to correct these errors, a supervisor randomly selects telephone orders and carefully inspects each one.

a. What is the probability that the third telephone order

selected will be the first to contain an error? b. What is the probability that the supervisor will inspect

between two and six (inclusive) telephone orders before finding an error? c. What is the probability that the inspector will examine at least seven orders before finding an error? d. What is the probability that the first error will be on the fourth telephone order or later? e. Suppose the first four telephone orders contain no errors. What is the probability that the first error will be on the eighth order or later? 5.133 Economics and Finance The manager of Capitol Park

Plaza, an apartment complex in Washington, DC, collects the rent from each tenant on the first day of every month. Past records indicate that the mean number of tenants who do not pay the rent on time in any given month is 4.7. Consider the rent collection for the next month. a. Find the probability that every tenant will pay the rent on time. b. Find the probability that at least seven tenants will be late with their rent. c. Suppose the number of delinquent rent payments in a month is independent of the number in every other month. What is the probability that at most three tenants will be late with their rent in two consecutive months? 5.134 Public Health and Nutrition The mean number of commercially prepared meals per week for a typical American is 4.39 This might seem a little high. But consider young professionals who eat lunch out several times each week, and families that order out for pizza or stop at a fast-food restaurant routinely. a. What is the probability that a randomly selected American does not eat out during a week? Eats out once during the week? Twice during the week? b. Suppose two Americans are selected at random. What is the probability that the total number of meals for the two Americans during a week is 0? 1? 2? c. Suppose the mean number of times an American eats out during a two-week period is 8. What is the probability that a randomly selected American does not eat out during a two-week period? Once during the two-week period? Twice during the two-week period? d. How do your answers in parts (b) and (c) compare? What property does this suggest about a Poisson random variable? 5.135 Psychology and Human Behavior The percentage

of Americans who claim to have no religious affiliation is the highest since 1930, approximately 20%.40 In a group of 30 police officers, 6 have no religious affiliation. Suppose 4 officers from this group are selected at random. a. What is the probability that exactly 1 officer will have no religious affiliation? b. What is the probability that at most 2 officers will have no religious affiliation? c. Suppose the group consists of 50 police officers, 10 with no religious affiliation. Find the probabilities in parts (a) and (b) given this new, larger group.

Chapter

d. Suppose 4 officers are selected at random from across the

country. What is the probability that exactly 1 will have no religious affiliation? At most 2 will have no religious affiliation? e. Compare all of these probabilities. Explain how the hypergeometric distribution is related to the binomial distribution.

5.136 Approaching Poisson Suppose X is a Poisson random variable with l 5 2. Let the random variable Y have a probability distribution as given in the following table.

P(Y 5 y)

0 1 2 3

P(X 5 0) P(X 5 1) P(X 5 2) P(X $ 3)

0 1 2 3 4

P(X 5 0) P(X 5 1) P(X 5 2) P(X 5 3) P(X $ 4)

P(Y 5 y)

0 1 2 3 4 5

P(X 5 0) P(X 5 1) P(X 5 2) P(X 5 3) P(X 5 4) P(X $ 5)

5 0.1353 5 0.2707 5 0.2707 5 0.1804 5 0.0902 5 0.0527

Continue in this manner. To what number is E(Y) converging, and why does this make sense?

5 0.1353 5 0.2707 5 0.2707 5 0.3233

5.137 A Committed Relationship

Suppose X is a geometric random variable with probability of success p 5 0.40 and Y is a binomial random variable with the same probability of success p 5 0.40. For a 5 1, 2, 3, c, 10, construct a table with the following probabilities. a. P ( X 5 a ) b. P ( Y 5 1 ) / a, where Y , B ( a, 0.40 ) c. P ( X # a ) d. 1 2 P ( Y 5 0 ) , where Y , B ( a, 0.40 )

Suppose the distribution of Y is changed slightly, at the right tail, as given in the following table. P(Y 5 y)

y

Find the expected value of Y.

Find the expected value of Y.

y

5 0.1353 5 0.2707 5 0.2707 5 0.1804 5 0.1429

Carefully examine the table and write a general formula to explain each equality. Can you prove these results?

Find the expected value of Y.

CHAPTER 5 SUMMARY Concept

237

Suppose the distribution of Y is changed again, once more at the right tail.

Challenge

y

Summary

Page

Random variable

188

Discrete random variable Continuous random variable Probability distribution for a discrete random variable Mean, or expected value, of a discrete random variable X Variance of a discrete random variable X Properties of a binomial experiment

190 190 193

Binomial random variable

212

203

Notation / Formula / Description

A function that assigns a unique numerical value to each outcome in a sample space. The set of all possible values is finite, or countably infinite. The set of all possible values is an interval of numbers. A method for conveying all the possible values of the random variable and the probability associated with each value. m 5 E(X) 5 a 3 x # p(x) 4 all x

204

s2 5 Var ( X ) 5 a 3 ( x 2 m ) 2 # p ( x ) 4 all x

212

1. n identical trials. 2. Each trial can result in only a success (S) or a failure (F). 3. Trials are independent. 4. Probability of a success is constant from trial to trial. The number of successes in n trials.

238

CHAPTER 5

Binomial probability distribution

Random Variables and Discrete Probability Distributions

214

Cumulative probability Geometric random variable Geometric probability distribution

215 225 226

Poisson random variable

227

Poisson probability distribution

228

Hypergeometric random variable

229

Hypergeometric probability distribution

230

If X , B ( n, p ) then n p ( x ) 5 a bpx ( 1 2 p ) n2x, x 5 0, 1, 2, 3, . . . , n, x

where m 5 np, s2 5 np ( 1 2 p ) , and s 5 !np ( 1 2 p ) . P(X # x) The number of trials necessary to realize the first success. p ( x ) 5 ( 1 2 p ) x21p, x 5 1, 2, 3, c 1 2p 12p 1 s2 5 where m 5 , , and s5 . 2 p Å p p2 A count of the number of times a specific event occurs during a given interval. e2llx p(x) 5 , x 5 0, 1, 2, 3, c, x! s 5 !l. where m 5 l, and s2 5 l , A count of the number of successes in a random sample of size n from a population of size N. M N2M a ba b x n2x p(x) 5 , max ( 0, n 2 N 1 M ) # x # min ( n, M ) , N a b n where m 5 n

s5

M , N

s2 5 a

N2n M M b n a1 2 b , N21 N N

and

N2n M M bn a1 2 b. Å N21 N N a

CHAPTER 5 EXERCISES

5

APPLICATIONS 5.138 Public Health and Nutrition Emergency defibrillators are now located in many public buildings. However, the U.S. Food and Drug Administration (FDA) is concerned about the reliability of these devices. Approximately 45,000 devices failed during the past seven years.41 Suppose the FDA claims that the probability of an emergency defibrillator working correctly is 0.90, and 30 of these devices are selected at random and tested. a. Find the probability that exactly 28 devices will work correctly. b. Find the probability that at least 25 devices will work correctly. c. Suppose only 20 of the devices work correctly. Is there any evidence to suggest that the proportion of emergency defibrillators that work correctly has changed? Justify your answer. 5.139 Business and Management IKEA is a Swedish

company that sells ready-to-assemble furniture. Shoppers contact customer service with regard to finding a nearby store, online shipping questions, and even for help with assembly. IKEA classifies all telephone calls to its customer support staff by the amount of time the customer is on hold. If the customer is

on hold for no more than 60 seconds, then the call is classified as successful (actually, this sounds like a miracle). The supervisor in technical support claims 80% of all calls are successful. Suppose 25 calls to technical support are selected at random. a. Find the mean, variance, and standard deviation of the number of successful calls. b. Find the probability that at least 18 calls will be successful. c. Suppose 21 calls are successful. Is there any evidence to suggest the supervisor’s claim is false? Justify your answer. 5.140 Economics and Finance Bank overdraft fees range from $10 to $38, consumers believe they are annoying and excessive, and these fees are a huge revenue source for banks.42 The Overdraft Protection Act of 2013 is designed to limit overdraft fees in a variety of ways and to require fees to be reasonable and proportional to the amount of the overdraft. Let X be the amount of an overdraft fee for a randomly selected bank. The probability distribution for X is given in the table below.

x

10

12

15

20

25

p(x)

0.02

0.06

0.08

0.10

0.16

x

27

30

35

38

p(x)

0.28

0.15

0.07

0.08

Chapter 5

a. Find the mean, variance, and standard deviation of the

overdraft amount. b. Find the probability that a randomly selected bank has an overdraft fee greater than $25. c. Find the probability that a randomly selected bank has an overdraft fee less than m 2 s . d. Suppose three banks are selected at random. What is the probability that at least one bank has an overdraft fee less than $20? 5.141 Manufacturing and Product Development Thales Alenia Space is a European company that manufactures communications satellites. Researchers at the company have determined that the most common reason for a satellite to fail once it is in orbit is a problem related to opening and initiating the solar panels. Suppose the probability of a failure related to the solar panels is 0.08. a. What is the probability that the fifth satellite launched will be the first to fail due to a solar-panel problem? b. What is the expected number of satellites launched until the first one fails due to a solar-panel problem? c. Thales Alenia Space is preparing an advertising campaign in which they claim to have had 20 successful launches in a row. What is the probability that the first failure due to a solar-panel problem will occur after the 20th launch? 5.142 Business and Management An easy-assembly,

no-tools-required, gas grill comes with detailed step-by-step instructions. Even though each grill is carefully packaged, there are often missing pieces. This can aggravate the customer and increase the cost to the producer, who must provide phone support and ship the missing parts. Suppose the mean number of missing pieces per packaged grill is 0.7, and one grill is randomly selected from the stockroom. a. What is the probability that there will be no missing pieces in the package? b. If there are more than five missing pieces, the producer identifies the packager and issues a warning. What is the probability of a warning being issued at the packaging plant? c. Suppose three grills are randomly selected. What is the probability that each will have no more than one missing piece? 5.143 Travel and Transportation Because of forced spending cuts in Spring 2013, the Federal Aviation Administration identified 149 air traffic control towers that would be closed.43 In a group of 20 air traffic control towers in the Midwest, five will be closed. Suppose six of the 20 air traffic control towers are selected at random. a. What is the probability that none of the six air traffic control towers will be closed? b. What is the probability that at most two of the air traffic control towers will be closed? c. What is the probability that at least four will be closed? 5.144 Sports and Leisure Over a period of 12 years, there

were approximately 2.42 shark attacks per year at the beaches in North Carolina.44 Consider the number of shark attacks during the thirteenth year.

Exercises

239

a. What is the probability that there will be no shark attacks? b. What is the probability that between two and five shark

attacks inclusive will occur? c. If there is evidence to suggest that the mean number of

shark attacks per year has increased, the Coast Guard will begin more patrols to adequately protect the public. Suppose there are eight shark attacks in the thirteenth year. Is there evidence to suggest the need for more patrols? Justify your answer. d. Find out exactly how many shark attacks there were in a recent year in North Carolina. Determine the probability of this occurrence. e. Florida has the highest number of shark attacks per year, approximately 22.5, followed by Hawaii (4.33 per year), and California (3.17 per year). What is the probability that there will be no attacks in all four states in a given year? 5.145 Technology and the Internet In February 2013, the

Chinese Ministry of Industry and Information Technology announced plans to expand broadband connections (or faster) to 70% of Chinese households.45 Suppose the plan is successful and 40 Chinese households are selected at random. a. What is the probability that exactly 25 households have broadband coverage? b. What is the probability that at most 30 households have broadband coverage? c. What is the probability that more than 33 households have broadband coverage? d. Suppose the number of households that have broadband coverage is within two standard deviations of the mean. What is the probability that the actual number that have broadband coverage is within one standard deviation of the mean? 5.146 Sports and Leisure In 2012–2013, Carnival Cruise

Lines experienced onboard problems with four ships: Triumph, Elation, Dream, and Legend. The technical issues included loss of power, steering problems, and even a fire in an engine room. Despite these setbacks, the cruise industry contributes almost $38 billion to the U.S. economy, and approximately 20% of all people in the United States have taken a cruise.46 Suppose 25 people from the United States are selected at random. a. Find the probability that exactly three people have been on a cruise. b. Find the probability that at most two people have been on a cruise. c. Suppose seven people have taken a cruise. Is there any evidence to suggest that the percentage of people who have taken a cruise has increased? Justify your answer. 5.147 Psychology and Human Behavior The army

emphasizes cleanliness and neatness in a military barracks. Each cadet is responsible for maintaining his or her area in top condition. Periodic inspections are held, and those receiving top scores are rewarded. Suppose the mean number of violations discovered per cadet during a barracks inspection is 2.7. a. What is the probability that a randomly selected cadet will have exactly three violations during an inspection?

240

CHAPT E R 5

Random Variables and Discrete Probability Distributions

b. If a cadet has six or more violations, he or she is assigned

to KP duty for one week. What is the probability that a randomly selected cadet will be assigned to KP duty following a barracks inspection? c. If every member of a 10-cadet unit has no violations, then each will receive a weekend pass. What is the probability of this happening following a barracks inspection? 5.148 Marketing and Consumer Behavior In the country’s

2012 budget, Canada decided to stop production of the penny.47 The government indicated that it costs 1.6 cents to produce each penny, some see the penny as a burden to the economy, and approximately 10% of all Canadians believe the penny is a nuisance. Suppose 50 Canadians are selected at random and asked if they believe the penny is a nuisance. a. Find the mean, variance, and standard deviation of the number of Canadians who believe the penny is a nuisance. b. What is the probability that at most three Canadians believe the penny is a nuisance? c. Find the probability that the number of Canadians who believe the penny is a nuisance will be within two standard deviations of the mean. 5.149 Marketing and Consumer Behavior The movie

Les Misérables, an adaptation of Victor Hugo’s novel, starred Hugh Jackman, Russell Crowe, Anne Hathaway, and Amanda Seyfried, and won many awards. The Flixster movie site, Rotten Tomatoes, rated the movie at 74% on the Tomatometer. However, 81% of all people who saw the movie liked it.48 Suppose 30 people who saw the movie are selected at random. a. What is the probability that at most 20 people liked the movie? b. What is the probability that at least 25 people liked the movie? c. Suppose 18 people liked the movie. Is there any evidence to suggest that the claim (81%) is wrong? Justify your answer. 5.150 Public Policy and Political Science In a recent

nationwide study, it was reported that U.S. adults continue to believe that big companies and lobbyists have too much power. In particular, 85% of those polled indicated that PACs (Political Action Committees) have too much influence in Washington. Suppose 50 U.S. adults are selected at random. a. What is the probability that at least 45 U.S. adults think PACs have too much power? b. What is the probability that between 38 and 42 (inclusive) U.S. adults think PACs have too much power? c. Suppose 35 U.S. adults think PACs have too much power. Is there any evidence to suggest the poll results are wrong? Justify your answer. 5.151 The CBS show NCIS stars Mark Harmon as Special

Agent Leroy Jethro Gibbs in which his team investigates military-related criminal cases. The character Abigail Sciuto is frequently seen drinking the high-energy caffeine-laden drink Caf-Pow (she keeps a spare in the lab refrigerator). The mean number of Caf-Pows Abby consumes per show is 3. Suppose a 2013 NCIS episode is selected at random.

a. What is the probability that Abby has no Caf-Pows on the

show? b. What is the probability that she is shown having at most

3 Caf-Pows on the show? c. Suppose Abby has 6 Caf-Pows. Is there any evidence that

the number of Caf-Pows per show has changed? Justify your answer?

EXTENDED APPLICATIONS 5.152 Discrete Uniform Random Variable Suppose X is a

random variable with probability distribution given by p(x) 5

1 5

x 5 1, 2, 3, 4, 5

a. Find the mean, variance, and standard deviation of X. b. Suppose p ( x ) 5 1 / 6, x 5 1, 2, 3, 4, 5, 6. Find the mean,

variance, and standard deviation of X.

c. Suppose p ( x ) 5 1 / n, x 5 1, 2, 3, c, n. Find the mean,

variance, and standard deviation of X in terms of n.

5.153 Medicine and Clinical Studies According to a Pew Research Center survey, approximately 35% of Americans attempt to diagnose a medical condition online.49 Highmark insurance company is concerned about the rising number of online diagnosers and the resulting failure to consult a physician. They have decided to select 25 policyholders at random. If the number of online diagnosers is 11 or fewer, then no action will be taken. Otherwise, they will begin a campaign to remind policyholders that they should always consult a physician to confirm a medical condition. a. Suppose the true proportion of online diagnosers is 0.35. What is the probability that Highmark will begin a new reminder campaign? b. Suppose the true proportion of online diagnosers is 0.40. What is the probability that no action will be taken? What if the true proportion is 0.50? c. Suppose the decision rule is changed such that if the number of online diagnosers is 12 or fewer, then no action will be taken. Answer parts (a) and (b) using this rule. 5.154 Marketing and Consumer Behavior Kohl’s is

running a sale in which customers may save as much as 40% on any purchase. Once a customer decides to make a purchase, he selects two sales prize tickets at random from a large bin at the front of the store. Each ticket has a percentage marked on it, and the probability of selecting each ticket is given in the table below. Percentage

10%

20%

30%

40%

Probability

0.50

0.35

0.10

0.05

The larger of the two percentages selected is used for the purchase. a. Let X be the maximum of the two prize ticket percentages. Find the probability distribution for X. b. Find the mean, variance, and standard deviation of X. c. What is the probability that a customer will receive at least 20% off on his or her purchase?

Chapter 5

5.155 Medicine and Clinical Studies A recent study suggests that total knee joint replacement surgery may be related to weight gain. Researchers who studied records from the Mayo Clinic Health system report that 30% of these patients gained 5% or more of their body weight following surgery.50 Suppose this percentage is the same at all hospitals in the United States and 40 people who had total knee joint replacement surgery are selected at random. a. What is the probability that exactly 14 will experience a weight gain? b. Find the largest value w such that the probability of w or fewer patients who experience a weight gain is at most 0.20. c. Suppose 16 patients experience a weight gain. Is there any evidence to suggest that the proportion of patients with knee joint replacement surgery who experience a weight gain has changed? Justify your answer. 5.156 Public Health and Nutrition A recent study suggested that 75% of all meals marketed specifically to babies and toddlers and sold in grocery stores had high sodium content.51 This is of deep concern because increased salt in the diet can cause hypertension, which may lead to cardiovascular disease. Suppose 30 toddler meals are randomly selected and the sodium content in each is carefully measured. a. What is the probability that exactly 23 meals will have high sodium content? b. What is the probability that at least 25 will have high sodium content? c. Suppose at most 20 meals have high sodium content. What is the probability that at most 15 will have high sodium content?

CHALLENGE 5.157 A Day on the Dock Two crews work on a receiving

dock at a fabric manufacturing plant. The first crew unloads four shipments every day and the second crew unloads seven shipments every day. A supervisor records whether each shipment is complete (a success) or missing items (a failure). Suppose X1 is a binomial random variable, representing the number of complete shipments for crew 1, with parameters n1 5 4 and p 5 0.6. Similarly, let X2 be a binomial random variable, representing the number of complete shipments for crew 2, with parameters n2 5 7 and p 5 0.6. Assume X1 and X2 are independent. a. Use technology to generate a random observation for X1 (the number of complete shipments for crew 1) and a random observation for X2 (the number of complete

Exercises

241

shipments for crew 2). Add these two values to compute a random total number of complete shipments for crews 1 and 2. Repeat this process to generate 1000 random total number of complete shipments for crews 1 and 2. Compute the relative frequency of occurrence of each observation. Suppose Y is a binomial random variable with n 5 11 and p 5 0.6. Use technology to construct a table of probabilities for Y 5 0, 1, 2, 3, c, 11. Compare these probabilities with the relative frequencies obtained above. b. Suppose a new receiving crew is added and it unloads five shipments each day. Let X3 be a binomial random variable, representing the number of complete shipments for crew 3, with parameters n3 5 5 and p 5 0.6. Use technology to generate random observations for X1, X2, and X3. Add these three values to compute a random total number of complete shipments for crews 1, 2, and 3. Repeat this process to generate 1000 random total number of complete shipments for crews 1, 2, and 3. Compute the relative frequency of occurrence of each observation. Suppose Y is a binomial random variable with n 5 16 and p 5 0.6. Use technology to construct a table of probabilities for Y 5 0, 1, 2, 3, c, 16. Compare these probabilities with the relative frequencies obtained above. c. Suppose another receiving crew is added and it unloads nine shipments each day. Let X4 be a binomial random variable, representing the number of complete shipments for crew 4, with parameters n2 5 9 and p 5 0.6. Let Y represent the total number of complete shipments for all four crews. i. Find P ( Y 5 15 ) (the probability of exactly 15 total complete shipments). ii. Find P ( Y # 12 ) . iii. Find P ( Y . 16 ) . iv. How many total complete shipments can be expected?

LAST STEP 5.158 Is a flu shot really effective? In 2012–2013

the CDC reported that the flu vaccine was 56% effective. That is, 56% of all people receiving the flu vaccine who were exposed to a flu virus did not contract the flu. To check this claim (56% effective), a random sample of 50 at-risk people who received a flu shot was selected. During the flu season, all 50 were exposed to the flu and 29 actually contracted the disease. Is there any evidence to suggest the claim is false, that the chance of contracting the flu is greater than 44%?

6

Continuous Probability Distributions Looking Back ■

Remember how to completely describe and compute probabilities associated with a discrete random variable.

■

Recall the characteristics of and probability computations associated with the binomial, geometric, Poisson, and hypergeometric random variables.

Looking Forward ■

Learn how to completely describe a continuous random variable.

■

Compute probabilities associated with a continuous random variable.

■

Understand the characteristics of the normal distribution and compute probabilities involving a normal random variable.

What’s better, faster or slower? LTE, or long-term evolution, is a wireless communications standard for mobile phones and data terminals. The first LTE service became available in 2009 in Oslo and Stockholm. Since then, countries all over the world have adopted this technology. Verizon was the first wireless company to offer LTE service in the United States, in 2010. For mobile phone users, data download speed is perhaps the most important characteristic of an LTE network. Faster transmission rates mean web pages, songs, and videos download more quickly. OpenSignal uses a mobile application to measure wireless download speed, and in 2013 AT&T was the fastest at 13 Mbps (megabits per second), Verizon Wireless’s mean download speed was 10 Mbps, Sprint Nextel’s was 7.7 Mbps, and MetroPCS was in last place with a speed of 1.2 Mbps.1 According to OpenSignal, the mean download speed in the United States in 2013 was 9.6 Mbps. Suppose the standard deviation is 2.3 Mbps and the distribution of download speeds is approximately normal. The concepts presented in this chapter will allow us to determine the most reasonable download speeds for customers and to decide when a customer has a legitimate complaint about download speed.

CONTENTS 6.1 Probability Distributions for a Continuous Random Variable 6.2 The Normal Distribution 6.3 Checking the Normality Assumption 6.4 The Exponential Distribution David Vernon/E+/Getty Images

243

244

CHAPTER 6

Continuous Probability Distributions

6.1 Probability Distributions for a Continuous Random Variable Suppose X is a continuous random variable; X takes on any value in some interval of numbers. A continuous probability distribution completely describes the random variable and is used to compute probabilities associated with the random variable.

Definition A probability distribution for a continuous random variable X is given by a smooth curve called a density curve, or probability density function (pdf). The curve is defined so that the probability that X takes on a value between a and b ( a , b ) is the area under the curve between a and b.

A CLOSER L OK 1. Probability in a continuous world is area under a curve. Figures 6.1–6.3 illustrate the

correspondence between the probability of an event (defined in terms of a continuous random variable) and the area under the density curve. Probability density curve

f (x)

f (x)

f (x) P(X ! a)

P(a ! X ! b)

a

b

x

Figure 6.1 The probability is the area of the shaded region.

P(X ! b)

a

x

Figure 6.2 The shaded area is P(X # a).

b

x

Figure 6.3 The shaded area is P(X $ b).

2. The density curve, or probability density function, is usually denoted by f. It is a

Remember that f (x) is not a probability. The density function leads to probability.

function, defined for all real numbers. f ( x ) is not the probability that the random variable X equals the specific value x. Rather, the function f leads to, or conveys, probability through area. 3. The shape of the graph of a density function can vary considerably. However, a density function must satisfy the following two properties: a. f must be defined so that the total area under the curve is 1. The total probability associated with any random variable must be 1. f ( x ) , a specific value of the density function, may be greater than 1 (while the total area under the curve is still exactly 1). b. f ( x ) $ 0 for all x. Therefore, the entire graph lies on or above the x axis. See Figure 6.4. f (x)

The total area under the curve must be 1.

The graph of a probability density function extends without end in both directions and lies on or above the x axis.

Figure 6.4 A valid probability density function.

x

6.1

This is not necessarily true for a discrete random variable.

245

4. If X is a continuous random variable with density function f, the probability that X

equals any one specific value is 0. That is, P ( X 5 a ) 5 0 for any a. The reason: There is no area under a single point. This seems like a contradiction. Certainly we can observe specific values of X, yet the probability of observing any single value is 0. Recall: Probability is a limiting relative frequency. There are an (uncountably) infinite number of values for any continuous random variable. Therefore, the limiting relative frequency of occurrence of any single value is 0. Because no probability is associated with a single point, the following four probabilities are all the same:

▲

P(X 5 a) translated: Find the area under the curve between a and a. This is asking for the area of a line segment. There is no second dimension. Hence the area is 0.

Probability Distributions for a Continuous Random Variable

P(a # X # b) 5 P(a , X # b) 5 P(a # X , b) 5 P(a , X , b)

(6.1)

▲

In fact, we can remove as many single points as we want from any interval, and the probability will stay the same. The only reasonable probability questions concerning continuous random variables involve intervals. And we can almost always sketch a graph to visualize these probabilities, or regions. So, how do we find area under a curve, and therefore probability? In general, this is a calculus question. Don’t panic. We’ll use a little geometry, tables, and technology to find the necessary area (probability). The (continuous) uniform distribution provides a good opportunity to illustrate the connection between area under the curve and probability. For this random variable, the total probability, 1, is distributed evenly, or uniformly, between two points. Computing probabilities associated with this random variable reduces to finding the area of a rectangle.

Definition The random variable X has a uniform distribution on the interval [a, b] if 1 f (x) 5 y b 2 a 0 m5

if a # x # b

2` , a , b , `

(6.2)

otherwise a1b 2

s2 5

(b 2 a)2 12

(6.3)

A CLOSER L OK STEPPED STEPPED TUTORIAL TUTORIALS DENSITY CURVES BOX PLOTS

1. a and b can be any real numbers, as long as a is less than b (a , b). 2. All of the probability (action) is between a and b. The probability density function is

the constant 1/ ( b 2 a ) between a and b, and zero outside of this interval. Hence, there is no area and no probability outside the interval [a, b]. Figure 6.5 shows a graph of the uniform probability density function. f (x)

1 b–a

a

b

Figure 6.5 The graph of the probability density function for a uniform random variable.

x

246

CH APTER 6

Continuous Probability Distributions

3. Equation 6.2 is a valid probability density function because f ( x ) $ 0 for all x, and the

total area under the curve is 1. The area under the curve for x , a is zero, and the area under the curve for x . b is zero. Between a and b, the area under the curve is the area of a rectangle ( area 5 width 3 height ) . See Figure 6.6. f (x)

Area ! (b – a) " 1 b–a

1 !1 b–a

1 b–a

Area ! 0 a

b

Area ! 0

x

b–a

Figure 6.6 The total area under the curve is 1. Here, the density curve consists of three line segments.

The following example involves a uniform distribution and illustrates visualizing and calculating probabilities associated with a continuous random variable.

Solution Trail 6.1 KE YWOR DS ■ ■

Uniform distribution Between 5 and 25 minutes

T RANSL ATI ON ■ ■

Uniform random variable a 5 5, b 5 25

CONCEPTS ■

■

Uniform probability distribution Probability is area under the density curve

VI S ION

The time it takes to reach a dive site has a uniform distribution between 5 and 25 minutes. The two most important components for solving this kind of problem are the probability distribution and the probability statement. Use Equation 6.2 to draw the density function, and sketch a graph corresponding to each probability statement.

Example 6.1 Reef Dives Bonaire, an Island in the Dutch Caribbean, is considered one of the top 10 diving destinations in the world. There is generally 60–100 feet of visibility, the current is mild, and at least 58 dive sites can be reached from shore. Guides take tourist groups on commercial boats to scuba dive and snorkel at selected locations. A careful examination of boat records has shown that the time it takes to reach a randomly selected dive site has a uniform distribution between 5 and 25 minutes. Suppose a dive site is selected at random. a. Carefully sketch a graph of the probability density function. b. Find the probability that it takes at most 10 minutes to reach the dive site. c. Find the probability that it takes between 10 and 20 minutes to reach the dive site. d. Find the mean time it takes to a dive site, and the variance and standard deviation.

SOLUTION a. Let X be the time it takes to reach a dive site. X is uniform between the times a 5 5

and b 5 25. Use Equation 6.2 to find 1 1 1 5 5 5 0.05 b2a 25 2 5 20 The probability density function is f (x) 5 e

0.05 0

if 5 # x # 25 otherwise

Figure 6.7 shows the graph of the probability density function. b. We have the distribution of X. Translate the question in part (b) into a probability statement, sketch the region corresponding to the probability statement, and find the area of that region. At most means up to and including 10. We need the probability that X is less than or equal to 10: P ( X # 10 ) . See Figure 6.8.

6.1

f (x)

f (x)

0.075

0.075

0.050

0.050

0.025

0.025

5

10

15

20

25

30

x

P(X ! 10)

Figure 6.7 The graph of the probability density function for a uniform random variable on the interval a 5 5 to b 5 25.

The probability statement P(X # 10) simplifies to P(5 # X # 10) in this case because there is no probability (area) for x less than 5.

247

Probability Distributions for a Continuous Random Variable

5

10

15

20

25

30

x

Figure 6.8 The area of the shaded region is P(X # 10).

P ( X # 10 ) 5 area under the density curve between 5 and 10 5 area of a rectangle 5 width 3 height 5 ( 5 )( 0.05 ) 5 0.25 The probability that it takes at most 10 minutes is 0.25.

P(10 # X # 20) 5 P(10 , X # 20) 5 P(10 # X , 20) 5 P(10 , X , 20)

c. The probability that it takes between 10 and 20 minutes to reach a dive site in terms of

the random variable X is P ( 10 # X # 20 ) . Even though the word inclusive is not used in the question, we chose to write the interval including the endpoints. It doesn’t really matter! Remember: In a continuous world, single values contribute no probability and do not change the probability calculation. P ( 10 # X # 20 ) 5 area under the density curve between 10 and 20 5 area of a rectangle 5 width 3 height 5 ( 10 )( 0.05 ) 5 0.50 The probability that it takes between 10 and 20 minutes is 0.50. See Figure 6.9. f (x) P(10 ! X ! 20)

0.075 0.050 0.025

5

10

15

20

25

30

x

Figure 6.9 The area of the shaded region is P(10 # X # 20). d. Use Equation 6.3 to find the mean and variance.

m5

a1b 5 1 25 30 5 5 5 15 2 2 2

The mean time it takes to reach a dive site is 15 minutes. Because the uniform distribution is symmetric, the mean is the middle of the distribution, and the mean is equal to the median.

248

CHAPTER 6

Continuous Probability Distributions

Challenge: Find a length of time t such that 90% of all dive sites are reached within t minutes.

s2 5

(b 2 a)2 ( 25 2 5 ) 2 202 400 5 5 5 < 33.3 12 12 12 12

s 5 "s2 5 !33.3 < 5.8

The standard deviation is approximately 5.8 minutes. TRY IT NOW

GO TO EXERCISE 6.12

To find a probability associated with any continuous random variable, the probability statement is often rewritten to use cumulative probability. From Chapter 5, cumulative probability means accumulate probability up to and including a fixed value. Find all the area under the density curve to the left of the fixed value. For a continuous random variable X, the cumulative probability up to x is P ( X # x ) . Figure 6.10 illustrates this cumulative probability. Suppose X is a continuous random variable, and a and b are constants. Here are some typical probability statements involving X, and equivalent expressions using cumulative probability.

It doesn’t matter whether we use # or ,, because one point contributes no probability. However, for consistency and accuracy throughout this text, cumulative probability will mean up to and including x; we use # (not ,).

P(X $ b) 5 1 2 P(X , b) 5 1 2 P(X # b)

(Figure 6.11)

The complement rule. A single value contributes no probability.

Cumulative probability P(a # X # b) 5 P(X # b) 2 P(X , a) 5 P(X # b) 2 P(X # a)

(Figure 6.12) A single value contributes no probability.

Find all the probability up to b, find all the probability up to a, and subtract. The difference is the probability that X lies in the interval from a to b. P(X " b) Density curve

Density curve

Density curve P(a ! X ! b)

P(X ! x)

P(X ! a) P(X ! b)

P(X ! b) x

b

Figure 6.10 The shaded area is the cumulative probability P(X # x).

a

Figure 6.11 Use the complement rule to convert to cumulative probability.

b

Figure 6.12 The shaded area is P(X $ b).

Here is one more way to picture cumulative probability. As x moves from left to right, we accumulate more and more probability. As x increases, cumulative probability also increases. Imagine starting at an altitude of zero and walking up a (smooth) hill. At any point along the walk, measure the altitude. This distance is the cumulative probability. Figure 6.13 shows the relationship between the area under the density curve and the altitude. f (x) 1.0

f (x) Cumulative distribution function

Density curve

0.5 P(X ! x) 0.0

x

0.0

x

Figure 6.13 Picturing cumulative probability: The altitude is equal to the shaded area.

6.1

249

Probability Distributions for a Continuous Random Variable

A CLOSER L OK Why is 1 the maximum value of the cumulative distribution function?

1. The drawing on the left in Figure 6.13 is a graph of a cumulative distribution function.

This function starts at 0 and is always increasing, until it reaches a maximum value of 1. 2. The mean, m, and the variance, s2 , for a continuous random variable are computed

using calculus. Although we will not consider any of these calculations, we will interpret and use these values as usual. m is a measure of the center of the distribution, and s2 (or s ) is a measure of the spread, or variability, of the distribution. Figure 6.14 shows density functions for the random variables X and Y. a. The mean of X is less than the mean of Y, mX , mY , because the center of the distribution of X is to the left of the center of the distribution of Y. b. The standard deviation of X is greater than the standard deviation of Y, sX . sY , because the distribution of X is more spread out, and thus has more variability, than the distribution of Y.

Density curve for Y

Density curve for X

!X

!Y

Figure 6.14 Density functions for X and Y. The mean and the variance (and standard deviation) convey the same information as before about the center and variability.

The following example illustrates the use of cumulative probability to compute probability associated with a continuous random variable. VIDEO TECH MANUALS PROBABILITY DISTRIBUEXEL DISCRIPTIVE TION CALCULATIONS WITH DENSITY CURVE

Example 6.2 Keeping Good Time Each Citizen Eco-Drive wristwatch is carefully tested for accuracy before being packaged and shipped. If the watch gains or loses time during the 24-hour testing period, it is sent to a technician for adjustment. The time inconsistency (in seconds) is a random variable, X, with the probability density function shown in Figure 6.15. A negative value of X indicates that the watch lost time, and a positive value indicates that the watch gained time. Cumulative probability for X is illustrated in Figure 6.16 and can be computed using f (x)

f (x)

P(X " x)

!10

!5

5

10 x

Figure 6.15 The probability density function for the time inconsistency of a wristwatch.

!10

!5

x

5

10

Figure 6.16 The cumulative probability for the time inconsistency of a wristwatch.

250

CHAPTER 6

Continuous Probability Distributions

Recall: e is the base of the natural logarithm; e < 2.71828. Most calculators have a specific key for e.

the equation that follows. The cumulative probability (the area of the shaded region in Figure 6.16) is 1 P(X # x) 5 for all x (6.4) 1 1 e2x/2 Suppose a watch is randomly selected. a. What is the probability that the watch is 5 seconds slow or slower? b. What is the probability that the watch is more than 10 seconds fast? c. What is the probability that the watch is between 3 seconds slow and 3 seconds fast?

SOLUTION a. If the watch is 5 seconds slow or slower, this means X # 25. The probability

P ( X # 25 ) is cumulative probability already. The calculation and the graphical interpretation (Figure 6.17) follow. f (x)

P(X " !5)

!10

5

!5

10 x

Figure 6.17 The cumulative probability for watch inconsistency, P(X # 5).

P ( X # 25 ) 5

1 (

)

1 1 e2 25 /2

5 0.0759

Use Equation 6.4.

The probability that a randomly selected watch is 5 seconds slow or slower is 0.0759. b. If the watch is more than 10 seconds fast, this means X . 10. To compute the corresponding probability, use the complement rule to convert to an expression involving cumulative probability, and use Equation 6.4. See Figure 6.18. f (x)

P(X " 10)

!10

!5

5

10 x

Figure 6.18 The probability for watch inconsistency, P(X . 10).

P ( X . 10 ) 5 1 2 P ( X # 10 ) 1 512 1 1 e210/2 5 1 2 0.9933 5 0.0067

The complement rule. Use Equation 6.4. Simplify.

6.1

251

Probability Distributions for a Continuous Random Variable

The probability that a randomly selected watch is at least 10 seconds fast is 0.0067. c. If the watch is between 3 seconds slow and 3 seconds fast, this means 23 # X # 3.

To compute the corresponding probability, find the difference between two cumulative probabilities. See Figure 6.19. f (x)

P(!3 " X " 3) P(X " !3)

!10

!5 !3

3 5

10 x

Figure 6.19 The probability for watch inconsistency, P(23 # X # 3).

P ( 23 # X # 3 ) 5 P ( X # 3 ) 2 P ( X , 23 ) 5 P ( X # 3 ) 2 P ( X # 23 ) 5a

A single point contributes no probability.

1 1 2a 23/2 b ( ) b 11e 1 1 e2 23 /2

Use Equation 6.4.

5 0.8176 2 0.1824 5 0.6352

Simplify.

The probability that a randomly selected watch is between 3 seconds slow and 3 seconds fast is 0.6352. TRY IT NOW

GO TO EXERCISE 6.22

In some problems a probability is given and we need to work backward to find a solution. Consider the following example.

Example 6.3 Voting Time The time it takes to vote may affect the outcome of an election. For example, some people may decide not to vote in districts with long lines and delays. The mean time to vote in the 2012 election was approximately 14 minutes, and Florida voters had the longest wait.2 Suppose the time to vote for a randomly selected person in Florida, X, has a uniform distribution between 10 and 60 minutes. Find the time t such that 75% of all people have to wait at most t minutes to vote.

SOLUTION STEP 1 Because X has a uniform distribution with a 5 10 and b 5 60, the probability Tim Dominick/MCT/ABACAUSA.COM/Newscom

This is a backward problem, because we know the probability (0.75) and need to find a starting point (t), a value of X that produces this probability.

density function is 1 5 0.02 f ( x ) 5 y 50 0

10 # x # 60 otherwise

We need to find the value of t such that P ( X # t ) 5 0.75. STEP 2 From Figure 6.20, the probability that it takes at most t minutes to vote is P ( X # t ) 5 area under the density curve from 10 to t 5 area of a rectangle 5 width 3 height 5 ( t 2 10 )( 0.02 )

252

CHAPTER 6

Continuous Probability Distributions

f (x) 0.020 0.015 0.010 0.005 10

t

60

x

Figure 6.20 The area of the shaded region is P(X # t). STEP 3 Set the expression for probability equal to 0.75 and solve for t.

( t 2 10 )( 0.02 ) 5 0.75 0.75 t 2 10 5 5 37.5 0.02 t 5 37.5 1 10 5 47.5

Divide both sides by 0.02. Add 10 to both sides.

Seventy-five percent of all voters vote within 47.5 minutes. TRY IT NOW

GO TO EXERCISE 6.15

SECTION 6.1 EXERCISES Concept Check 6.1 True/False The graph of a probability density function

may extend below the x axis. 6.2 True/False For a continuous random variable X with

probability density function f , P ( X 5 x ) 5 f ( x ) . 6.3 True/False For a continuous random variable there is no

probability associated with a single value. 6.4 True/False The mean, m, and the variance, s2, of a

continuous random variable describe the center and spread of the distribution. 6.5 Short Answer Explain how to compute probabilities for a continuous random variable. 6.6 Short Answer Explain why a cumulative distribution

function can never have a value greater than 1.

Practice 6.7 Suppose X is a uniform random variable with a 5 0 and

b 5 16. a. Carefully sketch a graph of the probability density function for X. b. Find the mean, variance, and standard deviation of X. c. Find P ( X $ 4 ) . d. Find P ( 2 # X , 12 ) . e. Find P ( X # 7 ) .

6.8 Suppose X is a uniform random variable with a 5 25 and

b 5 25. a. Carefully sketch a graph of the probability density function for X. b. Find the mean, variance, and standard deviation of X. c. Find P ( 210 , X , 21 ) . d. Find P ( X . 0 ) and P ( X $ 0 ) . e. Find P ( X $ 20 0 X $ 10 ) .

6.9 Suppose X is a uniform random variable with a 5 50 and

b 5 100. a. Find the mean, variance, and standard deviation of X. b. Find P ( m 2 s # X # m 1 s ) . c. Find P ( X $ m 1 2s ) . d. Find a value c such that P ( X # c ) 5 0.20. 6.10 Suppose X is a uniform random variable with a 5 25 and

b 5 65. a. Find the mean, variance, and standard deviation of X. b. Find the probability that X is more than two standard deviations from the mean. c. Find a value c such that P ( X $ c ) 5 0.40. d. Suppose two values of X are selected at random. What is the probability that both values are between 30 and 40? 6.11 Suppose X is a continuous random variable with probabil-

ity density function given by x 8 f (x) 5 y 0

if 0 # x # 4 otherwise

6.1

f (x) 0.5 0.4 0.3 0.2 0.1 !1

a. b. c. d. e. f.

1

2

3

4

5

x

Find P ( X # 1 ) . Find P ( X . 3 ) . Find P ( X . 4 ) . Find P ( 2 # X # 3 ) . Find P ( X # 2 0 X # 3 ) . Find a value c such that P ( X # c ) 5 0.5. Explain why c is not equal to 2.

Applications 6.12 Manufacturing and Product Development A Gold

Canyon candle is designed to last nine hours. However, depending on the wind, air bubbles in the wax, the quality of the wax, and the number of times the candle is re-lit, the actual burning time (in hours) is a uniform random variable with a 5 6.5 and b 5 10.5. Suppose one of these candles is randomly selected. a. Find the probability that the candle burns at least seven hours. b. Find the probability that the candle burns at most eight hours. c. Find the mean burning time and the probability that the burning time of a randomly selected candle will be within one standard deviation of the mean. d. Find a time t such that 25% of all candles burn longer than t hours. 6.13 Sport and Leisure According to Major League Baseball rules, a baseball should weigh between 5 and 5.25 ounces and have a circumference of between 9 and 9.25 inches. Suppose the weight of a baseball (in ounces) has a uniform distribution with a 5 5.085 and b 5 5.155, and the circumference (in inches) has a uniform distribution with a 5 9.0 and b 5 9.1. a. Find the probability that a randomly selected baseball has a weight greater than 5.14 ounces. Write a Solution Trail for this problem. b. Find the probability that a randomly selected baseball has a circumference less than 9.03 inches. c. Suppose the weight and the circumference are independent. Find the probability that a randomly selected baseball will have a weight between 5.11 and 5.13 ounces and a circumference between 9.04 and 9.06 inches. 6.14 Manufacturing and Product Development Pre-

manufactured wooden roof trusses allow builders to complete projects faster and with lower on-site labor costs. The connector plates for trusses are made from Grade A steel and are hot-dip galvanized. The thickness of a truss connector (in inches) varies

Probability Distributions for a Continuous Random Variable

253

slightly and has a uniform distribution with a 5 0.036 and b 5 0.050. a. If the manufacturer will only use connectors with a minimum thickness of 0.04 inch, what proportion of connectors is rejected? b. Suppose a truss connector is selected at random. Find the probability that the truss connector has a thickness between 0.042 and 0.045 inch. c. Find the mean, variance, and standard deviation of the thickness of a truss connector. 6.15 Travel and Transportation When the Department of

Transportation (DOT) repaints the center lines, edge lines, or no-passing-zone lines on a highway, epoxy paint is sometimes applied. This paint is more expensive than latex but lasts longer. If this paint splashes onto a vehicle, it has to be completely sanded off, and that area of the vehicle has to be repainted. The DOT has warned motorists that the drying time for this epoxy paint (in minutes) has a uniform distribution with a 5 30 and b 5 60. Suppose epoxy paint is applied to a small section of center line. a. What is the probability that the paint will be dry within 45 minutes? b. What is the probability that the paint will be dry in between 40 and 50 minutes? c. Find a value t such that the probability of the paint taking at least t minutes to dry is 0.75. d. If the DOT road crew removes all of the cones on the center line 55 minutes after painting, what is the probability that the paint will still be wet? 6.16 Physical Sciences In Grafton, a rural area in Vermont, the distance (in meters) between telephone poles has a uniform distribution with a 5 40 and b 5 65. Suppose two consecutive telephone poles are selected at random. a. What is the probability that the distance between the poles is less than 60 meters? b. What is the probability that the distance between the poles is between 45 and 55 meters? Write a Solution Trail for this problem. c. Any distance between poles greater than 50 meters is considered to be environment friendly. What is the probability that the distance is environment friendly?

Extended Applications 6.17 Psychology and Human Behavior Some of the

common medications to help people with insomnia fall asleep include Ambien, Lunesta, and Sonata. Ambien CR is an extended-release variation and is formulated so that an individual will fall asleep within 30 minutes.3 The probability density function for X, the time (in minutes) it takes to fall asleep after taking an Ambien CR tablet, is given below. 0.05 f ( x ) 5 y 20.0025 ( x 2 30 ) 0

if 0 # x # 10 if 10 , x # 30 otherwise

254

CHAPTER 6

Continuous Probability Distributions

c. What is the probability that the car is parked for more

f (x)

than 2.6 hours?

0.05

d. What is the probability that the car is parked for between

1.4 and 2.6 hours?

0.04

6.19 Marketing and Consumer Behavior Marini’s candy store on the beach boardwalk in Santa Cruz sells candy in bulk. Customers can mix products from over 100 barrels. The probability distribution for the number of pounds of candy purchased by a randomly selected customer is shown below.

0.03 0.02 0.01 0

5

10

15

20

25

30

x f (x)

a. Verify that this is a valid probability density function. b. If a randomly selected person takes an Ambien CR tablet

at bedtime, what is the probability that he will fall asleep within 5 minutes? c. What is the probability that the person will fall asleep between 20 and 30 minutes after taking a tablet? d. Find a value t such that the probability of falling asleep within t minutes after taking a tablet is 0.75. e. If it takes less than 15 minutes to fall asleep after taking a tablet, people consider the medication a success. Suppose 20 people are selected at random. What is the probability that exactly 14 people fall asleep successfully? What is the probability that at least 16 people fall asleep successfully? What is the probability that at most 10 people fall asleep successfully? 6.18 Marketing and Consumer Behavior The city of Kingston, Ontario, has approximately 3700 parking spaces. Metered parking is available on some streets and in certain parking garages, and the maximum length of stay at a metered spot is between one and three hours.4 The probability density function for the length of time a car is parked (in hours) at a metered spot in a certain lot is given below. Suppose a car parked at a metered spot in this lot is selected at random.

0.50

0.25

1

2

3

x

4

a. Verify that this is a valid probability density function. b. Find the probability that the next customer buys at most

2 pounds of candy. c. Find the probability that the next customer buys more

than 1 pound of candy. d. Suppose the next customer buys at most 1.5 pounds of

candy. What is the probability that she buys at most 0.5 pound of candy? 6.20 Economics and Finance On any given trading day,

the fluctuation, or change, in the price (in dollars) of JP Morgan Chase stock, listed on the New York Stock Exchange, is between 22.00 and 2.00. Suppose the change in price is a random variable with the probability density function shown below. f (x)

f (x) 1.0

0.375

0.8 0.6

0.250

0.4 0.2

0.125 0

1

2

3

4

x !3

!2

!1

1

2

3

x

a. What is the probability that the car is parked for less than

2 hours? b. What is the probability that the car is parked for less than

1.4 hours?

a. Verify that this is a valid probability density function. b. What is the probability that the stock price increases by

at least $1.00 on a randomly selected day?

6.1

Probability Distributions for a Continuous Random Variable

255

c. What is the probability that the change in stock price is

b. What is the probability that the person must wait for a

between 21.00 and 1.00? d. Find a value c such that P ( 2c # X # c ) 5 0.90.

c. What is the probability that the person must wait for a

table for over one half-hour? table for between 15 and 30 minutes?

6.21 Medicine and Clinical Studies Although we all

experience inflammation associated with a bruise or sprain, inflammation can also affect the cells in the body. C-reactive protein (CRP) is a measure of inflammation and can be part of a routine blood test. For healthy adults, the CRP level is less than 5 milligrams per liter of blood. Suppose the probability distribution for X, the CRP level in healthy adults, is given as follows: f (x) 5 e

20.08 ( x 2 5 ) 0

if 0 # x # 5 otherwise

f (x) 0.4 0.3 0.2 0.1

1

2

3

4

5

6 x

a. Verify that this is a valid probability distribution. b. What is the probability that a randomly selected healthy c. d. e.

f.

adult will have a CRP level less than 2.5? What is the probability that a randomly selected healthy adult will have a CRP level between 2 and 3? Find a value c such that 5% of healthy adults have a CRP level of at least c. If a patient has a CRP level of at least 4, then additional testing is done. What is the probability that a healthy adult will need additional testing? What is the probability that the fifth healthy adult will be the first to need additional testing?

Challenge 6.22 Marketing and Consumer Behavior Dinner

customers at the Primanti Brothers restaurant in Pittsburgh, Pennsylvania, often experience a long wait for a table. For a randomly selected customer who arrives at the restaurant between 6:00 P.M. and 7:00 P.M., the waiting time (in minutes) is a continuous random variable such that P(X # x) 5 e

1 2 e20.05x 0

if x $ 0 otherwise

Suppose a dinner customer is randomly selected. a. What is the probability that the person must wait for a table for at most 20 minutes?

6.23 Psychology and Human Behavior Parents with

children under age 16 often spend a lot of time during the day driving their kids to various places, for example, to/from after-school activities, music practice, sports practices and games, the library, and a friend’s home. Suppose a family has k child(ren) under 16 (k 5 1, 2, 3, 4, 5), and let the random variable Xk be the time (in hours) spent taxiing during the day. Xk has a uniform distribution with a 5 0 and b 5 k. For example, for a family with two children, X2 has a uniform distribution with a 5 0 and b 5 2. a. For a family with three children, what is the probability that parents will spend less than one hour driving kids on a randomly selected day? b. For a family of four children, what is the mean number of hours spent driving kids? What is the probability that the driving time will be greater than two standard deviations from the mean? c. For a family with five children under 16, find a time t such that the probability of driving kids more than t hours is 0.25. d. Suppose five families are selected at random, the first with one child under 16, the second with two children under 16, etc. What is the probability that all five families drive less than 30 minutes on a randomly selected day? What is the probability that all five families drive more than 90 minutes on a randomly selected day? 6.24 Suppose X is a continuous random variable such that 2

a. b. c. d.

P(X # x) 5 e

1 2 e2x /8 0

if x $ 0 otherwise

Find P ( X # 4 ) . Find P ( X . 2 ) . Find P ( 1 # X # 3 ) . Find P ( X # 2 0 X # 4 ) .

6.25 Sports and Leisure A figure skating routine is

designed to last six minutes. The amount of time (in minutes) less than or greater than six minutes is a random variable, X, with a probability density function given by 2 2 x2 f (x) 5 c Å p 0

2 2 #x# Åp Åp otherwise if 2

If the value of X is negative, then the routine was shorter than six minutes; if the value of X is positive, the routine went too long. a. Carefully sketch a graph of the density function. b. Find the probability that a randomly selected performance is within 1/ !p minutes of 6. That is, find Pa2

1 1 #X# b !p !p

256

CHAPTER 6

Continuous Probability Distributions

6.2 The Normal Distribution The normal probability distribution is very common and is the most important distribution in all of statistics. This bell-shaped density curve can be used to model many natural phenomena, and the normal distribution is used extensively in statistical inference. Recall that a random variable is completely described by certain parameters—for example, a binomial random variable by n and p, and a Poisson random variable by l. A normal distribution is completely characterized, or determined, by its mean m and variance s2 (or by its mean m and standard deviation s).

The Normal Probability Distribution Suppose X is a normal random variable with mean m and variance s2. The probability density function is given by f (x) 5 and 2` , x , `

2 1 ( )2 e2 x2m /2s s !2p

2` , m , `

(6.5)

s2 . 0

(6.6)

A CLOSER L OK

Bell-shaped means: Place a bell on a table and pass a plane (a piece of paper) through the bell perpendicular to the table. The intersection of the plane and the bell is a bell-shaped curve.

1. In this probability density function, e is the base of the natural logarithm; e < 2.71828.

p is another constant, commonly used in trigonometry; p < 3.14159. 2. We use the shorthand notation X , N ( m, s2 ) to indicate that X is (distributed as) a normal random variable with mean m and variance s2. For example, X , N ( 5, 36 ) means that X is a normal random variable with mean m 5 5 and variance s2 5 36 (and s 5 6). 3. Equation 6.6 means that x can be any real number (the density curve continues forever in both directions), the mean m can be any real number (positive or negative), and the variance can be any positive real number. 4. For any mean m and variance s2, the density curve is symmetric about the mean m, unimodal, and bell-shaped as shown in Figure 6.21. Concave down

f (x)

Concave up

!""

Concave up

!

!!"

x

Figure 6.21 Graph of the probability density function for a normal random variable with mean m and variance s2 . ▲

The graph of the probability density function changes concavity at x 5 m 2 s and again at x 5 m 1 s. The mean is equal to the median because the normal distribution is symmetric. ▲

We’ve seen e before, in the Poisson distribution.

6.2

The Normal Distribution

257

▲

▲

It can be shown (using calculus) that the total area under this density curve is 1 (even though it extends forever in both directions, getting closer and closer to the x axis but never touching it). 5. The mean m is a location parameter, and the variance s2 determines the spread of the distribution. As the variance increases, the total area under the probability density function (1) is rearranged. The graph is compressed down and pushed out (on the tails). Figures 6.22 and 6.23 show the effects of m and s2 on the location (center) and spread of the density curve.

STEPPED STEPPED TUTORIAL TUTORIALS NORMAL BOX PLOTS DISTRIBUTIONS

f (x)

f (x)

4

7

10

13

16

19

4

x

Figure 6.22 Normal probability density function with m 5 7 and small s2 .

6

8 10 12 14 16 18 20 x

Figure 6.23 Normal probability density function with m 5 12 and large s2 .

Suppose X is a normal random variable with mean m and variance s2 : X , N ( m, s2 ) . The probability X lies in some interval, for example 3 a, b 4 , is the area under the density curve between a and b (Figure 6.24). f (x)

P(a ! X ! b)

a

!

b

x

Figure 6.24 The shaded region corresponds to P(a # X # b).

The shaded region in Figure 6.24 is not a simple geometric figure; it’s bounded by a curve! Consequently, there is no nice formula for the area of this region, corresponding to P ( a # X # b ) . However, a probability statement associated with any normal random variable can be transformed into an equivalent expression involving a standard normal random variable (defined below). Cumulative probabilities associated with this distribution are provided in Appendix A, Table III.

The Standard Normal Random Variable

Let m 5 0 and s 5 1 in Equation 6.5.

The normal distribution with m 5 0 and s2 5 1 (and s 5 1 ) is called the standard normal distribution. A random variable that has a standard normal distribution is called a standard normal random variable, usually denoted Z. The probability density function for Z is given by 2 1 f (z) 5 e2z /2 2` , z , ` (6.7) !2p

258

CH APTER 6

Continuous Probability Distributions

A CLOSER L OK 1. In Equation 6.7 the independent variable z is used to define the probability density

function simply because the standard normal random variable is usually denoted by Z. 2. Figure 6.25 shows a graph of the probability density function for a standard normal

random variable. The mean is m 5 0 and the standard deviation is s 5 1. Note most of the probability (area) is within three standard deviations of the mean, between 23 and 3. The shorthand notation Z , N ( 0, 1 ) means Z is a normal random variable with mean 0 and variance 1. f (z)

!3

!2

!1

1

2

3

z

Figure 6.25 Graph of the probability density function for a standard normal random variable. We will often refer to a standard normal distribution as a Z world.

3. The standard normal distribution is not common, but it is used extensively as a refer-

ence distribution. Any probability statement involving any normal random variable can be transformed into an equivalent expression (with the same probability) involving a Z random variable. We will learn how to standardize shortly. Therefore, you need to become an expert at computing probabilities in the Z world. Probabilities associated with Z are computed using cumulative probability, as shown below. Figure 6.26 shows the steps for computing probabilities associated with a normal random variable. Probability statement involving X ! N(!, " 2)

Standardize

Probability statement involving Z ! N(0, 1)

Use cumulative probability

Final answer

Figure 6.26 Strategy for computing a probability associated with any normal random variable.

Probabilities associated with a standard normal random variable, Z, are computed using cumulative probability. Table III in Appendix A contains values for P ( Z # z ) for selected values of z. Figure 6.27 shows the geometric region corresponding to P ( Z # z ) , and Figure 6.28 illustrates the use of Table III in Appendix A. Locate the units and tenths digits in z along the left side of the table. Find the hundredths digit in z across the top row. The intersection of this row and column, in the body of the table, contains the cumulative probability. f (z)

P(Z ! z)

Figure 6.27 The shaded area under the standard normal density curve corresponds to P(Z # z).

z

6.2

z

0.00

0.01

0.02

0.03

0.04

0.05

The Normal Distribution

259

0.06

0.09

0.07

0.08

(

(

(

(

(

(

(

(

(

(

(

1.0 1.1 1.2 1.3 1.4

0.8413 0.8643 0.8849 0.9032 0.9192

0.8438 0.8665 0.8869 0.9049 0.9207

0.8461 0.8686 0.8888 0.9066 0.9222

0.8485 0.8708 0.8907 0.9082 0.9236

0.8508 0.8729 0.8925 0.9099 0.9251

0.8531 0.8749 0.8944 0.9115 0.9265

0.8554 0.8770 0.8962 0.9131 0.9279

0.8577 0.8790 0.8980 0.9147 0.9292

0.8599 0.8810 0.8997 0.9162 0.9306

0.8621 0.8830 0.9015 0.9177 0.9319

(

(

(

(

(

(

(

(

(

(

(

Figure 6.28 P(Z # 1.23) 5 0.8907 in Table III, Appendix A.

The following example illustrates the use of Table III in Appendix A to find probabilities associated with Z.

Example 6.4 Probability Calculations Associated with the Standard Normal Distribution Use Table III in Appendix A to find each probability associated with the standard normal distribution. a. P ( Z # 1.45 ) b. P ( Z $ 20.6 ) c. P ( 21.25 # Z # 2.13 ) d. Find the value b such that P ( Z # b ) 5 0.90.

SOLUTION a. This expression is already cumulative probability. Go directly to Table III in the

Appendix, and find the intersection of row 1.4 and column 0.05. See Figure 6.29. P ( Z # 1.45 ) 5 0.9265

Cumulative probability; use Table III in Appendix A.

Figure 6.30 shows a technology solution. f (z)

1.45

z

Figure 6.29 The area of the shaded region is P(Z # 1.45).

Figure 6.30 P(Z # 1.45).

b. This is a right-tail probability. Convert to cumulative probability and use Table III in

the Appendix. See Figure 6.31. P ( Z $ 20.6 ) 5 1 2 P ( Z , 20.6 ) 5 1 2 P ( Z # 20.6 ) 5 1 2 0.2743 5 0.7257 Figure 6.32 shows a technology solution.

The Complement Rule. One value doesn’t matter. Use Table III in the Appendix.

260

CH APTER 6

Continuous Probability Distributions

f (z)

!0.6 0

z

Figure 6.31 The area of the shaded region is P(Z $ 20.6).

Figure 6.32 P(Z $ 20.6).

c. Find all the probability up to 2.13, find all the probability up to 21.25, and subtract.

The difference is the probability that Z lies in this interval. See Figure 6.33. P ( 21.25 # Z # 2.13 ) 5 P ( Z # 2.13 ) 2 P ( Z , 21.25 ) 5 P ( Z # 2.13 ) 2 P ( Z # 21.25 )

Use cumulative probability. One value doesn’t matter.

5 0.9834 2 0.1056 5 0.8778

Use Table III in the Appendix.

Figure 6.34 shows a technology solution. f (z)

!1.25

2.13

z

Figure 6.33 The area of the shaded region is P(21.25 # x # 2.13).

Figure 6.34 P(21.25 # Z # 2.13).

d. In this problem, we need to work backward to find the solution. This is an inverse

cumulative probability problem. The cumulative probability is given. We need the value b such that the cumulative probability is 0.90. See Figure 6.35. Search the body of Table III in Appendix A to find a cumulative probability as close to 0.90 as possible. Read the row and column entries to find b. In the body of Table III, the closest cumulative probability to 0.90 is 0.8997. This corresponds to 1.28 < b. Figure 6.36 shows a technology solution. f (z)

b

z

Figure 6.35 The area of the shaded region is 0.90 5 P(Z # b).

Figure 6.36 Inverse cumulative probability.

6.2

The Normal Distribution

261

Note: Linear interpolation can be used to find a more exact answer. The technology solution presented in Figure 6.36 uses a special inverse cumulative probability functions. TRY IT NOW

GO TO EXERCISES 6.32 AND 6.34

Interpolation Recall: interpolation is a method of approximation. It is often used to estimate a value at a position between two given values in a table. Linear interpolation assumes that the two known values lie on a straight line. In Example 6.4(d), 0.90 is between the Table III known cumulative probabilities 0.8997 and 0.9015. Suppose the two points (1.28, 0.8997) and (1.29, 0.9015) lie on a straight line. The approximate z value corresponding to the cumulative probability 0.90 is 1.28 1 ( 0.01 )( 0.90 2 0.8997 ) / ( 0.9015 2 0.8997 ) 5 1.2817 The following rule provides the connection between any normal random variable and the standard normal random variable.

Standardization Rule If X is a normal random variable with mean m and variance s2 , then a standard normal random variable is given by Z5

X2m s

(6.8)

A CLOSER L OK There are other types of standardization. Z 5 (X 2 m)/ s is the most common.

1. The process of converting from X to Z is called standardization. Z is a standardized

random variable. 2. Using this rule, any probability involving a normal random variable can be transformed into an equivalent expression involving a Z random variable. We can then convert to cumulative probability if necessary, and use Table III in the Appendix. 3. The rule above is illustrated in Figure 6.37, using cumulative probability. f (x)

f (z)

P(Z " z)

X–! !

P(X " x) N(!, "2)

N(0,1)

!

x

z

Figure 6.37 An illustration of standardization. The areas of the shaded regions are equal.

The following calculation shows why the two shaded regions in Figure 6.37 have the same area, and how to use the rule to compute probabilities involving any normal random variable. Assume: X , N ( m, s2 ) .

262

CH APTER 6

Continuous Probability Distributions

Remember the phrase: Whatever you do to one side of the inequality, you have to do to the other side.

P(X # x) X2m x2m 5 Pa # b s s 5 P(Z # z)

The original (cumulative) probability statement. Work within the probability statement. Subtract the mean of X and divide by the standard deviation of X, on both sides of the inequality (standardize).

Apply the standardization rule within the probability statement. The expression with X is transformed into Z. The expression with x becomes some fixed value z. Use Table III in the Appendix to find this probability.

The examples below involve normal random variables and standardization. The hardest part of these types of problems is (as before) (1) to define and identify the probability distribution, and (2) to write a probability statement. Given a probability statement involving a normal random variable, all we have to do is standardize and use cumulative probability. Even for backward problems (with a known probability), we still standardize and still use cumulative probability. Note that the technology solutions presented do not require standardization. STATISTICAL APPLET NORMAL DENSITY CURVE

Example 6.5 Probability Calculations Associated with a Normal Random Variable Suppose X is a normal random variable with mean 10 and variance 4: X , N ( 10, 4 ) , and s 5 !4 5 2. a. Find P ( X . 12.5 ) .

b. Find P ( 9 # X # 10 ) . c. Find the value b such that P ( X # b ) 5 0.75.

SOLUTION a. X is normal. We know the mean and standard deviation. Standardize and use cumulative

probability associated with Z. X 2 10 12.5 2 10 . b 2 2 5 P ( Z . 1.25 ) 5 1 2 P ( Z # 1.25 )

P ( X . 12.5 ) 5 Pa

Standardize. Equation 6.8; simplify. The complement rule.

5 1 2 0.8944 5 0.1056

Use Table III in the Appendix.

Figure 6.38 illustrates this solution. f (x)

N(10, 4) P(X ! 12.5)

10

12.5

x

Standardize f (z)

Figure 6.38 Example 6.5 part (a) standardization illustrated: 10 is transformed to 0. 12.5 is transformed to 1.25. The areas of the shaded regions are the same.

P(Z ! 1.25) N(0, 1) 0

1.25

z

6.2

The Normal Distribution

b. Standardize again. Work within the probability statement to write an equivalent ex-

Standardization illustrated part (b):

pression involving Z.

P(9 ! X ! 10) f (x)

9 2 10 X 2 10 10 2 10 # # b 2 2 2 5 P ( 20.5 # Z # 0 ) 5 P ( Z # 0 ) 2 P ( Z , 20.5 ) 5 P ( Z # 0 ) 2 P ( Z # 20.5 )

P ( 9 # X # 10 ) 5 Pa

9 10

263

x

P("0.5 ! Z ! 0)

5 0.5000 2 0.3085 5 0.1915

Standardize. Use Equation 6.8; simplify. Use cumulative probability. One value doesn’t matter. Use Table III in the Appendix.

c. Convert the expression into cumulative probability involving Z. Because the probability

f (z)

is already given, this is an inverse cumulative probability problem. Work backward in Appendix Table III. P ( X # b ) 5 Pa z

"0.5 0

5 PaZ #

Standardization illustrated part (c): P(X ! b) " 0.75 f (x)

10 11.349 P

X 2 10 b 2 10 # b 2 2

x

b – 10 Z! 2

f (z)

Standardize.

b 2 10 b 5 0.75 2

Equation 6.8.

There is no other simplification within the probability statement. However, the resulting probability statement involves Z, and is cumulative probability. Find a value in the body of Table III, Appendix A, as close to 0.75 as possible. Set the corresponding z b 2 10 equal to a b, and solve for b. 2 b 2 10 5 0.6745 2 b 2 10 5 1.349 b 5 11.349

Table III; interpolation. Multiply both sides by 2. Add 10 to both sides.

Therefore, P ( X # 11.349 ) 5 0.75 and hence b 5 11.349. Figures 6.39–6.41 show technology solutions: 0 0.6745

z

Figure 6.39 P(X $ 12.5).

TRY IT NOW

Figure 6.40 P(9 # X # 10).

Figure 6.41 Inverse cumulative probability.

GO TO EXERCISES 6.39 AND 6.40

Example 6.6 Seat Pitch Seat pitch on a passenger airline is the distance from the back of one seat to the front of the one directly behind it. The greater the seat pitch, the more comfortable the seat and the less likely you are to travel with your knees against your chest. Some seats—for example, bulkhead seats—have a larger seat pitch.5 However, the seat pitch for all economy seats is normally distributed with mean 34 inches and standard deviation 0.5 inch.

264

CHAPTER 6

Continuous Probability Distributions

a. For a randomly selected economy seat, find the probability that the seat pitch is be-

Solution Trail 6.6

tween 33.25 and 34.75 inches (considered comfortable).

KE YWORD S ■ ■ ■

b. Any seat with a seat pitch less than 33 inches is considered constricted. Find the prob-

Normally distributed Mean Standard deviation

ability that a randomly selected economy seat is constricted.

SOLUTION a. Let X be the seat pitch in inches. The keywords in the problem suggest X ,

T RANSL ATI ON ■ ■ ■

N ( 34, 0.25 ) , s 5 0.5 . Between 33.25 and 34.75 means in the interval [33.25, 34.75] (whether it is closed or open doesn’t matter). Find the probability that X lies in this interval.

Normal random variable m 5 34 s 5 0.5

P ( 33.25 # X # 34.75 )

CONCEPTS ■

■

33.25 2 34 X 2 34 34.75 2 34 # # b 0.5 0.5 0.5 5 P ( 21.50 # Z # 1.50 ) 5 P ( Z # 1.50 ) 2 P ( Z # 21.50 ) 5 Pa

Normal probability distribution Standardization

VI S ION

5 0.9332 2 0.0668 5 0.8664

Define a normal random variable and translate each question into a probability statement. Standardize and use cumulative probability associated with Z if necessary.

Standardize. Equation 6.8; simplify. Use cumulative probability. Use Table III in the Appendix.

The probability that a randomly selected economy seat has seat pitch between 33.25 and 34.75 inches is 0.8664. b. A seat is constricted if the value of X is less than 33 inches. Find P ( X , 33 ) . X 2 34 33 2 34 , b 0.5 0.5 5 P ( Z , 22.00 )

P ( X , 33 ) 5 Pa

5 0.0228

Standardize. Equation 6.8; simplify. Cumulative probability; use Table III in the Appendix.

The probability that a randomly selected economy seat is constricted is 0.0228. Figure 6.42 shows a technology solution.

Recall that [33.25, 34.75] means 33.25 # X # 34.75. Standardization illustrated part (b): f (x)

P(X ! 33)

Figure 6.42 Normal probability calculations using JMP.

TRY IT NOW 33

34

x

P(Z ! –2.00) f (z)

–2.00

z

GO TO EXERCISE 6.43

Example 6.7 Backpacks and Back Pain Chronic back pain has become common in children because so many carry overfilled and overweight backpacks. Heavy school books, notebooks, calculators, and computer equipment, all crammed into a backpack and lugged around all day, increase the chance of neck and shoulder muscle spasms and lower-back pain. Research has shown that the total weight carried is directly related to the volume of a backpack. The volume of a randomly selected backpack sold commercially is normally distributed with mean 600 cubic inches and standard deviation 100 cubic inches. Find a symmetric interval about the mean volume, 3 m 2 b, m 1 b 4 , such that 95% of all backpack volumes lie in this interval.

SOLUTION

STEP 1 Let X be the volume (in cubic inches, in3) of a randomly selected backpack. The

information given indicates that X is a normal random variable with mean m 5 600 and standard deviation s 5 100: X , N ( 600, 10,000 ) .

6.2

KEYW OR DS ■ ■ ■

Normally distributed Mean Standard deviation 95%

P(600 ! b # X # 600 " b) N(600, 10,000)

TRAN SLATI O N ■ ■ ■ ■

Normal random variable m 5 600 s 5 100 Probability 0.95

0.95 0.025

■

0.025 600 ! b

CONC EPTS ■

Define a random variable and translate the question into a probability statement. A probability is given (0.95), suggesting an inverse cumulative probability question. Standardization illustrated: P(X ! 600 " b) # 0.025 f (x)

The area (or probability) in the tails of the distribution is 1 2 0.95 5 0.05 (the complement rule). The distribution is symmetric, so the probability to the left of ( 600 2 b ) is 0.05/2 5 0.025. b. P ( X # 600 1 b ) 5 0.975. The probability to the left of ( 600 1 b ) is 0.95 1 0.025 5 0.975. STEP 3 We’ll use the expression in (a).

2b b 100

Standardize.

Use Equation 6.8; simplify.

There is no further simplification within the probability statement. The resulting expression involves Z and is a cumulative probability. Find a value in the body of Table III in the Appendix as close to 0.025 as possible. 2b Set the corresponding z equal to , and solve for b. 100

# 0.025

f (z)

"1.96

( 600 2 b ) 2 600 X 2 600 # b 5 0.025 100 100

5 PaZ #

x

600 –b 100

x

random variable, so we will certainly have to standardize. And, because the probability is given, this is a backward problem. To use Table III in the Appendix, we need a cumulative probability statement. We need another interpretation of Figure 6.43 involving cumulative probability and b. Here are two possibilities. a. P ( X # 600 2 b ) 5 0.025.

P ( X # 600 2 b ) 5 Pa

Z!

600 " b

STEP 2 This problem reduces to finding the value for b. This question involves a normal

VI SI ON

P

600

Figure 6.43 A graphical representation of the probability statement.

Normal probability distribution Standardization

600 " b

265

Find a symmetric interval about the mean such that 95% of all backpack volumes lie in this interval translates as: Find a value of b such that P ( 600 2 b # X # 600 1 b ) 5 0.95. Figure 6.43 illustrates this probability statement.

Solution Trail 6.7 ■

The Normal Distribution

z

2b 5 21.96 100 2b 5 2196.00 b 5 196.00

Table III in the Appendix. Multiply both sides by 100. Multiply both sides by 21.

The value of b is 196 and the symmetric interval about the mean is P ( 600 2 b # X # 600 1 b ) 5 P ( 600 2 196 # X # 600 1 196 ) 5 P ( 404 # X # 796 ) 5 0.95

Figure 6.44 A technology solution: Use inverse cumulative probability to find each endpoint.

95% of all backpacks have a volume between 404 and 796 in3. Technology can be used to find the endpoints of the interval without solving for b. See Figure 6.44. TRY IT NOW

GO TO EXERCISE 6.46

266

CH APTER 6

Continuous Probability Distributions

Technology Corner Procedure: Solve probability questions involving a normal random variable. Reconsider: Example 6.5, solutions, and interpretations.

VIDEO TECH MANUALS EXEL DISCRIPTIVE NORMAL CALCULATIONS

CrunchIt! There is a built-in function to compute probabilities associated with a normal random variable. Select Distribution Calculator; Normal. To find cumulative probability or right-tail probability, select the Probability tab. Choose an appropriate inequality symbol and enter a value for the endpoint. To solve an inverse cumulative probability problem, select the Quantile tab and enter a cumulative probability. 1. Select Distribution Calculator; Normal. Enter the mean, 10, and the standard deviation, 2. Under the Probability tab,

select . and enter 12.5 for the endpoint. See Figure 6.45. 2. To find the probability that X takes on a value in an interval, use cumulative probability. P ( 9 # X # 10 ) 5 P ( X # 10 ) 2 P ( X # 9 ) . 3. Select Distribution Calculator; Normal. Enter the mean, 10, and the standard deviation, 2. Under the Quantile tab, enter the cumulative probability and click Calculate. See Figure 6.46.

Figure 6.45 P(X $ 12.5).

Figure 6.46 A solution to P(X # b) 5 0.75.

TI-84 Plus C The built-in function normalcdf is used to find (calculator) cumulative probability: the probability that X takes on a value between a and b. This function takes four arguments: a (lower), b (upper), m, and s. The built-in function invNorm takes three arguments: p (area), m, and s. This function returns a value x such that P ( X # x ) 5 p. The default values for m and s are 0 and 1, respectively. 1. Select DISTR ; DISTR; normalcdf. Enter the left endpoint (lower), 12.5, the right endpoint (upper), 1E99

(calculator infinity), the mean (m), 10, and the standard deviation (s), 2. Highlight Paste and tap ENTER . Refer to Figure 6.39. 2. Select DISTR ; DISTR; normalcdf. Enter the left endpoint (lower), 9, the right endpoint (upper), 10, the mean (m), 10, and the standard deviation (s), 2. Highlight Paste and tap ENTER . Refer to Figure 6.40. 3. Select DISTR ; DISTR; invNorm. Enter the cumulative probability (area), 0.75, the mean (m), 10, and the standard deviation (s), 2. Highlight Paste and tap ENTER . Refer to Figure 6.41.

Minitab There are several built-in functions to compute cumulative probability, tail probability, and inverse cumulative probability. These functions may be accessed through a graphical input window or by using the command language.

6.2

The Normal Distribution

267

1. In a session window, use the function CDF and the complement rule to find P ( X . 12.5 ) . See Figure 6.47. 2. Select Graph; Probability Distribution Plot; View Probability. In the Distribution menu, select Normal. Enter the Mean

and Standard deviation. Under the Shaded Area tab, choose X Value and Middle. Enter the X value 1 (9) and X value 2 (10). Click OK. Minitab displays a distribution plot with the shaded area corresponding to probability. See Figure 6.48. 3. Select Calc; Probability Distributions; Normal. Choose Inverse cumulative probability, enter the Mean, 10, and the Standard deviation, 2. Select Input constant (p), and enter 0.75. Click OK. The value of x is displayed in the session window. See Figure 6.49.

Figure 6.47 P(X . 12.5) using the command language.

Figure 6.48 P(9 # X # 10) using Probability Distribution Plot.

Figure 6.49 Inverse cumulative distribution function output.

Excel There are built-in functions to compute cumulative probability associated with a standard normal random variable (NORM.S.DIST) and a normal random variable with arbitrary mean and standard deviation (NORM.DIST). The functions NORM.S.INV and NORM.INV are the corresponding inverse cumulative probability functions. 1. Use the function NORM.DIST to find P ( X # 12.5 ) . Use the complement rule to find P ( X . 12.5 ) . See Figure 6.50. 2. Use the function NORM.DIST to find P ( X # 9 ) and P ( X # 10 ) . Compute the difference to find P ( 9 # X # 10 ) . See

Figure 6.51. 3. Use the function NORM.INV. Enter the cumulative probability, 0.75, the mean, 10, and the standard deviation, 2. See

Figure 6.52.

Figure 6.50 P(X . 12.5).

Figure 6.51 P(9 # X # 10).

Figure 6.52 Inverse cumulative probability.

SECTION 6.2 EXERCISES Concept Check 6.26 True/False The probability density function for any

normal random variable is bell-shaped. 6.27 True/False The mean and variance of a normal random

variable determine the location and spread of the distribution. 6.28 Fill in the Blank The standard normal random variable

has mean _____________ and variance _____________.

6.29 Fill in the Blank Any probability statement involving

a normal random variable can be converted to an equivalent statement involving a standard normal random variable through the process of _____________. 6.30 Multiple Choice For any normal random variable X, the statement P ( X # x ) is (a) cumulative probability; (b) inverse cumulative probability; (c) standardized.

268

CHAPTER 6

Continuous Probability Distributions

Practice 6.31 Let the random variable Z have a standard normal

distribution. Find each of the following probabilities and carefully sketch a graph corresponding to each expression. a. P ( Z # 2.16 ) b. P ( Z , 2.16 ) c. P ( Z # 20.47 ) d. P ( 0.73 . Z ) e. P ( 21.75 $ Z ) f. P ( 20.35 # Z # 0.65 ) ( ) g. P Z , 5 h. P ( Z # 24 ) i. P ( Z # 4 ) 6.32 Let the random variable Z have a standard normal

distribution. Find each of the following probabilities and carefully sketch a graph corresponding to each expression. a. P ( 21.33 . Z ) b. P ( Z , 2.35 ) c. P ( Z . 2.59 ) d. P ( 21.56 , Z , 20.56 ) e. P ( 0.13 , Z , 2.44 ) f. P ( 20.05 , Z , 0.76 ) g. P ( Z $ 2.67 ) h. P ( Z # 1.42 ) i. P ( Z # 22.00 c Z $ 2.00 ) j. P ( 21.82 , Z # 20.94 ) 6.33 Let the random variable Z have a standard normal

distribution. Find each of the following probabilities. a. P ( 21.00 # Z # 1.00 ) b. P ( 22.00 # Z # 2.00 ) c. P ( 23.00 # Z # 3.00 ) Do you recognize these three probabilities? What rule are they associated with? 6.34 Let the random variable Z have a standard normal

c. Find the probability that Z is beyond the inner fences. d. Find the outer fences for a standard normal distribution. e. Find the probability that Z is beyond the outer fences. 6.38 Compute each probability and carefully sketch a graph

corresponding to each expression. P ( X # 3.25 ) a. X , N ( 3, 0.0225 ) , P ( X . 60 ) b. X , N ( 52, 49 ) , c. X , N ( 27, 1 ) , P ( X # 24.5 ) d. X , N ( 235, 121 ) , P ( X . 200 ) P ( X $ 350 ) e. X , N ( 242, 132 ) , P ( X , 21.45 ) f. X , N ( 1.17, 3.94 ) , 6.39 Use technology to compute each probability and to

carefully sketch a graph corresponding to each expression. P ( 3.0 # X # 4.0 ) a. X , N ( 3.7, 4.55 ) , b. X , N ( 62, 100 ) , P ( 50 , X , 70 ) P ( X $ 45 ) c. X , N ( 32, 30 ) , P ( X , 76.95 ) d. X , N ( 77, 0.01 ) , e. X , N ( 250, 16 ) , P ( X , 255 c X . 245 ) P(8 # X # 9) f. X , N ( 7.6, 12 ) , 6.40 a. b. c. d. e. f.

Use technology to solve each expression for b. X , N ( 17, 28 ) , P ( X , b ) 5 0.75 X , N ( 303, 70 ) , P ( X # b ) 5 0.05 P ( 2b # X # b ) 5 0.90 X , N ( 0, 25 ) , X , N ( 212, 2 ) , P ( X . b ) 5 0.35 P ( m 2 b # X # m 1 b ) 5 0.68 X , N ( 37, 2.25 ) , X , N ( 26.35, 7.21 ) , P ( X , b ) 5 0.11

distribution. Solve each expression for b. Carefully sketch a graph corresponding to each probability statement. a. P ( Z # b ) 5 0.8686 b. P ( Z , b ) 5 0.1867 c. P ( Z , b ) 5 0.0016 d. P ( Z $ b ) 5 0.2643 e. P ( Z . b ) 5 0.9382 f. P ( Z $ b ) 5 0.5000 g. P ( b , Z ) 5 0.0192 h. P ( b . Z ) 5 0.9938 i. P ( 2b , Z , b ) 5 0.7995 j. P ( 2b # Z # b ) 5 0.5527

6.41 Suppose X is a normal random variable with mean 25 and

6.35 Let the random variable Z have a standard normal

Applications

distribution. Solve each expression for b. Carefully sketch a graph corresponding to each probability statement. a. P ( Z # b ) 5 0.5100 b. P ( Z . b ) 5 0.1080 c. P ( Z $ b ) 5 0.0500 d. P ( Z # b ) 5 0.0100 e. P ( 2b # Z # b ) 5 0.8000 f. P ( 2b , Z , b ) 5 0.6535 6.36 Let the random variable Z have a standard normal

distribution. Recall the definition for percentiles. P ( Z # 1.0364 ) 5 0.85, so 1.0364 is the 85th percentile. Find each of the following percentiles for a standard normal distribution. a. 10th b. 27th c. 85th d. 40th e. 49th f. 61st 6.37 Let Z be a standard normal random variable and recall the

calculations necessary to construct a box plot. a. Find the first and third quartiles for a standard normal distribution. b. Find the inner fences for a standard normal distribution.

standard deviation 6: X , N ( 25, 36 ) . a. Find the first and third quartiles for X. b. Find the inner fences for X. c. Find the probability that X is beyond the inner fences. d. Find the outer fences for X. e. Find the probability that X is beyond the outer fences.

6.42 Economics and Finance San Francisco is one of the most expensive cities in which to live in the United States. As of February 2013, the mean rent for a one-bedroom apartment in the Mission District was $2600.6 Assume that the distribution of rents is approximately normal and the standard deviation is $200. A one-bedroom apartment in the Mission District is selected at random. a. Find the probability that the rent is less than $2450. b. Find the probability that the rent is between $2500 and $2650. c. Find a rent r such that 90% of all rents are less than r dollars per month. Write a Solution Trail for this problem. 6.43 Marketing and Consumer Behavior According to an annual survey conducted by TheKnot.com, the mean cost of a wedding in 2012 was $28,427.7 This is still less than the high in 2008, but it reflects increasing confidence in the economy. Suppose the cost for a wedding is normally distributed, with a standard deviation of $1500, and a wedding is selected at random.

6.2

a. Find the probability that the wedding costs more than

$31,000. b. Find the probability that the wedding costs between $26,000 and $30,000. c. Find the probability that the wedding costs less than $25,000. 6.44 Public Policy and Political Science The President of

the United States gives a State of the Union Address every year in late January or early February. Since Lyndon Johnson’s address in 1966, Richard Nixon gave some of the shortest messages, and Bill Clinton presented a 1-hour and 28-minute address in 2000. The mean length for these addresses is 51.75 minutes and the standard deviation is 14.37 minutes. Assume the length of a State of the Union Address is normally distributed. a. What is the probability that the next State of the Union Address will be between 45 and 55 minutes long? b. What is the probability that the next State of the Union Address will be more than 90 minutes long? c. What is the probability that the next two State of the Union Addresses will be less than 30 minutes long? 6.45 Manufacturing and Product Development A standard Versa-lok block used in residential and commercial retaining wall systems has mean weight 37.19 kg.8 Assume the standard deviation is 0.8 kg and the distribution is approximately normal. A standard block unit is selected at random. a. What is the probability that the block weighs more than 38 kg? Write a Solution Trail for this problem. b. What is the probability that the block weighs between 36 and 37 kg? c. If the block weighs less than 35.5 kg, it cannot be used in certain commercial construction projects. What is the probability that the block cannot be used? 6.46 Marketing and Consumer Behavior Movie trailers are designed to entice audiences by showing scenes from coming attractions. Several trailers are usually shown in a theater before the start of the main feature, and most are available via the Internet. The duration of a movie trailer is approximately normal, with mean 150 seconds and standard deviation 30 seconds. a. What is the probability that a randomly selected trailer lasts less than 1 minute? b. Find the probability that a randomly selected trailer lasts between 2 minutes and 3 minutes 15 seconds. c. Any movie trailer that lasts beyond 4 minutes and 30 seconds is considered too long. What proportion of movie trailers is too long? d. Find a symmetric interval about the mean such that 99% of all movie trailer durations lie in this interval. 6.47 Biology and Environmental Science The salinity, or salt content, in the ocean is expressed in parts per thousand (ppt). The number varies with depth, rainfall, evaporation, river runoff, and ice formation. The mean salinity of the oceans is 35 ppt.9 Suppose the distribution of salinity is normal and the standard deviation is 0.52 ppt, and suppose a random sample of ocean water from a region in the tropical Pacific Ocean is obtained.

The Normal Distribution

269

a. What is the probability that the salinity is more than 36 ppt? b. What is the probability that the salinity is less than 33.5 ppt? c. A certain species of fish can only survive if the salinity is

between 33 and 35 ppt. What is the probability that this species can survive in a randomly selected area? d. Find a symmetric interval about the mean salinity such that 50% of all salinity levels lie in this interval. What are the endpoints of this interval called? 6.48 Public Health and Nutrition Many people grab a granola bar for breakfast or for a snack to make it through the afternoon slump at work. A Kashi GoLean Crisp Chocolate Caramel bar is 45 grams, and the mean amount of protein in each bar is 8 grams.10 Suppose the distribution of protein in a bar is normally distributed and the standard deviation is 0.15 gram, and a random Kashi bar is selected. a. What is the probability that the amount of protein is less than 7.75 grams? b. What is the probability that the amount of protein is between 7.8 and 8.2 grams? c. Suppose the amount of protein is at least 8.1 grams. What is the probability that it is more than 8.3 grams? d. Suppose three bars are selected at random. What is the probability that all three will be between 7.7 and 8.3 grams? 6.49 Public Health and Nutrition Many typical household cleaners contain toxic chemicals. 2-Butoxyethanol is found in multipurpose cleaners and is a very powerful solvent. The EPA has a safety standard for this chemical when used in the workplace, but cleaning at home in a confined area can cause levels to rise well above this standard. The mean percent of 2-butoxyethanol in Rain-X Glass Cleaner is 3.11 Suppose the distribution is approximately normal and the standard deviation is 1%, and a random bottle of Rain-X Glass Cleaner is selected. a. What is the probability that the percentage of 2-butoxyethanol is less than 2.5? b. What is the probability that the percentage of 2-butoxyethanol is between 2.2 and 3.5? c. Suppose the EPA has established a limit of 5% 2-butoxyethanol in all consumer products. What is the probability that a bottle exceeds this limit? 6.50 Psychology and Human Behavior In many U.S.

families, both parents work outside the home, while children spend time at daycare centers or are cared for by other relatives. The mean amount of time fathers spend with their child(ren) is 7.3 hours per week.12 Suppose this time distribution is approximately normal with standard deviation 0.75 hour, and suppose a father is randomly selected. a. What is the probability that the father spends at least 8 hours with his child in a given week? b. What is the probability that the father spends between 6 and 7 hours with his child in a given week? c. If the child sees his or her father for less than 6 hours per week, the parental bond is weakened. What is the probability that this special bond will be weakened in a given week? d. What is the probability that the father will spend at least 9 hours with his child in each of five randomly selected weeks?

270

CH APTER 6

Continuous Probability Distributions

Extended Applications 6.51 Sports and Leisure People who ride in hot-air balloons

usually fly just above the treetops at 200–500 feet. In populated areas, however, they usually stay at an altitude of at least 1000 feet. The amount of flying time possible in a hot-air balloon depends on many factors, including the number of propane burners, the number of people in the basket, and the weather. Assume the time spent aloft is normally distributed with mean 1.5 hours and standard deviation 0.45 hour. Suppose a hot-air balloon flight is selected at random. a. What is the probability that the flight time is between 1 and 2 hours? b. What is the probability that the flight time is more than 1 hour and 15 minutes? c. Find a value t such that 10% of all flights last less than t hours. d. Suppose a person offering hot-air balloon rides charges $50 for each ride of at least 1 hour, and $1.00 for every minute after 1 hour. What proportion of rides costs more than $100? 6.52 Sports and Leisure The Daytona 500, often referred to

as The Great American Race, is a spectacular sporting event, complete with a pre-race show. Jimmie Johnson won this race in 2013, when the mean speed per lap for all racers was 159.25 mph.13 Assume the speed is normally distributed with a standard deviation of 16 mph, and a driver and lap are selected at random. a. What is the probability that the speed on this lap is less than 155 mph? b. What is the probability that the speed on this lap is between 140 and 150 mph? c. The fastest recorded speed is 212 mph at Talladega in 1986. What is the probability that the speed on this lap will set a new record? d. What is the probability that the four leaders will all have a speed of at least 165 on this lap? 6.53 Biology and Environmental Science The amount of

timber harvested and sold is associated with the housing market and the general economy. In 2012, the total amount of timber harvested in the United States was 2,500,321 mbf (thousand board feet).14 It takes approximately 11 mbf to construct a typical 1900-square-foot-home. Assume the volume of timber harvested per acre is normally distributed with mean 30 mbf and standard deviation 6.25 mbf. Suppose an acre of timber is selected at random. a. What is the probability that the volume of timber harvested is between 25 and 40 mbf? b. What is the probability that the volume of timber harvested is less than 20 mbf? c. Suppose the acre has already produced 35 mbf. What is the probability that the volume harvested will be more than 40 mbf? d. A logging company selects three random acres to harvest during a week. The company will make a profit if all three acres produce more than 32 mbf. What is the probability that the logging company makes a profit?

6.54 Marketing and Consumer Behavior Kraft Foods recently announced that the Kool-Aid mascot, that big red pitcher of the powdered drink mix with arms and legs, will receive a makeover as they unveil a new liquid mix. Kool-Aid bursts are distributed in various flavors including tropical punch, berry blue, and grape. Kraft Foods claims that the mean amount of Kool-Aid in each burst bottle is 200 milliliters (ml).15 Assume the amount of drink in each bottle is normally distributed with standard deviation 1.25 ml. Suppose a bottle of berry blue is selected at random. a. What is the probability that the amount of drink will be between 199 and 201 ml? b. If the amount of drink is more than 202 ml, when the bottle is opened there will be a spill. What is the probability of a spill? c. Suppose there are 196 ml in the bottle of berry blue. Is there any evidence to suggest the claim made by Kraft Foods is false? Justify your answer. 6.55 Biology and Environmental Science Many backyard gardeners prefer Silver Queen Hybrid corn. This late-season variety is very sweet and has tender, white kernels. In some locations in the Northeast, gardeners have trouble harvesting this variety because of its longer growing time. The temperature of the soil should be at least 65°F before planting, and the growing time is approximately normal with mean 92 days and standard deviation 5 days. a. What is the probability that a randomly selected seed will mature in less than 90 days? b. What is the probability that a randomly selected seed will mature in between 95 and 100 days? c. Suppose a row in a backyard garden contains 12 plants. What is the probability that four will be ready for dinner by the 95th day? d. Find a value h such that 99% of all plants are ready to be harvested within h days. 6.56 Manufacturing and Product Development Violin

bows are made from various woods to accommodate musicians’ preferences and demands. Some commonly used woods include snakewood, ironwood, hakia, and pernambuco. While the bows are carefully handcrafted, they vary slightly in weight. Suppose a bowmaker claims the weight of his bows is normally distributed with mean 60 grams and standard deviation 3.2 grams. a. What is the probability that the weight of a randomly selected ironwood bow is between 58 and 62 grams? b. Good musicians can detect an unacceptable bow weight, i.e., a weight that differs from the mean by more than two standard deviations. What is the probability that a bow weight is unacceptable? c. Any manufactured bow that weighs more than 66 grams is reworked in order to decrease the weight. What is the probability that a randomly selected ironwood bow will need rework? d. Suppose the weight of a randomly selected bow is 55 grams. Is there any evidence to suggest the mean weight is less than 60 grams? Justify your answer.

6.2

The Normal Distribution

271

6.57 Medicine and Clinical Studies Repeated industrial

6.60 Physical Sciences Hydroelectric projects are carefully

tasks often cause work-related muscle disorders. Measurements of joint angles (of the shoulder and elbow, for example) required to complete a certain task can be used to predict future injuries. The shoulder joint angle required to fasten an aluminum door frame on an assembly line varies according to the worker’s height, arm length, and location. The shoulder joint angle for this task is normally distributed with mean 23.7 degrees and standard deviation 1.9 degrees. Suppose an employee is randomly selected. a. What is the probability that the shoulder joint angle will be between 20 and 25 degrees? b. What is the probability that the joint angle will be less than 18 degrees? c. If the joint angle is more than 28 degrees, there is a good chance the employee will suffer from a muscle disorder. What is the probability that the employee will suffer from a muscle disorder? d. If the joint angle is between 21.7 and 25.7 degrees, then management believes the ergonomics of the task are adequate. If five employees are randomly selected, what is the probability that four of the five have adequate ergonomics?

monitored, and their energy capability is predicted for several years into the future. Suppose the Klamath Hydro Project, located on the upper Klamath River in south-central Oregon, generates electricity according to a normal distribution. The Pacific Northwest Utilities Conference Committee claims the mean electricity generated per year is 35 megawatts (MW). a. The probability that the Klamath Hydro Project generates less than 34 MW during any randomly selected year is 0.3540. Find the standard deviation. b. Suppose the years are independent, and the hydro project will record a profit in a given year if it is able to generate at least 37.8 MW that year. What is the probability that the project will record a profit for four consecutive years? c. Suppose the electricity generated during a certain year is 33.5 MW. Is there any evidence to suggest that the claim by the Pacific Northwest Utilities Conference Committee is false? Justify your answer.

6.58 Biology and Environmental Science Many lakes are carefully monitored for pH concentration, total phosphorus, chlorophyll, nitrogen, and total suspended solids. These data are used to characterize the condition of the lake and to chart year-to-year variability. Based on information from the Lake Partner Program, Ontario Ministry of the Environment, Aberdeen Lake has a mean total phosphorus concentration of 14.6 mg/liter and standard deviation 5.8 mg/liter. Suppose a day is selected at random, and a total phosphorus measure from Aberdeen Lake is obtained. a. What is the probability that the total phosphorus is less than 13 mg/liter? b. What is the probability that the total phosphorus differs from the mean by more than 5 mg/liter? c. Suppose the total phosphorus is less than 20 mg/liter. What is the probability that it is less than 14 mg/liter? d. If the total phosphorus measurement is 27 mg/liter, is there any evidence to suggest the mean has increased? 6.59 Manufacturing and Product Development High-

pressure washers have become popular for cleaning siding, decks, and windows. This equipment is available in various engine types and horsepower. Suppose the power rating (in horsepower, hp) for a residential pressure washer is normally distributed with mean 20 hp and standard deviation s . a. The probability that a randomly selected power rating is within 2.5 hp of the mean is 0.7229. Find the value of s. b. A leading consumer magazine advised its readers to purchase pressure washers with a power rating of 15 hp or more. What proportion of pressure washers have this rating? c. If the power rating is more than 26.5 hp, the pressure washer will crack, or even break, certain windows. What is the probability that a pressure washer could break a window?

6.61 Manufacturing and Product Development Dining-

room chairs come in many different woods, styles, and shapes. The height of the seat of a randomly selected oak dining-room chair is approximately normal with mean 85 centimeters (cm) and standard deviation 1.88 cm. a. Find a value h such that 99% of all dining-room chairs have height less than h. b. Consumer testing indicates that any chair seat higher than 90 cm is uncomfortable to use when eating. What is the probability that a randomly selected dining-room chair is uncomfortable? c. Find the first and third quartiles of the dining-room chair height distribution. d. There is some evidence to suggest that, after five years of use, the mean height of these chairs has decreased, due to wear, erosion, and humidity. Suppose that after five years, the probability the height is more than 86 cm is 0.0718. Find the mean height after five years. 6.62 Sports and Leisure Tianlang Guan was the youngest

person ever to participate in the Masters Golf Tournament. The 14-year-old from China played the Augusta, Georgia, course with the confidence of a professional, but very slowly. He was warned about his slow play and was assessed a one-stroke penalty on the par-4 17th hole on the second day of the tournament.16 The PGA tour maintains a 40-second time limit to play a stroke, but also has several exceptions to this rule that allow for an additional 20 seconds. The mean time for all golf shots is 38 seconds.17 Assume the time for all golf shots is normally distributed with standard deviation 9 seconds, and suppose a golf shot is selected at random. a. What is the probability that a randomly selected shot takes between 25 and 35 seconds? b. If the shot takes more than 60 seconds, the golfer is assessed a penalty stroke. What is the probability that the golfer will be assessed a penalty stroke? c. Suppose a golfer takes 72 strokes to complete the round. What is the probability that at least 60 of these shots take less than 45 seconds?

272

CHAPTER 6

Continuous Probability Distributions

Challenge

a. What is the probability that a randomly selected tennis

6.63 Sports and Leisure The International Tennis Federa-

b. Suppose six tennis balls will be used in a tournament

tion (ITF) establishes the specifications for tennis balls. The diameter of a tennis ball used in any tournament must be between 2.5 and 2.625 inches. Suppose the diameter of a tennis ball is approximately normal with mean 2.5625 inches and standard deviation 0.04 inch.

ball will meet ITF diameter specifications? game. What is the probability that exactly one will not meet ITF diameter specifications? Assume independence.

6.3 Checking the Normality Assumption Almost every inferential statistics procedure requires certain assumptions, for example, that observations are selected independently or that variances are equal (for analysis of variance). And many statistical techniques are valid only if the observations are from a normal distribution. If an inference procedure requires normality, and the population distribution is not normal, then the conclusions are worthless. Therefore, it seems reasonable to be able to perform some kind of check for normality, to make sure there is no evidence to refute this assumption. Until now we have been using the normal distribution as a model for describing the variability of a random variable X, and we have been assuming that we know the values of the population mean m and the population variance s2 . If those values are not known, the sample mean x and the sample standard deviation s can be used as estimates of the unknown parameters m and s. However, we still cannot be sure that the normal distribution is an appropriate model to describe a particular set of observations. We need a way to check whether a set of observations does seem to come from a population with a normal distribution. There are four different methods we can use to look for any evidence of non-normality. Three of them use techniques that we have seen before; the fourth one is a new method. Given a set of observations, x1, x2, x3, . . . , xn, the following four methods may be used to check for any evidence of non-normality, for example, a distribution that is not bellshaped, a skewed distribution, or a distribution with heavy tails. 1. Graphs

P(Z # 20.6745) 5 0.25, and P(Z # 0.6745) 5 0.75.

Construct a histogram, a stem-and-leaf plot, and/or a dot plot. Examine the shape of the distribution for any indications that the distribution is not bell-shaped and symmetric. In a random sample, the distribution of the sample should be similar to the distribution of the population. 2. Backward Empirical Rule To use the empirical rule to test for normality, find the mean, the standard deviation, and the three symmetric intervals about the mean ( x 2 ks, x 1 ks ) , k 5 1, 2, 3. Compute the actual proportion of observations in each interval. If the actual proportions are close to 0.68, 0.95, and 0.997, then normality seems reasonable. Otherwise, there is evidence to suggest that the shape of the distribution is not normal. 3. IQR/s Find the interquartile range, IQR, and standard deviation, s, for the sample, and compute the ratio IQR/s. If the data are approximately normal, then IQR/s < 1.3. Here is some justification for this ratio. Consider a standard normal random variable, Z ( m 5 0, s 5 1 ) . The first quartile for Z is 20.6745 and the third quartile is 0.6745. The interquartile range divided by the standard deviation is 3 0.6745 2 (20.6745 ) 4 /1 5 1.349. In a random sample, the interquartile range should be close to the population interquartile range, and the standard deviation should be close to the population standard de-viation. Any normal distribution can be standardized, or compared to Z, so IQR/s < 1.3.

6.3

273

Checking the Normality Assumption

4. Normal Probability Plot

A normal probability plot is a scatter plot of each observation versus its corresponding standardized normal score. For a normal distribution, the points will fall along a straight line. The standardized normal scores are expected values. For example, in repeated samples of size n from the Z distribution, on average the smallest value is z1, on average the next largest value is z2 , etc., on average the largest value is zn.

How to Construct a Normal Probability Plot Suppose x1, x2, . . . , xn is a set of observations. 1. Order the observations from smallest to largest and let x(1), x(2), . . . , x(n) represent the

set of ordered observations.

2. Find the standardized normal scores for a sample of size n in Table IV in the Appendix:

z1, z2, . . . , zn.

3. Plot the ordered pairs (zi, x(i)). Most of the standardized normal scores are always between 22.0 and 12.0, because approximately 95% of all observations lie within two standard deviations of the mean.

If the scatter plot is nonlinear, there is evidence to suggest the data did not come from a normal distribution. Most statistical software (the TI-84 Plus C and Minitab included) automatically computes the expected Z values. Table IV in the Appendix provides standardized normal scores for some values of n. Figures 6.53–6.56 are examples of normal probability plots.

x

x

20

20

15

15

10

10

5

5

!2

!1

1

2

z

Figure 6.53 A normal probability plot. The points lie along an approximate straight line. There is no evidence of non-normality.

x

20

20

15

15

10

10

5

5

!2

!1

1

2

Figure 6.55 A normal probability plot. The plot suggests that the distribution is not normal and that the data set contains an outlier.

!1

1

2

z

Figure 6.54 A normal probability plot. The curved graph suggests that the distribution is not normal and is skewed.

x

!2

z

!2

!1

1

2

z

Figure 6.56 A normal probability plot. The plot suggests that the distribution is not normal and has heavy tails.

The data axis can be horizontal or vertical. To use a horizontal data axis, plot the points ( x(i), zi ) . Figure 6.57 shows a normal probability plot with the data plotted on the vertical axis and Figure 6.58 shows a normal probability plot (using the same data and standardized normal scores) with the data plotted on the horizontal axis.

274

CH APTER 6

Continuous Probability Distributions

x

z 2

80 1 60 0 40 !1 20 !2 0

!2

!1

1

2

z

20

60

40

80

x

Figure 6.58 A normal probability plot with the data plotted on the horizontal axis.

Figure 6.57 A normal probability plot with the data plotted on the vertical axis.

Interpretation of a normal probability plot is very subjective, and even if the axes are reversed, we are still looking for the points to lie along a straight line. All four methods can be used to check the normality assumption, and any one (or several) may suggest the data did not come from a normal distribution. Because we are searching for evidence of non-normality, even if we fail to reject the normality assumption in each test, we still cannot say with absolute certainty that the data came from a normal distribution. DATA SET

Example 6.8 Copper Mining

COPPER

In April 2013 there was a huge landslide at the Kennecott Utah Copper mine, one of the world’s deepest open pits, visible from space. Mine officials had anticipated this landslide, but operations were suspended indefinitely. Prior to the landslide, Kennecott produced approximately 753 tons of refined copper each day.18 A random sample of days was selected and the amount of refined copper was recorded for each. The 20 observations are given in the following table: 757 741

751 743

749 760

753 741

745 758

749 762

738 745

746 752

732 735

750 767

Is there any evidence to suggest that this distribution is not normally distributed?

SOLUTION STEP 1 Figure 6.59 shows a frequency histogram for these data. There are no obvious

outliers, and the distribution seems approximately normal.

5

George Frey /Landov

Frequency

4 3 2 1 0

730

735

740

745 750 755 Copper mined

760

765

770

Figure 6.59 Frequency histogram for the copper mine data.

6.3

Checking the Normality Assumption

275

STEP 2 The sample mean and the sample standard deviation are x 5 748.70 and

s 5 9.17. The following table lists three symmetric intervals about the mean, the number of observations in each interval, and the proportion of observations in each interval (recall that n 5 20). Interval

Frequency

Proportion

( x 2 s, x 1 s ) 5 ( 739.53, 757.87 ) ( x 2 2s, x 1 2s ) 5 ( 730.36, 767.04 ) ( x 2 3s, x 1 3s ) 5 ( 721.19, 776.21 )

13 20 20

0.65 1.00 1.00

The actual proportions are close to those given by the empirical rule (0.68, 0.95, and 0.997). STEP 3 The quartiles are Q1 5 742, Q3 5 755. IQR/s 5 ( 755 2 742 ) /9.17 5 1.4177 This ratio is close to 1.3. STEP 4 The table below lists each observation along with the corresponding normal score from Table IV in the Appendix. Observation

Normal score

Observation

Normal score

732 735 738 741 741 743 745 745 746 749

21.87 21.40 21.13 20.92 20.74 20.59 20.45 20.31 20.19 20.06

749 750 751 752 753 757 758 760 762 767

0.06 0.19 0.31 0.45 0.59 0.74 0.92 1.13 1.40 1.87

Plot these points to obtain the normal probability plot, as shown in Figure 6.60. Figure 6.61 shows a technology solution. The points lie along an approximately straight line. x 770 760 750 740 730

!2

!1

1

2

z

Figure 6.60 Normal probability plot for the copper mine data.

Figure 6.61 Normal probability plot.

The histogram, backward empirical rule, IQR/s, and normal probability plot show no significant evidence of non-normality. Remember, however, that this decision is very subjective. TRY IT NOW

GO TO EXERCISE 6.76

CHAPTER 6

DATA SET DOSAGE

Continuous Probability Distributions

Example 6.9 Chemotherapy Protocol A certain protocol for chemotherapy states that the total dose for patients under the age of 12 is no greater than 450 mg/m2 within six months. A random sample of 30 patients undergoing this form of chemotherapy was obtained, and their medical records were examined to determine the total dose of the drug over the past six months. The data are given in the following table. 350 377 427

351 378 430

352 387 432

353 396 437

354 399 440

358 402 441

361 406 443

364 408 446

371 412 447

376 424 449

Is there any evidence to suggest the distribution of six-month total dosage is not normally distributed?

SOLUTION STEP 1 Figure 6.62 shows a frequency histogram for these data. Although the graph

seems symmetric, it is not bell-shaped. Most of the data are concentrated in the tails of the distribution. This suggests the data are not from a normal distribution.

6 5 Frequency

276

4 3 2 1 0

350 360 370 380 390 400 410 420 430 440 450 Total dose

Figure 6.62 Frequency histogram for the cumulative chemotherapy dose data. STEP 2 The sample mean is x 5 399.03 and the sample standard deviation is s 5 34.94.

The following table lists three symmetric intervals about the mean, the number of observations in each interval, and the proportion of observations in each interval (computed using n 5 30). Interval

Frequency

Proportion

( x 2 s, x 1 s ) 5 ( 364.09, 433.97 ) ( x 2 2s, x 1 2s ) 5 ( 329.15, 468.91 ) ( x 2 3s, x 1 3s ) 5 ( 294.21, 503.85 )

15 30 30

0.50 1.00 1.00

The first two proportions (0.50 and 1.00) are significantly different from those given by the empirical rule (0.68 and 0.95). This suggests the population of total chemotherapy doses is not normal.

6.3

277

Checking the Normality Assumption

STEP 3 The quartiles are Q1 5 364.00 and Q3 5 432.00.

IQR/s 5 ( 432.00 2 364.00 ) /34.94 5 1.9462 This ratio is significantly different from 1.3, so there is more evidence to suggest the underlying population is not normal. STEP 4 The following table lists each observation along with the corresponding normal score.

Observation

Normal score

Observation

Normal score

Observation

Normal score

350 351 352 353 354 358 361 364 371 376

22.04 21.61 21.36 21.18 21.02 20.89 20.78 20.67 20.57 20.47

377 378 387 396 399 402 406 408 412 424

20.38 20.29 20.21 20.12 20.04 0.04 0.12 0.21 0.29 0.38

427 430 432 437 440 441 443 446 447 449

0.47 0.57 0.67 0.78 0.89 1.02 1.18 1.36 1.61 2.04

The normal probability plot is shown in Figure 6.63. The points do not lie along a straight line. Each tail is flat, which makes the graph look S-shaped. This suggests that the underlying population is not normal. Figure 6.64 shows a technology solution.

x 440 420 400 380 360 340

!2

!1

1

2

z

Figure 6.63 Normal probability plot for the chemotherapy dose data.

Figure 6.64 JMP normal probability plot.

The histogram, backward empirical rule, IQR/s, and the normal probability plot all indicate that this sample did not come from a normal population. TRY IT NOW

GO TO EXERCISE 6.79

278

CH APTER 6

Continuous Probability Distributions

TECHNOLOGY CORNER Procedure: Construct a normal probability plot. Reconsider: Example 6.8, solution, and interpretations.

VIDEO TECH MANUALS EXEL DISCRIPTIVE NORMAL QUANTILE PLOTS

CrunchIt! A QQ Plot (quantile plot) is used to construct a normal probability plot. 1. Enter the data into a column. Rename the column if desired. 2. Select Graphics; QQ Plot. Select the Sample (name of the column) from the drop-down menu. Click Calculate to view

the graph. See Figure 6.65. CrunchIt! adds a straight line to this plot for visual reference.

Figure 6.65 A QQ Plot constructed using CrunchIt!.

TI-84 Plus C A normal probability plot is one of the six built-in statistical plots. 1. Enter the data into list L1. 2. Choose STATPLOT ; STAT PLOTS; Plot1. Turn the plot On, select Type normal probability plot (the last graph

icon), enter the Data List, L1, set the Data Axis to Y, choose a Mark (for the points on the graph), and select a Color. 3. Enter appropriate window settings and press GRAPH to view the normal probability plot. Refer to Figure 6.61.

Minitab Use the built-in function NSCOR to compute the normal scores. Construct a scatter plot of the data versus the normal scores. 1. Enter the data into column C1. 2. Compute the normal scores in a session window (or by using the Minitab Calculator) and store the results in column

C2: LET C2 = NSCOR(C1). 3. Construct a scatter plot. a. In a session window: PLOT C1*C2. b. In a graphical input window: Graph; Scatterplot. Select Simple and let the Y variable be the data column (C1) and the X variable be the normal scores column (C2). See Figure 6.66.

6.3

Checking the Normality Assumption

279

Figure 6.66 A normal probability plot constructed using Minitab.

Excel To construct a normal probability plot, compute the normal scores using the formula below. Construct a scatter plot of the data versus the normal scores. 1. Enter the data into column C, in increasing order, and the numbers 1 to n 5 20 in column A. 2. Set the cell B1 equal to NORM.S.INV((A1 – 3/8)/(20 + 1/4)). Copy this result and paste into the cells

B2–B20. These are (approximately) the normal scores. 3. Highlight the data range B1:C20. Under the Insert tab, select Scatter; Scatter. See Figure 6.67.

Figure 6.67 A normal probability plot constructed using Excel.

SECTION 6.3 EXERCISES Concept Check 6.64 Short Answer Name four methods to search for

evidence of nonnormality. 6.65 True/False In a normal probability plot, the data axis

6.68 Fill in the Blank For a normal distribution,

IQR/ s