Chapter One
Methods of Collecting and Presenting Data
People make decisions every day, with decision-making logically based on some form of data. A person who accepts a job and moves to a new city needs to know how long it will take him/her to drive to work. The person could guess the time by knowing the distance and considering the traffic likely to be encountered along the route that will be traveled, or the new employee could drive the route at the anticipated regular time of departure for a few days before the first day of work.
With the second option, an experiment is performed, which if the test run were performed under normal road and weather conditions, would lead to a better estimate of the typical driving time than by merely knowing the distance and the route to be traveled.
Similarly, engineers conduct statistically designed experiments to obtain valuable information that will enable processes and products to be improved, and much space is devoted to statistically designed experiments in Chapter 12.
Of course, engineering data are also available without having performed a designed experiment, but this generally requires a more careful analysis than the analysis of data from designed experiments. In his provocative paper, "Launching the Space-Shuttle Challenger-Disciplinary Deficiencies in the Analysis of Engineering Data," F. F. Lighthall (1991) contended that "analysis of field data and reasoning were flawed" and that "staff engineers and engineering managers ... were unable to frame basis questions of covariation among field variables, and thus unable to see the relevance of routinely gathered field data to the issues they debated before the Challenger launch." Lighthall then states "Simple analyses of field data available to both Morton Thiokol and NASA at launch time and months before the Challenger launch are presented to show that the arguments against launching at cold temperatures could have been quantified...." The author's contention is that there was a "gap in the education of engineers." (Whether or not the Columbia disaster will be similarly viewed by at least some authors as being a deficiency in data analysis remains to be seen.)
Perhaps many would disagree with Lighthall, but the bottom line is that failure to properly analyze available engineering data or failure to collect necessary data can endanger lives-on a space shuttle, on a bridge that spans a river, on an elevator in a skyscraper, and in many other scenarios.
Intelligent analysis of data requires much thought, however, and there are no shortcuts. This is because analyzing data and solving associated problems in engineering and other areas is more of an art than a science. Consequently, it would be impractical to attempt to give a specific step-by-step guide to the use of the statistical methods presented in succeeding chapters, although general guidelines can still be provided and are provided in subsequent chapters. It is desirable to try to acquire a broad knowledge of the subject matter and position oneself to be able to solve problems with powers of reasoning coupled with subject matter knowledge.
The importance of avoiding the memorization of rules or steps for solving problems is perhaps best stated by Professor Emeritus Herman Chernoff of the Harvard Statistics Department in his online algebra text, Algebra 1 for Students Comfortable with Arithmetic (http://www.stat.harvard.edu/People/Faculty/Herman Chernoff/ Herman Chernoff Algebra 1.pdf).
Memorizing rules for solving problems is usually a way of avoiding understanding. Without understanding, great feats of memory are required to handle a limited class of problems, and there is no ability to handle new types of problems.
My approach to this issue has always been to draw a rectangle on a blackboard and then make about 15-20 dots within the rectangle. The dots represent specific types of problems; the rectangle represents the body of knowledge that is needed to solve not only the types of problems represented by the dots, but also any type of problem that would fall within the rectangle. This is essentially the same as what Professor Chernoff is saying.
This is an important distinction that undoubtedly applies to any quantitative subject and should be understood by students and instructors, in general.
Semiconductor manufacturing is one area in which statistics is used extensively. International SEMATECH (SEmiconductor MAnufacturing TECHnology), located in Austin, Texas, is a nonprofit research and development consortium of the following 13 semiconductor manufacturers: Advanced Micro Devices, Conexant, Hewlett-Packard, Hyundai, Infineon Technologies, IBM, Intel, Lucent Technologies, Motorola, Philips, STMicroelectronics, TSMC, and Texas Instruments. Intel, in particular, uses statistics extensively.
The importance of statistics in these and other companies is exemplified by the NIST/SEMATECH e-Handbook of Statistical Methods (Croarkin and Tobias, 2002), a joint effort of International SEMATECH and NIST (National Institute of Standards and Technology), with the assistance of various other professionals. The stated goal of the handbook, which is the equivalent of approximately 3,000 printed pages, is to provide a Web-based guide for engineers, scientists, businesses, researchers, and teachers who use statistical techniques in their work. Because of its sheer size, the handbook is naturally much more inclusive than this textbook, although there is some overlap of material. Of course, the former is not intended for use as a textbook and, for example, does not contain any exercises or problems, although it does contain case studies. It is a very useful resource, however, especially since it is almost an encyclopedia of statistical methods. It can be accessed at www.itl.nist.gov/div898/handbook and will henceforth often be referred to as the e-Handbook of Statistical Methods or simply as the e-Handbook.
There are also numerous other statistics references and data sets that are available on the Web, including some general purpose Internet statistics textbooks. Much information, including many links, can be found at the following websites: http://www.utexas. edu/cc/stat/world/softwaresites.html and http://my.execpc.com/ ~helberg/statistics.html. The Journal of Statistics Education is a free, online statistics publication devoted to statistics education. It can be found at http://www. amstat.org/publications/jse.
Statistical education is a two-way street, however, and much has been written about how engineers view statistics relative to their work. At one extreme, Brady and Allen (2002) stated: "There is also abundant evidence-for example, Czitrom (1999)-that most practicing engineers fail to consistently apply the formal data collection and analysis techniques that they have learned and in general see their statistical education as largely irrelevant to their professional life." (It is worth noting that the first author is an engineering manager in industry.) The Accreditation Board for Engineering and Technology (ABET) disagrees with this sentiment and several years ago decreed that all engineering majors must have training in probability and statistics. Undoubtedly, many engineers would disagree with Brady and Allen (2002), although historically this has been a common view.
One relevant question concerns the form in which engineers and engineering students believe that statistical exposition should be presented to them. Lenth (2002), in reviewing a book on experimental design that was written for engineers and engineering managers and emphasizes hand computation, touches on two extremes by first stating that "... engineers just will not believe something if they do not know how to calculate it ...," and then stating "After more thought, I realized that engineers are quite comfortable these days-in fact, far too comfortable-with results from the blackest of black boxes: neural nets, genetic algorithms, data mining, and the like."
So have engineers progressed past the point of needing to see how to perform all calculations that produce statistical results? (Of course, a world of black boxes is undesirable.) This book was written with the knowledge that users of statistical methods simply do not perform hand computation anymore to any extent, but many computing formulas are nevertheless given for interested readers, with some formulas given in chapter appendices.
1.1 OBSERVATIONAL DATA AND DATA FROM DESIGNED EXPERIMENTS
Sports statistics are readily available from many sources and are frequently used in teaching statistical concepts. Assume that a particular college basketball player has a very poor free throw shooting percentage, and his performance is charted over a period of several games to see if there is any trend. This would constitute observational data-we have simply observed the numbers. Now assume that since the player's performance is so poor, some action is taken to improve his performance. This action may consist of extra practice, visualization, and/or instruction from a professional specialist. If different combinations of these tasks were employed, this could be in the form of a designed experiment. In general, if improvement is to occur, there should be experimentation. Otherwise, any improvement that seems to occur might be only accidental and not be representative of any real change.
Similarly, W. Edwards Deming (1900-1993) coined the terms analytic studies and enumerative studies and often stated that "statistics is prediction." He meant that statistical methods should be used to improve future products, processes, and so on, rather than simply "enumerating" the current state of affairs as is exemplified, for example, by the typical use of sports statistics. If a baseball player's batting average is .274, does that number tell us anything about what the player should do to improve his performance? Of course not, but when players go into a slump they try different things; that is, they experiment. Thus, experimentation is essential for improvement.
This is not to imply, however, that observational data (i.e., enumerative studies) have no value. Obviously, if one is to travel/progress to "point B," it is necessary to know the starting point, and in the case of the baseball player who is batting .274, to determine if the starting point is one that has some obvious flaws.
When we use designed experiments, we must have a way of determining if there has been a "significant" change. For example, let's say that an industrial engineer wants to determine if a new manufacturing process is having a significant effect on throughput. He/she obtains data from the new process and compares this against data that are available for the old process. So now there are two sets of data and information must be extracted from those two sets and a decision reached. That is, the engineer must compute statistics (such as averages) from each set of data that would be used in reaching a decision. This is an example of inferential statistics, a subject that is covered extensively in Chapters 4-15.
DEFINITION
A statistic is a summary measure computed from a set of data.
One point that cannot be overemphasized (so the reader will see it discussed in later chapters) is that experimentation should generally not be a one-time effort, but rather should be repetitive and sequential. Specifically, as is illustrated in Figure 1.1, exprimentation should in many applications be a never-ending learning process. Mark Price has the highest free throw percentage in the history of the National Basketball Association (NBA) at .904, whereas in his four-year career at Georgia Tech his best year was .877 and he does not even hold the career Georgia Tech field goal percentage record (which is held by Roger Kaiser at .858). How could his professional percentage be considerably higher than his college percentage, despite the rigors of NBA seasons that are much longer than college seasons? Obviously, he had to experiment to determine what worked best for him.
1.2 POPULATIONS AND SAMPLES
Whether data have been obtained as observational data or from a designed experiment, we have obtained a sample from a population.
DEFINITION
A sample is a subset of observations obtained from a larger set, termed a population.
To the layperson, a population consists of people, but a statistical population can consist of virtually anything. For example, the collection of desks on a particular college campus could be defined as a population. Here we have a finite population and one could, for example, compute the average age of desks on campus. What is the population if we toss a coin ten times and record the number of heads? Here the population is conceptually infinite as it would consist of all of the tosses of the coin that could be made. Similarly, for a manufacturing scenario the population could be all of the items of a particular type produced by the current manufacturing process-past, present, and future.
If our sample is comprised of observational data, the question arises as to how the sample should be obtained. In particular, should we require that our sample be random, or will we be satisfied if our sample is simply representative?
DEFINITION A random sample of a specified size is one for which every possible sample of that size has the same chance of being selected from the population.
A simple example will be given to illustrate this concept. Suppose a population is defined to consist of the numbers 1, 2, 3, 4, 5, and 6, and you wish to obtain a random sample of size two from this population. How might this be accomplished? What about listing all of the possible samples of size two and then randomly selecting one? There are 15 such samples and they are given below.
12 15 24 34 45 13 16 25 35 46 14 23 26 36 56
Following the definition just given, a random sample of size two from this population is such that each of the possible samples has the same probability of being selected.
There are various ways to obtain a random sample, once a frame, a list of all of the elements of a population, is available. Obviously, one approach would be to use a software program that generates random numbers. Another approach would be to use a random number table such as Table A at the end of the book. That table could be used as follows. In general, the elements in the population would have to be numbered in some way. In this example the elements are numbers, and since the numbers are single-digit numbers, only one column of Table A need be used. If we arbitrarily select the first column in the first set of four columns, we could proceed down that column; the first number observed is 1 and the second is 5. Thus, our sample of size two would consist of those two numbers.
Now how would we proceed if our population is defined to consist of all transistors of a certain type manufactured in a given day at a particular facility? Could a random sample be obtained?
In general, to obtain a random sample we do need a frame, which as has been stated is a list of all of the elements in the population. It would certainly be impractical to "number" all of the transistors so that a random sample could be taken. Consequently, a convenience sample is frequently used instead of a random sample. The important point is that the sample should be representative, and more or less emulate a random sample since common statistical theory is based on the assumption of random sampling.
For example, we might obtain samples of five units from an assembly line every 30 minutes. With such a sampling scheme, as is typical when control charts (see Chapter 11) are constructed, every item produced will not have the same probability of being included in any one of the samples with this systematic sampling approach, as it is called.
Such a sampling approach could produce disastrous results if, unbeknown to the person performing the sampling, there was some cyclicality in the data. This was clearly shown in McCoun (1949, 1974) in regard to a tooling problem. If you imagine data that would graph approximately as a sine curve, and if the sampling coincided with the periodicity of the curve, the variability of the data could be greatly underestimated and the trend that would be clearly visible for the entire set of data would be hidden.
(Continues...)
Excerpted from Modern Engineering Statisticsby Thomas P. Ryan Copyright © 2007 by John Wiley & Sons, Inc. . Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.