Inaccuracies in Data

How and Where Mistakes Arise: The only way to avoid mistakes is, of course, to work carefully but a general knowledge about the nature of mistakes and how they arise helps us to work carefully. Most mistakes arise at the stage of copying from the original material to the worksheet or from one worksheet to another, transferring from the worksheet onto the calculating machine or vice versa, and reading from mathematical tables. It is a good idea in any computational program to cut down copying and transferring operations as much as possible. A person who computes should always do things neatly in the first instance and never indulge in the habit of doing “rough work” and then making a fair copy. Computational steps should be broken up into the minimum possible number of unit operations - operations that can be carried out on the calculating machine without having to write down any intermediate answer. Finally the work should be so arranged that it is not necessary to refer to mathematical tables every now and then. As far as possible, all references to such tables should be made together at the same time : this minimizes the possibility of referring to a wrong page and of making gross mistakes in reading similar numbers from the same table. In many mathematical tables, when the first few digits occur repeatedly, they are separated from the body of the table and put separately in a corner; a change in these leading digits in the middle of a row is indicated by a line or some other suitable symbol. We should be careful to read the leading digits correctly from such tables.

Classification of Mistakes: Mistakes in copying, transferring and reading fall into three broad classes: digit substitution, juxtaposition and repetition. One mistake is to substitute hurriedly one digit for another in a number, for instance, 0 for 6, 0 for 9, 1 for 7, 1 for 4, 3 for 8, or 7 for 9. The only remedy is to write the digits distinctly. Another mistake is to alter the arrangement of the digits in a number, to write 32 for 23 or 547 for 457. The third type of mistake occurs when the same number or digit occurs repeatedly. For instance, 12,225 may be copied as 1225 or in the series of numbers 71, 63, 64, 64, 64,  . one  or more of the 64’s may be forgotten. We should be especially careful to avoid these mistakes.

Precautions: Certain general precautions should be taken to avoid mistakes in computations. Whenever possible, we should make provision for checking the accuracy of computation. One way is to make use of mathematical identities and compute the same quantity by different methods. Computations should be properly laid out, in tabular form, with check columns whenever possible. Further before starting on the detailed computations a few extra minutes may be taken for computing mentally as rough answer. This serves a check on the final computation. To summarize, we may lay down the following five principles for avoiding mistakes in computation:
·         Write the digits distinctly.
·         Cut down copying and transferring operations.
·         Use tabular arrangement for computations.
·         Keep provision for checking.
·         Guess the answer beforehand.

A last word of warning may be helpful. If a mistake is made, it is almost impossible to locate and correct the mistake by going through the original computation, even if this is done a number of times. The best way out is to work the whole thing afresh, perhaps using a different computational layout altogether.

It is convenient at the start to make a distinction between different types of accuracies in computational work. A blunder is a gross inaccuracy arising through ignorance. A statistician who knows his theory rarely commits a blunder. But even when he knows the procedure in detail and use machines for computations, he sometimes makes mistakes. There is a third type of inaccuracy, which we shall call an error. This is different from the other two types in that it is usually impracticable and sometimes even impossible to avoid. In other words an error is an observation which is incorrect, perhaps because it was recorded wrongly in the first place or because it has been copied or typed incorrectly at some stage. An outlier is a ‘wild’ or extreme observation which does not appear to be consistent with the rest of the data. Outliers arise for a variety of reasons and can create severe problems. Errors and outliers are often confused. An error may or may not be an outlier, while an outlier may not be an error.
The search for errors and outliers is an important part of Initial Data Analysis. The terms data editing and data cleaning are used to denote procedures for detecting and correcting errors. Generally this is an iterative and ongoing process.
Some checks can be made ‘by hand’, but a computer can readily be programmed to make other routine checks and this should be done. The main checks are for credibility, consistency and completeness. Credibility checks include carrying out a range test on each variable. Here a credible range of possible values is pre specified for each variable and every observation is checked to ensure that it lies within the required range. These checks pick up gross outliers as well as impossible values. Bivariate and multivariate checks are also possible. A set of checks, called ‘if-then’ checks, can be made to assess credibility and consistency between variables.
Another simple, but useful, check is to get a printout of the data and examine it by eye. Although it may be impractical to check every digit visually, the human eye is very efficient at picking out suspect values in a data array provided they are printed in strict column formation in a suitably rounded form. When a suspect value has been detected, the analyst must decide what to do about it. It may be possible to go back to the original data records and use them to make any necessary corrections. In some cases, such as occasional computer malfunctions, correction may not be possible and an observation which is known to be an error may have to be treated as a missing observation.
Extreme observations which, while large, could still be correct, are more difficult to handle. The tests for deciding whether an outlier is significant provide little information as to whether an observation is actually an error. Rather external subject-matter considerations become paramount. It is essential to get advice from people in the field as to which suspect values are obviously silly or impossible, and which, while physically possible, are extremely unlikely and should be viewed with caution. Sometimes additional and further data may resolve the problem. It is sometimes sensible to remove an outlier, or treat it as a missing observation, but this outright rejection of an observation is rather drastic, particularly if there is evidence of a long tail in the distribution. Sometimes the outliers are the most interesting observations.
An alternative approach is to use robust methods of estimation which automatically downweight extreme observations. For example, one possibility for univariate data is to use Winsorization, in which an extreme observation is adjusted towards the overall mean, perhaps to the second  most extreme value (either large or small as appropriate). However, many analysts prefer a diagnostic approach which highlights unusual observations for further study. Whatsoever amendments are required to be made to the data, there needs to be a clear, and preferably simple, sequence of steps to make the required changes in data.
Missing observations arise for a variety of reasons. A respondent may forget to answer all the questions, an animal may be killed accidentally before a treatment has shown any effect, a scientist may forget to record all the necessary variables or a patient may drop out of a clinical trial etc. It is important to find out why an observation is missing. This is best done by asking ‘people in the field’. In particular, there is a world of difference between observations lost through random event, and situations where missing observations are created deliberately. Further the probability that an observation, y, is missing may depend on the value of y and/or on the values of explanatory variables. Only if the probability depends on neither then the observations are said to be missing completely at random. For multivariate data, it is sometimes possible to infer missing values from other variables, particularly if redundant variables are included (e.g. age can be inferred from date of birth).

Errors may arise from one or more of the following sources : (a) the mathematical formulation is only an idealized and very seldom an exact description of reality; (b) parameters occurring in mathematical formulae are almost always subject to errors of estimation; (c) many mathematical problems can only be solved by an infinite process, whereas all computations have to be terminated after a finite number of steps;  (d) because of the limited digit capacity of computing equipment, computations have to be carried with numbers rounded off conveniently. However, it is not necessary to try to avoid all errors, because usually the final answer need be correct only to a certain number of figures. The theory of calculations with approximate numbers will be subjected to the following errors:

Rounding Off: Because of the limited digit capacity of all computing equipments, computations are generally be carried out with numbers rounded off suitably. To round off a number to n digits, replace all digits to the right of the n-th digit by zeros. If the discarded number contributes less than half a unit in the n-th place, leave the n-th digit unaltered; if it is greater than half a unit, increase the n-th digit by unity; if it is exactly half a unit, leave the n-th digit unaltered when it is an even number and increase it by unity when it is an odd number. For example, the numbers 237.582, 46.85, 3.735 when rounded off to three digits would become 238, 46.8 and 3.74, respectively.

Significant Figures: In a rounded-off number, significant figures are the digits 1, 2,..., 9. Zero (0) is also a significant figure except when it is used to fix the decimal point or to fill the places of unknown or discarded digits. Thus in 0.002603, the number of significant figures is only four. Given a number like 58,100 we cannot say whether the zeros are significant figures or not; to be specific we should write it in the form 5.81 x 104, 5.810 x 104 or 5.8100 x 104 to indicate respectively that the number of significant figures is three, four or five.

Error Involved in the Use of Approximate Numbers : If   u  is the true value of a number and u0  an  approximation  to  it,  then  the  error  involved  is   E = u - u0 . The relative error is e =   and the percentage error is p =