Statistical Averages

Summary Statistics

After the data have been properly checked for its quality, the first and foremost analysis is usually for the descriptive statistics. The general aim is to summarize the data, iron out any peculiarities and perhaps get ideas for a more sophisticated analysis. The data summary may help to suggest a suitable model which in turn suggests an appropriate inferential procedure. The first phase of the analysis will be described as the initial examination of the data or initial data analysis. It has many things in common with explanatory data analysis which includes a variety of graphical and numerical techniques for exploring data. Thus explanatory data analysis is an essential part of nearly every analysis. It provides a reasonably systematic way of digesting and summarizing the data with its exact form naturally varies widely from problem to problem. In general, under initial and exploratory data analysis, the following are given due importance.

Measures of Central Tendency

One of the most important aspects of describing a distribution is the central value around which the observations are distributed. Any arithmetical measure which is intended to represent the center or central value of a set of observations is known as measure of central tendency.

The Arithmetic Mean (or simply Mean)

Suppose that n observations are obtained for a sample from a population. Denote the values of the n observations by x₁, x_2.....x_n; x₁ being the value of the first sample observation, x₂ that of second observation and so on. The arithmetic mean or mean or average denoted by

is given by

The symbol S ( read as ‘sigma’ ) means sum the individual values x₁ ,x₂,...,x_n of the variable, X. Usually the limits of the summations are not written, since it is always understood that the summation is over all n values. Hence we can write

The above formula enables us to find the mean when values x₁, x₂ ,....,x_n of n discrete observations are available. Sometimes the data set are given in the form of a frequency distribution table then the formula is as follows:

Arithmetic Mean of Grouped Data

Suppose that there are k classes or intervals. Let x₁, x₂ ,..., x_kdenote the class mid-points of these k intervals and let f₁, f₂, ..., f_k denotes the corresponding frequencies of these classes. Then the arithmetic mean

Properties of the arithmetic mean

(a) The Sum of the deviations of a set of n observations x₁ , x₂,..., x_n from their mean

is zero. Let d_i as deviation of x_i from

then

(b) If x₁ ,x₂,...,x_n are n observations,

is their mean and d_i = x_i - A is the deviation of x_i from a given number A, then

(d) If in a frequency distribution all the k class intervals are of the same width c, and d_i = x_i - A denote the deviation of x_i from A, where A is the value of a certain mid-point and x₁, x₂ ,..., x_kare the class mid-points of the k-classes, then d_i = c u_i where u_i = 0, ± 1, ± 2,..... and

The Median

The median of a set of n measurements or observations x₁ , x₂ ,..., x_n is the middle value when the measurements are arranged in an array according to their order of magnitude. If n is odd, the middle value is the median. If n is even, there are two middle values and the average of these values is the median. The median is the value which divides the set of observations into two equal halves, such that 50% of the observations lie below the median and 50% above the median. The median is not affected by the actual values of the observations but rather on their positions.

The Median of Grouped Data

The formula of median of grouped data is as

The Mode

The mode is the observation which occurs most frequently in a set. In grouped data mode is worked out as

The mode can be determined analytically in the case of continuous distribution. For a symmetrical distribution, the mean, median and mode coincide. For a distribution skewed to the left ( or negatively skewed distribution ), the mean, the median and the mode are in that order (as they appear in the dictionary ) and for a distribution skewed to the right ( or positively skewed distribution) they occur in the reverse order, mode, median and mean. There is an empirical formula for a moderately asymmetrical skewed distribution, it is given by Mean - Mode = 3 (Mean - Median)

The Geometric Mean

There are two other averages, the geometric mean and harmonic mean which are sometimes used. The Geometric Mean ( GM ) of a set of observations is such that its logarithm equals the arithmetic mean of the logarithms of the values of the observations. GM = (x₁ x₂..... x_n)^1/n

log GM = 1/n (å log x_i) or in frequency distribution, log GM = 1/n (å f_i log x_i)

In case of frequency distribution,

The geometric mean can be obtained only if the values assumed by the observation are positive( greater than zero).

Harmonic mean

The Harmonic Mean ( HM ) of a set of observations is such that its reciprocal is the arithmetic mean of the reciprocals of the values of the observation

The harmonic mean is rarely computed for a frequency distribution.

Weighted Mean
If there are n observations, x₁, x₂, x_3,…,x_n with corresponding weights w₁, w₂, w_3,…,w_n, then the weighted mean is given by,

In computing the mean, we take the frequency of a class as its weight. That is

Hence, it is a special case of weighted mean. The three means are related by

A.M. ³ G.M. ³ H.M.

Important characteristics of a good average

Since an average is a representative item of a distribution it should possess the following properties :

1. It should take all items into consideration.

2. It should not be affected by extreme values.

3. It should be stable from sample to sample.

4. It should be capable of being used for further statistical analysis.

Mean satisfies all the properties excepting that it is affected by the presence of extreme items. For example, if the items are 5, 6, 7, 7, 8 and 9 then the mean, median and mode are all equal to 7. If the last value is 30 instead of 9, the mean will be 10, whereas median and mode are not changed. Though median and mode are better in this respect they do not satisfy the other properties. Hence mean is the best average among these three.

When to use different averages

The proper average to be used depends upon the nature of the data, nature of the frequency distribution and the purpose.

If the data is qualitative one, only mode can be computed. For example, when we are interested in knowing the typical soil type in a locality or the typical cropping pattern in a region we can use mode. On the other hand, if the data is quantitative one, we can use any one of the averages

If the data is quantitative, then we have to consider the nature of the frequency distribution. When the frequency distribution is skewed (not symmetrical) the median or mode will be proper average. In case of raw data in which extreme values, either small or large, are present, the median or mode is the proper average. In case of a symmetrical distribution either mean or median or mode can be used. However, as seen already, the mean is preferred over the other two.

When we are dealing with rates, speed and prices we use harmonic mean. If we are interested in relative change, as in the case of bacterial growth, cell division etc., geometric mean is the most appropriate average.