Skewness and Kurtosis in Statistics

The average and measure of dispersion can describe the distribution but they are not sufficient to describe the nature of the distribution. For this purpose we use other concepts known as Skewness and Kurtosis. The symmetrical and skewed distributions are shown by curves as
Skewness
Skewness means lack of symmetry. A distribution is said to be symmetrical when the values are uniformly distributed around the mean. For example, the following distribution is symmetrical about its mean 3.
x                      :           1          2            3        4          5
frequency  (f ) :           5          9          12        9          5

In a symmetrical distribution the mean, median and mode coincide, that is, mean = median = mode.
Several measures are used to express the direction and extent of skewness of a dispersion. The important measures are that given by Pearson. The first one is the Coefficient of Skewness:

For a symmetric distribution Sk = 0. If the distribution is negatively skewed then Sk is negative and if it is positively skewed then Sk is positive. The range for Sk is from -3 to 3.

The other measure uses the b (read ‘beta’) coefficient which is given by,  where, m2 and m3 are the second and third central moments. The second central moment m2 is nothing but the variance. The sample estimate of this coefficient is  where m2 and m3 are the  sample central moments given by  


For a symmetrical distribution b1 = 0. Skewness is positive or negative depending upon whether m3 is positive or negative.

Kurtosis
A measure of the peakness or convexity of a curve is known as Kurtosis.


           
It is clear from the above figure that all the three curves, (1), (2) and (3) are symmetrical about the mean. Still they are not of the same type. One has different peak as compared to that of others. Curve (1) is known as mesokurtic (normal curve); Curve (2) is  known as leptocurtic (leading curve) and Curve (3) is known as platykurtic (flat curve). Kurtosis is measured by Pearson’s coefficient, b2 (read ‘beta - two’).It is given by .
The sample estimate of this coefficient is  
 where, m4 is the fourth central moment given by m4

The distribution is called normal if b2 = 3. When b2 is more than 3 the distribution is said to be leptokurtic. If b2 is less than 3 the distribution is said to be platykurtic.

Measures of Dispersion in Statistics


We know that averages are representatives of a frequency distribution but they fail to give a complete picture of the distribution. They do not tell anything about the scatterness of observations within the distribution.

Suppose that we have the distribution of the yields (kg per plot) of two paddy varieties from 5 plots each.

The distribution may be as follows:
Variety I          45        42        42        41        40
Variety II        54        48        42        33        30

It can be seen that the mean yield for both varieties is 42 kg. But we can not say that the performance of the two varieties is same. There is greater uniformity of yields in the first variety whereas there is more variability in the yields of the second variety. The first variety may be preferred since it is more consistent in yield performance. From the above example, it is obvious that a measure of central tendency alone is not sufficient to describe a frequency distribution. In addition to it we should have a measure of scatterness of observations. The scatterness or variation of observations from their average is called the dispersion. There are different measures of dispersion like the range, the quartile deviation, the mean deviation and the standard deviation.

Range
The simplest measure of dispersion is the range. The range is the difference between the minimum and maximum values in a group of observations for example, suppose that the yields (kg per plot) of a variety from five plots are 8, 9, 8, 10 and 11. The range is (11 - 8) = 3 kg. In practice the range is indicated as 8 - 11 kg.
Range takes only the maximum and minimum values into account and not all the values. Hence it is a very unstable or unreliable indicator of the amount of deviation. It is affected by extreme values. In the above example, if we have 15 instead of figure 11, the range will be (8 - 15) = 7 kg.  In order to avoid these difficulties another measure of dispersion called quartile deviation is preferred.

Quartile Deviation
We can delete the values below the first quartile and the values above the third quartile. It is assumed that the unusually extreme values are eliminated by this way. We can then take the mean of the deviations of the two quartiles from the second quartile (median). That is,
This quantity is known as the quartile deviation (Q.D.).
The quartile deviation is more stable than the range as it depends on two intermediate values. This is not affected by extreme values since the extreme values are already removed. However, quartile deviation also fails to take the values of all deviations

Mean Deviation
Mean deviation is the mean of the deviations of individual values from their average. The average may be either mean or median. For raw data the mean deviation from the median is the least. Therefore, median is considered to be most suitable for raw data. But usually the mean is used to find out the mean deviation. The mean deviation is given by
M.D. =   for raw data and M.D. =    for grouped data
All positive and negative differences are treated as positive values. Hence we use the modulus symbol | |. We have to read  as “modulus”. If we take  as such, the sum of the deviations,  will be 0. Hence, if the signs are not eliminated the mean deviation will always be 0, which is not correct.

The steps of computation are as follows :
Step 1: If the classes are not continuous we have to make them continuous.
Step 2: Find out the mid values of the classes (mid - X = x).
Step 3: Compute the mean.
Step 4: Find out  for all values of x.
Step 5: Multiply eachby the corresponding frequencies.
Step 6: Use the formula.

The mean deviation takes all the values into consideration. It is fairly stable compared to range or quartile deviation. Since, the mean deviation ignores signs of deviations, it is not possible to use it for further statistical analysis and it is not stable as standard deviation which is defined as:

Standard Deviation
Ignoring the signs of the deviations is mathematically not correct. We may square the deviation to make a negative value as positive. After calculating the average squared deviations, it can be expressed in original units by taking its square root. This type of the measure of variation is known as Standard Deviation.
The standard deviation is defined as the square root of the mean of the squared deviations of individual values from their mean. Symbolically,
Standard Deviation (S.D.)   or  
This is called standard deviation because of the fact that it indicates a sort of group standard spread of values around their mean. For grouped data it is given as
Standard Deviation  (S.D.) or  
The sample standard deviation should be an unbiased estimate of the population standard deviation because we use sample standard deviation to estimate the population standard deviation. For this we substitute n - 1 for n in the formula. Thus, the sample standard deviation is written as
          
For grouped data it is given by

            
where,
            C = class interval
            d = (x - A) / C as given under mean.

The square of the standard deviation is known as the variance. In the analysis of variance technique, the termis called the sum of squares, and the variance is called the mean square. The standard deviation is denoted by s in case of sample, and by s (read ‘sigma’) in case of population.
The standard deviation is the most widely used measure of dispersion. It takes all the items into consideration. It is more stable compared to other measures. However, it will be inflated by extreme items as is the mean.

The standard deviation has some additional special characteristics. It is not affected by adding or subtracting a constant value to each observed value. It is affected by multiplying or dividing each observation by a constant. When the observations are multiplied by a constant, the resulting standard deviation will be equivalent to the product of the actual standard deviation and the constant. (Note that division of all observations by a constant, C is equivalent to multiplication by its reciprocal, 1/C. Subtracting a constant C is equivalent of adding a constant, - C.)
The standard deviations can be pooled. If the sum of squares for the first distribution with n1 observations is SS1, and the sum of squares for the second distribution with n2 observations is SS2,  then the pooled standard deviation is given by,

           
Measures of Relative Dispersion
Suppose that the two distributions to be compared are expressed in the same units and their means are equal or nearly equal. Then their variability can be compared directly by using their standard deviations. However, if their means are widely different or if they are expressed in different units of measurement, we can not use the standard deviations as such for comparing their variability. We have to use the relative measures of dispersion in such situations.
There are relative dispersion in relation to range, the quartile deviation, the mean deviation, and the standard deviation. Of these, the coefficient of variation which is related to the standard deviation is important. The coefficient of variation is given by,
C.V. = (S.D. / Mean) x 100
The C.V. is a unit-free measure. It is always expressed as percentage. The C.V. will be small if the variation is small of the two groups, the one with less C.V. is said to be more consistent.
The coefficient of variation is unreliable if the mean is near zero. Also it is unstable if the measurement scale used is not ratio scale. The C.V. is informative if it is given along with the mean and standard deviation. Otherwise, it may be misleading.