Measures of Dispersion in Statistics


We know that averages are representatives of a frequency distribution but they fail to give a complete picture of the distribution. They do not tell anything about the scatterness of observations within the distribution.

Suppose that we have the distribution of the yields (kg per plot) of two paddy varieties from 5 plots each.

The distribution may be as follows:
Variety I          45        42        42        41        40
Variety II        54        48        42        33        30

It can be seen that the mean yield for both varieties is 42 kg. But we can not say that the performance of the two varieties is same. There is greater uniformity of yields in the first variety whereas there is more variability in the yields of the second variety. The first variety may be preferred since it is more consistent in yield performance. From the above example, it is obvious that a measure of central tendency alone is not sufficient to describe a frequency distribution. In addition to it we should have a measure of scatterness of observations. The scatterness or variation of observations from their average is called the dispersion. There are different measures of dispersion like the range, the quartile deviation, the mean deviation and the standard deviation.

Range
The simplest measure of dispersion is the range. The range is the difference between the minimum and maximum values in a group of observations for example, suppose that the yields (kg per plot) of a variety from five plots are 8, 9, 8, 10 and 11. The range is (11 - 8) = 3 kg. In practice the range is indicated as 8 - 11 kg.
Range takes only the maximum and minimum values into account and not all the values. Hence it is a very unstable or unreliable indicator of the amount of deviation. It is affected by extreme values. In the above example, if we have 15 instead of figure 11, the range will be (8 - 15) = 7 kg.  In order to avoid these difficulties another measure of dispersion called quartile deviation is preferred.

Quartile Deviation
We can delete the values below the first quartile and the values above the third quartile. It is assumed that the unusually extreme values are eliminated by this way. We can then take the mean of the deviations of the two quartiles from the second quartile (median). That is,
This quantity is known as the quartile deviation (Q.D.).
The quartile deviation is more stable than the range as it depends on two intermediate values. This is not affected by extreme values since the extreme values are already removed. However, quartile deviation also fails to take the values of all deviations

Mean Deviation
Mean deviation is the mean of the deviations of individual values from their average. The average may be either mean or median. For raw data the mean deviation from the median is the least. Therefore, median is considered to be most suitable for raw data. But usually the mean is used to find out the mean deviation. The mean deviation is given by
M.D. =   for raw data and M.D. =    for grouped data
All positive and negative differences are treated as positive values. Hence we use the modulus symbol | |. We have to read  as “modulus”. If we take  as such, the sum of the deviations,  will be 0. Hence, if the signs are not eliminated the mean deviation will always be 0, which is not correct.

The steps of computation are as follows :
Step 1: If the classes are not continuous we have to make them continuous.
Step 2: Find out the mid values of the classes (mid - X = x).
Step 3: Compute the mean.
Step 4: Find out  for all values of x.
Step 5: Multiply eachby the corresponding frequencies.
Step 6: Use the formula.

The mean deviation takes all the values into consideration. It is fairly stable compared to range or quartile deviation. Since, the mean deviation ignores signs of deviations, it is not possible to use it for further statistical analysis and it is not stable as standard deviation which is defined as:

Standard Deviation
Ignoring the signs of the deviations is mathematically not correct. We may square the deviation to make a negative value as positive. After calculating the average squared deviations, it can be expressed in original units by taking its square root. This type of the measure of variation is known as Standard Deviation.
The standard deviation is defined as the square root of the mean of the squared deviations of individual values from their mean. Symbolically,
Standard Deviation (S.D.)   or  
This is called standard deviation because of the fact that it indicates a sort of group standard spread of values around their mean. For grouped data it is given as
Standard Deviation  (S.D.) or  
The sample standard deviation should be an unbiased estimate of the population standard deviation because we use sample standard deviation to estimate the population standard deviation. For this we substitute n - 1 for n in the formula. Thus, the sample standard deviation is written as
          
For grouped data it is given by

            
where,
            C = class interval
            d = (x - A) / C as given under mean.

The square of the standard deviation is known as the variance. In the analysis of variance technique, the termis called the sum of squares, and the variance is called the mean square. The standard deviation is denoted by s in case of sample, and by s (read ‘sigma’) in case of population.
The standard deviation is the most widely used measure of dispersion. It takes all the items into consideration. It is more stable compared to other measures. However, it will be inflated by extreme items as is the mean.

The standard deviation has some additional special characteristics. It is not affected by adding or subtracting a constant value to each observed value. It is affected by multiplying or dividing each observation by a constant. When the observations are multiplied by a constant, the resulting standard deviation will be equivalent to the product of the actual standard deviation and the constant. (Note that division of all observations by a constant, C is equivalent to multiplication by its reciprocal, 1/C. Subtracting a constant C is equivalent of adding a constant, - C.)
The standard deviations can be pooled. If the sum of squares for the first distribution with n1 observations is SS1, and the sum of squares for the second distribution with n2 observations is SS2,  then the pooled standard deviation is given by,

           
Measures of Relative Dispersion
Suppose that the two distributions to be compared are expressed in the same units and their means are equal or nearly equal. Then their variability can be compared directly by using their standard deviations. However, if their means are widely different or if they are expressed in different units of measurement, we can not use the standard deviations as such for comparing their variability. We have to use the relative measures of dispersion in such situations.
There are relative dispersion in relation to range, the quartile deviation, the mean deviation, and the standard deviation. Of these, the coefficient of variation which is related to the standard deviation is important. The coefficient of variation is given by,
C.V. = (S.D. / Mean) x 100
The C.V. is a unit-free measure. It is always expressed as percentage. The C.V. will be small if the variation is small of the two groups, the one with less C.V. is said to be more consistent.
The coefficient of variation is unreliable if the mean is near zero. Also it is unstable if the measurement scale used is not ratio scale. The C.V. is informative if it is given along with the mean and standard deviation. Otherwise, it may be misleading.

How to Choose Sample Size for a Simple Random Sample

what sample size do we need for simple random sampling
Before we proceed to the concept, consider the following problem. You are conducting a survey. The sampling method is simple random sampling, without replacement. You want your survey to provide a specified level of precision.

To choose the right sample size for a simple random sample, you need to define the following inputs.

  • Specify the desired margin of error ME. This is your measure of precision.
  • Specify alpha.
    For a hypothesis test, alpha is the significance level.
    For an estimation problem, alpha is: 1 - Confidence level.
  • Find the critical standard score z.
    For an estimation problem or for a two-tailed hypothesis test, the critical standard score (z) is the value for which the cumulative probability is 1 - alpha/2.
    For a one-tailed hypothesis test, the critical standard score (z) is the value for which the cumulative probability is 1 - alpha.
  • Unless the population size is very large, you need to specify the size of the population (N).

    Given these inputs, the following formulas find the smallest sample size that provides the desired level of precision.

Sample
statistic
Population
size
Sample size
Mean
Known
n = { z2 * σ2 * [ N / (N - 1) ] } / { ME2 + [ z2 * σ2 / (N - 1) ] }
Mean
Unknown
n = ( z2 * σ2 ) / ME2
Proportion
Known
n = [ ( z2 * p * q ) + ME2 ] / [ ME2 + z2 * p * q / N ]
Proportion
Unknown
n = [ ( z2 * p * q ) + ME2 ] / ( ME2 )
This approach works when the sample size is relatively large (greater than or equal to 30). Use the first or third formulas when the population size is known. When the population size is large but unknown, use the second or fourth formulas.

For proportions, the sample size requirements vary, based on the value of the proportion. If you are unsure of the right value to use, set p equal to 0.5. This will produce a conservative sample size estimate; that is, the sample size will produce at least the precision called for and may produce better precision.

Sample Problem

At the end of every school year, the state administers a reading test to a simple random sample drawn without replacement from a population of 100,000 third graders. Over the last five years, students who took the test correctly answered 75% of the test questions.

What sample size should you use to achieve a margin of error equal to plus or minus 4%, with a confidence level of 95%?

Solution: To solve this problem, we follow the steps outlined above.

Specify the margin of error. This was given in the problem definition. The margin of error is plus or minus 4% or 0.04.
Specify the confidence level. This was also given. The confidence level is 95% or 0.95.
Compute alpha. Alpha is equal to one minus the confidence level. Thus, alpha = 1 - 0.95 = 0.05.
Determine the critical standard score (z). Since this is an estimation problem, the critical standard score is the value for which the cumulative probability is 1 - alpha/2 = 1 - 0.05/2 = 0.975. 

To find that value, we use the Normal Calculator. Recall that the distribution of standard scores has a mean of 0 and a standard deviation of 1. Therefore, we plug the following entries into the normal calculator: Value = 0.975; Mean = 0; and Standard deviation = 1. The calulator tells us that the value of the standard score is 1.96.
And finally, we assume that the population proportion p is equal to its past value over the previous 5 years. That value is 0.75. Given these inputs, we can find the smallest sample size n that will provide the required margin of error.
n = [ (z2 * p * q ) + ME2 ] / [ ME2 + z2 * p * q / N ] 
n = [ (1.96)2 * 0.75 * 0.25 + 0.0016] / [ 0.0016 + (1.96)2 * 0.75 * 0.25 / 100,000 ]
n = (0.7203 + 0.0016) / ( 0.0016 + 0.0000072) = 449.2
Therefore, to achieve a margin of error of plus or minus 4 percent, we will need to survey 450 students, using simple random sampling.

Introduction to Normal Distribution in Statistics

A continuous random variable has an infinite number of values that can be represented by an interval on the number line.  It’s probability distribution  is called a continuous probability distribution.  In this article, we will be understanding the most important continuous probability distribution in statistics, the normal distribution.  
A normal distribution is a continuous probability distribution for a random variable, x.  The graph of a normal distribution is called the normal curve.  A normal distribution has the following properties. 
1. The mean, median and mode are equal. 
2. The normal curve is bell-shaped and is symmetric about the mean.
3. The total area under the normal curve is equal to 1.
4. The normal curve approaches, but never touches the x-axis as it extends farther and farther away from the mean.
5. Between m - s and m + s (in the center of the curve) the graph curves downward.  The graph curves upward to the left of m - s and to the right of m + s.  The points at which the curve changes from curving upward to curving downward are called inflection points. 
6. A normal distribution can have any mean and any positive standard deviation.  These two parameters, m and s completely determine the shape of a normal curve.  The mean gives the location of the line of symmetry and the standard deviation describes how much the data are spread out. 
See the line of symmetry for each?  That’s the mean.  However, if it is fatter, then the standard deviation is greater.  That’s the difference.  

Understanding Mean & Standard Deviation
Which normal curve has a greater mean?
Which normal curve has a greater standard deviation
The line of symmetry of curve A occurs at x = 15.  The line of symmetry of curve B occurs at x = 12.  So, curve A has a greater mean.
Curve B is more spread out than curve A, so curve B has a greater standard deviation. 

The Empirical Rule
In a normal distribution with mean  m and standard deviation s, you can approximate areas under the normal curve as follows:
1. About 68% of the area lies between m - s and m + s
2. About 95% of the area lies between 
m - 2s and m + 2s
3. About 99.7% of the area lies between 
m - 3s and m + 3s


Storytelling with Data - Web Analysis

Why Storytelling?
Visual analysis means exploring data visually. A story unfolds as you navigate from one visual summary into another.
You and your team have sorted through and analyzed a dense data set, made industry-relevant discoveries and created data visualizations that allow you to share those insights with others—whether other team members, current or potential clients, or the community at large. Before you present your work, think about your audience and the goals you want to achieve.
Keep in mind that the people you want to communicate with may not have the same background or technical knowledge that you do. Although eager to understand, audience members may need additional explanation, simplified insights or a more slowly paced presentation to grasp your point of view. Making the data accessible to your audience is your responsibility.
A tried-and-true method of connecting with your audience is to embed your data insights within a story. The story framework will capture your audience's attention and help you meet the business objectives driving your work.

Story Framework
Before you embark on your data journey, identify the questions you want to answer. You may have a specific question in mind or a general area you'd like to explore. As you dig into your data, you may find an entirely new or unexpected story, but it helps to have a starting point.

An example of one story framework for a web analysis example on how to visualize online campaign performance.
and the output...

Statistical p-values

statistical p value chartWhen results of studies or research are reported, important decisions are made on the basis of these results. For example, new varieties are often tested against standard varieties to determine if the new varieties is more effective. Several methods of manufacturing may be compared to select the best technique to manufacture the best product. Several evidence may be examined to determine if there is a possible link between one activity and a result. In such kind of studies, results are summarized by a statistical test, and a decision about the significance of the result is based on a p-value. Therefore, it is important for the reader to know what the p-value is all about.
To describe how the p-value works, we'll use a common statistical test as an example, the Student's t-test for independent groups. For this test, subjects are randomly assigned to one of two groups. Some treatment is performed on the subjects in one group, and the other group acts as a control where no treatment or a standard treatment is given. For this example, suppose group one is given a new drug and group 2 is given then standard drug. Time to relief is measured for both groups. The outcome measurement is assumed to be a continuous variable which is normally distributed, and it is assumed that the population variance for the measure is the same for both groups.
For this example the sample mean for group one is 10 and the sample mean for group two is 12. The sample standard deviation for group one is 1.8 and the sample standard deviation for group two is 1.9. The sample size for both groups is 12. Entering this data into a statistical program will produce a t-statistic and a p-value. Calculated t = -2.65 with 22 degrees of freedom, and a p-value of 0.0147. This means that you have evidence that the mean time to relief for group one was significantly different than for group two.
To interpret this p-value, you must first know how the test was structured. In the case of this two-sided t-test, the hypotheses are:
Ho: u1 = u2 (Null hypothesis: means of two groups are equal)
Ha: u1 <> u2 (Alternative: means of the two groups are not equal)
A low p-value for the statistical test points to rejection of the null hypothesis because it indicates how unlikely it is that a test statistic as extreme as or more extreme than the one given by this data will be observed from this population if the null hypothesis is true. Since p=0.015, this means that if the population means were equal as hypothesized (under the null), there is a 15 in 1000 chance that a more extreme test statistic would be obtained using data from this population. If you agree that there is enough evidence to reject the null hypothesis, you conclude that there is significant evidence to support the alternative hypothesis.
The researcher decides what significance level to use i.e., what cutoff point will decide significance. The most commonly used level of significance is 0.05. When the significance level is set at 0.05, any test resulting in a p-value under 0.05 would be significant. Therefore, you would reject the null hypothesis in favor of the alternative hypothesis. Since you are comparing only two groups, you can look at the sample means to see which is largest. The sample mean of group one is smallest, so you conclude that medicine one acted significantly faster, on average, than medicine two. This would be reported in an article using a phrase like this: "The mean time to relief for group one was significantly smaller than for group two. (two sided t-test, t(22) = -2.65, p=0.015)."
P-values do not simple provide you with a Yes or No answer, they provide a sense of the strength of the evidence against the null hypothesis. Lower the p-value, the stronger the evidence.

Cautions about Regression and Correlation

Correlation and regression are powerful tools for describing the relationship between two variables. When you use these tools, you must be aware of their limitations.

■ Correlation and regression lines describe only linear relationships. You can do the calculations for any relationship between two quantitative variables, but the results are useful only if the scatterplot shows a linear pattern.
■ Correlation and least-squares regression lines are not resistant. Always plot your data and look for observations that may be influential.
Extrapolation. Suppose that you have data on a child’s growth between 3 and 8 years of age. You find a strong linear relationship between age x and height y. If you fit a regression line to these data and use it to predict height at age 25 years, you will predict that the child will be 8 feet tall. Growth slows down and
then stops at maturity, so extending the straight line to adult ages is foolish. Few relationships are linear for all values of x. Don’t make predictions far outside the range of x that actually appears in your data.
Lurking variable. the relationship between two variables can often be understood only by taking other variables into account. Lurking variables can make a correlation or regression misleading.
You should always think about possible lurking variables before you draw conclusions based on correlation or regression.

E X T R A P O L AT I O N
Extrapolation is the use of a regression line for prediction far outside the range of values of the explanatory variable x that you used to obtain the line. Such predictions are often not accurate.

L U R K I N G V A R I A B L E
A lurking variable is a variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among those variables.

Statistical Terminology - Explained

We often make use of statistical techniques in the analysis of findings. Some are highly sophisticated and complex, but those most often used are easy to understand. The most common are measures of central tendency (ways of calculating averages) and correlation coefficients (measures of the degree to which one variable relates consistently to another). There are three methods of calculating averages, each of which has certain advantages and shortcomings. Take as an example the amount of personal wealth (including all assets such as houses, cars, bank accounts and investments) owned by 13 individuals. Suppose they own the following amounts:

1. £000 (zero)
2 £5,000
3. £10,000
4. £20,000
5. £40,000
6. £40 ,000
7. £40,000
8. £80,000
9. £100,000
10. £150,000
11. £200,000
12. £400,000
13. £10,000,000
The mean corresponds to the average, arrived at by adding together the personal wealth of all 13 people and dividing the result by 13. The total is £11,085,000; dividing this by 13, we reach a mean of £852,692.31. This mean is often a useful calculation because it is based on the whole range of data provided. However, it can be misleading where one or a small number of cases are very different from the majority. In the above example, the mean is not in fact an appropriate measure of central tendency, because the presence of one very large figure, £ 1 0,000,000 skews the picture. One might get the impression when using the mean to sumnarize this data that most of the people own far more than they actually do. In such instances, one of two other measures may be used. The mode is the figure that occurs most frequently in a given set of data. In our example, it is £40,000. The problem with the mode is that it does not take into account the overall distribution of the data - i.e., the range of figures covered. The most frequently occurring case in a set of figures is not necessarily representative of their distribution as a whole and thus may not be a useful average. In this case, £40,000 is too close to the lower end of the figures. The third measure is the median, which is the middle of any set of figures; here, this would be the seventh figure, again £40,000. Our example gives an odd number of figures 13. If there had been an even number - for instance, 12 - the median would be calculated by taking the mean of the two middle cases. Like the mode, the median gives no idea of the actual range of the data measured.
Sometimes a researcher will use more than one measure of central tendency to avoid giving a deceptive picture of the average. More often, he will calculate the standard deviation for the data in question. This is a way of calculating the degree of dispersal, or the range, of a set of figures - which in this case goes from zero to £10,000,000. Correlation coefficients offer a useful way of expressing how closely connected two (or more) variables are. Where two variables correlate completely, we can speak of a perfect positive correlation, expressed as 1. Where no relation is found between two variables - they have no consistent connection at all, the coefficient is zero. A perfect negative correlation, expressed as -1, exists when two variables are in a completely inverse relation to one another. Correlations of the order of 0.6 or more, whether positive or negative, are usually regarded as indicating a strong degree of connection between whatever variables are being analysed. Positive correlations on this level might be found between, say, social class background and voting behaviour.

Multicollinearity

The use and interpretation of a multiple regression model depends implicitly on the assumption that the explanatory variables are not strongly interrelated. In most regression applications the explanatory variables are not orthogonal. Usually the lack of orthogonality is not serious enough to affect the analysis. However, in some situations the explanatory variables are so strongly interrelated that the regression results are ambiguous. Typically, it is impossible to estimate the unique effects of individual variables in the regression equation. The estimated values of the coefficients are very sensitive to slight changes in the data and to the addition or deletion of variables in the equation. The regression coefficients have large sampling errors which affect both inference and forecasting that is based on the regression model. The condition of severe non-orthogonality is also referred to as the problem of multicollinearity.

The presence of multicollinearity has a number of potentially serious effects on the least squares estimates of regression coefficients. Multicollinearity also tends to produce least squares estimates that are too large in absolute value.

Remedial Measures

i) Collection of additional data: Collecting additional data has been suggested as one of the methods of combating multicollinearity. The additional data should be collected in a manner designed to break up the multicollinearity in the existing data.

ii) Model respecification: Multicollinearity is often caused by the choice of model, such as when two highly correlated regressors are used in the regression equation. In these situations some respecification of the regression equation may lessen the impact of multicollinearity. One approach to respecification is to redefine the regressors. For example, if x1, x2 and x3 are nearly linearly dependent it may be possible to find some function such as x = (x1+x2)/x3 or x = x1x2x3 that preserves the information content in the original regressors but reduces the multicollinearity.

iii) Ridge Regression: When method of least squares is used, parameter estimates are unbiased. A number of procedures have been developed for obtaining biased estimators of regression coefficients to tackle the problem of multicollinearity. One of these procedures is ridge regression. The ridge estimators are found by solving a slightly modified version of the normal equations. Each of the diagonal elements of X'X matrix are added a small quantity.

Diagnostics and Remedial Measures

The interpretation of data based on analysis of variance (ANOVA) is valid only when the following assumptions are satisfied:
1. Additive Effects: Treatment effects and block (environmental) effects are additive.
2. Independence of errors: Experimental errors are independent.
3. Homogeneity of Variances: Errors have common variance.
4. Normal Distribution: Errors follow a normal distribution.
Also the statistical tests t, F, z, etc. are valid under the assumption of independence of errors and normality of errors. The departures from these assumptions make the interpretation based on these statistical techniques invalid. Therefore, it is necessary to detect the deviations and apply the appropriate remedial measures.
• The assumption of independence of errors, i.e., error of an observation is not related to or depends upon that of another. This assumption is usually assured with the use of proper randomization procedure. However, if there is any systematic pattern in the arrangement of treatments from one replication to another, errors may be non-independent. This may be handled by using nearest neighbour methods in the analysis of experimental data.
• The assumption of additive effects can be defined and detected in the following manner:

Additive Effects: The effects of two factors, say, treatment and replication, are said to be additive if the effect of one-factor remains constant over all the levels of other factors. A hypothetical set of data from a randomized complete block (RCB) design, with 2 treatments and 2 replications, with additive effects is as
Treatment                        Replication       Replication Effect
                                         I           II            I - II
A                                    190       125            65
B                                    170       105            65
Treatment Effect (A-B)    20         20
Here, the treatment effect is equal to 20 for both replications and replication effect is 65 for both treatments.
When the effect of one factor is not constant at all the levels of other factor, the effects are said to be non-additive.

Normality of Errors: The assumptions of homogeneity of variances and normality are generally violated together. To test the validity of normality of errors for the character under study, one can take help of Normal Probability Plot, Anderson-Darling Test, D'Augstino's Test, Shapiro - Wilk's Test, Ryan-Joiner test, Kolmogrov-Smirnov test, etc. In general moderate departures from normality are of little concern in the fixed effects ANOVA as F - test is slightly affected but in case of random effects, it is more severely impacted by non-normality. The significant deviations of errors from normality, makes the inferences invalid. So before analyzing the data, it is necessary to convert the data to a scale that it follows a normal distribution. In the data from designed field experiments, we do not directly use the original data for testing of normality or homogeneity of observations because this is embedded with the treatment effects and some of other effects like block, row, column, etc. So there is need to eliminate these effects from the data before testing the assumptions of normality and homogeneity of variances. For eliminating the treatment effects and other effects we fit the model corresponding to the design adopted and estimate the residuals. These residuals are then used for testing the normality of the observations. In other words, we want to test the null hypothesis H0: errors are normally distributed against alternative hypothesis H1: errors are not normally distributed. In SAS and SPSS commonly used tests are Shapiro-Wilk test and Kolmogrov-Smirnov test. MINITAB uses three tests viz. Anderson-Darling, Ryan-Joiner, Kolmogrov-Smirnov for testing the normality of data.

Homogeneity of Error Variances: A crude method for detecting the heterogeneity of variances is based on scatter plots of means and variance or range of observations or errors, residual vs fitted values, etc.
Based on these scatter plots, the heterogeneity of variances can be classified into two types:
1. Where the variance is functionally related to mean.
2. Where there is no functional relationship between the variance and the mean.
The scatter-diagram of means and variances of observations for each treatment across the replications gives only a preliminary idea about homogeneity of error variances. Statistically the homogeneity of error variances is tested using Bartlett's test for normally distributed errors and Levene test for non-normal errors.

Remedial Measures: Data transformation is the most appropriate remedial measure, in the situation where the variances are heterogeneous and are some functions of means. With this technique, the original data are converted to a new scale resulting into a new data set that is expected to satisfy the homogeneity of variances. Because a common transformation scale is applied to all observations, the comparative values between treatments are not altered and comparison between them remains valid.
Error partitioning is the remedial measure of heterogeneity that usually occurs in experiments, where, due to the nature of treatments tested some treatments have errors that are substantially higher (lower) than others.
Here, we shall concentrate on those situations where character under study is non-normal and variances are heterogeneous. Depending upon the functional relationship between variances and means, suitable transformation is adopted. The transformed variate should satisfy the following:
1. The variances of the transformed variate should be unaffected by changes in the means. This is also called the variance stabilizing transformation.
2. It should be normally distributed.
3. It should be one for which effects are linear and additive.
4. The transformed scale should be such for which an arithmetic average from the sample is an efficient estimate of true mean.
The following are the three transformations, which are being used most commonly.

Web Analytics - An Overview

Web analytics is the practice of measuring, collecting, analysing and reporting online data for the purposes of understanding how a web site is used by its visitors and how to optimise its usage. The focus of web analytics is to understand a site’s users, their behaviour and activities.
The study of online user behaviour and activities generates valuable marketing intelligence and provides:
• performance measures of the website against targets
• insights on user behaviours and needs, and how the site is meeting those needs
• optimisation ability to make modifications to improve the website based on the results

An average web analytics tool offers hundreds of metrics. All are interesting but only a few would be useful for measuring the website’s performance. Focus on what is important to get meaningful insights about your website, start your web analytics initiative by defining realistic and measurable objectives for your site.
A business will not be successful if the customers are not satisfied. The same applies to your website. You must provide a compelling customer experience that creates value for your users and persuades them to take action. Each website may have a number of different users. To create a compelling user experience you must study each user segment in detail. Create user profiles for each segment that answer:
• Who is your target market?
• Why would they visit your site?
• What do they wish to accomplish on your site?
• What are the barriers to their satisfaction?

Key performance indicators or KPIs are a simple and practical technique widely used to measure performance. They are often expressed in rates, ratios, averages, percentages. The challenge is to choose the KPIs that will drive action and challenge you to continually optimise your site to achieve your objectives. It is important to understand the difference between an interesting metric and an insightful KPI. Peterson, in his book, suggests:
KPIs should never be met with a blank stare. Ask yourself “If this number improves by 10% who should I congratulate?” and “If this number declines by 10% who do I need to scream at?” If you don’t have a good answer for both questions, likely the metric is interesting but not a key performance indicator.

How is the user activity data collected?
There are two distinct methods to collect user activity data:
Web server log files – Web servers are capable of logging “user requests”, or a user’s movements around a website. These files can be used to perform analysis and create reports on website traffic.
Tracking scripts inserted into web pages – With this approach a small java script is inserted into a web page and every time the page is downloaded into a user’s browser, this script executes itself, capturing information about the activity performed. Since the web page contains the tracking script, regardless of how it is served to the user, it will execute each time the page is downloaded on the user’s browser.

Standard user activity data can be enriched through:
URL tracking parameters – Tracking parameters are added to a web page’s URL so you can collect additional information about site usage. For example. to understand what the users are searching for, you can put the keywords being searched for into the URL of the search results page. That result page’s URL will then look like this: “search_results.html?keyword=public holidays”.
Cookies - Cookies are small packets of data deposited on the computer hard disk of the user when the person visits a website. Cookies can contain all sorts of information, such as visitor's unique identification number for that site; the last time that person visited the site and so on. Your web analytics solution can be configured to detect the cookie for identifying returning users and read its content for more advanced reporting such as recency of a visit.
Online forms - Forms often constitute low-cost/high-value interaction points for websites. They are part of shopping carts, they facilitate many online processes such as applications, subscriptions, registrations, or they are simply used to seek feedback. Your web analytics solution can be configured to capture certain information collected from web forms through custom fields for more advanced reporting such as demographic profiling.

Qualitative data
In Web Analytics, to understand the ‘why’ behind an issue revealed by quantitative data, we turn to qualitative data. Sources of qualitative data include:
Surveys – Online or offline surveys are one way to capture information on what customers think and how they feel.
Web Site Testing – Testing could take place in a lab or online where participants are asked to undertake a task

How web analytics tools identify users?
Web analytics tools need a way of identifying users to be able to report on user sessions (also referred to as visits). There are different techniques to identify users such as IP addresses, user agent and IP address combination, cookies, authenticated user. Nowadays, the most common user identification technique is via cookies which are small packets of data that are usually deposited on the computer hard disk of the user when the person visits a website.
There are several types of cookies:
First Party Cookie is served from the website being visited.
Third Party Cookie is served by a third party organisation such as ad agencies or web analytics vendors on behalf of the website being visited.
Session Cookie is not saved to the computer and expires at the end of the session.
Increased cookie blocking and deletion practices whereby users configure their browsers to not accept the cookie or manually remove cookies from their computers presents a challenge for web analytics tools to accurately identify users.