April 2010

2:59 PM

By statisticalconcepts

In: Statistics

Statistical Averages

Summary Statistics

After the data have been properly checked for its quality, the first and foremost analysis is usually for the descriptive statistics. The general aim is to summarize the data, iron out any peculiarities and perhaps get ideas for a more sophisticated analysis. The data summary may help to suggest a suitable model which in turn suggests an appropriate inferential procedure. The first phase of the analysis will be described as the initial examination of the data or initial data analysis. It has many things in common with explanatory data analysis which includes a variety of graphical and numerical techniques for exploring data. Thus explanatory data analysis is an essential part of nearly every analysis. It provides a reasonably systematic way of digesting and summarizing the data with its exact form naturally varies widely from problem to problem. In general, under initial and exploratory data analysis, the following are given due importance.

Measures of Central Tendency

One of the most important aspects of describing a distribution is the central value around which the observations are distributed. Any arithmetical measure which is intended to represent the center or central value of a set of observations is known as measure of central tendency.

The Arithmetic Mean (or simply Mean)

Suppose that n observations are obtained for a sample from a population. Denote the values of the n observations by x₁, x_2.....x_n; x₁ being the value of the first sample observation, x₂ that of second observation and so on. The arithmetic mean or mean or average denoted by

is given by

The symbol S ( read as ‘sigma’ ) means sum the individual values x₁ ,x₂,...,x_n of the variable, X. Usually the limits of the summations are not written, since it is always understood that the summation is over all n values. Hence we can write

The above formula enables us to find the mean when values x₁, x₂ ,....,x_n of n discrete observations are available. Sometimes the data set are given in the form of a frequency distribution table then the formula is as follows:

Arithmetic Mean of Grouped Data

Suppose that there are k classes or intervals. Let x₁, x₂ ,..., x_kdenote the class mid-points of these k intervals and let f₁, f₂, ..., f_k denotes the corresponding frequencies of these classes. Then the arithmetic mean

Properties of the arithmetic mean

(a) The Sum of the deviations of a set of n observations x₁ , x₂,..., x_n from their mean

is zero. Let d_i as deviation of x_i from

then

(b) If x₁ ,x₂,...,x_n are n observations,

is their mean and d_i = x_i - A is the deviation of x_i from a given number A, then

(d) If in a frequency distribution all the k class intervals are of the same width c, and d_i = x_i - A denote the deviation of x_i from A, where A is the value of a certain mid-point and x₁, x₂ ,..., x_kare the class mid-points of the k-classes, then d_i = c u_i where u_i = 0, ± 1, ± 2,..... and

The Median

The median of a set of n measurements or observations x₁ , x₂ ,..., x_n is the middle value when the measurements are arranged in an array according to their order of magnitude. If n is odd, the middle value is the median. If n is even, there are two middle values and the average of these values is the median. The median is the value which divides the set of observations into two equal halves, such that 50% of the observations lie below the median and 50% above the median. The median is not affected by the actual values of the observations but rather on their positions.

The Median of Grouped Data

The formula of median of grouped data is as

The Mode

The mode is the observation which occurs most frequently in a set. In grouped data mode is worked out as

The mode can be determined analytically in the case of continuous distribution. For a symmetrical distribution, the mean, median and mode coincide. For a distribution skewed to the left ( or negatively skewed distribution ), the mean, the median and the mode are in that order (as they appear in the dictionary ) and for a distribution skewed to the right ( or positively skewed distribution) they occur in the reverse order, mode, median and mean. There is an empirical formula for a moderately asymmetrical skewed distribution, it is given by Mean - Mode = 3 (Mean - Median)

The Geometric Mean

There are two other averages, the geometric mean and harmonic mean which are sometimes used. The Geometric Mean ( GM ) of a set of observations is such that its logarithm equals the arithmetic mean of the logarithms of the values of the observations. GM = (x₁ x₂..... x_n)^1/n

log GM = 1/n (å log x_i) or in frequency distribution, log GM = 1/n (å f_i log x_i)

In case of frequency distribution,

The geometric mean can be obtained only if the values assumed by the observation are positive( greater than zero).

Harmonic mean

The Harmonic Mean ( HM ) of a set of observations is such that its reciprocal is the arithmetic mean of the reciprocals of the values of the observation

The harmonic mean is rarely computed for a frequency distribution.

Weighted Mean
If there are n observations, x₁, x₂, x_3,…,x_n with corresponding weights w₁, w₂, w_3,…,w_n, then the weighted mean is given by,

In computing the mean, we take the frequency of a class as its weight. That is

Hence, it is a special case of weighted mean. The three means are related by

A.M. ³ G.M. ³ H.M.

Important characteristics of a good average

Since an average is a representative item of a distribution it should possess the following properties :

1. It should take all items into consideration.

2. It should not be affected by extreme values.

3. It should be stable from sample to sample.

4. It should be capable of being used for further statistical analysis.

Mean satisfies all the properties excepting that it is affected by the presence of extreme items. For example, if the items are 5, 6, 7, 7, 8 and 9 then the mean, median and mode are all equal to 7. If the last value is 30 instead of 9, the mean will be 10, whereas median and mode are not changed. Though median and mode are better in this respect they do not satisfy the other properties. Hence mean is the best average among these three.

When to use different averages

The proper average to be used depends upon the nature of the data, nature of the frequency distribution and the purpose.

If the data is qualitative one, only mode can be computed. For example, when we are interested in knowing the typical soil type in a locality or the typical cropping pattern in a region we can use mode. On the other hand, if the data is quantitative one, we can use any one of the averages

If the data is quantitative, then we have to consider the nature of the frequency distribution. When the frequency distribution is skewed (not symmetrical) the median or mode will be proper average. In case of raw data in which extreme values, either small or large, are present, the median or mode is the proper average. In case of a symmetrical distribution either mean or median or mode can be used. However, as seen already, the mean is preferred over the other two.

When we are dealing with rates, speed and prices we use harmonic mean. If we are interested in relative change, as in the case of bacterial growth, cell division etc., geometric mean is the most appropriate average.

9:38 AM

By statisticalconcepts

In: Data Mining

Database Marketing

“Good information is essential for fact based decision – making”

In the good old days, many “savvy” corporate CEOs and other assorted head honchos in the public and private sector routinely made critical decisions by the seat of their pants. They relied on their experience, their intuition and their “gut” to determine a course of action that could make or break the organization. Sometimes they were right, sometimes they were wrong, and sometimes the organization went down the drain. Today, more and more of these C-level decision-makers are turning to analytics for help in the decision-making process. The stakes are just too high and the competition is just too fierce to rely on your “gut.” Instead of shouting, “show me the money,” savvy CEOs are now shouting, “show me the data and the statistical analysis first . . . and then I’ll show our shareholders the money.” The trend toward data-based decision-making is being driven, of course, by astronomical increases in data, statistical modeling capabilities and computing power. Analytics is defined as “the extensive use of data, statistical and quantitative analysis, explanatory and predictive models, and fact-based management to drive decisions and actions.”That encompasses the work of hundreds of thousands if not millions of “analysts” of all stripes around the world.

Using and evaluating data are important steps to the improvement process. Data are any information about an organization that can be gathered, reviewed and analyzed in order to produce a useful knowledge. Looking at the combination of pieces of knowledge and facts together, whether it has to do with demographics, achievement, test scores, or climate, helps an organization formulate hypotheses to decide how best to use the information. Basing educated guesses upon data are the beginning steps in creating an effective and efficient improvement process. One can focus their attention on specific indicators that are being displayed by the data, and identify priority areas in which they direct their focus. Once priority areas are narrowed down realistic goals are made and moving into action becomes the next step. Reviewing data, forming hypotheses, and creating action plans helps to move toward the goal of creating positive changes.

The following is a list of the 23 essential techniques used in database marketing. Anyone who works in marketing today has to be familiar with and be able to use all of these methods.

1) LTV. Customer Lifetime Value can be calculated in any industry, business to business or business to consumer. It is used to direct marketing strategy. In the early days of database marketing few knew how to calculate it or how to use it. Today it is widely practiced. It is powerful and it works.

2) RFM (Recency, Frequency, Monetary Analysis) is a highly successful way of predicting which customers will respond to promotions. It has been around for fifty years, but even today many marketers do not understand it or use it properly. It is a versatile tool that has helped to make database marketing successful.

3) Customer Communications. Personalized customer communications, based on data in a database, can be shown (using tests and controls) to increase customer retention, loyalty, cross sales, up sales and referrals. They are effective and they work. They are the principal reason why you build a marketing database.

4) Appended Data. It is possible today to append data to any name and address file to learn age, income, home value, home ownership, presence of children, length of residence, and about forty other valuable pieces of information about any household. This information can be used to create customer segments, and guide strategy designed to create powerful customer communications. Similar information can be appended to business to business files: SIC code, number of employees and annual sales.

5) Predictive Models. Using appended demographic and behavioral data, it is possible to create models that predict, accurately, which customers are most likely to defect, and which customers are most likely to respond to new initiatives. Modeling, combined with customer communications, can be very powerful technique that can increase response and reduce your attrition rate.

6) Relational Databases. Putting customer databases in a relational form makes it possible to store an unlimited amount of information about any customer or prospect, and retrieve it in an instant in a hundred different ways. Relational databases are essential to modern database marketing. Marketers need to understand the principles involved.

7) Caller ID. Set up originally as a call routing device, Caller ID linked to a customer marketing database permits customer service to get a customer’s complete record up on the screen before taking a call. As a result, the CSR can speak to the customer as if she knew her, bonding with her and building close rapport. This helps deliver on the promise of database marketing.

8) Websites. The web has revolutionized database marketing. A modern website, with cookies can do almost everything that a live operator can do, and much more, showing and enabling customers to print pictures of the product, maps, instructions, background information and details. Web sites are not wonderful at selling. They are a tremendous research tool and customer bonding and ordering tool. No database marketer can be really successful without a personalized website with cookies.

9) Email. Despite the SPAM, emails have emerged as a powerful database marketing tool. The ability to contact customers immediately “Your product was shipped today. Here is the tracking number…” makes for vastly improved customer relationships leading to retention and increased sales.

10) Tests and Controls. Since 1980 marketers were sending out direct mail, and measuring the response to each campaign. Today, we can use our database to measure much more. Setting aside customers in a control group, we can measure with pin point accuracy the short and long term effect of any marketing initiative.

11) Loyalty Programs. Most customers are delighted to participate in well designed loyalty programs. Airlines have been outstandingly successful in these programs. Their use has spread to supermarkets, hotels, retail stores, and a variety of industries. They are part of the mix of retention building services that database marketing has made possible.

12) Analytical Software. It used to be that after a campaign, you got canned printed reports showing what happened. Today, marketers have very sophisticated analytical software linked to their database so that each analyst can do any type of standard or ad hoc report before, during and after a campaign, with the results printed on his PC printer. We have “hands on” marketing which has made database marketing very powerful.

13) Web Access to the database. Today the marketing database is in a relational format on a server which is accessed online over the web by anyone in the company, from any location. Instead of a couple of analysts working with the data, it is available to management, sales, customer service, marketing, and market research. Web access has made marketing databases a useful tool throughout the enterprise.

14) Rented Lists. In the past, most companies kept their customer lists strictly private. Today, most lists are shared, exchanged or rented. Sharing of lists created the catalog industry, and has spurred the growth of hundreds of other direct response industries.

15) Campaign Management Software. Direct marketing campaigns used to be generated by memoranda to a service bureau: “Select these groups, divide them into these segments with these codes, and fax me the counts”. The process of getting the mail out the door took three to six weeks. Today, marketers have campaign management software linked to their database so that they can do the planning and the actual selections themselves in an afternoon. It cuts weeks off of the direct mail time, resulting in higher response rates.

16) Profitability Analysis. We used to know that some customers were more profitable to us than others, but it was hard to measure. Today banks, supermarkets, insurance firms, business to business enterprises, and many others can compute the monthly profitability of each customer. They have discovered that many customers are unprofitable. As a result they have changed their marketing and pricing strategy to increase their profits.

17) Customer Segmentation. There used to be so few customers that sales and marketers could keep needed information about them in their heads. Today, companies have many more customers – some in the millions. A database is needed to store the information. To develop marketing strategies for all these customers, you have to divide them into segments usually based on demographics and behavior. Success comes from creating useful segments, and developing customer marketing strategies for each segment.

18) Multi-channel marketing. Customers buy through multiple channels: retail, catalog, and web. We have learned that multi-channel customers buy more than single channel buyers. To be successful, you need a database that provides a 360 degree picture of your customer, coupled with strategies that recognize and communicate personally with the customer when she shows up in any of the three channels.

19) Treating customers differently. All businesses have Gold customers – a small percentage that provides 80% of your revenue and profit. With a marketing database, you can identify these Gold customers. Then you develop programs designed to retain them. You use resources that you could not afford to spend on all of your customers. Profits come from working to retain the best, and encouraging others to move up to higher status levels.

20) Next Best Product. The database is used to determine what customers in each segment normally buy. From this, you can determine anomalies: customers who are not buying what the others are buying (usually because they are buying this product from somewhere else). This is their Next Best Product. The NBP is put into the customer database record and used by customer service and sales in communicating with customers.

21) Penetration Analysis. Using a database and on line analytical software, marketers can do their own penetration analysis. What percent of sales do we have in each zip code, or SIC code, or income level, or age group? This is a versatile tool that can help you to locate retail stores, place advertising, and direct your sales force.

22) Cluster Coding. In many industries, using clusters with penetration analysis can help you identify who is buying your products, and who isn’t. It can be a creative tool to use in improving your marketing and sales.

23) Status Levels. The airlines started it: Platinum, Gold, and Silver. It has spread to other industries. Customers now understand their status, and work to move up to a higher level. Companies provide special benefits, rewards and services for higher status customers. In a democracy, it is an egalitarian method of customer differentiation which assists in building customer loyalty and company profits.

If you are not familiar with and using all 23 techniques in your work, you may not be getting the level of customer retention, cross sales, up sales, referrals and profits that others are getting.

12:14 PM

By statisticalconcepts

In: Statistics

Interpretation of Correlation

Correlation refers to a technique used to measure the relationship between two or more variables.When two things are correlated, it means that they vary together.Positive correlation means that high scores on one are associated with high scores on the other, and that low scores on one are associated with low scores on the other. Negative correlation, on the other hand, means that high scores on the first thing are associated with low scores on the second. Negative correlation also means that low scores on the first are associated with high scores on the second. An example is the correlation between body weight and the time spent on a weight-loss program. If the program is effective, the higher the amount of time spent on the program, the lower the body weight. Also, the lower the amount of time spent on the program, the higher the body weight.
Pearson r is a statistic that is commonly used to calculate bivariate correlations.

For an Example Pearson r = -0.80, p < .01. What does this mean?

To interpret correlations, four pieces of information are necessary.
1. The numerical value of the correlation coefficient.Correlation coefficients can vary numerically between 0.0 and 1.0. The closer the correlation is to 1.0, the stronger the relationship between the two variables. A correlation of 0.0 indicates the absence of a relationship. If the correlation coefficient is –0.80, which indicates the presence of a strong relationship.

2. The sign of the correlation coefficient.A positive correlation coefficient means that as variable 1 increases, variable 2 increases, and conversely, as variable 1 decreases, variable 2 decreases. In other words, the variables move in the same direction when there is a positive correlation. A negative correlation means that as variable 1 increases, variable 2 decreases and vice versa. In other words, the variables move in opposite directions when there is a negative correlation. The negative sign indicates that as class size increases, mean reading scores decrease.

3. The statistical significance of the correlation.A statistically significant correlation is indicated by a probability value of less than 0.05. This means that the probability of obtaining such a correlation coefficient by chance is less than five times out of 100, so the result indicates the presence of a relationship. For -0.80 there is a statistically significant negative relationship between class size and reading score (p < .001), such that the probability of this correlation occurring by chance is less than one time out of 1000.

4. The effect size of the correlation.For correlations, the effect size is called the coefficient of determination and is defined as r². The coefficient of determination can vary from 0 to 1.00 and indicates that the proportion of variation in the scores can be predicted from the relationship between the two variables. For r = -0.80 the coefficient of determination is 0.65, which means that 65% of the variation in mean reading scores among the different classes can be predicted from the relationship between class size and reading scores. (Conversely, 35% of the variation in mean reading scores cannot be explained.)

A correlation can only indicate the presence or absence of a relationship, not the nature of the relationship. Correlation is not causation. There is always the possibility that a third variable influenced the results. For example, perhaps the students in the small classes were higher in verbal ability than the students in the large classes or were from higher income families or had higher quality teachers.