Statistical Concepts and Analytics Explained

4:32 PM

By statisticalconcepts

In: Data Mining

What is Segmentation?

Segmentation is the process of partitioning markets into groups of customers and prospects with similar needs and/or characteristics who are likely to exhibit similar purchase behavior. Strategic segmentation is for planning business and marketing strategy. Tactical segmentation is when a marketing manager wishes to prioritize marketing activities across a fairly large customer base, targeting certain products, offers and creative to certain customers. A good segmentation strategy supports both strategic and tactical activity.
Literature suggests the following steps:

Criteria for Market Segmentation

There are a huge number of variables that could be used for market segmentation in theory. They comprise easy to determine demographic factors as well as variables on user behavior or customer preferences. In addition, there are differences between private customers and businesses. The following table shows the most important traditional variables for segmentation.

Five criteria for an effective segmentation:

1) Measurable: It has to be possible to determine the values of the variables used for segmentation with justifiable efforts. This is important especially for demographic and geographic variables. For an organization with direct sales (without intermediaries), the own customer database could deliver valuable information on buying behavior (frequency, volume, product groups, mode of payment etc).

2) Relevant: The size and profit potential of a market segment have to be large enough to economically justify separate marketing activities for this segment.

3) Accessible: The segment has to be accessible and servable for the organization. That means, for instance, that there are target-group specific advertising media, as magazines or websites the target audience likes to use.

4) Distinguishable: The market segments have to be that diverse that they show different reactions to different marketing mixes.

5) Feasible: It has to be possible to approach each segment with a particular marketing program and to draw advantages from that.

Benefits of segmentation

• Segmentation is the best way to make marketing relevant, better matching customer needs.

• By segmenting markets the target consumer can be reached more often and at a lower cost.

• By marketing products that appeal to different customer preferences a business can retain customers who might otherwise switch to competing products and brands.

• Provides a framework for the company to make more money. It provides a lens through which marketing can be continuously improved in the future.

• Manage different types of customers differently across the various customer touch points and through the customer lifecycle.

• Allocate different resources and investment levels to segments and deliver superior value to distinct groups of customers.

Segments need to be
•    Large enough to be economic
•    Have similar attributes
•    Reachable or targetable
•    Relevant to business needs
•    Actionable

Each part of a good segmentation solution will yield typically between four and nine segments. Too few segments tend to result in very general segments, and too many segments results in lots of small segments that may not be usable or meaningful. Segmentation development should be driven by economic incremental gain; for example, the benefit of new email segmentation must be more than the cost of any extra creative or analysis required.

The segmentation process
The process itself begins with narrowing the universe to be studied into a specific market now served by the company and obtaining basic information on competing products or services now on offer. Once this step has been completed, variables to be used are identified, reviewed, and tested. At the most basic level such variables, for example, might involve income and demographic characteristics of the consumers.
With these preparations completed, actual market research is organized to collect and to analyze data on the selected broad body of consumers. Analysis of the data will begin to cluster the consumers into distinct groupings based on the variables. Additional analysis, possibly involving more research, will next be conducted to develop detailed profiles of each segment already identified.
If the right variables were chosen at the outset and the market research was competently done, the resulting groupings will have characteristics distinct enough, and documented well enough, to permit the company to select one or more segments which will be easiest or more profitable to serve. The company's own strategy will play a role. Its aim, for example, may be use its capacity more fully and the company will therefore select a segment which will purchase the largest volume; alternative the company's aim may be low production levels with high profits, leading to a focus on another segment.
The last stage of the segmentation process will be the development of product and marketing plans based on the segment(s) most closely matching the company's "ideal" situation.
In general, customers are willing to pay a premium for a product that meets their needs more specifically than does a competing product. Thus marketers who successfully segment the overall market and adapt their products to the needs of one or more smaller segments stand to gain in terms of increased profit margins and reduced competitive pressures. Small businesses, in particular, may find market segmentation to be a key in enabling them to compete with larger firms.

Statistical Method
Cluster analysis: groups individuals or organizations together is such a way that the components in the clusters are more similar to one another than they are to the components of other clusters
- Hierarchical partitioning
- Nonhierarchical partitioning

Recency Frequency Monetary Modeling (RFM): RFM is an effective process for marketing to your loyal customers and uses purchase behavior by recency, frequency and monetary to determine what offers work for what type of customers. Generally, only small percentages of customers respond to typical offers. But with RFM, you can ensure you are targeting the right set of customers who are most likely to respond. RFM is a powerful segmentation method for predicting customer response and ensures improvement in response as well as profits. It is used primarily for targeted campaigning, customer acquisition, cross-sell, up-sell, retention, etc and is a guarantor of campaign effectiveness and optimization.

7:23 AM

By statisticalconcepts

In: Statistics

Role of Statistics

Statistics is the scientific application of mathematical principles to the collection, analysis, and presentation of numerical data. Statisticians apply their mathematical and statistical knowledge to the design of surveys and experiments; the collection, processing, and analysis of data; and the interpretation of the experiment and survey results. Opinion polls, statements of accuracy on scales and other measuring devises, and information about average earnings in an occupation are all usually the work of statisticians.

Statisticians may apply their knowledge of statistical methods to a variety of subject areas, such as biology, economics, engineering, medicine, public health, psychology, marketing, education, and sports. Many economic, social, political, and military decisions cannot be made without statistical techniques, such as the design of experiments to gain Federal approval of a newly manufactured drug. Statistics might be needed to show whether the seemingly good results of a drug were likely because of the drug rather than just the effect of random variation in patient outcomes.

One technique that is especially useful to statisticians is sampling—obtaining information about a population of people or group of things by surveying a small portion of the total. For example, to determine the size of the audience for particular programs, television-rating services survey only a few thousand families, rather than all viewers. Statisticians decide where and how to gather the data, determine the type and size of the sample group, and develop the survey questionnaire or reporting form. They also prepare instructions for workers who will collect and tabulate the data. Finally, statisticians analyze, interpret, and summarize the data using computer software.

In business and industry, statisticians play an important role in quality control and in product development and improvement. In an automobile company, for example, statisticians might design experiments to determine the failure time of engines exposed to extreme weather conditions by running individual engines until failure and breakdown. Working for a pharmaceutical company, statisticians might develop and evaluate the results of clinical trials to determine the safety and effectiveness of new medications. At a computer software firm, statisticians might help construct new statistical software packages to analyze data more accurately and efficiently. In addition to product development and testing, some statisticians also are involved in deciding what products to manufacture, how much to charge for them, and to whom the products should be marketed. Statisticians also may manage assets and liabilities, determining the risks and returns of certain investments.

Statisticians also are employed by nearly every government agency. Some government statisticians develop surveys that measure population growth, consumer prices, or unemployment. Other statisticians work for scientific, environmental, and agricultural agencies and may help figure out the average level of pesticides in drinking water, the number of endangered species living in a particular area, or the number of people afflicted with a particular disease. Statisticians also are employed in national defense agencies, determining the accuracy of new weapons and the likely effectiveness of defense strategies.

Because statistical specialists are employed in so many work areas, specialists who use statistics often have different professional designations. For example, a person using statistical methods to analyze economic data may have the title econometrician, while statisticians in public health and medicine may hold titles such as biostatistician or biometrician.

6:59 AM

By statisticalconcepts

In: Statistics

Sampling - Concepts and Definitions

a) Population

The collection of all units of a specified type in a given region at a particular point or period of time is termed as a population or universe. Thus, we may consider a population of persons, families, farms, cattle in a region or a population of trees or birds in a forest or a population of fish in a tank etc. depending on the nature of data required.

b) Sampling Unit

Elementary units or group of such units which besides being clearly defined, identifiable and observable, are convenient for purpose of sampling are called sampling units. For instance, in a family budget enquiry, usually a family is considered as the sampling unit since it is found to be convenient for sampling and for ascertaining the required information. In a crop survey, a farm or a group of farms owned or operated by a household may be considered as the sampling unit.

c) Sampling Frame

A list of all the sampling units belonging to the population to be studied with their identification particulars or a map showing the boundaries of the sampling units is known as sampling frame. Examples of a frame are a list of farms and a list of suitable area segments like villages in India or counties in the United States. The frame should be up to date and free from errors of omission and duplication of sampling units.

d) Random Sample

One or more sampling units selected from a population according to some specified procedures are said to constitute a sample. The sample will be considered as random or probability sample, if its selection is governed by ascertainable laws of chance. In other words, a random or probability sample is a sample drawn in such a manner that each unit in the population has a predetermined probability of selection. For example, if a population consists of the N sampling units U1, U2,…,Ui,…,UN then we may select a sample of n units by selecting them unit by unit with equal probability for every unit at each draw with or without replacing the sampling units selected in the previous draws.

e) Non-random sample

A sample selected by a non-random process is termed as non-random sample. A Non-random sample, which is drawn using certain amount of judgment with a view to getting a representative sample is termed as judgment or purposive sample. In purposive sampling units are selected by considering the available auxiliary information more or less subjectively with a view to ensuring a reflection of the population in the sample. This type of sampling is seldom used in large-scale surveys mainly because it is not generally possible to get strictly valid estimates of the population parameters under consideration and of their sampling errors due to the risk of bias in subjective selection and the lack of information on the probabilities of selection of the units.

f) Population parameters

Suppose a finite population consists of the N units U1, U2,…,UN and let Yi be the value of the variable y, the characteristic under study, for the ith unit Ui, (i=1,2,…,N). For instance, the unit may be a farm and the characteristic under study may be the area under a particular crop. Any function of the values of all the population units (or of all the observations constituting a population) is known as a population parameter or simply a parameter. Some of the important parameters usually required to be estimated in surveys are population total

and population mean

g) Statistic, Estimator and Estimate

Suppose a sample of n units is selected from a population of N units according to some probability scheme and let the sample observations be denoted by y1,y2,…,yn. Any function of these values which is free from unknown population parameters is called a statistic.

An estimator is a statistic obtained by a specified procedure for estimating a population parameter. The estimator is a random variable and its value differs from sample to sample and the samples are selected with specified probabilities. The particular value, which the estimator takes for a given sample, is known as an estimate.

h) Sample design

A clear specification of all possible samples of a given type with their corresponding probabilities is said to constitute a sample design. For example, suppose we select a sample of n units with equal probability with replacement, the sample design consists of Nn possible samples (taking into account the orders of selection and repetitions of units in the sample) with 1/Nn as the probability of selection for each of them, since in each of the n draws any one of the N units may get selected. Similarly, in sampling n units with equal probability without replacement, the number of possible samples (ignoring orders of selection of units) is

and the probability of selecting each of the samples is

i) Unbiased Estimator

Let the probability of getting the i-th sample be Pi and let ti be the estimate, that is, the value of an estimator t of the population parameter

based on this sample (i=1,2,…,Mo), Mo being the total number of possible samples for the specified probability scheme. The expected value or the average of the estimator t is given by

An estimator t is said to be an unbiased estimator of the population parameter

if its expected value is equal to

irrespective of the y-values. In case expected value of the estimator is not equal to population parameter, the estimator t is said to be a biased estimator of

. The estimator t is said to be positively or negatively biased for population parameter according as the value of the bias is positive or negative.

j) Measures of error

Since a sample design usually gives rise to different samples, the estimates based on the sample observations will, in general, differ from sample to sample and also from the value of the parameter under consideration. The difference between the estimate ti based on the i-th sample and the parameter, namely

(ti -

), may be called the error of the estimate and this error varies from sample to sample. An average measure of the divergence of the different estimates from the true value is given by the expected value of the squared error, which is

and this is known as mean square error (MSE) of the estimator. The MSE may be considered to be a measure of the accuracy with which the estimator t estimates the parameter.

The expected value of the squared deviation of the estimator from its expected value is termed sampling variance. It is a measure of the divergence of the estimator from its expected value and is given by

This measure of variability may be termed as the precision of the estimator t.

The MSE of t can be expressed as the sum of the sampling variance and the square of the bias. In case of unbiased estimator, the MSE and the sampling variance are same. The square root of the sampling variance σ(t) is termed as the standard error (SE) of the estimator t. In practice, the actual value of σ(t) is not generally known and hence it is usually estimated from the sample itself.

k) Confidence interval

The frequency distribution of the samples according to the values of the estimator t based on the sample estimates is termed as the sampling distribution of the estimator t. It is important to mention that though the population distribution may not be normal, the sampling distribution of the estimator t is usually close to normal, provided the sample size is sufficiently large. If the estimator t is unbiased and is normally distributed, the interval

is expected to include the parameter

in P% of the cases where P is the proportion of the area between –K and +K of the distribution of standard normal variate. The interval considered is said to be a confidence interval for the parameter

with a confidence coefficient of P% with the confidence limit t – K SE(t) and t + K SE(t). For example, if a random sample of the records of batteries in routine use in a large factory shows an average life t = 394 days, with a standard error SE(t) = 4.6 days, the chances are 99 in 100 that the average life in the population of batteries lies between

tL = 394 - (2.58)(4.6) = 382 days

tU = 394 + (2.58)(4.6) = 406 days

The limits, 382 days and 406 days are called lower and upper confidence limits of 99% confidence interval for t. With a single estimate from a single survey, the statement “

lies between 382 and 406 days” is not certain to be correct. The “99% confidence” figure implies that if the same sampling plan were used may times in a population, a confidence statement being made from each sample, about 99% of these statements would be correct and 1% wrong.

l) Sampling and Non-sampling error

The error arising due to drawing inferences about the population on the basis of observations on a part (sample) of it is termed sampling error. The sampling error is non-existent in a complete enumeration survey since the whole population is surveyed.

The errors other than sampling errors such as those arising through non-response, in- completeness and inaccuracy of response are termed non-sampling errors and are likely to be more wide-spread and important in a complete enumeration survey than in a sample survey. Non-sampling errors arise due to various causes right from the beginning stage when the survey is planned and designed to the final stage when the data are processed and analyzed.

The sampling error usually decreases with increase in sample size (number of units selected in the sample) while the non-sampling error is likely to increase with increase in sample size.

As regards the non-sampling error, it is likely to be more in the case of a complete enumeration survey than in the case of a sample survey since it is possible to reduce the non-sampling error to a great extent by using better organization and suitably trained personnel at the field and tabulation stages in the latter than in the former.

6:48 PM

By statisticalconcepts

In: Web Analytics

Visitors vs. Visits

Let us first distinguish how these two web analysis metrics are different:

Unique Visitors
A unique visitor is a metric referred to an individual who has visited a site for the first time within a certain time period. Say if the unique visitor could have visited a site 10 times in a week, but if the time period specifies unique visitors for that week, a single unique visitor will only be counted once for that week. Once that week is over, that unique visitor can be counted again for a new specified time period. The primary method of calculating unique visitors is by setting a persistent cookie on the visitor’s browser to uniquely identify the visitor. Cookie technology helps to avoid common pitfalls, for example, IP Pooling, caching, or tracking visitors behind a firewall, when counting unique visitors.

Visits
A visit begins when a person first views a page on your website. The visit will continue until that person stops all activity on the site for 30 minutes, or until the maximum visit length occurs, which is 12 hours. For example, if you visit a page on www.xxxx.com, you have one instance of a visit that lasts until you have incurred 30 minutes of inactivity. Closing your browser window does not automatically end the current visit. If you don’t view any pages for more than 30 minutes, and then resume viewing pages, then a new visit is registered. The maximum length of any visit is 12 hours. No visit is allowed to extend beyond that period. If additional activity occurs for a period longer than 12 hours, with no period of inactivity exceeding 30 minutes, the visit will be counted twice. The new visit is counted at the first page view after the 12 hour length has been exceeded. The 30 minute visit timeout period is an industry standard. It is used by most web analytics products and it is recommended by industry analysts.
Both numbers are important (although they may not be critical – depending on the outcomes you need to measure) and they provide similar information, but there are some important differences.
1) Every Visit represents an opportunity to persuade or convert a visitor to a customer.
2) Measuring visits is based on fairly established industry standards
3) Unique Visitors are less accurate than Visits - Most web analytics tools, in the absence of cookie setting, fall back on IP address and user agent. This introduces significant variability in your Unique Visitor counts and can skew your true site performance and reach.
4) Unique Visitors mask your true conversion opportunities - Unique Visitors are a superset of Visits and may represent multiple opportunities to convert a customer. As such, using Unique Visitors as the denominator in most performance calculations is actually overstating the effectiveness of your site. For example, if I visit a retail site 4 times in one week, and purchase twice - what is my conversion rate? If you use weekly unique visitors, my conversion rate is 80%. If you use visits, my conversion rate is 10%. Which is a better representation of site effectiveness? Clearly the 10% is much more valuable in understanding where your site may or may not be performing optimally. With the 10% conversion metric, I have the opportunity to analyze which visits did not convert…what happened? Is it a navigational issue? A cross-sell problem? Or perhaps a remarketing opportunity? If you used Unique Visitors, you’d never get this visibility.
Unique Visitors has often been viewed as one of the most strategic web metrics. Countless companies and site operators have insisted on knowing how many unique visitors came to their site on any given day. By contrast, Visits has largely been the neglected stepchild of web metrics. Most folks know it’s there, but many prefer to ignore it in favor of the more popular Unique Visitor metric. Generally speaking, a visit starts when someone reaches your website, and is considered complete after 30 minutes of inactivity. It is also commonly referred to as a “session”.
It is good to know how many people have come to your site, just as it is good to know how many people walk into your store in the mall. It gives you an idea of the total number of customers/potential customers that you are drawing in, and allows you to compare trends over time to spot opportunities or problems.
But there’s a big difference between a person poking their head into your store on their way to the food court, then never to returning again, and a person who repeatedly makes the trip to your store, even if they don’t purchase something every time. And this is where I think visits may provide more relevant, actionable information than visitors for this client.

It is interesting to know how many people visit the site and to hopefully see this grow over time, but more critical in this case is the number of visits. Now for sake of argument, let’s say you can only report one of these metrics to your executives - which is it going to be? Unique Visitors? or Visits?
Leaving aside the argument over whether either of these satisfies the criteria to become a real KPI, let’s consider the uses of each metric.

Let’s assume 5% of your visitors delete cookies…that would imply a 5% level of inaccuracy around unique right? Wrong. The fallback method for unique visitor determination is most commonly IP and user agent string - a *much* less reliable approach than cookies. This was actually a key reason log file solutions fell out of favor - because most relied on IP and user agent and hence were highly inaccurate. Because of the inaccuracy of user and IP agent, your 5% of cookie rejecting visitors can actually skew your traffic numbers by many times over. So you may find that your 5% is actually 15% of your unique visitors. And because it’s nearly impossible to reconcile this number (outside of triangulating with registered user counts), you have little hope in relying on unique visitors as a true measure of “visitors”.

Furthermore, assuming you can set a persistent cookie, you’re only measuring a computer - not a person. Multiple people use single computers. Single people use multiple computers. So what is your true unique visitor count?

You see, a single visit could, and very likely, result in multiple hits and page views.
If you left and came back in an hour, it could count as another visit but not another visitor. That’s where it gets a little more complicated. Things that must be considered are how long you were inactive and how long it took you to come back.

There’s more. In order to count as a unique visitor, the visitor must have a browser that accepts cookies.
OK, the truth is, they can both help you so depending on who you ask, you might get a different opinion. If one is reporting on Visits then he doesn’t have to worry about whether or not a user’s browser accepts cookies. But ultimately, it’s more valuable to look at visits since those numbers include repeat visitors.

So the number of visitors is interesting, the number of visits may be more so, but we need to get to the real reason our site exists: conversions. In this case, purchases. And to make decisions about optimization and resource allocation, we need to understand the efficiency of various channels bringing visits to our site and this means: conversion rate. And to get a conversion rate that makes sense, we need to have the most appropriate denominator.

Which brings us back to visitors vs. visits? Yes, it can be useful to know what percentage of unique visitors in a month made a purchase, but wouldn’t it be more useful for site selling repeat purchase products – to know the percentage of visits that resulted in a purchase?

7:18 AM

By statisticalconcepts

In: Statistics

Need for Statistical Data

Since the beginning of the twentieth century the economic and social life of the people and the functional system of industry and business, educational and medical facilities and other activities of the community have undergone substantial changes due to spectacular developments in the field of science and technology. Now the emphasis is on specialization in mass production and utilization of goods and services of a given type with a view to get the maximum possible benefit per unit of cost. Considerable planning is required in a large-scale projects and any rational decision regarding efficient formulation and execution of suitable plans and projects or an objective assessment of their effectiveness, whether in the field of industry, business or governmental activities, has necessarily to be based on objective data regarding resources and needs. There is, therefore, a need for various types of statistical (quantified) information to be collected and analyzed in an objective manner and presented suitably so as to serve as a sound basis for taking policy decisions in different fields of human activity. In modern times, the primary users of statistical data are the state, industry, business, scientific institutions, public organizations and international agencies.

For instance, to execute its various responsibilities, the state is in need of a variety of information regarding different sectors of the economy, sections of people and geographical regions in the country as well as information on the available resources such as manpower, cultivable land, forests, water, minerals and oil. If the resources were unlimited, planning would be relatively simple as it would consist in just providing each one with what he needs in terms of money, material, employment, education etc. But such a situation is only hypothetical, as in reality the resources are limited and the needs are usually not well defined and are elastic.

Therefore, for the purpose of proper planning fairly detailed data on the available resources and on the needs are to be collected. For example, the country is in need of data on production and consumption of different types of products to enable it to take objective decisions regarding its import and export polices. Statistical information on the cost of living of different categories of people living in various parts of the country is of importance in shaping its policies in respect of wage and price levels.

Complete enumeration survey
One way of obtaining the required information at regional and country level is to collect the data for each and every unit (person, household, field, factory, shop, etc as the case may be) belonging to the population or universe, which is the aggregate of all units of a given type under consideration and this procedure of obtaining information is termed complete enumeration survey. The effort, money and time required for carrying out complete enumeration surveys to obtain the different types of data will, generally, be extremely large. However, if the information is required for each and every unit in the domain of study, a complete enumeration survey is clearly necessary. Examples of such situations are income tax assessment where the income of each individual is assessed and taxed, preparation of voters’ list for election purposes and recruitment of personnel in an establishment etc. But there are many situations, where only summary figures are required for the domain of study as a whole or for group of units and in such situations collection of data for every unit is only a means to an end and not the end itself. It is worth mentioning that exact planning for the future is not possible, since this would need accurate information on the resources that would be available and on the needs that would have to be satisfied in future. In general, past data are used to forecast the resources and the needs of the future and hence there is some element of uncertainty in planning. Because of this uncertainty, only broad (and not exact) allocations of the resources are usually attempted. Thus some margin of error may be permitted in the data needed for planning, provided this error is not large enough to affect the broad allocations.

Sampling
Considering that some margin of error is permissible in the data needed for practical purposes, an effective alternative to a complete enumeration survey can be a sample survey where only some of the units selected in a suitable manner from the population are surveyed and an inference is drawn about the population on the basis of observations made on the selected units. It can be easily seen that compared to a sample survey, a complete enumeration survey is time-consuming, expensive, has less scope in the sense of restricted subject coverage and is subject to greater coverage, observational and tabulation errors. In certain investigations, it may be essential to use specialized equipment or highly trained field staff for data collection making it almost impossible to carry out such investigations except on a sampling basis. Besides, in case of destructive surveys, a complete enumeration survey is just not practicable. Thus, if the interest is to obtain the average life of electric bulbs in a batch then one will have to confine the observations, of necessity, to a part (or a sample) of the population or universe and to infer about the population as a whole on the basis of the observations on the sample. However, since an inference is made about the whole from a part in a sample survey, the results are likely to be different from the population values and the differences would depend on the selected part or sample. Thus the information provided by a sample is subject to a kind of error which is known as sampling error. On the other hand, as only a part of the population is to be surveyed, there is greater scope for eliminating the ascertainment or observational errors by proper controls and by employing trained personnel than is possible in a complete enumeration survey. It is of interest to note that if a sample survey is carried out according to certain specified statistical principles, it is possible not only to estimate the value of the characteristic for the population as a whole on the basis of the sample data, but also to get a valid estimate of the sampling error of the estimate. There are various steps involved in the planning and execution of a sample survey. One of the principal steps in a sample survey relate to methods of data collection.

8:27 AM

By statisticalconcepts

In: Statistics

Transformation of Data

Note: It should be emphasized that transformation of data in statistics, if needed, must take place right at the beginning of the statistical analysis.
The validity of analysis of variance depends on certain important assumptions like normality of errors and random effects, independence of errors, homoscedasticity of errors and effects are additive. The analysis is likely to lead to faulty conclusions when some of these assumptions are violated. A very common case of violation is the assumption regarding the constancy of variance of errors. One of the alternatives in such cases is to go for a weighted analysis of variance wherein each observation is weighted by the inverse of its variance. For this, an estimate of the variance of each observation is to be obtained which may not be feasible always. Quite often, the data are subjected to certain scale transformations such that in the transformed scale, the constant variance assumption is realized. Some of such transformation of data in statistics can also correct for departures of observations from normality because unequal variance is many times related to the distribution of the variable also. Major aims of applying transformation of data in statistics are to bring data closer to normal distribution, to reduce relationship between mean and variance, to reduce the influence of outliers, to improve linearity in regression, to reduce interaction effects, to reduce skewness and kurtosis. Certain methods are available for identifying the transformation of data in statistics needed for any particular data set but one may also resort to certain standard forms of transformation of data depending on the nature of the data. Most commonly used transformation of data in the analysis of experimental data are Arcsine, Logarithmic and Square root. These transformations of data can be carried out using the following options.

Arcsine Transformation : Arcsine transformation of data is appropriate for the data on proportions, i.e., data obtained from a count and the data expressed as decimal fractions and percentages. The distribution of percentages is binomial and arcsine transformation of data makes the distribution normal. Since the role of Arcsine transformation of data is not properly understood, there is a tendency to transform any percentage using arc sine transformation. But only that percentage data that are derived from count data, such as % barren tillers (which is derived from the ratio of the number of non-bearing tillers to the total number of tillers) should be transformed and not the percentage data such as % protein or % carbohydrates, which are not derived from count data.

In the case of proportions, derived from frequency data, the observed proportion p can be changed to a new form

This type of transformation of data is known as angular or arcsine transformation. However, when nearly all values in the data lie between 0.3 and 0.7, there is no need for such transformation. It may be noted that the angular transformation is not applicable to proportion or percentage data which are not derived from counts. For example, percentage of marks, percentage of profit, percentage of protein in grains, oil content in seeds, etc., can not be subjected to angular transformation. The angular transformation is not good when the data contain 0 or 1 values for p. The transformation in such cases is improved by replacing 0 with (1/4n) and 1 with [1-(1/4n)], before taking angular values, where n is the number of observations based on which p is estimated for each group.

ASIN gives the arcsine of a number. The arcsine is the angle whose sine is number and this number must be from -1 to 1. The returned angle is given in radians in the range

. To express the arcsine in degrees, multiply the result by 180/

. For this go to the CELL where the transformation is required and write =ASIN (Give Cell identification for which transformation to be done)* 180*7/22 and press ENTER. Then copy it for all observations.

Example: ASIN (0.5) equals 0.5236 (

/6 radians) and ASIN (0.5)* 180/PI equals 30 (degrees).

Logarithmic Transformation: Logarithmic transformation of data is suitable for the data where the variance is proportional to square of the mean or the coefficient of variation (S.D./mean) is constant or where effects are multiplicative. These conditions are generally found in the data that are whole numbers and cover a wide range of values. This is usually the case when analyzing growth measurements.For data of this nature, logarithmic transformation of data is recommended. It squeezes the bigger values and stretches smaller values. A simple plot of group means against the group standard deviation will show linearity in such cases. A good example is data from an experiment involving various types of insecticides. For the effective insecticide, insect counts on the treated experimental unit may be small while for the ineffective ones, the counts may range from 100 to several thousands. When zeros are present in the data, it is advisable to add 1 to each observation before making the transformation. The log transformation of data is particularly effective in normalizing positively skewed distributions. It is also used to achieve additivity of effects in certain cases.

LN gives the natural logarithm of a positive number. Natural logarithms are based on the constant e (2.72). For this go the CELL where the transformation is required and write = LN(Give Cell Number for which transformation to be done) and press ENTER. Then copy it for all observations.

Example: LN(86) equals 4.45, LN(2.72) equals 1, LN(EXP(3)) Equals 3 and EXP(LN(4)) equals 4. Further, EXP returns e raised to the power of a given number, LOG returns the logarithm of a number to a specified base and LOG 10 returns the base-10 logarithm of a number.

Square Root Transformation: This transformation of data is appropriate for the data sets where the variance is proportional to the mean. Here, the data consists of small whole numbers, for example, data obtained in counting rare events. This data set generally follows the Poisson distribution and square root transformation approximates Poisson to normal distribution. If the original observations are brought to square root scale by taking the square root of each observation, it is known as square root transformation. This is appropriate when the variance is proportional to the mean as discernible from a graph of group variances against group means. Linear relationship between mean and variance is commonly observed when the data are in the form of small whole numbers (e.g., counts of wildlings per quadrat, weeds per plot, earthworms per square metre of soil, insects caught in traps, etc.). When the observed values fall within the range of 1 to 10 and especially when zeros are present, the transformation should be,

SQRT gives square root of a positive number. For this go to the CELL where the transformation is required and write = SQRT (Give Cell No. for which transformation to be done = 0.5) and press ENTER. Then copy it for all observations. However, if number is negative, SQRT return the #NUM ! error value.

Example: SQRT(16) equals 4, SQRT(-16) equals #NUM! and SQRT(ABS(-16)) equals 4.

Box-Cox Transformation:

By now we know that if the relation between the variance of observations and the mean is known then this information can be utilize in selecting the form of the transformation.

We now elaborate on this point and show how it is possible to estimate the form of the required transformation from the data.

Box-Cox transformation is a power transformation of the original data.

Let y_ut is the observation pertaining to the u^th plot, then the power transformation implies that we use y_ut’s as

--- eq(1)

Box and Cox (1964) have shown how the transformation parameter l in eq(1) may be estimated simultaneously with the other model parameters (overall mean and treatment effects) using the method of maximum likelihood. The procedure consists of performing, for the various values of l, a standard analysis of variance on

is the geometric mean of the observations. The maximum likelihood estimate of l is the value for which the error sum of squares, say SS_e(l), is minimum. Notice that we cannot select the value of l by directly comparing the error sum of squares from analysis of variance on y^l because for each value of l the error sum of squares is measured on a different scale. Equation (A) rescales the responses so at error sums of squares are directly comparable.

Therefore, the l can be estimated in three different ways i.e. by minimizing these error sum of squares.

This is a very general transformation and the commonly used transformations follow as particular cases. The particular cases for different values of

are given below.

l	Transformation
1	No Transformation
½	Square Root
0	Log
-1/2	Reciprocal Square Root
-1	Reciprocal

If any one of the observations is zero then the geometric mean is undefined. In the expression A, geometric mean is in denominator so it is not possible to compute that expression. For solving this problem, we add a small quantity to each of the observations.

Once the transformation has been made, the analysis is carried out with the transformed data and all the conclusions are drawn in the transformed scale. However, while presenting the results, the means and their standard errors are transformed back into original units. While transforming back into the original units, certain corrections have to be made for the means. In the case of log transformed data, if the mean value is

, the mean value of the original units will be antilog (

+ 1.15

) instead of antilog (

). If the square root transformation had been used, then the mean in the original scale would be antilog ((

+ V(

))² instead of (

)² where V(

) represents the variance of

. No such correction is generally made in the case of angular transformation. The inverse transformation for angular transformation would be p = (sin q)².

Note: Examples discussed are for MS-Excel.

2:11 PM

By statisticalconcepts

In: Statistics

Conjoint Analysis

Conjoint Analysis is a popular marketing research technique that marketers use to determine what features a new product should have and how it should be priced which is a multivariate analysis technique introduced to the marketers in 1970's. Conjoint Analysis is basically a data de- compositional technique which tries to plot the output data on the joint space of the importance of each attribute. The important point to note is that the consumer is not asked to assign scores to different attribute separately. The main steps involved in using conjoint analysis include determination of the salient attributes for the given product from the points of view of the consumers, assigning a set of discrete levels or a range of continuous values to each of the attributes, utilizing fraction factorial design of experiment for designing the stimuli for experiment, physically designing the stimuli, data collection, conjoint analysis and determination of part worth utilities. The possible application, of conjoint analysis includes product design, market segmentation, swot analysis etc. In its original form, conjoint analysis is a main effects analysis-of-variance problem with an ordinal scale of-measurement dependent variable. Conjoint analysis decomposes rankings or rating-scale evaluation judgments of products into components based on qualitative attributes of the products. Attributes can include price, color, guarantee, environmental impact, and so on. A numerical utility or part-worth utility value is computed for each level of each attribute. The goal is to compute utilities such that the rank ordering of the sums of each product’s set of utilities is the same as the original rank ordering or violates that ordering as little as possible. When a monotonic transformation of the judgments is requested, a nonmetric conjoint analysis is performed. Nonmetric conjoint analysis models are fit iteratively. When the judgments are not transformed, a metric conjoint analysis is performed. Metric conjoint analysis models are fit directly with ordinary least squares. When all of the attributes are nominal, the metric conjoint analysis problem is a simple main-effects ANOVA model. In both metric and nonmetric conjoint analysis, the respondents are typically not asked to rate all possible combinations of the attributes. For example, with five attributes, three with three levels and two with two levels, there are 3×3×3×2×2 = 108 possible combinations. Rating that many combinations would be difficult for consumers, so typically only a small fraction of the combinations are rated. Typically, combinations are chosen from an orthogonal array which is a fractional-factorial design. The statistical technique of Fractional Factorial Design of Experiment finds out the minimum number of product designs which are necessary to use in the study and yet provide us all the information that we originally sought. These designs are also mutually independent (orthogonal) to avoid any redundancy in the data and allow the representation of each of the attributes and their respective levels in an unbiased manner.

Conjoint Analysis Steps

1. The respondent is given a set of stimulus profiles (constructed along factorial design principles in the full profile case). In the two-factor approach, pairs of factors are presented, each appearing approximately an equal number of times.

2. The respondents rank or rate the stimuli according to some overall criterion, such as preference, acceptability, or likelihood of purchase.

3. In the analysis of the data, part-worths are identified for the factor levels such that each specific combination of part-worths equals the total utility of any given profile. A set of part-worths is derived for each respondent.

4. The goodness-of-fit criterion relates the derived ranking or rating of stimulus profiles to the original ranking or rating data.

5. A set of objects are defined for the choice simulator. Based on previously determined part-worths for each respondent, each simulator computes an utility value for each of the objects defined as part of the simulation. 6. Choice simulator models are invoked which rely on decision rules (first choice model, average probability model or logit model) to estimate the respondent's object of choice. Overall choice shares are computed for the sample.

How to conduct Conjoint

While specific research objectives will dictate the direction of conjoint research, there are several components common to all conjoint engagements. These steps include: definition of attributes; establishment of attribute levels; choice of conjoint methodology; design of experiment; data collection; data analysis; and development of the market simulator.

Step 1: Definition of Attributes

To replicate the decision-making process, it is necessary to understand each of the attributes consumers consider when making an actual purchasing decision. Experience, previous research, and/or the specific research objectives will determine which attributes are of particular importance, and whether all product features should be displayed or only those most relevant to differentiating a product from competitive offerings.

Step 2: Establishment of Attribute

Levels Once attributes for the conjoint research have been defined, it must be determined how attributes will vary from one product concept to the next. This step involves the establishment of attribute levels. Attribute levels must be comprehensive enough to capture all of the products that exist, or soon exist, within the marketplace. However, as with the definition of attributes, care must be taken to avoid respondent fatigue, so only the most prevalent attribute levels will be chosen for testing (typically 3-5 attribute levels per attribute). Further, the number of attribute levels chosen has a direct impact on the number of concepts respondents will be asked to evaluate. The optimal number of attribute levels tested will be that which ensures research objectives are satisfied while minimizing the burden faced by respondents.

Step 3: Choice of Conjoint Methodology

Because no two product and/or service categories are exactly the same, there are a number of conjoint methodologies at a marketing researcher's disposal. The three primary methods used today include: conjoint value analysis (CVA), adaptive conjoint analysis (ACA), and choice-based conjoint analysis (CBC), with adaptive choice-based conjoint (ACBC) emerging as a new generation of conjoint analysis. For the purposes of this whitepaper, we will focus on CBC analysis, by far the most popular conjoint methodology currently used by researchers. Some types of Conjoint Methodologies include: 1. Choice-Based Conjoint (CBC) 2. Conjoint Value Analysis (CVA) 3. Adaptive Conjoint Analysis (ACA)

Step 4: Design of Experiment

Having established the methodology, attributes, and attributes levels to be tested; we can then create concept profiles (i.e., descriptions of product concepts using the attributes and attribute levels to be used in the research). Respondents are asked to evaluate a number of these concepts, and in the case of CBC determine which, if any, they would choose to purchase given the opportunity. Fortunately, it is not necessary that every potential product offering be evaluated. In fact, this would be quite impossible, as there are typically thousands of potential product configurations in any given study. For example, there are 1800 hypothetical products in the energy bar study (3 brands x 5 protein levels x 6 carbohydrate levels x 4 flavors x 5 price levels). However, with a carefully constructed conjoint design, we are able to calculate respondent preference for each attribute and attribute level. Therefore, assuming a simple additive model (i.e., product preference is the sum of preference for its attributes), we can estimate how respondents would react to any product offering.

Step 5: Data Collection

An online survey is recommended for almost all conjoint research engagements, as it provides the most effective, cost efficient, time sensitive, and highest quality solution. Respondents are required to consider a great deal of information, allowing them to visually assess the stimuli results in more reliable findings. An online presentation of product concepts and conjoint tasks allows respondents to complete the survey at their own pace, allowing time for thoughtful and accurate responses. With over 70% of U.S. adults accessing the Internet via computers at home, work, or school (Source: Pew Internet and American Life Project), an online methodology allows for data collection from a large sample set.

Step 6: Data Analysis

With a carefully constructed conjoint survey, we can statistically deduce the consumer values for each feature respondents may be subconsciously using to evaluate concepts. Analysis of conjoint data yields a series of scores for each respondent for each attribute level. These scores, known as part-worth, may be likened to the unit which is an arbitrary measurement of utility consumers associate with a product and its attributes. Each score reflects the value the respondent associates with each attribute level, and is the building block from which all analysis is conducted. By assuming a simple additive model, we are able to build products and pricing structures, and then calculate the value consumers find in that product. By comparing this to other potential products in the marketplace, we can begin to understand how consumers will choose products in the real world.

Step 7: Development of Market Simulator

While preliminary analysis of conjoint data results in valuable insight regarding consumers and their preferences, the real value of conjoint analysis comes from the market simulators developed at the conclusion of the research engagement. The market simulator is a software program, similar to a spreadsheet, which allows users to conduct "what-if" analyses with data collected during conjoint fielding. As mentioned above, respondents can be asked to evaluate only a small fraction of concept profiles, yet still reveal how they would respond to any product offering. Therefore, it is possible to aggregate the preferences of all consumers to reveal how the market as a whole will respond to any product offering. Furthermore, we can assess how the marketplace will respond to two or more competing products by calculating the market’s share of preference for every product of interest.

Conjoint Analysis Survey Examples