Statistical Concepts and Analytics Explained

9:07 PM

By Analytics Concepts and Solutions

In: Data Mining, Statistics

Python Packages for Data Science

In order to do data analysis in Python, you should know a little bit about the main packages relevant to analysis in Python. A Python library is a collection of functions and methods that allow you to perform lots of actions without writing any code. The libraries usually contain built-in modules providing different functionalities, which you can use directly. And there are extensive libraries, offering a broad range of facilities.

Infographic vector created by rawpixel dot com on freepik dot com

We have divided the Python data analysis libraries into three groups:
Scientific Computing Libraries

i. Pandas offers data structure and tools for effective data manipulation and analysis. It provides fast axis to structured data. The primary instrument of Pandas is a two-dimensional table consisting of column and row labels, which are called a DataFrame. It is designed to provide easy indexing functionality. ii. The Numpy library uses arrays for its inputs and outputs. It can be extended to objects for matrices, and with minor coding changes, developers can perform fast array processing. iii. SciPy includes functions for some advanced math problems, as well as data visualization. Using data visualization methods is the best way to communicate with others, showing them meaningful results of analysis.

Libraries to create graphs, charts and maps

i. The Matplotlib package is the most well-known library for data visualization. It is great for making graphs and plots. The graphs are also highly customizable.

ii. Seaborn: It is based on Matplotlib. It's very easy to generate various plots such as heat maps, time series, and violin plots.

Machine Learning algorithms:

i. The Scikit-learn library contains tools for statistical modeling, including regression, classification, clustering and so on. This library is built on NumPy, SciPy and Matplotlib.

ii. StatsModels is also a Python module that allows users to explore data, estimate statistical models, and perform statistical tests.

8:40 PM

By Analytics Concepts and Solutions

In: Statistics

Statistical Tools to Kick Start with Data Science

Some of the most common and convenient statistical tools like correlation, regression, box plot, line of best fit to get started with data science.

Histogram

Histograms are a statistical way of representing the frequencies of data values in particular intervals. The more traditional description is that a histogram is a chart plot of a frequency table where the height of the bars in it tell us how many data points are in each interval.

Box Plot

Box plots, sometimes referred to as 'Box and Whisker Plots', are a visual way of summarizing basic characteristics of a data set. Box Plots show the highest and lowest values (that are not outliers), the middle value and the values at the 1st quarter and 3rd quarter mark. Outliers are shown as dots after the 'whiskers'. This gives as a simple way to quickly understand the spread of the data and is great for quickly comparing two data sets.

Think of the quartiles as, if your data set was ranked from it's lowest to its highest value, Q1 would be the middle of the low values, (below the median) and Q3 is the middle of the high values (above the median). Box Plots break out a data set into 4 sets, before Q1, Q1 to the Median (Q2), the Median to Q3 and, after Q3. The interquartile range is defined as Q3 - Q1.

The lowest part of the box in the Box Plot is Q1, there is a line inside the box, the median, and the end of the box, Q3.

Outliers are typically defined as 1.5 * (Q3 - Q1) and if a data point is (1.5* the interquartile range) away from Q1 and Q3 (the edges of the box), it is considered an outlier. The whiskers of the box plot are lines from each end of the box out to the farthest data point that is not an outlier.

Scatter Plot

Scatter Plots are charts that visualize the relationship between two sets of data. Every data point for one variable is plotted against a corresponding data point for another variable and the resultant pattern looks like scattered data points.

One of the most common relationships that can be shown in Scatter Plots is the relationship between a product's price and the number of units sold. Typically when the price goes down (a sale), more units are sold and when the price goes up, people buy less of a product. There are exceptions but that tends to be the case.

The thing we want to reduce uncertainty about goes on the Y axis. We want to know what our sales will be if our price changes. The variable on the Y axis, sales, is the dependent variable. The variable on the X axis is the independent variable and it's the thing we can change, increase or decrease, to see what happens to Y.

To plot a Scatter Plot, we take each value of Y and X for a particular observation e.g., a day, a week, a month and we mark them on a graph. Done.

A Scatter Plot showing many dots following the line between 11 and 5 on a clock is a negative linear relationship e.g., when price goes up, sales go down. A Scatter Plot running from 1 to 7 on a clock shows a positive relationship, when X goes up, so does Y.

Correlation

Correlation is a numerical way of interpreting the relationship between two variables. A Regression analysis uses the 'least squares method' to fit a line through a scatter plot and is measured by R Squared.

The Correlation coefficient is the square route of R Squared, taking on the sign (+ or -) of the slope of the data. When R is high and positive, we say that there is a positive correlation e.g., when X goes up, Y goes up.

Correlation or r, measures the tightness of the scatter dots to the line of best fit and its sign tells us whether Y goes up with changes in X or Y goes down with changes in X. An r of 0 (zero) means that there is no relationship between X and Y and when r is 1, that means that there is a perfect relationship between X and Y where, when X goes up, Y goes up.

Line of Best Fit

A line of best fit is a line drawn through a scatter plot so that each point on that line minimizes the total distance to any of the scatter data points. This is traditionally called a 'Least Squares Line' and it follows the formula, y = mX + C. Imagine a line running through a scatter plot. Each point on that line will have an X and a Y value. The least squares method says that we would take each line y value and subtract it from the scatter dots Y value. Our intention is to sum all these values however, some of the values in this subtraction will be negative as some scatter dots will be below the line so, we square each value before we sum all the values. The Line of Best Fit is one where that sum of the squares is the lowest number it can be i.e., all the dots are either on or very close to the line. When this happens, our line is a good way of predicting Y values, given values of X.

The Line of Best Fit is one where that sum of the squares is the lowest number it can be i.e., all the dots are either on or very close to the line. This summation is called 'Sum of Squared Errors'. When the SSE is at its lowest, our line is a good way of predicting Y values, given values of X.

Regression

In a scatter plot with a line of best fit running through it, we can assess how well the X variable explains the changes in the Y variable using Regression e.g., What percentage of Sales is explained by changes in Price?

When we find a high 'R Squared', in percentage terms, the changes in Y are largely explained by the changes in X. If the R Squared is 95%, we say that, 95% of the variation in Sales is explained by the change in Price and the rest, the 5%, is due to error. The error referred to here is the distances between the dots in the scatter and the line of best fit.

There are many other components to a Regression table output including, confidence intervals, P Values and t ratios.

2:22 PM

By statisticalconcepts

In: Web Analytics

Adobe Analytics Connector - Microsoft Power BI Desktop

In the month of December 2017 Microsoft Power BI is releasing a new connector for Adobe Analytics i.e., Omniture. This new connector will allow to import and analyze your Adobe Analytics data within Power BI.
Being a Beta version you have to first enable the connector from Options and Settings >> Options >> Preview Features to view the Adobe connector under Online services.

This Adobe Analytics connector can be found in the Get Data dialog, under the Online Services category.

After the connection is established, you can view and select multiple dimensions and measures within the Navigator dialog box to create a tabular output, also one can provide input parameters for the selected items.

After selecting it can be load into Power BI Desktop for any further additional data transformations and filters within the Query Editor UX, through the Edit option in this dialog.

Once you select your required Dimensions and Measures you will see the table connectors are created automatically in power BI, you will see filed created and also the editor window.

10:37 AM

By statisticalconcepts

In: Statistics

Skewness and Kurtosis in Statistics

The average and measure of dispersion can describe the distribution but they are not sufficient to describe the nature of the distribution. For this purpose we use other concepts known as Skewness and Kurtosis. The symmetrical and skewed distributions are shown by curves as

Skewness

Skewness means lack of symmetry. A distribution is said to be symmetrical when the values are uniformly distributed around the mean. For example, the following distribution is symmetrical about its mean 3.

x : 1 2 3 4 5

frequency (f ) : 5 9 12 9 5

In a symmetrical distribution the mean, median and mode coincide, that is, mean = median = mode.

Several measures are used to express the direction and extent of skewness of a dispersion. The important measures are that given by Pearson. The first one is the Coefficient of Skewness:

For a symmetric distribution S_k = 0. If the distribution is negatively skewed then S_k is negative and if it is positively skewed then S_k is positive. The range for S_k is from -3 to 3.

The other measure uses the b (read ‘beta’) coefficient which is given by,

where, m₂ and m₃ are the second and third central moments. The second central moment m₂is nothing but the variance. The sample estimate of this coefficient is

where m₂ and m₃ are the sample central moments given by

For a symmetrical distribution b₁ = 0. Skewness is positive or negative depending upon whether m₃ is positive or negative.

Kurtosis

A measure of the peakness or convexity of a curve is known as Kurtosis.

It is clear from the above figure that all the three curves, (1), (2) and (3) are symmetrical about the mean. Still they are not of the same type. One has different peak as compared to that of others. Curve (1) is known as mesokurtic (normal curve); Curve (2) is known as leptocurtic (leading curve) and Curve (3) is known as platykurtic (flat curve). Kurtosis is measured by Pearson’s coefficient, b₂ (read ‘beta - two’).It is given by

.
The sample estimate of this coefficient is

where, m₄ is the fourth central moment given by m₄ =

The distribution is called normal if b₂ = 3. When b₂ is more than 3 the distribution is said to be leptokurtic. If b₂ is less than 3 the distribution is said to be platykurtic.

11:33 AM

By statisticalconcepts

In: Statistics

Measures of Dispersion in Statistics

We know that averages are representatives of a frequency distribution but they fail to give a complete picture of the distribution. They do not tell anything about the scatterness of observations within the distribution.

Suppose that we have the distribution of the yields (kg per plot) of two paddy varieties from 5 plots each.

The distribution may be as follows:

Variety I 45 42 42 41 40

Variety II 54 48 42 33 30

It can be seen that the mean yield for both varieties is 42 kg. But we can not say that the performance of the two varieties is same. There is greater uniformity of yields in the first variety whereas there is more variability in the yields of the second variety. The first variety may be preferred since it is more consistent in yield performance. From the above example, it is obvious that a measure of central tendency alone is not sufficient to describe a frequency distribution. In addition to it we should have a measure of scatterness of observations. The scatterness or variation of observations from their average is called the dispersion. There are different measures of dispersion like the range, the quartile deviation, the mean deviation and the standard deviation.

Range

The simplest measure of dispersion is the range. The range is the difference between the minimum and maximum values in a group of observations for example, suppose that the yields (kg per plot) of a variety from five plots are 8, 9, 8, 10 and 11. The range is (11 - 8) = 3 kg. In practice the range is indicated as 8 - 11 kg.

Range takes only the maximum and minimum values into account and not all the values. Hence it is a very unstable or unreliable indicator of the amount of deviation. It is affected by extreme values. In the above example, if we have 15 instead of figure 11, the range will be (8 - 15) = 7 kg. In order to avoid these difficulties another measure of dispersion called quartile deviation is preferred.

Quartile Deviation

We can delete the values below the first quartile and the values above the third quartile. It is assumed that the unusually extreme values are eliminated by this way. We can then take the mean of the deviations of the two quartiles from the second quartile (median). That is,

This quantity is known as the quartile deviation (Q.D.).

The quartile deviation is more stable than the range as it depends on two intermediate values. This is not affected by extreme values since the extreme values are already removed. However, quartile deviation also fails to take the values of all deviations

Mean Deviation

Mean deviation is the mean of the deviations of individual values from their average. The average may be either mean or median. For raw data the mean deviation from the median is the least. Therefore, median is considered to be most suitable for raw data. But usually the mean is used to find out the mean deviation. The mean deviation is given by

M.D. =

for raw data and M.D. =

for grouped data

All positive and negative differences are treated as positive values. Hence we use the modulus symbol | |. We have to read as “modulus

”. If we take

as such, the sum of the deviations,

will be 0. Hence, if the signs are not eliminated the mean deviation will always be 0, which is not correct.

The steps of computation are as follows :

Step 1: If the classes are not continuous we have to make them continuous.

Step 2: Find out the mid values of the classes (mid - X = x).

Step 3: Compute the mean.

Step 4: Find out

for all values of x.

Step 5: Multiply each

by the corresponding frequencies.

Step 6: Use the formula.

The mean deviation takes all the values into consideration. It is fairly stable compared to range or quartile deviation. Since, the mean deviation ignores signs of deviations, it is not possible to use it for further statistical analysis and it is not stable as standard deviation which is defined as:

Standard Deviation

Ignoring the signs of the deviations is mathematically not correct. We may square the deviation to make a negative value as positive. After calculating the average squared deviations, it can be expressed in original units by taking its square root. This type of the measure of variation is known as Standard Deviation.

The standard deviation is defined as the square root of the mean of the squared deviations of individual values from their mean. Symbolically,

Standard Deviation (S.D.) or

This is called standard deviation because of the fact that it indicates a sort of group standard spread of values around their mean. For grouped data it is given as

Standard Deviation (S.D.) or

The sample standard deviation should be an unbiased estimate of the population standard deviation because we use sample standard deviation to estimate the population standard deviation. For this we substitute n - 1 for n in the formula. Thus, the sample standard deviation is written as

For grouped data it is given by

where,

C = class interval

d = (x - A) / C as given under mean.

The square of the standard deviation is known as the variance. In the analysis of variance technique, the term

is called the sum of squares, and the variance is called the mean square. The standard deviation is denoted by s in case of sample, and by s (read ‘sigma’) in case of population.

The standard deviation is the most widely used measure of dispersion. It takes all the items into consideration. It is more stable compared to other measures. However, it will be inflated by extreme items as is the mean.

The standard deviation has some additional special characteristics. It is not affected by adding or subtracting a constant value to each observed value. It is affected by multiplying or dividing each observation by a constant. When the observations are multiplied by a constant, the resulting standard deviation will be equivalent to the product of the actual standard deviation and the constant. (Note that division of all observations by a constant, C is equivalent to multiplication by its reciprocal, 1/C. Subtracting a constant C is equivalent of adding a constant, - C.)

The standard deviations can be pooled. If the sum of squares for the first distribution with n₁ observations is SS₁, and the sum of squares for the second distribution with n₂ observations is SS₂, then the pooled standard deviation is given by,

Measures of Relative Dispersion

Suppose that the two distributions to be compared are expressed in the same units and their means are equal or nearly equal. Then their variability can be compared directly by using their standard deviations. However, if their means are widely different or if they are expressed in different units of measurement, we can not use the standard deviations as such for comparing their variability. We have to use the relative measures of dispersion in such situations.

There are relative dispersion in relation to range, the quartile deviation, the mean deviation, and the standard deviation. Of these, the coefficient of variation which is related to the standard deviation is important. The coefficient of variation is given by,

C.V. = (S.D. / Mean) x 100

The C.V. is a unit-free measure. It is always expressed as percentage. The C.V. will be small if the variation is small of the two groups, the one with less C.V. is said to be more consistent.

The coefficient of variation is unreliable if the mean is near zero. Also it is unstable if the measurement scale used is not ratio scale. The C.V. is informative if it is given along with the mean and standard deviation. Otherwise, it may be misleading.

7:17 AM

By statisticalconcepts

In: Statistics

How to Choose Sample Size for a Simple Random Sample

what sample size do we need for simple random sampling

Before we proceed to the concept, consider the following problem. You are conducting a survey. The sampling method is simple random sampling, without replacement. You want your survey to provide a specified level of precision.

To choose the right sample size for a simple random sample, you need to define the following inputs.

Specify the desired margin of error ME. This is your measure of precision.
Specify alpha.
For a hypothesis test, alpha is the significance level.
For an estimation problem, alpha is: 1 - Confidence level.
Find the critical standard score z.
For an estimation problem or for a two-tailed hypothesis test, the critical standard score (z) is the value for which the cumulative probability is 1 - alpha/2.
For a one-tailed hypothesis test, the critical standard score (z) is the value for which the cumulative probability is 1 - alpha.
Unless the population size is very large, you need to specify the size of the population (N).

Given these inputs, the following formulas find the smallest sample size that provides the desired level of precision.

Sample statistic	Population size	Sample size
Mean	Known	n = { z² * σ² * [ N / (N - 1) ] } / { ME² + [ z² * σ² / (N - 1) ] }
Mean	Unknown	n = ( z² * σ² ) / ME²
Proportion	Known	n = [ ( z² * p * q ) + ME² ] / [ ME² + z² * p * q / N ]
Proportion	Unknown	n = [ ( z² * p * q ) + ME² ] / ( ME² )

This approach works when the sample size is relatively large (greater than or equal to 30). Use the first or third formulas when the population size is known. When the population size is large but unknown, use the second or fourth formulas.

For proportions, the sample size requirements vary, based on the value of the proportion. If you are unsure of the right value to use, set p equal to 0.5. This will produce a conservative sample size estimate; that is, the sample size will produce at least the precision called for and may produce better precision.

Sample Problem

At the end of every school year, the state administers a reading test to a simple random sample drawn without replacement from a population of 100,000 third graders. Over the last five years, students who took the test correctly answered 75% of the test questions.

What sample size should you use to achieve a margin of error equal to plus or minus 4%, with a confidence level of 95%?

Solution: To solve this problem, we follow the steps outlined above.

Specify the margin of error. This was given in the problem definition. The margin of error is plus or minus 4% or 0.04.

Specify the confidence level. This was also given. The confidence level is 95% or 0.95.

Compute alpha. Alpha is equal to one minus the confidence level. Thus, alpha = 1 - 0.95 = 0.05.

Determine the critical standard score (z). Since this is an estimation problem, the critical standard score is the value for which the cumulative probability is 1 - alpha/2 = 1 - 0.05/2 = 0.975.

To find that value, we use the Normal Calculator. Recall that the distribution of standard scores has a mean of 0 and a standard deviation of 1. Therefore, we plug the following entries into the normal calculator: Value = 0.975; Mean = 0; and Standard deviation = 1. The calulator tells us that the value of the standard score is 1.96.

And finally, we assume that the population proportion p is equal to its past value over the previous 5 years. That value is 0.75. Given these inputs, we can find the smallest sample size n that will provide the required margin of error.

n = [ (z2 * p * q ) + ME2 ] / [ ME2 + z2 * p * q / N ]

n = [ (1.96)2 * 0.75 * 0.25 + 0.0016] / [ 0.0016 + (1.96)2 * 0.75 * 0.25 / 100,000 ]

n = (0.7203 + 0.0016) / ( 0.0016 + 0.0000072) = 449.2

Therefore, to achieve a margin of error of plus or minus 4 percent, we will need to survey 450 students, using simple random sampling.

3:42 PM

By statisticalconcepts

In: Statistics

Introduction to Normal Distribution in Statistics

A continuous random variable has an infinite number of values that can be represented by an interval on the number line. It’s probability distribution is called a continuous probability distribution. In this article, we will be understanding the most important continuous probability distribution in statistics, the normal distribution.

A normal distribution is a continuous probability distribution for a random variable, x. The graph of a normal distribution is called the normal curve. A normal distribution has the following properties.

1. The mean, median and mode are equal.

2. The normal curve is bell-shaped and is symmetric about the mean.

3. The total area under the normal curve is equal to 1.

4. The normal curve approaches, but never touches the x-axis as it extends farther and farther away from the mean.

5. Between m - s and m + s (in the center of the curve) the graph curves downward. The graph curves upward to the left of m - s and to the right of m + s. The points at which the curve changes from curving upward to curving downward are called inflection points.

6. A normal distribution can have any mean and any positive standard deviation. These two parameters, m and s completely determine the shape of a normal curve. The mean gives the location of the line of symmetry and the standard deviation describes how much the data are spread out.

See the line of symmetry for each? That’s the mean. However, if it is fatter, then the standard deviation is greater. That’s the difference.

Understanding Mean & Standard Deviation

Which normal curve has a greater mean?

Which normal curve has a greater standard deviation

The line of symmetry of curve A occurs at x = 15. The line of symmetry of curve B occurs at x = 12. So, curve A has a greater mean.

Curve B is more spread out than curve A, so curve B has a greater standard deviation.

The Empirical Rule

In a normal distribution with mean m and standard deviation s, you can approximate areas under the normal curve as follows:

1. About 68% of the area lies between m - s and m + s
2. About 95% of the area lies between m - 2s and m + 2s
3. About 99.7% of the area lies between m - 3s and m + 3s

10:40 PM

By statisticalconcepts

In: Web Analytics

Storytelling with Data - Web Analysis

Why Storytelling?
Visual analysis means exploring data visually. A story unfolds as you navigate from one visual summary into another.
You and your team have sorted through and analyzed a dense data set, made industry-relevant discoveries and created data visualizations that allow you to share those insights with others—whether other team members, current or potential clients, or the community at large. Before you present your work, think about your audience and the goals you want to achieve.
Keep in mind that the people you want to communicate with may not have the same background or technical knowledge that you do. Although eager to understand, audience members may need additional explanation, simplified insights or a more slowly paced presentation to grasp your point of view. Making the data accessible to your audience is your responsibility.
A tried-and-true method of connecting with your audience is to embed your data insights within a story. The story framework will capture your audience's attention and help you meet the business objectives driving your work.

Story Framework
Before you embark on your data journey, identify the questions you want to answer. You may have a specific question in mind or a general area you'd like to explore. As you dig into your data, you may find an entirely new or unexpected story, but it helps to have a starting point.

An example of one story framework for a web analysis example on how to visualize online campaign performance.

and the output...

6:47 AM

By statisticalconcepts

In: Statistics

Statistical p-values

When results of studies or research are reported, important decisions are made on the basis of these results. For example, new varieties are often tested against standard varieties to determine if the new varieties is more effective. Several methods of manufacturing may be compared to select the best technique to manufacture the best product. Several evidence may be examined to determine if there is a possible link between one activity and a result. In such kind of studies, results are summarized by a statistical test, and a decision about the significance of the result is based on a p-value. Therefore, it is important for the reader to know what the p-value is all about.
To describe how the p-value works, we'll use a common statistical test as an example, the Student's t-test for independent groups. For this test, subjects are randomly assigned to one of two groups. Some treatment is performed on the subjects in one group, and the other group acts as a control where no treatment or a standard treatment is given. For this example, suppose group one is given a new drug and group 2 is given then standard drug. Time to relief is measured for both groups. The outcome measurement is assumed to be a continuous variable which is normally distributed, and it is assumed that the population variance for the measure is the same for both groups.
For this example the sample mean for group one is 10 and the sample mean for group two is 12. The sample standard deviation for group one is 1.8 and the sample standard deviation for group two is 1.9. The sample size for both groups is 12. Entering this data into a statistical program will produce a t-statistic and a p-value. Calculated t = -2.65 with 22 degrees of freedom, and a p-value of 0.0147. This means that you have evidence that the mean time to relief for group one was significantly different than for group two.
To interpret this p-value, you must first know how the test was structured. In the case of this two-sided t-test, the hypotheses are:
Ho: u1 = u2 (Null hypothesis: means of two groups are equal)
Ha: u1 <> u2 (Alternative: means of the two groups are not equal)
A low p-value for the statistical test points to rejection of the null hypothesis because it indicates how unlikely it is that a test statistic as extreme as or more extreme than the one given by this data will be observed from this population if the null hypothesis is true. Since p=0.015, this means that if the population means were equal as hypothesized (under the null), there is a 15 in 1000 chance that a more extreme test statistic would be obtained using data from this population. If you agree that there is enough evidence to reject the null hypothesis, you conclude that there is significant evidence to support the alternative hypothesis.
The researcher decides what significance level to use i.e., what cutoff point will decide significance. The most commonly used level of significance is 0.05. When the significance level is set at 0.05, any test resulting in a p-value under 0.05 would be significant. Therefore, you would reject the null hypothesis in favor of the alternative hypothesis. Since you are comparing only two groups, you can look at the sample means to see which is largest. The sample mean of group one is smallest, so you conclude that medicine one acted significantly faster, on average, than medicine two. This would be reported in an article using a phrase like this: "The mean time to relief for group one was significantly smaller than for group two. (two sided t-test, t(22) = -2.65, p=0.015)."
P-values do not simple provide you with a Yes or No answer, they provide a sense of the strength of the evidence against the null hypothesis. Lower the p-value, the stronger the evidence.

8:44 AM

By statisticalconcepts

In: Statistics

Cautions about Regression and Correlation

Correlation and regression are powerful tools for describing the relationship between two variables. When you use these tools, you must be aware of their limitations.

■ Correlation and regression lines describe only linear relationships. You can do the calculations for any relationship between two quantitative variables, but the results are useful only if the scatterplot shows a linear pattern.
■ Correlation and least-squares regression lines are not resistant. Always plot your data and look for observations that may be influential.
■Extrapolation. Suppose that you have data on a child’s growth between 3 and 8 years of age. You find a strong linear relationship between age x and height y. If you fit a regression line to these data and use it to predict height at age 25 years, you will predict that the child will be 8 feet tall. Growth slows down and
then stops at maturity, so extending the straight line to adult ages is foolish. Few relationships are linear for all values of x. Don’t make predictions far outside the range of x that actually appears in your data.
■Lurking variable. the relationship between two variables can often be understood only by taking other variables into account. Lurking variables can make a correlation or regression misleading.

You should always think about possible lurking variables before you draw conclusions based on correlation or regression.

E X T R A P O L AT I O N
Extrapolation is the use of a regression line for prediction far outside the range of values of the explanatory variable x that you used to obtain the line. Such predictions are often not accurate.

L U R K I N G V A R I A B L E
A lurking variable is a variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among those variables.