Statistical Tools to Kick Start with Data Science

Some of the most common and convenient statistical tools like correlation, regression, box plot, line of best fit to get started with data science.

Histogram

Histograms are a statistical way of representing the frequencies of data values in particular intervals. The more traditional description is that a histogram is a chart plot of a frequency table where the height of the bars in it tell us how many data points are in each interval.

Box Plot

Box plots, sometimes referred to as 'Box and Whisker Plots', are a visual way of summarizing basic characteristics of a data set. Box Plots show the highest and lowest values (that are not outliers), the middle value and the values at the 1st quarter and 3rd quarter mark. Outliers are shown as dots after the 'whiskers'. This gives as a simple way to quickly understand the spread of the data and is great for quickly comparing two data sets.

Think of the quartiles as, if your data set was ranked from it's lowest to its highest value, Q1 would be the middle of the low values, (below the median) and Q3 is the middle of the high values (above the median). Box Plots break out a data set into 4 sets, before Q1, Q1 to the Median (Q2), the Median to Q3 and, after Q3. The interquartile range is defined as Q3 - Q1.

The lowest part of the box in the Box Plot is Q1, there is a line inside the box, the median, and the end of the box, Q3.

Outliers are typically defined as 1.5 * (Q3 - Q1) and if a data point is (1.5* the interquartile range) away from Q1 and Q3 (the edges of the box), it is considered an outlier. The whiskers of the box plot are lines from each end of the box out to the farthest data point that is not an outlier.

Scatter Plot

Scatter Plots are charts that visualize the relationship between two sets of data. Every data point for one variable is plotted against a corresponding data point for another variable and the resultant pattern looks like scattered data points.

One of the most common relationships that can be shown in Scatter Plots is the relationship between a product's price and the number of units sold. Typically when the price goes down (a sale), more units are sold and when the price goes up, people buy less of a product. There are exceptions but that tends to be the case.

The thing we want to reduce uncertainty about goes on the Y axis. We want to know what our sales will be if our price changes. The variable on the Y axis, sales, is the dependent variable. The variable on the X axis is the independent variable and it's the thing we can change, increase or decrease, to see what happens to Y.

To plot a Scatter Plot, we take each value of Y and X for a particular observation e.g., a day, a week, a month and we mark them on a graph. Done.

A Scatter Plot showing many dots following the line between 11 and 5 on a clock is a negative linear relationship e.g., when price goes up, sales go down. A Scatter Plot running from 1 to 7 on a clock shows a positive relationship, when X goes up, so does Y.

Correlation

Correlation is a numerical way of interpreting the relationship between two variables. A Regression analysis uses the 'least squares method' to fit a line through a scatter plot and is measured by R Squared.

The Correlation coefficient is the square route of R Squared, taking on the sign (+ or -) of the slope of the data. When R is high and positive, we say that there is a positive correlation e.g., when X goes up, Y goes up.

Correlation or r, measures the tightness of the scatter dots to the line of best fit and its sign tells us whether Y goes up with changes in X or Y goes down with changes in X. An r of 0 (zero) means that there is no relationship between X and Y and when r is 1, that means that there is a perfect relationship between X and Y where, when X goes up, Y goes up.

Line of Best Fit

A line of best fit is a line drawn through a scatter plot so that each point on that line minimizes the total distance to any of the scatter data points. This is traditionally called a 'Least Squares Line' and it follows the formula, y = mX + C. Imagine a line running through a scatter plot. Each point on that line will have an X and a Y value. The least squares method says that we would take each line y value and subtract it from the scatter dots Y value. Our intention is to sum all these values however, some of the values in this subtraction will be negative as some scatter dots will be below the line so, we square each value before we sum all the values. The Line of Best Fit is one where that sum of the squares is the lowest number it can be i.e., all the dots are either on or very close to the line. When this happens, our line is a good way of predicting Y values, given values of X.

The Line of Best Fit is one where that sum of the squares is the lowest number it can be i.e., all the dots are either on or very close to the line. This summation is called 'Sum of Squared Errors'. When the SSE is at its lowest, our line is a good way of predicting Y values, given values of X.

Regression

In a scatter plot with a line of best fit running through it, we can assess how well the X variable explains the changes in the Y variable using Regression e.g., What percentage of Sales is explained by changes in Price?

When we find a high 'R Squared', in percentage terms, the changes in Y are largely explained by the changes in X. If the R Squared is 95%, we say that, 95% of the variation in Sales is explained by the change in Price and the rest, the 5%, is due to error. The error referred to here is the distances between the dots in the scatter and the line of best fit.

There are many other components to a Regression table output including, confidence intervals, P Values and t ratios.