Standard Data Visualizations

 BAR CHART

The classic bar chart uses either horizontal or vertical bars to show discrete, numerical comparisons among categories. One axis of the chart shows the specific categories being compared, and the other axis represents a discrete value. Some bar graphs present bars clustered in groups of more than one (grouped bar graphs), and others show the bars divided into subparts to show cumulative effect (stacked bar graphs).

bar chart

DEVIATION BAR CHART
When graphs display deviation relationships, they communicate how one or more set of metric values differ from a primary set of values. Deviation relationships can be effectively displayed using the following objects:
- Horizontal bars (except when combined with a time-series relationship)
- Vertical bars
vertical bar chart
DUAL AXIS BAR CHART
A Dual Axis bar chart uses either horizontal or vertical bars to show discrete, numerical comparisons among categories. It can be a combination of a bar and a line with 3 axes. One axis of the chart shows the categories and the other two axes show respective values.
dual axis bar chart

STACKED BAR CHART
Stacked Bar Graphs segment the bars of multiple datasets on top of each other. They are used to show how a larger category is divided into smaller categories and what the relationship of each part has on the total amount. There are two types of Stacked Bar Graphs:
- Simple Stacked Bar Graphs
- 100% Stacked Bar Graphs
Vertical Stacked Bar Graph


LINE CHART
Line charts are used to display quantitative values over a continuous interval or time span. They are most frequently used to show trends and relationships (when grouped with other lines). This gives the "big picture" over an interval, to see how it has developed over that period. Line graphs are drawn by first plotting data points on a Cartesian coordinate grid, then connecting a line between the points. Typically, the y-axis has a quantitative value, while the x-axis has either a category or sequenced scale.
line chart

PIE CHART
Pie charts help show proportions and percentages between categories, by dividing a circle into proportional segments. Each arc length represents a proportion of each category; the full circle represents the total sum of all the data, equal to 100%. Pie charts are used for making part-to-whole comparisons with discrete or continuous data. They are most impactful with a small data set.
pie chart
BUBBLE CHART
Bubble Charts are typically used to compare and show the relationships between labeled/categorized circles, by the use of positioning and proportions. The overall picture of Bubble Charts can be used to analyze patterns/correlations. Bubble Charts use a Cartesian coordinate system to plot points along a grid where the X and Y axis are separate variables. Each point is assigned a label or category (either displayed alongside or on a legend). Each plotted point then represents a third variable by the area of its circle. Colors can also be used to distinguish between categories or to represent an additional data variable.
BUBBLE CHART

SCATTER PLOT
A Scatter plot can help you identify the relationships that exist between different values. By displaying a variable in each axis, you can detect if a relationship or correlation between the two variables exists. The Various types of correlations that can be interpreted are positive (values increase together), negative (one value decreases as the other increases), null (no correlation), linear, exponential and U-shaped. The strength of the correlation can be determined by how closely packed the points are to each other on the graph.
SCATTER PLOT

AREA CHART
Area Charts are Line Charts with the area below the line filled in with a certain color or texture. Area Graphs are drawn by first plotting data points on a cartesian coordinate grid, then joining a line between the points and finally filling in the space below the completed line. Like Line Charts, Area Charts are used to display the development of quantitative values over an interval or time period. They are most commonly used to show trends and relationships.
area chart

BOX PLOT
A box plot is a convenient way to visually display groups of numerical data through their quartiles. It shows distribution of data based on minimum, maximum, median, and percentiles. Typically used in descriptive statistics, box plots are a great way to quickly examine one or more data sets graphically. Although they may seem primitive in comparison to a histogram or density plot, they have the advantage of taking up less space, which is useful when comparing distributions between many groups or data sets.
box plot

GANTT CHART
Gantt charts (also referred to as project timelines) are bar charts that help plan and monitor project development or resource allocation on a horizontal time scale. They are essentially horizontal bar charts which provide graphical illustration of a schedule that can help users plan, coordinate, and track specific tasks in a project. The data analyzed in a Gantt chart has a defined starting and ending value; for example, Project A begins 4/15/06 and ends 5/10/06.
gant chart

HILOW STOCK / CANDLESTICK
This chart control displays financial data as a series of candlesticks representing the high, low, opening, and closing values of a data series (four metrics). The top and bottom of the vertical line in each candlestick represent the high and low values for the data point, while the top and bottom of the filled box represent the opening and closing values.
candle stick

HISTOGRAM
A histogram visualizes the distribution of data over a continuous interval or a certain time period. Each bar in a histogram represents the tabulated frequency at each interval/bin. The total area of the  histogram is equal to the total number of datasets. Histograms help give an estimate of where values are concentrated, what the extremes are and whether there are any gaps or unusual values.
HISTOGRAM
PARETO CHART
A Pareto chart is designed to help identify the cause of a quality problem or loss. It includes a Histogram that shows how often a specific problem is occurring or the different types of problems that are occurring. In general, Pareto charts allow you to display the specific areas in which improvement or investigation is necessary. It contains both a bar and a line chart. The values are represented by descending bars and the running % to total is represented by the line. It depicts the percent journey to total & also displays actual values.
PARETO CHART

POLAR CHART / RADAR CHART
Radar Charts are a way of comparing multiple quantitative variables. This makes them useful for seeing which variables have similar values or if there are any outliers amongst each variable. They are also useful for seeing which variables are scoring high or low within a dataset, making them ideal for displaying performance.
RADAR CHART

WATER FALL
A Waterfall visualization highlights the increments and decrements of the values of metrics over time. Analysts can use the widget to identify aspects of their business that are contributing to the fluctuations in the values. The visualization can also be used to perform “what-if” analyses. For e.g., % Revenue Y/Y Variance by Month. It shows how different aspects of the business positively or negatively affect the bottom line.
WATER FALL

GAUGES
A Gauge visualization is a simple status indicator that displays a needle that moves within a range of numbers displayed on its outside edges. A real-world example of a gauge is a car's speedometer. Like the Cylinder and Thermometer widgets, this type of visualization is designed to display the value of a single metric. The needle within the gauge is a visual representation of that single metric value.
GAUGES

TIME SERIES
A Time Series Slider is an area graph that allows a document analyst to choose which section of the graph to view at a time. The visualization consists of two related graphs, one positioned above the other. The top graph is the controller, and contains a slider. The bottom graph is the primary graph. You use the slider on the controller to select some portion of the controller, which determines the range of data visible in the primary graph. It allows users to see a high level trend of one or more metrics and a detailed view by varying the window of the visible data. For e.g., Revenue trend by Date.
MAPS
A Map allows users to visualize the data so they can identify and analyze relationships, patterns, and trends in their data. Some of the functionalities available are:
- Displaying areas, points, and data that are color-coded based on metric values
- Using image markers, bubble markers, density maps, or color-coded areas to visualize data on the map
- Zooming/panning on the map and data
- Displaying an Information window with additional data for a marker or area
- Providing the ability to customize the Information window, such as providing additional details or metric information, including demographic content from the mapping service
- Drilling up to summary levels of data and down to detailed levels of data
maps

HEAT MAP
A Heat Map presents a combination of colored rectangles, each representing an attribute element, that allow you to quickly grasp the state and impact of a large number of variables. Heat Maps are often used in the financial services industry to review the status of a portfolio. The rectangles contain varieties and shadings of colors, that emphasize on the status of various components. In a Heat Map, the size of each rectangle represents its relative weight and the color represents the relative change in the value of that rectangle. You can hover over each rectangle to see which attribute element the rectangle represents; and its metric values.
heat maps

FUNNEL
A Funnel helps to quickly analyze various trends across several metric values. It is a variation of a stacked percent bar chart that displays data that adds up to 100%. Therefore, it can allow analysts to visualize the percent contribution of sales data. It can also show the stages in a sales process and reveal the amount of potential revenue for each stage. When the visualization is used to analyze a sales process, analysts can use the widget to drill down to key metrics such as deal size, profit potential, and probability of closing. The size of the area is determined by the series value as a percentage of the total of all values.

MICRO CHARTS
Micro chart visualizations gives the trend of a metric at a glance without having to know many additional details. The bar, sparkline, and bullet microcharts used in the microcharts convey information that an analyst can understand just by looking at the chart once. It consists of compact representations of data that allow analysts to quickly visualize trends in data. It conveys information so that a user can, at a glance, determine the trend of a metric over time or how a metric is performing compared to forecasted figures.
micro chart


DATA CLOUD
A Data Cloud displays attribute elements in various sizes to depict the differences in metric values between the elements. This type of visualization is similar to a Heat Map in that they both allow an analyst to quickly identify the most significant, positive, or negative contributions. A Data Cloud widget is basically a list of attribute elements. The first metric on the template determines the font size for the attribute elements. A bigger font for an element indicates a larger metric value.
data cloud


Python Packages for Data Science

 In order to do data analysis in Python, you should know a little bit about the main packages relevant to analysis in Python. A Python library is a collection of functions and methods that allow you to perform lots of actions without writing any code. The libraries usually contain built-in modules providing different functionalities, which you can use directly. And there are extensive libraries, offering a broad range of facilities.



Infographic vector created by rawpixel dot com on freepik dot com

We have divided the Python data analysis libraries into three groups:
Scientific Computing Libraries
    i. Pandas offers data structure and tools for effective data manipulation and analysis. It provides fast axis to structured data. The primary instrument of Pandas is a two-dimensional table consisting of column and row labels, which are called a DataFrame. It is designed to provide easy indexing functionality.     ii. The Numpy library uses arrays for its inputs and outputs. It can be extended to objects for matrices, and with minor coding changes, developers can perform fast array processing.          iii. SciPy includes functions for some advanced math problems, as well as data visualization. Using data visualization methods is the best way to communicate with others, showing them meaningful results of analysis.

Libraries to create graphs, charts and maps
i. The Matplotlib package is the most well-known library for data visualization. It is great for making graphs and plots. The graphs are also highly customizable.

ii. Seaborn: It is based on Matplotlib. It's very easy to generate various plots such as heat maps, time series, and violin plots.

Machine Learning algorithms
i.  The Scikit-learn library contains tools for statistical modeling, including regression, classification, clustering and so on. This library is built on NumPy, SciPy and Matplotlib.

ii. StatsModels is also a Python module that allows users to explore data, estimate statistical models, and perform statistical tests.

Statistical Tools to Kick Start with Data Science

Some of the most common and convenient statistical tools like correlation, regression, box plot, line of best fit to get started with data science.


Histogram

Histograms are a statistical way of representing the frequencies of data values in particular intervals. The more traditional description is that a histogram is a chart plot of a frequency table where the height of the bars in it tell us how many data points are in each interval. 

Box Plot

Box plots, sometimes referred to as 'Box and Whisker Plots', are a visual way of summarizing basic characteristics of a data set. Box Plots show the highest and lowest values (that are not outliers), the middle value and the values at the 1st quarter and 3rd quarter mark. Outliers are shown as dots after the 'whiskers'. This gives as a simple way to quickly understand the spread of the data and is great for quickly comparing two data sets. 

Think of the quartiles as, if your data set was ranked from it's lowest to its highest value, Q1 would be the middle of the low values, (below the median) and Q3 is the middle of the high values (above the median). Box Plots break out a data set into 4 sets, before Q1, Q1 to the Median (Q2), the Median to Q3 and, after Q3. The interquartile range is defined as Q3 - Q1.

The lowest part of the box in the Box Plot is Q1, there is a line inside the box, the median, and the end of the box, Q3.

Outliers are typically defined as 1.5 * (Q3 - Q1) and if a data point is (1.5* the interquartile range) away from Q1 and Q3 (the edges of the box), it is considered an outlier. The whiskers of the box plot are lines from each end of the box out to the farthest data point that is not an outlier.

box plot

Scatter Plot

Scatter Plots are charts that visualize the relationship between two sets of data. Every data point for one variable is plotted against a corresponding data point for another variable and the resultant pattern looks like scattered data points.

One of the most common relationships that can be shown in Scatter Plots is the relationship between a product's price and the number of units sold. Typically when the price goes down (a sale), more units are sold and when the price goes up, people buy less of a product. There are exceptions but that tends to be the case.  

The thing we want to reduce uncertainty about goes on the Y axis. We want to know what our sales will be if our price changes. The variable on the Y axis, sales, is the dependent variable. The variable on the X axis is the independent variable and it's the thing we can change, increase or decrease, to see what happens to Y.

To plot a Scatter Plot, we take each value of Y and X for a particular observation e.g., a day, a week, a month and we mark them on a graph. Done. 

A Scatter Plot showing many dots following the line between  11 and 5 on a clock is a negative linear relationship e.g., when price goes up, sales go down. A Scatter Plot running from 1 to 7 on a clock shows a positive relationship, when X goes up, so does Y.

Correlation

Correlation is a numerical way of interpreting the relationship between two variables. A Regression analysis uses the 'least squares method' to fit a line through a scatter plot and is measured by R Squared. 

The Correlation coefficient is the square route of R Squared, taking on the sign (+ or -) of the slope of the data. When R is high and positive, we say that there is a positive correlation e.g., when X goes up, Y goes up. 

Correlation or r, measures the tightness of the scatter dots to the line of best fit and its sign tells us whether Y goes up with changes in X or Y goes down with changes in X. An r of 0 (zero) means that there is no relationship between X and Y and when r is 1, that means that there is a perfect relationship between X and Y where, when X goes up, Y goes up.

Line of Best Fit

A line of best fit is a line drawn through a scatter plot so that each point on that line minimizes the total distance to any of the scatter data points. This is traditionally called a 'Least Squares Line' and it follows the formula, y = mX + C. Imagine a line running through a scatter plot. Each point on that line will have an X and a Y value. The least squares method says that we would take each line y value and subtract it from the scatter dots Y value. Our intention is to sum all these values however, some of the values in this subtraction will be negative as some scatter dots will be below the line so, we square each value before we sum all the values. The Line of Best Fit is one where that sum of the squares is the lowest number it can be i.e., all the dots are either on or very close to the line. When this happens, our line is a good way of predicting Y values, given values of X.

The Line of Best Fit is one where that sum of the squares is the lowest number it can be i.e., all the dots are either on or very close to the line. This summation is called 'Sum of Squared Errors'. When the SSE is at its lowest, our line is a good way of predicting Y values, given values of X.

Regression

In a scatter plot with a line of best fit running through it, we can assess how well the X variable explains the changes in the Y variable using Regression e.g., What percentage of Sales is explained by changes in Price?

When we find a high 'R Squared', in percentage terms, the changes in Y are largely explained by the changes in X. If the R Squared is 95%, we say that, 95% of the variation in Sales is explained by the change in Price and the rest, the 5%, is due to error. The error referred to here is the distances between the dots in the scatter and the line of best fit. 

There are many other components to a Regression table output including, confidence intervals, P Values and t ratios.

Adobe Analytics Connector - Microsoft Power BI Desktop

In the month of December 2017 Microsoft Power BI is releasing a new connector for Adobe Analytics i.e., Omniture. This new connector will allow to import and analyze your Adobe Analytics data within Power BI.
Being a Beta version you have to first enable the connector from Options and Settings >> Options >> Preview Features to view the Adobe connector under Online services.

This Adobe Analytics connector can be found in the Get Data dialog, under the Online Services category.

Adobe connector power BI

After the connection is established, you can view and select multiple dimensions and measures within the Navigator dialog box to create a tabular output, also one can provide input parameters for the selected items.
report suite selector

After selecting it can be load into Power BI Desktop for any further additional data transformations and filters within the Query Editor UX, through the Edit option in this dialog.
query editor

Once you select your required Dimensions and Measures you will see the table connectors are created automatically in power BI, you will see filed created and also the editor window. 




Skewness and Kurtosis in Statistics

The average and measure of dispersion can describe the distribution but they are not sufficient to describe the nature of the distribution. For this purpose we use other concepts known as Skewness and Kurtosis. The symmetrical and skewed distributions are shown by curves as
Skewness
Skewness means lack of symmetry. A distribution is said to be symmetrical when the values are uniformly distributed around the mean. For example, the following distribution is symmetrical about its mean 3.
x                      :           1          2            3        4          5
frequency  (f ) :           5          9          12        9          5

In a symmetrical distribution the mean, median and mode coincide, that is, mean = median = mode.
Several measures are used to express the direction and extent of skewness of a dispersion. The important measures are that given by Pearson. The first one is the Coefficient of Skewness:

For a symmetric distribution Sk = 0. If the distribution is negatively skewed then Sk is negative and if it is positively skewed then Sk is positive. The range for Sk is from -3 to 3.

The other measure uses the b (read ‘beta’) coefficient which is given by,  where, m2 and m3 are the second and third central moments. The second central moment m2 is nothing but the variance. The sample estimate of this coefficient is  where m2 and m3 are the  sample central moments given by  


For a symmetrical distribution b1 = 0. Skewness is positive or negative depending upon whether m3 is positive or negative.

Kurtosis
A measure of the peakness or convexity of a curve is known as Kurtosis.


           
It is clear from the above figure that all the three curves, (1), (2) and (3) are symmetrical about the mean. Still they are not of the same type. One has different peak as compared to that of others. Curve (1) is known as mesokurtic (normal curve); Curve (2) is  known as leptocurtic (leading curve) and Curve (3) is known as platykurtic (flat curve). Kurtosis is measured by Pearson’s coefficient, b2 (read ‘beta - two’).It is given by .
The sample estimate of this coefficient is  
 where, m4 is the fourth central moment given by m4

The distribution is called normal if b2 = 3. When b2 is more than 3 the distribution is said to be leptokurtic. If b2 is less than 3 the distribution is said to be platykurtic.

Measures of Dispersion in Statistics


We know that averages are representatives of a frequency distribution but they fail to give a complete picture of the distribution. They do not tell anything about the scatterness of observations within the distribution.

Suppose that we have the distribution of the yields (kg per plot) of two paddy varieties from 5 plots each.

The distribution may be as follows:
Variety I          45        42        42        41        40
Variety II        54        48        42        33        30

It can be seen that the mean yield for both varieties is 42 kg. But we can not say that the performance of the two varieties is same. There is greater uniformity of yields in the first variety whereas there is more variability in the yields of the second variety. The first variety may be preferred since it is more consistent in yield performance. From the above example, it is obvious that a measure of central tendency alone is not sufficient to describe a frequency distribution. In addition to it we should have a measure of scatterness of observations. The scatterness or variation of observations from their average is called the dispersion. There are different measures of dispersion like the range, the quartile deviation, the mean deviation and the standard deviation.

Range
The simplest measure of dispersion is the range. The range is the difference between the minimum and maximum values in a group of observations for example, suppose that the yields (kg per plot) of a variety from five plots are 8, 9, 8, 10 and 11. The range is (11 - 8) = 3 kg. In practice the range is indicated as 8 - 11 kg.
Range takes only the maximum and minimum values into account and not all the values. Hence it is a very unstable or unreliable indicator of the amount of deviation. It is affected by extreme values. In the above example, if we have 15 instead of figure 11, the range will be (8 - 15) = 7 kg.  In order to avoid these difficulties another measure of dispersion called quartile deviation is preferred.

Quartile Deviation
We can delete the values below the first quartile and the values above the third quartile. It is assumed that the unusually extreme values are eliminated by this way. We can then take the mean of the deviations of the two quartiles from the second quartile (median). That is,
This quantity is known as the quartile deviation (Q.D.).
The quartile deviation is more stable than the range as it depends on two intermediate values. This is not affected by extreme values since the extreme values are already removed. However, quartile deviation also fails to take the values of all deviations

Mean Deviation
Mean deviation is the mean of the deviations of individual values from their average. The average may be either mean or median. For raw data the mean deviation from the median is the least. Therefore, median is considered to be most suitable for raw data. But usually the mean is used to find out the mean deviation. The mean deviation is given by
M.D. =   for raw data and M.D. =    for grouped data
All positive and negative differences are treated as positive values. Hence we use the modulus symbol | |. We have to read  as “modulus”. If we take  as such, the sum of the deviations,  will be 0. Hence, if the signs are not eliminated the mean deviation will always be 0, which is not correct.

The steps of computation are as follows :
Step 1: If the classes are not continuous we have to make them continuous.
Step 2: Find out the mid values of the classes (mid - X = x).
Step 3: Compute the mean.
Step 4: Find out  for all values of x.
Step 5: Multiply eachby the corresponding frequencies.
Step 6: Use the formula.

The mean deviation takes all the values into consideration. It is fairly stable compared to range or quartile deviation. Since, the mean deviation ignores signs of deviations, it is not possible to use it for further statistical analysis and it is not stable as standard deviation which is defined as:

Standard Deviation
Ignoring the signs of the deviations is mathematically not correct. We may square the deviation to make a negative value as positive. After calculating the average squared deviations, it can be expressed in original units by taking its square root. This type of the measure of variation is known as Standard Deviation.
The standard deviation is defined as the square root of the mean of the squared deviations of individual values from their mean. Symbolically,
Standard Deviation (S.D.)   or  
This is called standard deviation because of the fact that it indicates a sort of group standard spread of values around their mean. For grouped data it is given as
Standard Deviation  (S.D.) or  
The sample standard deviation should be an unbiased estimate of the population standard deviation because we use sample standard deviation to estimate the population standard deviation. For this we substitute n - 1 for n in the formula. Thus, the sample standard deviation is written as
          
For grouped data it is given by

            
where,
            C = class interval
            d = (x - A) / C as given under mean.

The square of the standard deviation is known as the variance. In the analysis of variance technique, the termis called the sum of squares, and the variance is called the mean square. The standard deviation is denoted by s in case of sample, and by s (read ‘sigma’) in case of population.
The standard deviation is the most widely used measure of dispersion. It takes all the items into consideration. It is more stable compared to other measures. However, it will be inflated by extreme items as is the mean.

The standard deviation has some additional special characteristics. It is not affected by adding or subtracting a constant value to each observed value. It is affected by multiplying or dividing each observation by a constant. When the observations are multiplied by a constant, the resulting standard deviation will be equivalent to the product of the actual standard deviation and the constant. (Note that division of all observations by a constant, C is equivalent to multiplication by its reciprocal, 1/C. Subtracting a constant C is equivalent of adding a constant, - C.)
The standard deviations can be pooled. If the sum of squares for the first distribution with n1 observations is SS1, and the sum of squares for the second distribution with n2 observations is SS2,  then the pooled standard deviation is given by,

           
Measures of Relative Dispersion
Suppose that the two distributions to be compared are expressed in the same units and their means are equal or nearly equal. Then their variability can be compared directly by using their standard deviations. However, if their means are widely different or if they are expressed in different units of measurement, we can not use the standard deviations as such for comparing their variability. We have to use the relative measures of dispersion in such situations.
There are relative dispersion in relation to range, the quartile deviation, the mean deviation, and the standard deviation. Of these, the coefficient of variation which is related to the standard deviation is important. The coefficient of variation is given by,
C.V. = (S.D. / Mean) x 100
The C.V. is a unit-free measure. It is always expressed as percentage. The C.V. will be small if the variation is small of the two groups, the one with less C.V. is said to be more consistent.
The coefficient of variation is unreliable if the mean is near zero. Also it is unstable if the measurement scale used is not ratio scale. The C.V. is informative if it is given along with the mean and standard deviation. Otherwise, it may be misleading.