## Degrees of Freedom (df)

Degrees of Freedom denotes the number of samples that a statistician has the freedom to choose. Degrees of Freedom is based on a concept that one could not have exercised his /her freedom to select all the samples.
The concept can be explained by an analogy:
X+Y = 10 (1)
In the above equation you have freedom to choose a value for X or Y but not both because when you choose one, the other is fixed. If you choose 8 for X, then Y has to be 2. So the degree of freedom here is 1.
X+Y+Z = 15 (2)
In the formula (2), one can choose values for two variables but not all. You have freedom to choose 8 for X and 2 for Y, If so, then Z is fixed, So the degree of freedom is 2. df is calculated by subtracting 1 from the size of each group. The methods of df calculation may vary with the test used.
“Degrees of freedom” refers to the number of scores that are free to vary.  It is required when one is working on estimates of population values (sample statistics).  It is often abbreviated as df.  It is the number of observations in the data collection that are free to vary after the sample statistics have been calculated.  For example, in calculating a sample standard deviation, we must subtract the sample mean from each of the n data observations in order to get the deviations from the mean.  But once we have completed the next to the last subtraction, the final deviation is automatically determined, since the deviations prior to squaring must sum to zero.  Therefore, the last deviation is not free to vary; only n-1 are free to vary.  As a rule of thumb, every estimate of a population value in a formula equals 1 degree of freedom.
The practical impact of using degrees of freedom is found when one considers small samples.  Consider the t distribution, for example.  With small samples it gets flatter, or another way of looking at it is that more area underneath the curve (region of rejection) is farther away from the mean.  This makes using degrees of freedom a more conservative and accurate, since small samples tend to underestimate the population value.
Practical Example:
Consider a situation in which the scores that make up the distribution are unknown, and I tell you to guess at the first score in the distribution of 5 numbers.  Your guesses will be wild because that score could be any number.  The same is true for the second score, third score, and the fourth score.  All of these scores have complete freedom to vary.   But, if I tell you that the first four scores in the distribution are 3,4,5, and 6; and I tell you what the mean of the distribution (5, in this case), then the last score in this distribution has to be 7.  In other words, if the mean is known, the missing score is determined by your knowledge of the other four.  Therefore n-1 scores are free to vary.  The (-1) is often called a “restriction.”  Note that by making the numerator smaller, the standard deviation becomes larger. A larger standard deviation means that the sampling distribution is flatter; and flatter means more values are farther from the mean.  It’s tougher to find significance.
In conclusion, when the statistical formula is concerned with description, degrees of freedom is n.  When the formula is concerned with inference, some restrictions apply. The idea is to adjust for a small sample size’s tendency to underestimate the population parameter.  As n gets larger, this becomes less of a problem because the distribution becomes less flat and more normal, but we still use the sample formula to calculate the statistic.

## Web Analytics Terms

A
Abandonment: is the measure when a visitor exits or leaves a conversion process on a website and does not return later during the session.
A/B Testing: A method of banner ads, emails and landing pages testing by which a baseline control sample is compared to a variety of single-variable test samples.
Accuracy: The ability of a measurement to match the actual value of the quantity being measured. Accuracy is the foundation upon which your web analytics should be built. If you can't trust that your data is accurate, you can't make confident decisions. In statistical terms, accuracy is the width of the confidence interval for a desired confidence level.
Acknowledgement Page: A page displayed after a visitor completes an action or transaction. For example, a thank-you or a receipt page. Acknowledgement pages are often important in Scenario Analysis, where it is an indicator of a completed scenario.
Acquisition: The process of gaining customers through the means of different marketing strategies. For the purposes of web analytics, it often refers specifically to the process of attracting visitors to a web site.
ACT: After-Click Tracking is the recording the activity path of a visitor to a site after they have clicked on an email link.
Actionable Data: Information that allows you to make a decision or can be made use of in any way.
Ad: A link that takes a visitor to a web site when clicked on, usually graphic or text.
Ad Click: A click on an advertisement on a website which takes a user to another site.
Ad Hoc Query: A non-standard inquiry posed to a database of information as the need arises.
Ad View: A web page that presents an ad. There may be more than one ad on an ad view. Once visitors have viewed an ad, they can click on it.
Affiliate Marketing: A method of promoting web businesses in which an affiliate is rewarded for providing customers. Compensation could be made based on a value for visits, subscriptions, leads, sales, and so on.
Aggregate Data: A summary of collected information which groups data together without individual-level statistics.
Algorithm: A mathematical formula used by search engines to determine which web sites in their database to present in search results, in which order. While search engine algorithms change regularly, primary on-page factors include keyword density and source code optimization. The primary off-page factor is link popularity.
API: Application Programming Interface is a system that a computer or application supplies in order to allow requests for service to be made of it by other computer programs. APIs allow data to be exchanged between computer programs, and a standard software API method includes Open Database Connectivity (ODBC).
ASP: Active Server Pages are a set of software components that run on a web server and let developers build dynamic web pages.
Attrition: The erosion of your customer base over time. The opposite of customer retention.
Authentication: The technique by which access to Internet or intranet resources requires the user to enter a username and password as identification.
Average Lifetime Value: The average of the lifetime value of a visitor or multiple visitors during the reporting period, where each visitor's lifetime value is the total monetary value of a visitor's past orders since visitor tracking began.
B
Bandwidth: Measure, in kilobytes of data transferred, of the traffic on a site.
Banner Ad: An advertisement embedded on a web page usually intended to drive traffic to a different website by linking to the advertiser's site. The Interactive Advertising Bureau (IAB) has created a standard set of banner ad sizes (Medium Rectangle, Rectangle, Leaderboard, Wide Skyscraper) into a set of guidelines called the Universal Ad Package).
Benchmark: A standard by which something can be measured or judged. For example, you benchmark your Key Performance Indicators to ensure everyone in your organization is measuring performance against the same goals.
Blog/Web Logs: A self-published, managed or maintained Web diary. Usually updated daily or weekly, blogs have historically been personal, but gained notoriety after the 2004 election as an influential media outlet. Companies now use blogs to extend their brand and improve their organic search visibility.
Bounce Rate: The percentage of entrances on a web page that result in an immediate exit from the web site.
Browser: A program used to locate and view HTML documents.
Business Intelligence: While some would claim it's an oxymoron, business intelligence refers to a category of software and tools designed to gather, store, analyze, and deliver data in a user-friendly format to help organizations make more informed business decisions. Software types include dashboarding, data mining, data warehouses, and other information systems.
C
Campaign Analysis: A mesaure that tracks activity originating from a marketing campaign, so you can compare your campaigns and evaluate their effectiveness.
Client: The browser used by a website visitor.
Client Error: An error that occurs because of an invalid request by the visitor's browser.
Cloaking: In terms of search engine marketing, this is the act of getting a search engine to record content for a URL that is different than what a searcher will ultimately see. It can be done in many technical ways. Several search engines have explicit rules against unapproved cloaking. Those violating these guidelines might find their pages penalized or banned from a search engine's index. As for approved cloaking, this generally only happens with search engines offering paid inclusion program.
Click Fraud: A type of internet crime that occurs in pay per click online advertising when a person, automated script, or computer program imitates a legitimate user of a web browser clicking on an ad, for the purpose of generating a charge per click without having actual interest in the target of the ad's link.
Content Management System (CMS): a software platform that aids in the management of content on a Web site.
Contextual Link Ads/Inventory: To supplement their business models, certain text-link advertising networks (like Google) have expanded their network distribution to include "contextual inventory". Most vendors of "search engine traffic" have expanded the definition of Search Engine Marketing to include this contextual inventory. Contextual or content inventory is generated when listings are displayed on pages of Web sites (usually not search engines), where the written content on the page indicates to the ad-server that the page is a good match to specific keywords and phrases. Often this matching method is validated by measuring the number of times a viewer clicks on the displayed ad. These ads typically do not perform as well as traditional text ads on search engines, but the lower cost justifies the expense.
Conversion: An action that signifies a completion of a specified activity. For many sites, a user converts if they buy a product, sing up for a newsletter, or download a file. The conversion rate is the percentage of visitors who do convert. Cookie deletion can have an impact on your conversion rate because if a cookie is being systematically deleted, repeat visitor rates will be under-counted and new visitor rates will be over-counted, thus skewing the conversion rate metric by which you analyze your site's overall effectiveness.
Conversion Funnel: The series of steps that move a visitor towards a specified conversion event, such as an order or registration signup. See also abandonment.
Conversion Rate: The relationship between visitors to a web site and actions considered to be a "conversion," such as a sale or request to receive more information. This metric is often expressed as a percentage.
Cookie: A text file that transmits information to a data collection facility via a 1x1 pixel GIF image request and includes a tracking ID that is used to identify returning visitors. Contrary to some industry speculation, cookies can not be used for malicious use such as privacy tapping. See also first and third-party cookies.
Cost-per-Click (CPC): System where an advertiser pays an agreed amount for each click someone makes on a link leading to their web site. Also known as PPC or paid listings.
Cost-per-Thousand (CPM): System where an advertiser pays an agreed amount for the number of times their ad is seen by a consumer, regardless of the consumer's subsequent action. This term is heavily used in print, broadcasting and direct marketing, as well as with online banner ad sales. CPM stands for "cost per thousand," since ad views are often sold in blocks of 1,000. The M in CPM is Latin for thousand.
Crawler/Spider/Robot: Component of search engine that indexes web sites automatically. A search engine's crawler (also called a spider or robot), copies web page source code into its index database and follows links to other web pages.
CTR: Click Through Rate. A click through rate is the rate at which visitors "click through" from one website page or property to the next. A good indication of an ad's effectiveness.
Customer Segment: A powerful aspect of relationship marketing in which you target sub-section or group of customers who share a specific trait or set of behaviors. See also demographics and psychographics.
D
Dashboard: A web analytics dashboard provides all of your critical metrics in one place to help you understand the health or performance of your business.
Data Warehouse: is a logical collection of information gathered from many different operational databases used to create business intelligence that supports business analysis activities and decision-making tasks, primarily, a record of an enterprise's past transactional and operational information, stored in a database designed to favor efficient data analysis and reporting.
Demographics: The physical characteristics of human populations and segments of populations, often used to identify consumer markets. Demographics can include information such as age, gender, marital status, education, and geographic location. See also psychographics.
Directories: A type of search engine where listings are gathered or reviewed by humans, rather than by search engine crawlers. In directories, web sites are often reviewed, summarized in about 25 words and placed in a particular category. The largest and most popular directory site is Yahoo!
Domain: An area in the Internet specified by a URL address. The top-level domain is at the end after the dot and the second-level domain comes before it, and shows where in the top-level domain the address can be found. For example in www.xyz.com, ".com" is the top-level domain and "xyz" is the second level domain.
Domain Name: The text name that corresponds to a numeric IP address of a computer on the Internet.
Doorway/Landing/Gateway/Bridge/Jump Pages: A web page created expressly in hopes of ranking well for a term in a search engine's organic/non-paid listings and which itself does not deliver much information to those viewing it. Instead, visitors will often see only some enticement on the doorway page leading them to other pages, or they may be seamlessly redirected to a real page within the existing web site. With cloaking, visitors may never see the doorway page at all. Several search engines have guidelines against doorway pages, though they are more commonly allowed in through paid inclusion programs.
E
E-commerce: The act of selling goods and services online via a standalone site or through an online auction center.
Email Bounce: The number of e-mails that were sent but never reach the intended receiver.
Entry Page: The first viewed page on a visitor's path through a site.
Exit Page: The last page viewed on a visitor's path through a site.
F
First Party Cookie: For most business models, first-party cookies are regarded as the most reliable method to measure visitor activity. Whereas a third-party cookie is usually set by an analytics vendor, (an entity with whom the user does not have a relationship), a first-party cookie is set by the business, an organization with whom the Web site visitor has specifically chosen to do business. Because of this relationship, first-party cookies are deemed more secure by the user. Also seecookies.
Frequency: The number of times a visitor has visited a site during a reporting period. Average Frequency is the average of frequencies of all the visitors during the reporting period. Frequency is a retention metric and is part of RFM (recency, frequency, monetary) analysis.
H
Hit: Any request from a file or a web-server. A single page likely contains multiple hits as multiple image and text files are downloaded from the web-server.
Home Page: The main page of a web site. The home page provides visitors with an overview and links to the rest of the site
I
Impression: The display of an online advertisement (usually a banner ad) to a web site visitor.
Index: The collection of information (contained in a large database) a search engine has that searchers can query against. With crawler-based search engines, the index is typically copies of all the web pages they have found from crawling the web. With human-powered directories, the index contains the summaries of all web sites that have been categorized.
Inbound/Back Link: A text or graphical hyperlink from one site to another. Google and other search engines' algorithms consider a site's popularity based on the quality and quantity of inbound links from relevant third party sites to help determine search positioning.
Internet: The Internet is the publicly accessible global system of interconnected computer networks that transmit data via a standardized Internet Protocol. See alsoWorld Wide Web.
J
JavaScript: A scripting language developed by Netscape. While it is often used for websites, it is also used to enable scripting access to objects embedded in other applications.
K
Keyword: Terms entered into the search field of a web search engine. See also organic search and PPC.
KPI: Key Performance Indicators. Key Performance Indicators are typically kept in dashboards and provide customers with an understanding of how the site is performing.
L
Latency: The average number of days between visits for a given visitor during a reporting period. For example, those who visit on average every seven days.
Link: On a web page, text or an image that has been coded to take a browser from one page to another or from one site to another.
Log File: A file created by a web or proxy server which contains all of the access information regarding the activity on that server.
Long Tail: First coined by Chris Anderson in an October 2004 Wired magazine article to describe the niche strategy of certain business such as Amazon.com or Netflix. In relation to search engine marketing (SEM) the Long Tail refers to the keyword phrases that are highly detailed and specific and may generate low volumes of searches and traffic, but add up to generate a majority of traffic for sites with deep content or product SKUs.
LTV: Life-Time Value. Life-Time Value is a metric used to describe the value a specific customer has over the life of their relationship with you.
M
Meta Search Engine: A search engine that gets listings from two or more other search engines, rather than through its own efforts.
Meta Tags: Information placed in a web page not intended for users to see but instead which typically passes information to search engine crawlers, browser software and some other applications.
Meta Description Tag: Allows page authors to say how they would like their pages described when listed by search engines. Not all search engines use the tag.
Meta Keywords Tag: Allows page authors to add text to a page to help with the search engine ranking process.
Meta Robots Tag: Allows page authors to keep their web pages from being indexed by search engines, especially helpful for those who cannot create robots.txt files. The Robots Exclusion page provides official details.
Metrics: Metrics are a system of parameters or ways of quantitative assessment of a process that is to be measured, along with the processes to carry out such measurement. Metrics define what is to be measured.
Mobile Search: An evolving branch of information retrieval services that is centered on the convergence of mobile platforms and mobile handsets or other mobile devices. The services allow users to find mobile content interactively on mobile websites, and mobile content shows a media shift toward mobile multimedia.
Multivariate Testing: A process by which more than one component of a website may be tested in a live environment. It can be thought of in simple terms as numerous split tests or A/B tests performed on one page at the same time.
N
Navigation: The act of moving from location to location within a web site, or between web sites, accomplished by clicking on links. Navigation can also refer to the overall structure of the links on the site, comprising the paths available to the visitor.
Non-referrals: Visitors who arrive at a site by typing a domain into an address bar, using a bookmark, or clicking on an emailed URL.
O
Online Reputation Management (ORM): The act of monitoring, addressing or mitigating undesirable search engine results or mentions in online media for a company or product. Techniques include generating new content and creating posts on existing content.
OpenSearch: A collection of technologies that allow publishing of search results in a format suitable for syndication and aggregation. It is a way for websites and search engines to publish search results in a standard and accessible format. Originally developed by Amazon and recently adopted by Yahoo!, OpenSearch relies on abstract-based microformats (dataRSS, eRDF, FOAF, GeoRSS, hCard, hEvent, hReview, hAtom, MediaRSS, RDFa, XFN, etc.) to integrate syndicated content into search results.
Opt-in: This permission-based email communication requires customers to verify the opt-in method before their e-mail addresses can be used to communicate with them.
Organic/Natural Listings: Listings that search engines do not sell (unlike paid listings). Instead, sites appear solely because a search engine has deemed it editorially important for them to be included, regardless of payment. Paid inclusion content is also often considered "organic" even though it is paid for. This is because that content usually appears intermixed with unpaid organic results.
Organic Search: A type of search in which web users find sites having unpaid listings, as opposed to using the pay-per-click advertisement listings displayed among the search results.
Outbound Links: Links on a particular web page leading to other web pages on a different domain.
P
Page: A document provided by the server, including HTML, scripts, and text files. Images, sound files and video are not considered pages. Documents are defined by the system administrator, but generally include all static content, such as complete html pages. Dynamic pages are created with variables and do not exist anywhere in a static form. Forms are scripted pages which get information from a visitor that gets passed back to the server.
Page Tag: A piece of JavaScript code embedded on a web page and executed by the browser when the page is viewed. See also log files.
Page View: is generally defined as a request to load a single page of a website. On the web, a page request would result from a web surfer clicking on a link on another page that points to the page in question. See also hit.
Parameters: These are located in the URL immediately after a question mark and followed by an equal sign and a return value, known as name=value.
Path: A path is the click pattern a visitor uses as they traverse through multiple pages.
PPC: Pay Per Click or paid search uses search keywords that cost a certain amount for each customer click on that term in order to get to your site. See also organic search.
Q
Query: A question or inquiry used to find answers about certain metrics.
Query Parameter: An individual piece of a query string consisting of a parameter name and a value for the parameter.
R
Reach: The size of the audience reading, viewing, hearing, or interacting with a message in a given period of time. Reach can be understood as either an absolute number, or a fraction of a population.
Rear-View Mirror Metrics: Metrics that measure what has occurred. For example campaign response metrics are such metrics that tell you how a campaign performed.
Recency: The number of days since a visitor's most recent visit during a reporting period.
Referrals: The location that visitors come from, particularly the sites, search engines or directories. Relationship Marketing: Relationship marketing is a type of marketing that traces its roots to direct response marketing. It emphasizes building long-term relationships with customers rather than individual transactions. It requires understanding customer needs as they go through life cycles of interacting and purchasing from organizations, and requires that marketers accurately determine customer intent in order to provide them the right message at the right time.
RFM Analysis: Recency, Frequency, Monetary analysis.
ROI: Return on Investment
Robot: An automated process that performs mundane, repeatable tasks to provide information. Search engine robots or bots provide such functions, cataloging the internet for searchers to find information.
RSS: Really Simple Syndication is a type of web syndication used by news sites and weblogs which provides summaries of information with links to the complete content.
S
Sampling: In statistics, the selection of individual observations intended to yield knowledge about a population, especially for the purposes of statistical inference.
Search Engine: A search engine is a program that helps you find information on the web.
Search Engine: Any service generally designed to allow users to search the web or a specialized database of information. Web search engines generally have paid listings and organic listings. Organic listings typically come from crawling the web, though often human-powered directory listings are also optionally offered. Top tier search engines include Google, MSN, Teoma and Yahoo!
Search Engine Marketing (SEM): The act of marketing a web site via search engines, whether this be improving rank in organic listings (search engine optimization), purchasing paid listings (PPC management) or a combination of these and other search engine-related activities (i.e. affiliate programs, shopping feeds or link development).
Search Engine Optimization (SEO): The act of altering a web site so that it does well in the organic, crawler-based listings of search engines. In the past, has also been used as a term for any type of search engine marketing activity, though now the term search engine marketing is more commonly used as an umbrella term.
Search Terms: The words (or phrase) a searcher enters into a search engine's search box. Also used to refer to the terms a search engine marketer hopes a particular page will be found for. Also called keywords, query terms or query.
Segment: A grouping of customers, defined by website activity or other data, which can be used to target them effectively.
Social Media Marketing (SMM): A form of internet marketing which seeks to achieve branding and marketing communication goals through the participation in various social media networks (MySpace, Facebook, LinkedIn), social bookmarking (Digg, Stumbleupon), social media sharing (Flickr, YouTube), review/ratings sites (ePinions, BizRate), blogs, forums, news aggregators and virtual 3D networks (SecondLife, ActiveWorlds). Each social media site can be optimized to generate awareness or traffic.
Social Media Optimization (SMO): A set of methods for generating publicity through social media, online communities and community websites. Methods of SMO include adding RSS feeds, adding a "Digg This" button, blogging and incorporating third party community functionalities like Flickr photo slides and galleries or YouTube videos. Social media optimization is a form of search engine marketing.
Spam: Any search engine marketing method that a search engine deems to be detrimental to its efforts to deliver relevant, quality search results. Some search engines have written guidelines about what they consider to be spamming, but ultimately any activity a particular search engine deems harmful may be considered spam, whether or not there are published guidelines against it. Examples of spam include the creation of nonsensical doorway pages designed to please search engine algorithms rather than human visitors, or a heavy repetition of search terms on a page to increase keyword density. .
Spider: An automated software program that gathers pages from the Internet.
Submission: The act to submitting a URL for inclusion into a search engine's index. Unless done through paid inclusion, submission generally does not guarantee listing. In addition, submission does not help with rank improvement on crawler-based search engines unless search engine optimization efforts have been implemented. Submission can be done manually (i.e., you fill out an online form and submit) or automated, where a software program or online service may process the forms behind the scenes.
Suffix: The last part of a domain that can be used to identify the type of organization or location of a site.
T
Third-party cookie: Hosted web analytics services track visitor behavior by inserting a small piece of tracking code onto each page of a site. Because the cookie is served by an analytics vendor rather than your own site, the cookie is considered third-party.
Traffic: On the web, traffic refers to the amount of data sent and received by visitors to a website.
U
URL: A Uniform Resource Locator is a means of identifying an exact location on the Internet.
User Agent: Fields in an extended web server log file identifying the browser and platform used by a visitor.
User Session: A period of activity (all hits) for one user of a website. A unique user is determined by the IP address or cookie. Typically, a user session is terminated when a user is inactive for more than 30 minutes.
Unique Visitors: refers to a measure captured by web analytics solutions that track the interaction a single user has with a website over time.
V
Viral Marketing: Any marketing technique that induces Web sites or users to pass on a marketing message to other sites or users, creating a potentially exponential growth in the message's visibility and effect.
Visitor: Similar to unique visitor, visitor refers to an individual that visits a website. A visitor or unique visitor can have multiple visits.
Visitor Session: Interaction by a site visitor. The session ends when the visitor leaves the site.
Visit: A visit is an interaction a unique visitor has with a website over a specified period of time or activity. If a visitor has left a site or has not executed a click within 30 minutes, the visit session will terminate.
W
W3C: World Wide Web Consortium develops interoperable technologies (specifications, guidelines, software, and tools) to lead the Web to its full potential.
Web 2.0: The use of World Wide Web technology and web design that aims to facilitate creativity, information sharing, and, most notably, collaboration among users. These concepts have led to the development and evolution of web-based communities and hosted services, such as social-networking sites, wikis, blogs, and folksonomies.
Web Analytics: The measurement of data as it relates to an Internet site, including the behavior of visitors, the amount of traffic, the conversion rates, web server performance, user experience, and other information in order to understand and proof of results and continually improve the results of a site towards a set of objectives.
Website: A website is a collection of web pages, on particular domain name or sub-domain on the World Wide Web on the Internet. Usually it is made up of a set of web pages created using HTML and accessible via HTTP.
What if: A type of analysis that allows an end-user to pose hypothetical situations against their data to model or predict outcomes.
World Wide Web: Also called the web, this is a global information space which people can communicate via computers connected to the Internet. Some people use "internet" and "the web" interchangeably, even though the web is a service that operates over the internet.
X
XML: Extensible Markup Language is a World Wide Web Consortium (W3C) recommended general-purpose markup language for creating special-purpose markup languages, capable of describing many different kinds of data.
XML Feeds: A form of paid inclusion where a search engine is "fed" information about pages via XML, rather than gathering that information through crawling actual pages. Marketers can pay to have their pages included in a spider based search index either annually per URL or on a CPC basis based on an XML document representing each page on the client site. New media types are being introduced into paid inclusion, including graphics, video, audio, and rich media. These feeds are commonly used for Shopping Feeds.
Z
Zero Latency: Latency is a time delay between the moment something is started, and the moment one of the effects of that event begins. When there is no time lapse between the event and the effect, it's called zero latency. In analytics, this term is used to describe instantaneous receipt of data and the ability to analyze and act on that data.
Zero-page Visit: A visit that included no page views. This is possible if a visit consisted of at least one request for a non-page file (such as a graphic) but no page files (such as .htm, .asp, .jsp, or .cfm.)

## Statistics Formulas

Notation
Capitalization
In general, capital letters refer to population attributes (i.e., parameters); and lower-case letters refer to sample attributes (i.e., statistics). For example,
- P refers to a population proportion; and p, to a sample proportion.
- X refers to a set of population elements; and x, to a set of sample elements.
- N refers to population size; and n, to sample size.
Greek vs. Roman Letters
Like capital letters, Greek letters refer to population attributes. Their sample counterparts, however, are usually Roman letters. For example,
- μ refers to a population mean; and x, to a sample mean.
- σ refers to the standard deviation of a population; and s, to the standard deviation of a sample.
Population Parameters
- μ refers to a population mean.
- σ refers to the standard deviation of a population.
- σ2 refers to the variance of a population.
- P refers to the proportion of population elements that have a particular attribute.
- Q refers to the proportion of population elements that do not have a particular attribute, so Q = 1 - P.
- ρ is the population correlation coefficient, based on all of the elements from a population.
- N is the number of elements in a population.
Sample Statistics
- x refers to a sample mean.
- s refers to the standard deviation of a sample.
- s2 refers to the variance of a sample.
- p refers to the proportion of sample elements that have a particular attribute.
- q refers to the proportion of sample elements that do not have a particular attribute, so q = 1 - p.
- r is the sample correlation coefficient, based on all of the elements from a sample.
- n is the number of elements in a sample.
Simple Linear Regression
- Β0 is the intercept constant in a population regression line.
- Β1 is the regression coefficient (i.e., slope) in a population regression line.
- R2 refers to the coefficient of determination.
- b0 is the intercept constant in a sample regression line.
- b1 refers to the regression coefficient in a sample regression line (i.e., the slope).
- sb1 refers to the refers to the standard error of the slope of a regression line.
Probability
- P(A) refers to the probability that event A will occur.
- P(A|B) refers to the conditional probability that event A occurs, given that event B has occurred.
- P(A') refers to the probability of the complement of event A.
- P(A ∩ B) refers to the probability of the intersection of events A and B.
- P(A ∪ B) refers to the probability of the union of events A and B.
- E(X) refers to the expected value of random variable X.
- b(x; n, P) refers to binomial probability.
- b*(x; n, P) refers to negative binomial probability.
- g(x; P) refers to geometric probability.
- h(x; N, n, k) refers to hypergeometric probability.
Counting
- n! refers to the factorial value of n.
- nPr refers to the number of permutations of n things taken r at a time.
- nCr refers to the number of combinations of n things taken r at a time.
Set Theory
- A ∩ B refers to the intersection of events A and B.
- A ∪ B refers to the union of events A and B.
- {A, B, C} refers to the set of elements consisting of A, B, and C.
- {∅} refers to the null set.
Hypothesis Testing
- H0 refers to a null hypothesis.
- H1 or Ha refers to an alternative hypothesis.
- α refers to the significance level.
- Β refers to the probability of committing a Type II error.
Random Variables
- Z or z refers to a standardized score, also known as a z score.
- zα refers to the standardized score that has a cumulative probability equal to 1 - α.
- tα refers to the t score that has a cumulative probability equal to 1 - α.
- fα refers to a f statistic that has a cumulative probability equal to 1 - α.
- fα(v1, v2) is a f statistic with a cumulative probability of 1 - α, and v1 and v2 degrees of freedom.
- Χ2 refers to a chi-square statistic.
Special Symbols
- Σ is the summation symbol, used to compute sums over a range of values.
- Σx or Σxi refers to the sum of a set of n observations. Thus, Σxi = Σx = x1 + x2 + . . . + xn.
- sqrt refers to the square root function. Thus, sqrt(4) = 2 and sqrt(25) = 5.
- Var(X) refers to the variance of the random variable X.
- SD(X) refers to the standard deviation of the random variable X.
- SE refers to the standard error of a statistic.
- ME refers to the margin of error.
- DF refers to the degrees of freedom.

Formulas
Parameters
- Population mean = μ = ( Σ Xi ) / N
- Population standard deviation = σ = sqrt [ Σ ( Xi - μ )2 / N ]
- Population variance = σ2 = Σ ( Xi - μ )2 / N
- Variance of population proportion = σP2 = PQ / n
- Standardized score = Z = (X - μ) / N
- Population correlation coefficient = ρ = [ 1 / N ] * Σ { [ (Xi - μX) / σx ] * [ (Yi - μY) / σy ] }

Statistics
Unless otherwise noted, these formulas assume simple random sampling.
- Sample mean = x = ( Σ xi ) / n
- Sample standard deviation = s = sqrt [ Σ ( xi - x )2 / ( n - 1 ) ]
- Sample variance = s2 = Σ ( xi - x )2 / ( n - 1 )
- Variance of sample proportion = sp2 = pq / (n - 1)
- Pooled sample proportion = p = (p1 * n1 + p2 * n2) / (n1 + n2)
- Pooled sample standard deviation = sp = sqrt [ (n1 - 1) * s12 + (n2 - 1) * s22 ] / (n1 + n2 - 2) ]
- Sample correlation coefficient = r = [ 1 / (n - 1) ] * Σ { [ (xi - x) / sx ] * [ (yi - y) / sy ] }

Simple Linear Regression
- Simple linear regression line: ŷ = b0 + b1x
- Regression coefficient = b1 = Σ [ (xi - x) (yi - y) ] / Σ [ (xi - x)2]
- Regression slope intercept = b0 = y - b1 * x
- Regression coefficient = b1 = r * (sy / sx)
- Standard error of regression slope = sb1 = sqrt [ Σ(yi - ŷi)2 / (n - 2) ] / sqrt [ Σ(xi - x)2 ]

Counting
- n factorial: n! = n * (n-1) * (n - 2) * . . . * 3 * 2 * 1. By convention, 0! = 1.
- Permutations of n things, taken r at a time: nCr = n! / (n - r)!
- Combinations of n things, taken r at a time: nCr = n! / r!(n - r)! = nPr / r!

Probability
- Rule of addition: P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
- Rule of multiplication: P(A ∩ B) = P(A) P(B|A)
- Rule of subtraction: P(A') = 1 - P(A)

Random Variables
In the following formulas, X and Y are random variables, and a and b are constants.
- Expected value of X = E(X) = μx = Σ [ xi * P(xi) ]
- Variance of X = Var(X) = σ2 = Σ [ xi - E(x) ]2 * P(xi) = Σ [ xi - μx ]2 * P(xi)
- Normal random variable = z-score = z = (X - μ)/σ
- Chi-square statistic = Χ2 = [ ( n - 1 ) * s2 ] / σ2
- f statistic = f = [ s12/σ12 ] / [ s22/σ22 ]
- Expected value of sum of random variables = E(X + Y) = E(X) + E(Y)
- Expected value of difference between random variables = E(X - Y) = E(X) - E(Y)
- Variance of the sum of independent random variables = Var(X + Y) = Var(X) + Var(Y)
- Variance of the difference between independent random variables = Var(X - Y) = E(X) + E(Y)

Sampling Distributions
- Mean of sampling distribution of the mean = μx = μ
- Mean of sampling distribution of the proportion = μp = P
- Standard deviation of proportion = σp = sqrt[ P * (1 - P)/n ] = sqrt( PQ / n )
- Standard deviation of the mean = σx = σ/sqrt(n)
- Standard deviation of difference of sample means = σd = sqrt[ (σ12 / n1) + (σ22 / n2) ]
- Standard deviation of difference of sample proportions = σd = sqrt{ [P1(1 - P1) / n1] + [P2(1 - P2) / n2] }

Standard Error
- Standard error of proportion = SEp = sp = sqrt[ p * (1 - p)/n ] = sqrt( pq / n )
- Standard error of difference for proportions = SEp = sp = sqrt{ p * ( 1 - p ) * [ (1/n1) + (1/n2) ] }
- Standard error of the mean = SEx = sx = s/sqrt(n)
- Standard error of difference of sample means = SEd = sd = sqrt[ (s12 / n1) + (s22 / n2) ]
- Standard error of difference of paired sample means = SEd = sd = { sqrt [ (Σ(di - d)2 / (n - 1) ] } / sqrt(n)
- Pooled sample standard error = spooled = sqrt [ (n1 - 1) * s12 + (n2 - 1) * s22 ] / (n1 + n2 - 2) ]
- Standard error of difference of sample proportions = sd = sqrt{ [p1(1 - p1) / n1] + [p2(1 - p2) / n2] }

Discrete Probability Distributions
- Binomial formula: P(X = x) = b(x; n, P) = nCx * Px * (1 - P)n - x = nCx * Px * Qn - x
- Mean of binomial distribution = μx = n * P
- Variance of binomial distribution = σx2 = n * P * ( 1 - P )
- Negative Binomial formula: P(X = x) = b*(x; r, P) = x-1Cr-1 * Pr * (1 - P)x - r
- Mean of negative binomial distribution = μx = rQ / P
- Variance of negative binomial distribution = σx2 = r * Q / P2
- Geometric formula: P(X = x) = g(x; P) = P * Qx - 1
- Mean of geometric distribution = μx = Q / P
- Variance of geometric distribution = σx2 = Q / P2
- Hypergeometric formula: P(X = x) = h(x; N, n, k) = [ kCx ] [ N-kCn-x ] / [ NCn ]
- Mean of hypergeometric distribution = μx = n * k / N
- Variance of hypergeometric distribution = σx2 = n * k * ( N - k ) * ( N - n ) / [ N2 * ( N - 1 ) ]
- Poisson formula: P(x; μ) = (e-μ) (μx) / x!
- Mean of Poisson distribution = μx = μ
- Variance of Poisson distribution = σx2 = μ
- Multinomial formula: P = [ n! / ( n1! * n2! * ... nk! ) ] * ( p1n1 * p2n2 * . . . * pknk )

Linear Transformations
For the following formulas, assume that Y is a linear transformation of the random variable X, defined by the equation: Y = aX + b.
- Mean of a linear transformation = E(Y) = Y = aX + b.
- Variance of a linear transformation = Var(Y) = a2 * Var(X).
- Standardized score = z = (x - μx) / σx.
- t-score = t = (x - μx) / [ s/sqrt(n) ].

Estimation
- Confidence interval: Sample statistic + Critical value * Standard error of statistic
- Margin of error = (Critical value) * (Standard deviation of statistic)
- Margin of error = (Critical value) * (Standard error of statistic)

Hypothesis Testing
- Standardized test statistic = (Statistic - Parameter) / (Standard deviation of statistic)
- One-sample z-test for proportions: z-score = z = (p - P0) / sqrt( p * q / n )
- Two-sample z-test for proportions: z-score = z = z = [ (p1 - p2) - d ] / SE
- One-sample t-test for means: t-score = t = (x - μ) / SE
- Two-sample t-test for means: t-score = t = [ (x1 - x2) - d ] / SE
- Matched-sample t-test for means: t-score = t = [ (x1 - x2) - D ] / SE = (d - D) / SE
- Chi-square test statistic = Χ2 = Σ[ (Observed - Expected)2 / Expected ]

Degrees of Freedom

The correct formula for degrees of freedom (DF) depends on the situation (the nature of the test statistic, the number of samples, underlying assumptions, etc.).

- One-sample t-test: DF = n - 1
- Two-sample t-test: DF = (s12/n1 + s22/n2)2 / { [ (s12 / n1)2 / (n1 - 1) ] + [ (s22 / n2)2 / (n2 - 1) ] }
- Two-sample t-test, pooled standard error: DF = n1 + n2 - 2
- Simple linear regression, test slope: DF = n - 2
- Chi-square goodness of fit test: DF = k - 1
- Chi-square test for homogeneity: DF = (r - 1) * (c - 1)
- Chi-square test for independence: DF = (r - 1) * (c - 1)

Sample Size
Below, the first two formulas find the smallest sample sizes required to achieve a fixed margin of error, using simple random sampling. The third formula assigns sample to strata, based on a proportionate design. The fourth formula, Neyman allocation, uses stratified sampling to minimize variance, given a fixed sample size. And the last formula, optimum allocation, uses stratified sampling to minimize variance, given a fixed budget.

- Mean (simple random sampling): n = { z2 * σ2 * [ N / (N - 1) ] } / { ME2 + [ z2 * σ2 / (N - 1) ] }
- Proportion (simple random sampling): n = [ ( z2 * p * q ) + ME2 ] / [ ME2 + z2 * p * q / N ]
- Proportionate stratified sampling: nh = ( Nh / N ) * n
- Neyman allocation (stratified sampling): nh = n * ( Nh * σh ) / [ Σ ( Ni * σi ) ]
- Optimum allocation (stratified sampling): nh = n * [ ( Nh * σh ) / sqrt( ch ) ] / [ Σ ( Ni * σi ) / sqrt( ci ) ]