2010

5:30 PM

By statisticalconcepts

In: Statistics

Degrees of Freedom (df)

Degrees of Freedom denotes the number of samples that a statistician has the freedom to choose. Degrees of Freedom is based on a concept that one could not have exercised his /her freedom to select all the samples.

The concept can be explained by an analogy:

X+Y = 10 (1)

In the above equation you have freedom to choose a value for X or Y but not both because when you choose one, the other is fixed. If you choose 8 for X, then Y has to be 2. So the degree of freedom here is 1.

X+Y+Z = 15 (2)

In the formula (2), one can choose values for two variables but not all. You have freedom to choose 8 for X and 2 for Y, If so, then Z is fixed, So the degree of freedom is 2. df is calculated by subtracting 1 from the size of each group. The methods of df calculation may vary with the test used.

“Degrees of freedom” refers to the number of scores that are free to vary. It is required when one is working on estimates of population values (sample statistics). It is often abbreviated as df. It is the number of observations in the data collection that are free to vary after the sample statistics have been calculated. For example, in calculating a sample standard deviation, we must subtract the sample mean from each of the n data observations in order to get the deviations from the mean. But once we have completed the next to the last subtraction, the final deviation is automatically determined, since the deviations prior to squaring must sum to zero. Therefore, the last deviation is not free to vary; only n-1 are free to vary. As a rule of thumb, every estimate of a population value in a formula equals 1 degree of freedom.

The practical impact of using degrees of freedom is found when one considers small samples. Consider the t distribution, for example. With small samples it gets flatter, or another way of looking at it is that more area underneath the curve (region of rejection) is farther away from the mean. This makes using degrees of freedom a more conservative and accurate, since small samples tend to underestimate the population value.

Practical Example:

Consider a situation in which the scores that make up the distribution are unknown, and I tell you to guess at the first score in the distribution of 5 numbers. Your guesses will be wild because that score could be any number. The same is true for the second score, third score, and the fourth score. All of these scores have complete freedom to vary. But, if I tell you that the first four scores in the distribution are 3,4,5, and 6; and I tell you what the mean of the distribution (5, in this case), then the last score in this distribution has to be 7. In other words, if the mean is known, the missing score is determined by your knowledge of the other four. Therefore n-1 scores are free to vary. The (-1) is often called a “restriction.” Note that by making the numerator smaller, the standard deviation becomes larger. A larger standard deviation means that the sampling distribution is flatter; and flatter means more values are farther from the mean. It’s tougher to find significance.

In conclusion, when the statistical formula is concerned with description, degrees of freedom is n. When the formula is concerned with inference, some restrictions apply. The idea is to adjust for a small sample size’s tendency to underestimate the population parameter. As n gets larger, this becomes less of a problem because the distribution becomes less flat and more normal, but we still use the sample formula to calculate the statistic.

2:33 PM

By statisticalconcepts

In: Web Analytics

Web Analytics Terms

Abandonment: is the measure when a visitor exits or leaves a conversion process on a website and does not return later during the session.

A/B Testing: A method of banner ads, emails and landing pages testing by which a baseline control sample is compared to a variety of single-variable test samples.

Accuracy: The ability of a measurement to match the actual value of the quantity being measured. Accuracy is the foundation upon which your web analytics should be built. If you can't trust that your data is accurate, you can't make confident decisions. In statistical terms, accuracy is the width of the confidence interval for a desired confidence level.

Acknowledgement Page: A page displayed after a visitor completes an action or transaction. For example, a thank-you or a receipt page. Acknowledgement pages are often important in Scenario Analysis, where it is an indicator of a completed scenario.

Acquisition: The process of gaining customers through the means of different marketing strategies. For the purposes of web analytics, it often refers specifically to the process of attracting visitors to a web site.

ACT: After-Click Tracking is the recording the activity path of a visitor to a site after they have clicked on an email link.

Actionable Data: Information that allows you to make a decision or can be made use of in any way.

Ad: A link that takes a visitor to a web site when clicked on, usually graphic or text.

Ad Click: A click on an advertisement on a website which takes a user to another site.

Ad Hoc Query: A non-standard inquiry posed to a database of information as the need arises.

Ad View: A web page that presents an ad. There may be more than one ad on an ad view. Once visitors have viewed an ad, they can click on it.

Affiliate Marketing: A method of promoting web businesses in which an affiliate is rewarded for providing customers. Compensation could be made based on a value for visits, subscriptions, leads, sales, and so on.

Aggregate Data: A summary of collected information which groups data together without individual-level statistics.

Algorithm: A mathematical formula used by search engines to determine which web sites in their database to present in search results, in which order. While search engine algorithms change regularly, primary on-page factors include keyword density and source code optimization. The primary off-page factor is link popularity.

API: Application Programming Interface is a system that a computer or application supplies in order to allow requests for service to be made of it by other computer programs. APIs allow data to be exchanged between computer programs, and a standard software API method includes Open Database Connectivity (ODBC).

ASP: Active Server Pages are a set of software components that run on a web server and let developers build dynamic web pages.

Attrition: The erosion of your customer base over time. The opposite of customer retention.

Authentication: The technique by which access to Internet or intranet resources requires the user to enter a username and password as identification.

Average Lifetime Value: The average of the lifetime value of a visitor or multiple visitors during the reporting period, where each visitor's lifetime value is the total monetary value of a visitor's past orders since visitor tracking began.

Bandwidth: Measure, in kilobytes of data transferred, of the traffic on a site.

Banner Ad: An advertisement embedded on a web page usually intended to drive traffic to a different website by linking to the advertiser's site. The Interactive Advertising Bureau (IAB) has created a standard set of banner ad sizes (Medium Rectangle, Rectangle, Leaderboard, Wide Skyscraper) into a set of guidelines called the Universal Ad Package).

Benchmark: A standard by which something can be measured or judged. For example, you benchmark your Key Performance Indicators to ensure everyone in your organization is measuring performance against the same goals.

Blog/Web Logs: A self-published, managed or maintained Web diary. Usually updated daily or weekly, blogs have historically been personal, but gained notoriety after the 2004 election as an influential media outlet. Companies now use blogs to extend their brand and improve their organic search visibility.

Bounce Rate: The percentage of entrances on a web page that result in an immediate exit from the web site.

Browser: A program used to locate and view HTML documents.

Business Intelligence: While some would claim it's an oxymoron, business intelligence refers to a category of software and tools designed to gather, store, analyze, and deliver data in a user-friendly format to help organizations make more informed business decisions. Software types include dashboarding, data mining, data warehouses, and other information systems.

Campaign Analysis: A mesaure that tracks activity originating from a marketing campaign, so you can compare your campaigns and evaluate their effectiveness.

Client: The browser used by a website visitor.

Client Error: An error that occurs because of an invalid request by the visitor's browser.

Cloaking: In terms of search engine marketing, this is the act of getting a search engine to record content for a URL that is different than what a searcher will ultimately see. It can be done in many technical ways. Several search engines have explicit rules against unapproved cloaking. Those violating these guidelines might find their pages penalized or banned from a search engine's index. As for approved cloaking, this generally only happens with search engines offering paid inclusion program.

Click Fraud: A type of internet crime that occurs in pay per click online advertising when a person, automated script, or computer program imitates a legitimate user of a web browser clicking on an ad, for the purpose of generating a charge per click without having actual interest in the target of the ad's link.

Content Management System (CMS): a software platform that aids in the management of content on a Web site.

Contextual Link Ads/Inventory: To supplement their business models, certain text-link advertising networks (like Google) have expanded their network distribution to include "contextual inventory". Most vendors of "search engine traffic" have expanded the definition of Search Engine Marketing to include this contextual inventory. Contextual or content inventory is generated when listings are displayed on pages of Web sites (usually not search engines), where the written content on the page indicates to the ad-server that the page is a good match to specific keywords and phrases. Often this matching method is validated by measuring the number of times a viewer clicks on the displayed ad. These ads typically do not perform as well as traditional text ads on search engines, but the lower cost justifies the expense.

Conversion: An action that signifies a completion of a specified activity. For many sites, a user converts if they buy a product, sing up for a newsletter, or download a file. The conversion rate is the percentage of visitors who do convert. Cookie deletion can have an impact on your conversion rate because if a cookie is being systematically deleted, repeat visitor rates will be under-counted and new visitor rates will be over-counted, thus skewing the conversion rate metric by which you analyze your site's overall effectiveness.

Conversion Funnel: The series of steps that move a visitor towards a specified conversion event, such as an order or registration signup. See also abandonment.

Conversion Rate: The relationship between visitors to a web site and actions considered to be a "conversion," such as a sale or request to receive more information. This metric is often expressed as a percentage.

Cookie: A text file that transmits information to a data collection facility via a 1x1 pixel GIF image request and includes a tracking ID that is used to identify returning visitors. Contrary to some industry speculation, cookies can not be used for malicious use such as privacy tapping. See also first and third-party cookies.

Cost-per-Click (CPC): System where an advertiser pays an agreed amount for each click someone makes on a link leading to their web site. Also known as PPC or paid listings.

Cost-per-Thousand (CPM): System where an advertiser pays an agreed amount for the number of times their ad is seen by a consumer, regardless of the consumer's subsequent action. This term is heavily used in print, broadcasting and direct marketing, as well as with online banner ad sales. CPM stands for "cost per thousand," since ad views are often sold in blocks of 1,000. The M in CPM is Latin for thousand.

Crawler/Spider/Robot: Component of search engine that indexes web sites automatically. A search engine's crawler (also called a spider or robot), copies web page source code into its index database and follows links to other web pages.

CTR: Click Through Rate. A click through rate is the rate at which visitors "click through" from one website page or property to the next. A good indication of an ad's effectiveness.

Customer Segment: A powerful aspect of relationship marketing in which you target sub-section or group of customers who share a specific trait or set of behaviors. See also demographics and psychographics.

Dashboard: A web analytics dashboard provides all of your critical metrics in one place to help you understand the health or performance of your business.

Data Warehouse: is a logical collection of information gathered from many different operational databases used to create business intelligence that supports business analysis activities and decision-making tasks, primarily, a record of an enterprise's past transactional and operational information, stored in a database designed to favor efficient data analysis and reporting.

Demographics: The physical characteristics of human populations and segments of populations, often used to identify consumer markets. Demographics can include information such as age, gender, marital status, education, and geographic location. See also psychographics.

Directories: A type of search engine where listings are gathered or reviewed by humans, rather than by search engine crawlers. In directories, web sites are often reviewed, summarized in about 25 words and placed in a particular category. The largest and most popular directory site is Yahoo!

Domain: An area in the Internet specified by a URL address. The top-level domain is at the end after the dot and the second-level domain comes before it, and shows where in the top-level domain the address can be found. For example in www.xyz.com, ".com" is the top-level domain and "xyz" is the second level domain.

Domain Name: The text name that corresponds to a numeric IP address of a computer on the Internet.

Doorway/Landing/Gateway/Bridge/Jump Pages: A web page created expressly in hopes of ranking well for a term in a search engine's organic/non-paid listings and which itself does not deliver much information to those viewing it. Instead, visitors will often see only some enticement on the doorway page leading them to other pages, or they may be seamlessly redirected to a real page within the existing web site. With cloaking, visitors may never see the doorway page at all. Several search engines have guidelines against doorway pages, though they are more commonly allowed in through paid inclusion programs.

E-commerce: The act of selling goods and services online via a standalone site or through an online auction center.

Email Bounce: The number of e-mails that were sent but never reach the intended receiver.

Entry Page: The first viewed page on a visitor's path through a site.

Exit Page: The last page viewed on a visitor's path through a site.

First Party Cookie: For most business models, first-party cookies are regarded as the most reliable method to measure visitor activity. Whereas a third-party cookie is usually set by an analytics vendor, (an entity with whom the user does not have a relationship), a first-party cookie is set by the business, an organization with whom the Web site visitor has specifically chosen to do business. Because of this relationship, first-party cookies are deemed more secure by the user. Also seecookies.

Frequency: The number of times a visitor has visited a site during a reporting period. Average Frequency is the average of frequencies of all the visitors during the reporting period. Frequency is a retention metric and is part of RFM (recency, frequency, monetary) analysis.

Hit: Any request from a file or a web-server. A single page likely contains multiple hits as multiple image and text files are downloaded from the web-server.

Home Page: The main page of a web site. The home page provides visitors with an overview and links to the rest of the site

Impression: The display of an online advertisement (usually a banner ad) to a web site visitor.

Index: The collection of information (contained in a large database) a search engine has that searchers can query against. With crawler-based search engines, the index is typically copies of all the web pages they have found from crawling the web. With human-powered directories, the index contains the summaries of all web sites that have been categorized.

Inbound/Back Link: A text or graphical hyperlink from one site to another. Google and other search engines' algorithms consider a site's popularity based on the quality and quantity of inbound links from relevant third party sites to help determine search positioning.

Internet: The Internet is the publicly accessible global system of interconnected computer networks that transmit data via a standardized Internet Protocol. See alsoWorld Wide Web.

JavaScript: A scripting language developed by Netscape. While it is often used for websites, it is also used to enable scripting access to objects embedded in other applications.

Keyword: Terms entered into the search field of a web search engine. See also organic search and PPC.

KPI: Key Performance Indicators. Key Performance Indicators are typically kept in dashboards and provide customers with an understanding of how the site is performing.

Latency: The average number of days between visits for a given visitor during a reporting period. For example, those who visit on average every seven days.

Link: On a web page, text or an image that has been coded to take a browser from one page to another or from one site to another.

Log File: A file created by a web or proxy server which contains all of the access information regarding the activity on that server.

Long Tail: First coined by Chris Anderson in an October 2004 Wired magazine article to describe the niche strategy of certain business such as Amazon.com or Netflix. In relation to search engine marketing (SEM) the Long Tail refers to the keyword phrases that are highly detailed and specific and may generate low volumes of searches and traffic, but add up to generate a majority of traffic for sites with deep content or product SKUs.

LTV: Life-Time Value. Life-Time Value is a metric used to describe the value a specific customer has over the life of their relationship with you.

Meta Search Engine: A search engine that gets listings from two or more other search engines, rather than through its own efforts.

Meta Tags: Information placed in a web page not intended for users to see but instead which typically passes information to search engine crawlers, browser software and some other applications.

Meta Description Tag: Allows page authors to say how they would like their pages described when listed by search engines. Not all search engines use the tag.

Meta Keywords Tag: Allows page authors to add text to a page to help with the search engine ranking process.
Meta Robots Tag: Allows page authors to keep their web pages from being indexed by search engines, especially helpful for those who cannot create robots.txt files. The Robots Exclusion page provides official details.

Metrics: Metrics are a system of parameters or ways of quantitative assessment of a process that is to be measured, along with the processes to carry out such measurement. Metrics define what is to be measured.

Mobile Search: An evolving branch of information retrieval services that is centered on the convergence of mobile platforms and mobile handsets or other mobile devices. The services allow users to find mobile content interactively on mobile websites, and mobile content shows a media shift toward mobile multimedia.

Multivariate Testing: A process by which more than one component of a website may be tested in a live environment. It can be thought of in simple terms as numerous split tests or A/B tests performed on one page at the same time.

Navigation: The act of moving from location to location within a web site, or between web sites, accomplished by clicking on links. Navigation can also refer to the overall structure of the links on the site, comprising the paths available to the visitor.

Non-referrals: Visitors who arrive at a site by typing a domain into an address bar, using a bookmark, or clicking on an emailed URL.

Online Reputation Management (ORM): The act of monitoring, addressing or mitigating undesirable search engine results or mentions in online media for a company or product. Techniques include generating new content and creating posts on existing content.

OpenSearch: A collection of technologies that allow publishing of search results in a format suitable for syndication and aggregation. It is a way for websites and search engines to publish search results in a standard and accessible format. Originally developed by Amazon and recently adopted by Yahoo!, OpenSearch relies on abstract-based microformats (dataRSS, eRDF, FOAF, GeoRSS, hCard, hEvent, hReview, hAtom, MediaRSS, RDFa, XFN, etc.) to integrate syndicated content into search results.

Opt-in: This permission-based email communication requires customers to verify the opt-in method before their e-mail addresses can be used to communicate with them.

Organic/Natural Listings: Listings that search engines do not sell (unlike paid listings). Instead, sites appear solely because a search engine has deemed it editorially important for them to be included, regardless of payment. Paid inclusion content is also often considered "organic" even though it is paid for. This is because that content usually appears intermixed with unpaid organic results.

Organic Search: A type of search in which web users find sites having unpaid listings, as opposed to using the pay-per-click advertisement listings displayed among the search results.

Outbound Links: Links on a particular web page leading to other web pages on a different domain.

Page: A document provided by the server, including HTML, scripts, and text files. Images, sound files and video are not considered pages. Documents are defined by the system administrator, but generally include all static content, such as complete html pages. Dynamic pages are created with variables and do not exist anywhere in a static form. Forms are scripted pages which get information from a visitor that gets passed back to the server.

Page Tag: A piece of JavaScript code embedded on a web page and executed by the browser when the page is viewed. See also log files.

Page View: is generally defined as a request to load a single page of a website. On the web, a page request would result from a web surfer clicking on a link on another page that points to the page in question. See also hit.

Parameters: These are located in the URL immediately after a question mark and followed by an equal sign and a return value, known as name=value.

Path: A path is the click pattern a visitor uses as they traverse through multiple pages.

PPC: Pay Per Click or paid search uses search keywords that cost a certain amount for each customer click on that term in order to get to your site. See also organic search.

Query: A question or inquiry used to find answers about certain metrics.

Query Parameter: An individual piece of a query string consisting of a parameter name and a value for the parameter.

Reach: The size of the audience reading, viewing, hearing, or interacting with a message in a given period of time. Reach can be understood as either an absolute number, or a fraction of a population.

Rear-View Mirror Metrics: Metrics that measure what has occurred. For example campaign response metrics are such metrics that tell you how a campaign performed.

Recency: The number of days since a visitor's most recent visit during a reporting period.

Referrals: The location that visitors come from, particularly the sites, search engines or directories. Relationship Marketing: Relationship marketing is a type of marketing that traces its roots to direct response marketing. It emphasizes building long-term relationships with customers rather than individual transactions. It requires understanding customer needs as they go through life cycles of interacting and purchasing from organizations, and requires that marketers accurately determine customer intent in order to provide them the right message at the right time.

RFM Analysis: Recency, Frequency, Monetary analysis.

ROI: Return on Investment

Robot: An automated process that performs mundane, repeatable tasks to provide information. Search engine robots or bots provide such functions, cataloging the internet for searchers to find information.

RSS: Really Simple Syndication is a type of web syndication used by news sites and weblogs which provides summaries of information with links to the complete content.

Sampling: In statistics, the selection of individual observations intended to yield knowledge about a population, especially for the purposes of statistical inference.

Search Engine: A search engine is a program that helps you find information on the web.

Search Engine: Any service generally designed to allow users to search the web or a specialized database of information. Web search engines generally have paid listings and organic listings. Organic listings typically come from crawling the web, though often human-powered directory listings are also optionally offered. Top tier search engines include Google, MSN, Teoma and Yahoo!

Search Engine Marketing (SEM): The act of marketing a web site via search engines, whether this be improving rank in organic listings (search engine optimization), purchasing paid listings (PPC management) or a combination of these and other search engine-related activities (i.e. affiliate programs, shopping feeds or link development).

Search Engine Optimization (SEO): The act of altering a web site so that it does well in the organic, crawler-based listings of search engines. In the past, has also been used as a term for any type of search engine marketing activity, though now the term search engine marketing is more commonly used as an umbrella term.

Search Terms: The words (or phrase) a searcher enters into a search engine's search box. Also used to refer to the terms a search engine marketer hopes a particular page will be found for. Also called keywords, query terms or query.

Segment: A grouping of customers, defined by website activity or other data, which can be used to target them effectively.

Social Media Marketing (SMM): A form of internet marketing which seeks to achieve branding and marketing communication goals through the participation in various social media networks (MySpace, Facebook, LinkedIn), social bookmarking (Digg, Stumbleupon), social media sharing (Flickr, YouTube), review/ratings sites (ePinions, BizRate), blogs, forums, news aggregators and virtual 3D networks (SecondLife, ActiveWorlds). Each social media site can be optimized to generate awareness or traffic.

Social Media Optimization (SMO): A set of methods for generating publicity through social media, online communities and community websites. Methods of SMO include adding RSS feeds, adding a "Digg This" button, blogging and incorporating third party community functionalities like Flickr photo slides and galleries or YouTube videos. Social media optimization is a form of search engine marketing.

Spam: Any search engine marketing method that a search engine deems to be detrimental to its efforts to deliver relevant, quality search results. Some search engines have written guidelines about what they consider to be spamming, but ultimately any activity a particular search engine deems harmful may be considered spam, whether or not there are published guidelines against it. Examples of spam include the creation of nonsensical doorway pages designed to please search engine algorithms rather than human visitors, or a heavy repetition of search terms on a page to increase keyword density. .

Spider: An automated software program that gathers pages from the Internet.

Submission: The act to submitting a URL for inclusion into a search engine's index. Unless done through paid inclusion, submission generally does not guarantee listing. In addition, submission does not help with rank improvement on crawler-based search engines unless search engine optimization efforts have been implemented. Submission can be done manually (i.e., you fill out an online form and submit) or automated, where a software program or online service may process the forms behind the scenes.

Suffix: The last part of a domain that can be used to identify the type of organization or location of a site.

Third-party cookie: Hosted web analytics services track visitor behavior by inserting a small piece of tracking code onto each page of a site. Because the cookie is served by an analytics vendor rather than your own site, the cookie is considered third-party.

Traffic: On the web, traffic refers to the amount of data sent and received by visitors to a website.

URL: A Uniform Resource Locator is a means of identifying an exact location on the Internet.

User Agent: Fields in an extended web server log file identifying the browser and platform used by a visitor.

User Session: A period of activity (all hits) for one user of a website. A unique user is determined by the IP address or cookie. Typically, a user session is terminated when a user is inactive for more than 30 minutes.
Unique Visitors: refers to a measure captured by web analytics solutions that track the interaction a single user has with a website over time.

Viral Marketing: Any marketing technique that induces Web sites or users to pass on a marketing message to other sites or users, creating a potentially exponential growth in the message's visibility and effect.

Visitor: Similar to unique visitor, visitor refers to an individual that visits a website. A visitor or unique visitor can have multiple visits.

Visitor Session: Interaction by a site visitor. The session ends when the visitor leaves the site.

Visit: A visit is an interaction a unique visitor has with a website over a specified period of time or activity. If a visitor has left a site or has not executed a click within 30 minutes, the visit session will terminate.

W3C: World Wide Web Consortium develops interoperable technologies (specifications, guidelines, software, and tools) to lead the Web to its full potential.

Web 2.0: The use of World Wide Web technology and web design that aims to facilitate creativity, information sharing, and, most notably, collaboration among users. These concepts have led to the development and evolution of web-based communities and hosted services, such as social-networking sites, wikis, blogs, and folksonomies.

Web Analytics: The measurement of data as it relates to an Internet site, including the behavior of visitors, the amount of traffic, the conversion rates, web server performance, user experience, and other information in order to understand and proof of results and continually improve the results of a site towards a set of objectives.

Website: A website is a collection of web pages, on particular domain name or sub-domain on the World Wide Web on the Internet. Usually it is made up of a set of web pages created using HTML and accessible via HTTP.

What if: A type of analysis that allows an end-user to pose hypothetical situations against their data to model or predict outcomes.

World Wide Web: Also called the web, this is a global information space which people can communicate via computers connected to the Internet. Some people use "internet" and "the web" interchangeably, even though the web is a service that operates over the internet.

XML: Extensible Markup Language is a World Wide Web Consortium (W3C) recommended general-purpose markup language for creating special-purpose markup languages, capable of describing many different kinds of data.

XML Feeds: A form of paid inclusion where a search engine is "fed" information about pages via XML, rather than gathering that information through crawling actual pages. Marketers can pay to have their pages included in a spider based search index either annually per URL or on a CPC basis based on an XML document representing each page on the client site. New media types are being introduced into paid inclusion, including graphics, video, audio, and rich media. These feeds are commonly used for Shopping Feeds.

Zero Latency: Latency is a time delay between the moment something is started, and the moment one of the effects of that event begins. When there is no time lapse between the event and the effect, it's called zero latency. In analytics, this term is used to describe instantaneous receipt of data and the ability to analyze and act on that data.

Zero-page Visit: A visit that included no page views. This is possible if a visit consisted of at least one request for a non-page file (such as a graphic) but no page files (such as .htm, .asp, .jsp, or .cfm.)

8:30 AM

By statisticalconcepts

In: Statistics

Statistics Formulas

Notation

Capitalization
In general, capital letters refer to population attributes (i.e., parameters); and lower-case letters refer to sample attributes (i.e., statistics). For example,
- P refers to a population proportion; and p, to a sample proportion.
- X refers to a set of population elements; and x, to a set of sample elements.
- N refers to population size; and n, to sample size.
Greek vs. Roman Letters
Like capital letters, Greek letters refer to population attributes. Their sample counterparts, however, are usually Roman letters. For example,
- μ refers to a population mean; and x, to a sample mean.
- σ refers to the standard deviation of a population; and s, to the standard deviation of a sample.
Population Parameters
- μ refers to a population mean.
- σ refers to the standard deviation of a population.
- σ2 refers to the variance of a population.
- P refers to the proportion of population elements that have a particular attribute.
- Q refers to the proportion of population elements that do not have a particular attribute, so Q = 1 - P.
- ρ is the population correlation coefficient, based on all of the elements from a population.
- N is the number of elements in a population.
Sample Statistics
- x refers to a sample mean.
- s refers to the standard deviation of a sample.
- s2 refers to the variance of a sample.
- p refers to the proportion of sample elements that have a particular attribute.
- q refers to the proportion of sample elements that do not have a particular attribute, so q = 1 - p.
- r is the sample correlation coefficient, based on all of the elements from a sample.
- n is the number of elements in a sample.
Simple Linear Regression
- Β0 is the intercept constant in a population regression line.
- Β1 is the regression coefficient (i.e., slope) in a population regression line.
- R2 refers to the coefficient of determination.
- b0 is the intercept constant in a sample regression line.
- b1 refers to the regression coefficient in a sample regression line (i.e., the slope).
- sb1 refers to the refers to the standard error of the slope of a regression line.
Probability
- P(A) refers to the probability that event A will occur.
- P(A|B) refers to the conditional probability that event A occurs, given that event B has occurred.
- P(A') refers to the probability of the complement of event A.
- P(A ∩ B) refers to the probability of the intersection of events A and B.
- P(A ∪ B) refers to the probability of the union of events A and B.
- E(X) refers to the expected value of random variable X.
- b(x; n, P) refers to binomial probability.
- b*(x; n, P) refers to negative binomial probability.
- g(x; P) refers to geometric probability.
- h(x; N, n, k) refers to hypergeometric probability.
Counting
- n! refers to the factorial value of n.
- nPr refers to the number of permutations of n things taken r at a time.
- nCr refers to the number of combinations of n things taken r at a time.
Set Theory
- A ∩ B refers to the intersection of events A and B.
- A ∪ B refers to the union of events A and B.
- {A, B, C} refers to the set of elements consisting of A, B, and C.
- {∅} refers to the null set.
Hypothesis Testing
- H0 refers to a null hypothesis.
- H1 or Ha refers to an alternative hypothesis.
- α refers to the significance level.
- Β refers to the probability of committing a Type II error.
Random Variables
- Z or z refers to a standardized score, also known as a z score.
- zα refers to the standardized score that has a cumulative probability equal to 1 - α.
- tα refers to the t score that has a cumulative probability equal to 1 - α.
- fα refers to a f statistic that has a cumulative probability equal to 1 - α.
- fα(v1, v2) is a f statistic with a cumulative probability of 1 - α, and v1 and v2 degrees of freedom.
- Χ2 refers to a chi-square statistic.
Special Symbols
- Σ is the summation symbol, used to compute sums over a range of values.
- Σx or Σxi refers to the sum of a set of n observations. Thus, Σxi = Σx = x1 + x2 + . . . + xn.
- sqrt refers to the square root function. Thus, sqrt(4) = 2 and sqrt(25) = 5.
- Var(X) refers to the variance of the random variable X.
- SD(X) refers to the standard deviation of the random variable X.
- SE refers to the standard error of a statistic.
- ME refers to the margin of error.
- DF refers to the degrees of freedom.

Formulas
Parameters
- Population mean = μ = ( Σ Xi ) / N
- Population standard deviation = σ = sqrt [ Σ ( Xi - μ )2 / N ]
- Population variance = σ2 = Σ ( Xi - μ )2 / N
- Variance of population proportion = σP2 = PQ / n
- Standardized score = Z = (X - μ) / N
- Population correlation coefficient = ρ = [ 1 / N ] * Σ { [ (Xi - μX) / σx ] * [ (Yi - μY) / σy ] }

Statistics
Unless otherwise noted, these formulas assume simple random sampling.
- Sample mean = x = ( Σ xi ) / n
- Sample standard deviation = s = sqrt [ Σ ( xi - x )2 / ( n - 1 ) ]
- Sample variance = s2 = Σ ( xi - x )2 / ( n - 1 )
- Variance of sample proportion = sp2 = pq / (n - 1)
- Pooled sample proportion = p = (p1 * n1 + p2 * n2) / (n1 + n2)
- Pooled sample standard deviation = sp = sqrt [ (n1 - 1) * s12 + (n2 - 1) * s22 ] / (n1 + n2 - 2) ]
- Sample correlation coefficient = r = [ 1 / (n - 1) ] * Σ { [ (xi - x) / sx ] * [ (yi - y) / sy ] }

Simple Linear Regression
- Simple linear regression line: ŷ = b0 + b1x
- Regression coefficient = b1 = Σ [ (xi - x) (yi - y) ] / Σ [ (xi - x)2]
- Regression slope intercept = b0 = y - b1 * x
- Regression coefficient = b1 = r * (sy / sx)
- Standard error of regression slope = sb1 = sqrt [ Σ(yi - ŷi)2 / (n - 2) ] / sqrt [ Σ(xi - x)2 ]

Counting
- n factorial: n! = n * (n-1) * (n - 2) * . . . * 3 * 2 * 1. By convention, 0! = 1.
- Permutations of n things, taken r at a time: nCr = n! / (n - r)!
- Combinations of n things, taken r at a time: nCr = n! / r!(n - r)! = nPr / r!

Probability
- Rule of addition: P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
- Rule of multiplication: P(A ∩ B) = P(A) P(B|A)
- Rule of subtraction: P(A') = 1 - P(A)

Random Variables
In the following formulas, X and Y are random variables, and a and b are constants.
- Expected value of X = E(X) = μx = Σ [ xi * P(xi) ]
- Variance of X = Var(X) = σ2 = Σ [ xi - E(x) ]2 * P(xi) = Σ [ xi - μx ]2 * P(xi)
- Normal random variable = z-score = z = (X - μ)/σ
- Chi-square statistic = Χ2 = [ ( n - 1 ) * s2 ] / σ2
- f statistic = f = [ s12/σ12 ] / [ s22/σ22 ]
- Expected value of sum of random variables = E(X + Y) = E(X) + E(Y)
- Expected value of difference between random variables = E(X - Y) = E(X) - E(Y)
- Variance of the sum of independent random variables = Var(X + Y) = Var(X) + Var(Y)
- Variance of the difference between independent random variables = Var(X - Y) = E(X) + E(Y)

Sampling Distributions
- Mean of sampling distribution of the mean = μx = μ
- Mean of sampling distribution of the proportion = μp = P
- Standard deviation of proportion = σp = sqrt[ P * (1 - P)/n ] = sqrt( PQ / n )
- Standard deviation of the mean = σx = σ/sqrt(n)
- Standard deviation of difference of sample means = σd = sqrt[ (σ12 / n1) + (σ22 / n2) ]
- Standard deviation of difference of sample proportions = σd = sqrt{ [P1(1 - P1) / n1] + [P2(1 - P2) / n2] }

Standard Error
- Standard error of proportion = SEp = sp = sqrt[ p * (1 - p)/n ] = sqrt( pq / n )
- Standard error of difference for proportions = SEp = sp = sqrt{ p * ( 1 - p ) * [ (1/n1) + (1/n2) ] }
- Standard error of the mean = SEx = sx = s/sqrt(n)
- Standard error of difference of sample means = SEd = sd = sqrt[ (s12 / n1) + (s22 / n2) ]
- Standard error of difference of paired sample means = SEd = sd = { sqrt [ (Σ(di - d)2 / (n - 1) ] } / sqrt(n)
- Pooled sample standard error = spooled = sqrt [ (n1 - 1) * s12 + (n2 - 1) * s22 ] / (n1 + n2 - 2) ]
- Standard error of difference of sample proportions = sd = sqrt{ [p1(1 - p1) / n1] + [p2(1 - p2) / n2] }

Discrete Probability Distributions
- Binomial formula: P(X = x) = b(x; n, P) = nCx * Px * (1 - P)n - x = nCx * Px * Qn - x
- Mean of binomial distribution = μx = n * P
- Variance of binomial distribution = σx2 = n * P * ( 1 - P )
- Negative Binomial formula: P(X = x) = b*(x; r, P) = x-1Cr-1 * Pr * (1 - P)x - r
- Mean of negative binomial distribution = μx = rQ / P
- Variance of negative binomial distribution = σx2 = r * Q / P2
- Geometric formula: P(X = x) = g(x; P) = P * Qx - 1
- Mean of geometric distribution = μx = Q / P
- Variance of geometric distribution = σx2 = Q / P2
- Hypergeometric formula: P(X = x) = h(x; N, n, k) = [ kCx ] [ N-kCn-x ] / [ NCn ]
- Mean of hypergeometric distribution = μx = n * k / N
- Variance of hypergeometric distribution = σx2 = n * k * ( N - k ) * ( N - n ) / [ N2 * ( N - 1 ) ]
- Poisson formula: P(x; μ) = (e-μ) (μx) / x!
- Mean of Poisson distribution = μx = μ
- Variance of Poisson distribution = σx2 = μ
- Multinomial formula: P = [ n! / ( n1! * n2! * ... nk! ) ] * ( p1n1 * p2n2 * . . . * pknk )

Linear Transformations
For the following formulas, assume that Y is a linear transformation of the random variable X, defined by the equation: Y = aX + b.
- Mean of a linear transformation = E(Y) = Y = aX + b.
- Variance of a linear transformation = Var(Y) = a2 * Var(X).
- Standardized score = z = (x - μx) / σx.
- t-score = t = (x - μx) / [ s/sqrt(n) ].

Estimation
- Confidence interval: Sample statistic + Critical value * Standard error of statistic
- Margin of error = (Critical value) * (Standard deviation of statistic)
- Margin of error = (Critical value) * (Standard error of statistic)

Hypothesis Testing
- Standardized test statistic = (Statistic - Parameter) / (Standard deviation of statistic)
- One-sample z-test for proportions: z-score = z = (p - P0) / sqrt( p * q / n )
- Two-sample z-test for proportions: z-score = z = z = [ (p1 - p2) - d ] / SE
- One-sample t-test for means: t-score = t = (x - μ) / SE
- Two-sample t-test for means: t-score = t = [ (x1 - x2) - d ] / SE
- Matched-sample t-test for means: t-score = t = [ (x1 - x2) - D ] / SE = (d - D) / SE
- Chi-square test statistic = Χ2 = Σ[ (Observed - Expected)2 / Expected ]

Degrees of Freedom

The correct formula for degrees of freedom (DF) depends on the situation (the nature of the test statistic, the number of samples, underlying assumptions, etc.).

- One-sample t-test: DF = n - 1
- Two-sample t-test: DF = (s12/n1 + s22/n2)2 / { [ (s12 / n1)2 / (n1 - 1) ] + [ (s22 / n2)2 / (n2 - 1) ] }
- Two-sample t-test, pooled standard error: DF = n1 + n2 - 2
- Simple linear regression, test slope: DF = n - 2
- Chi-square goodness of fit test: DF = k - 1
- Chi-square test for homogeneity: DF = (r - 1) * (c - 1)
- Chi-square test for independence: DF = (r - 1) * (c - 1)

Sample Size

Below, the first two formulas find the smallest sample sizes required to achieve a fixed margin of error, using simple random sampling. The third formula assigns sample to strata, based on a proportionate design. The fourth formula, Neyman allocation, uses stratified sampling to minimize variance, given a fixed sample size. And the last formula, optimum allocation, uses stratified sampling to minimize variance, given a fixed budget.

- Mean (simple random sampling): n = { z2 * σ2 * [ N / (N - 1) ] } / { ME2 + [ z2 * σ2 / (N - 1) ] }
- Proportion (simple random sampling): n = [ ( z2 * p * q ) + ME2 ] / [ ME2 + z2 * p * q / N ]
- Proportionate stratified sampling: nh = ( Nh / N ) * n
- Neyman allocation (stratified sampling): nh = n * ( Nh * σh ) / [ Σ ( Ni * σi ) ]
- Optimum allocation (stratified sampling): nh = n * [ ( Nh * σh ) / sqrt( ch ) ] / [ Σ ( Ni * σi ) / sqrt( ci ) ]

12:29 PM

By statisticalconcepts

In: Statistics

Use of Statistical Analysis?

One sample t-test

One sample t-test, tests whether a sample mean significantly differs from a hypothesized value

One sample median test

One sample median test is used to test whether a sample median differs significantly from a hypothesized value.

Binomial test

Binomial test is used to test whether the proportion of successes on a two-level categorical dependent variable significantly differs from a hypothesized value.

Chi-square goodness of fit

A chi-square goodness of fit test is used to test whether the observed proportions for a categorical variable differ from hypothesized proportions. For example, let's suppose that we believe that the general population consists of x% Hispanic, y% Asian, z% African American and a% White folks. We want to test whether the observed proportions from our sample differ significantly from these hypothesized proportions.

Two independent samples t-test

An independent samples t-test is used when you want to compare the means of a normally distributed interval dependent variable for two independent groups.

Wilcoxon-Mann-Whitney test

The Wilcoxon-Mann-Whitney test is a non-parametric analog to the independent samples t-test and can be used when you do not assume that the dependent variable is a normally distributed interval variable (you only assume that the variable is at least ordinal).

Chi-square test

A chi-square test is used when you want to see if there is a relationship between two categorical variables. Remember that the chi-square test assumes that the expected value for each cell is five or higher.

Fisher's exact test

The Fisher's exact test is used when you want to conduct a chi-square test but one or more of your cells has an expected frequency of five or less. Remember that the chi-square test assumes that each cell has an expected frequency of five or more, but the Fisher's exact test has no such assumption and can be used regardless of how small the expected frequency is.

One-way ANOVA

A one-way analysis of variance (ANOVA) is used when you have a categorical independent variable (with two or more categories) and a normally distributed interval dependent variable and you wish to test for differences in the means of the dependent variable broken down by the levels of the independent variable.

Kruskal Wallis test

The Kruskal Wallis test is used when you have one independent variable with two or more levels and an ordinal dependent variable. In other words, it is the non-parametric version of ANOVA and a generalized form of the Mann-Whitney test method since it permits two or more groups.

Paired t-test

A paired (samples) t-test is used when you have two related observations (i.e., two observations per subject) and you want to see if the means on these two normally distributed interval variables differ from one another.

Wilcoxon signed rank sum test

The Wilcoxon signed rank sum test is the non-parametric version of a paired samples t-test. You use the Wilcoxon signed rank sum test when you do not wish to assume that the difference between the two variables is interval and normally distributed (but you do assume the difference is ordinal).

One-way repeated measures ANOVA

You would perform a one-way repeated measures analysis of variance if you had one categorical independent variable and a normally distributed interval dependent variable that was repeated at least twice for each subject. This is the equivalent of the paired samples t-test, but allows for two or more levels of the categorical variable. This tests whether the mean of the dependent variable differs by the categorical variable.

Factorial ANOVA

A factorial ANOVA has two or more categorical independent variables (either with or without the interactions) and a single normally distributed interval dependent variable.

Friedman test

You perform a Friedman test when you have one within-subjects independent variable with two or more levels and a dependent variable that is not interval and normally distributed (but at least ordinal). We will use this test to determine if there is a difference in the reading, writing and math scores. The null hypothesis in this test is that the distribution of the ranks of each type of score (i.e., reading, writing and math) are the same.

Factorial logistic regression

A factorial logistic regression is used when you have two more categorical independent variables but a dichotomous dependent variable.

Correlation

A correlation is useful when you want to see the relationship between two (or more) normally distributed interval variables. By squaring the correlation and then multiplying by 100, you can determine what percentage of the variability is shared.

Simple linear regression

Simple linear regression allows us to look at the linear relationship between one normally distributed interval predictor and one normally distributed interval outcome variable.

Non-parametric correlation

A Spearman correlation is used when one or both of the variables are not assumed to be normally distributed and interval (but are assumed to be ordinal). The values of the variables are converted in ranks and then correlated.

Simple logistic regression

Logistic regression assumes that the outcome variable is binary (i.e., coded as 0 and 1).

Multiple regression

Multiple regression is very similar to simple regression, except that in multiple regression you have more than one predictor variable in the equation.

Analysis of covariance

Analysis of covariance is like ANOVA, except in addition to the categorical predictors you also have continuous predictors as well.

Multiple logistic regression

Multiple logistic regression is like simple logistic regression, except that there are two or more predictors. The predictors can be interval variables or dummy variables, but cannot be categorical variables. If you have categorical predictors, they should be coded into one or more dummy variables.

Discriminant analysis

Discriminant analysis is used when you have one or more normally distributed interval independent variables and a categorical dependent variable. It is a multivariate technique that considers the latent dimensions in the independent variables for predicting group membership in the categorical dependent variable.

One-way MANOVA

MANOVA (multivariate analysis of variance) is like ANOVA, except that there are two or more dependent variables. In a one-way MANOVA, there is one categorical independent variable and two or more dependent variables.

Multivariate multiple regression

Multivariate multiple regression is used when you have two or more variables that are to be predicted from two or more predictor variables.

Canonical correlation

Canonical correlation is a multivariate technique used to examine the relationship between two groups of variables. For each set of variables, it creates latent variables and looks at the relationships among the latent variables. It assumes that all variables in the model are interval and normally distributed.

Factor analysis

Factor analysis is a form of exploratory multivariate analysis that is used to either reduce the number of variables in a model or to detect relationships among variables. All variables involved in the factor analysis need to be interval and are assumed to be normally distributed. The goal of the analysis is to try to identify factors which underlie the variables. There may be fewer factors than variables, but there may not be more factors than variables.

12:21 AM

By statisticalconcepts

In: Statistics

Results of Statistical Analysis

Results of statistical analyses can indicate the precision of the results, give further description of the data or demonstrate the statistical significance of comparisons. Where statistical significance is referred to in text, the reference should be included in such a way as to minimize disruption to the flow of the text. Significance probabilities can either be presented by reference to conventional levels, e.g. (P < 0.05) or, more informatively, by stating the exact probability, e.g. (P = 0.023). An alternative to including a large number of statements about significance is to include an overall covering sentence at the beginning of the results section, or some other suitable position. An example of such a sentence is:- 'All treatment differences referred to in the results are statistically significant at least at the 5% level unless otherwise stated.'

Descriptive Statistics

When simply describing a set of data with summary statistics, useful statistics to present are the mean, the number of observations and a measure of the variation or "scatter" of the observations, as well as the units of measurement. The range or the standard deviation (SD) is useful measures of the variation in the data. The standard error (SE) is not relevant in this context, since it measures the precision with which the mean of the data estimates the mean of a larger population. If there are a large number of variables to be described the means, SDs etc. should be presented in a table. However if there are only one or two variables, these results can be included in the text.

For example:-'The initial weights of 48 ewes in the study had a mean of 34.7 kg and ranged from 29.2 to 38.6 kg." or

'The mean initial weight of ewes in the study was 34.7 kg (n = 48, SD = 2.61)".

Analyses of Variance

In most situations, the only candidates from the analysis of variance table for presentation are the significance probabilities of the various factors and interactions and sometimes the residual variance or standard deviation. When included, they should be within the corresponding table of means, rather than in a separate table.

In general, authors should present relevant treatment means, a measure of their precision, and maybe significance probabilities. The treatment means are the primary information to be presented, with measures of precision being of secondary importance. The layout of the table should reflect these priorities; the secondary information should not obscure the main message of the data. The layout of tables of means depends, as is shown below, on the design of the experiment, in particular on:-

* whether the design is balanced i.e. equal numbers of observations per treatment;

* whether the treatments have a factorial structure.

* for factorial designs, whether or not there are interactions.

Measures of Precision

The measure of precision should be either a standard error of a mean (SE), or a standard error of the difference between two means (SED), or a least significant difference (LSD). In the latter case, the significance level used should be stated, e.g. 5% LSD. The SED and LSD are usually only suitable for balanced designs. For balanced designs 5% LSD ( 2 ( SED and SED = (2 ( SE ( (one and a half) x SE.

Only one of these three statistics is necessary and it is important to make it clear which is being used. Preference is for the standard error (SE). It is the simplest. We can always multiply by 11/2 or 3 to give the SED or 5% LSD, and we can use the SE for both balanced and unbalanced situations. Measures of precision are usually presented with one more decimal place than the means. This is not a strict rule. For example a mean of 74 with a standard error of 32 is fine, but a mean of 7.4, with a standard error of 0.3, should have the extra decimal place and be given as 0.32.

Some researchers like to include the results of a multiple comparison procedure such as Fisher's LSD. These are added as a column with a series of letters, (a, b, c, etc) where treatments with the same letter are not significantly different. Often, though, these methods are abused. The common multiple comparison procedures are only valid when there is no "structure" in the set of treatments, e.g. when a number of different plant accessions or sources of protein are being compared.

In addition a single standard error or LSD is given in the balanced case, individual standard errors in the unbalanced case. The results from a multiple comparison procedure are additional to, and not a substitute for, the reporting of the standard errors.

Single Factor Experiments

The most straightforward case is a balanced design with simple treatments. Here each treatment has the same precision, so only one SE (or SED or LSD) per variable is needed. In the table of results, each row should present means for one treatment; results for different variables are presented in columns. The statistical analysis results are presented as one or two additional rows: one giving SEs (or SEDs or LSDs) and the other possibly giving significance probabilities. If the F-probabilities are given we suggest that the actual probabilities be reported, rather than just the levels of significance, e.g. 5% (or 0.05), 0.01 or 0.001. In particular, reporting a result was "not significant", often written as "ns", is not helpful. In interpreting the results, it is sometimes useful to know if the level of significance was 6% or 60%. If specified contrasts are of interest, e.g. polynomial contrasts for quantitative treatments, their individual F-test probabilities should be presented with or instead of the overall F-test.

For unbalanced experiments each treatment mean has a different precision. Then the best way of presenting results is to include a separate column of standard errors next to each column of means and also a column containing the number of observations. If the number of observations per group is the same for the different variables, then only one column of numbers should be presented (usually the first column). Otherwise a separate column will be needed for each variable. If there is little variation in the number of observations per treatment, and therefore little variation in the standard errors, it is sometimes possible to use an "average" SE or SED and present results as if the experiment was balanced. This should only be done if it does not distort the results, and it should be clearly stated that this procedure has been used. Alternatively, if the numbers per group are not too unequal, another method is to present the residual standard deviation for each variable instead of a column of individual standard errors. If the groups are of very unequal size, the above compromises should not be used and the number of variables presented in any one table should be reduced.

Factorial Experiments

For factorial experiments there is usually more statistical information to present. Also the means to be presented will depend on whether or not there are interactions which are both statistically significant and of practical importance.

This section discusses two-factor experiments, but the recommendations can be easily extended to more complex cases. It also assumes a balanced experiment with equal numbers of observations for each treatment. However, the recommendations can be combined in a fairly straightforward manner with those above for the unbalanced case.

If there is no interaction then the "main effect" means should be presented. For example a 3*2 factorial experiment on sheep nutrition might have three "levels" of supplementation (None, Medium and High) and two levels of parasite control (None and Drenched), giving six treatments in total. There are five main effect means: three means for the levels of supplementation, averaged over the two levels of parasite control, and two means for the levels of parasite control. In this example there would also be two SEs and two significance probabilities for each variable, corresponding to the two factors.

If there are interactions which are statistically significant and of practical importance, then main effect means alone are of limited use. In this case, the individual treatment means should be presented. For a balanced design, there is now only one SE per variable (except for split-plot designs), but three rows giving F-test probabilities for the two main effects and the interaction. Additional rows for F-test probabilities can be used for results of polynomial contrasts for quantitative factors or other pre-planned contrasts.

Regression Analysis

The key results of a linear regression analysis are usually the regression coefficient (b), its standard error, the intercept or constant term, the correlation coefficient (r) and the residual standard deviation. For multiple regression there will be a number of coefficients and SEs, and the coefficient of determination (R²) will replace r. If a number of similar regression analyses have been done the relevant results can be presented in a table, with one column for each parameter.

If results of just one or two regression analyses are presented, they can be incorporated in the text. This can either be done by presenting the regression equation as in :-

'The regression equation relating dry matter yield (DM, kg/ha) to amount of phosphorus applied (P, kg/ha) is DM = 1815 + 32.1P (r = 0.74, SE of regression coefficient = 8.9).”

or presenting individual parameters as in:-

'Linear regression analysis showed that increasing the amount of phosphorous applied by 1 kg/ha increased dry matter yield by 32.1 kg/ha (SE = 8.9). The correlation coefficient was 0.74.'

It is often revealing to present a graph of the regression line. If there is only one line to present on a graph, the individual points should also be included. This is not always necessary and tends to be confusing, with more than one line. Details of the regression equation(s) and correlation coefficient(s) can be included with the graph if there is sufficient space. If this information would obscure the message of the graph, then it should be presented elsewhere.

Error Bars on Graphs and Charts

Error bars displayed on graphs or charts are sometimes very informative, while in other cases they obscure the trends which the picture is meant to demonstrate. The decision on whether or not to include error bars within the chart, or give the information as part of the caption, should depend on whether they make the graph clearer or not.

If error bars are displayed, it must be clear whether the bars refer to standard deviations, standard errors, least significant differences, confidence intervals or ranges. Where error bars representing, say, standard errors are presented, then one of the two methods below should be used.

(a) the bar is centred on the mean, with one SE above the mean and one SE below the mean. i.e. the bar has a total length of twice the SE.

(b) the bar appears either completely above or completely below the mean, and represents one SE.

If the error bar has the same length for all points in the graph, then it should be drawn only once and placed to one side of the graph, rather than on the points. This occurs with the results of balanced experiments.

2:59 PM

By statisticalconcepts

In: Statistics

Statistical Averages

Summary Statistics

After the data have been properly checked for its quality, the first and foremost analysis is usually for the descriptive statistics. The general aim is to summarize the data, iron out any peculiarities and perhaps get ideas for a more sophisticated analysis. The data summary may help to suggest a suitable model which in turn suggests an appropriate inferential procedure. The first phase of the analysis will be described as the initial examination of the data or initial data analysis. It has many things in common with explanatory data analysis which includes a variety of graphical and numerical techniques for exploring data. Thus explanatory data analysis is an essential part of nearly every analysis. It provides a reasonably systematic way of digesting and summarizing the data with its exact form naturally varies widely from problem to problem. In general, under initial and exploratory data analysis, the following are given due importance.

Measures of Central Tendency

One of the most important aspects of describing a distribution is the central value around which the observations are distributed. Any arithmetical measure which is intended to represent the center or central value of a set of observations is known as measure of central tendency.

The Arithmetic Mean (or simply Mean)

Suppose that n observations are obtained for a sample from a population. Denote the values of the n observations by x₁, x_2.....x_n; x₁ being the value of the first sample observation, x₂ that of second observation and so on. The arithmetic mean or mean or average denoted by

is given by

The symbol S ( read as ‘sigma’ ) means sum the individual values x₁ ,x₂,...,x_n of the variable, X. Usually the limits of the summations are not written, since it is always understood that the summation is over all n values. Hence we can write

The above formula enables us to find the mean when values x₁, x₂ ,....,x_n of n discrete observations are available. Sometimes the data set are given in the form of a frequency distribution table then the formula is as follows:

Arithmetic Mean of Grouped Data

Suppose that there are k classes or intervals. Let x₁, x₂ ,..., x_kdenote the class mid-points of these k intervals and let f₁, f₂, ..., f_k denotes the corresponding frequencies of these classes. Then the arithmetic mean

Properties of the arithmetic mean

(a) The Sum of the deviations of a set of n observations x₁ , x₂,..., x_n from their mean

is zero. Let d_i as deviation of x_i from

then

(b) If x₁ ,x₂,...,x_n are n observations,

is their mean and d_i = x_i - A is the deviation of x_i from a given number A, then

(d) If in a frequency distribution all the k class intervals are of the same width c, and d_i = x_i - A denote the deviation of x_i from A, where A is the value of a certain mid-point and x₁, x₂ ,..., x_kare the class mid-points of the k-classes, then d_i = c u_i where u_i = 0, ± 1, ± 2,..... and

The Median

The median of a set of n measurements or observations x₁ , x₂ ,..., x_n is the middle value when the measurements are arranged in an array according to their order of magnitude. If n is odd, the middle value is the median. If n is even, there are two middle values and the average of these values is the median. The median is the value which divides the set of observations into two equal halves, such that 50% of the observations lie below the median and 50% above the median. The median is not affected by the actual values of the observations but rather on their positions.

The Median of Grouped Data

The formula of median of grouped data is as

The Mode

The mode is the observation which occurs most frequently in a set. In grouped data mode is worked out as

The mode can be determined analytically in the case of continuous distribution. For a symmetrical distribution, the mean, median and mode coincide. For a distribution skewed to the left ( or negatively skewed distribution ), the mean, the median and the mode are in that order (as they appear in the dictionary ) and for a distribution skewed to the right ( or positively skewed distribution) they occur in the reverse order, mode, median and mean. There is an empirical formula for a moderately asymmetrical skewed distribution, it is given by Mean - Mode = 3 (Mean - Median)

The Geometric Mean

There are two other averages, the geometric mean and harmonic mean which are sometimes used. The Geometric Mean ( GM ) of a set of observations is such that its logarithm equals the arithmetic mean of the logarithms of the values of the observations. GM = (x₁ x₂..... x_n)^1/n

log GM = 1/n (å log x_i) or in frequency distribution, log GM = 1/n (å f_i log x_i)

In case of frequency distribution,

The geometric mean can be obtained only if the values assumed by the observation are positive( greater than zero).

Harmonic mean

The Harmonic Mean ( HM ) of a set of observations is such that its reciprocal is the arithmetic mean of the reciprocals of the values of the observation

The harmonic mean is rarely computed for a frequency distribution.

Weighted Mean
If there are n observations, x₁, x₂, x_3,…,x_n with corresponding weights w₁, w₂, w_3,…,w_n, then the weighted mean is given by,

In computing the mean, we take the frequency of a class as its weight. That is

Hence, it is a special case of weighted mean. The three means are related by

A.M. ³ G.M. ³ H.M.

Important characteristics of a good average

Since an average is a representative item of a distribution it should possess the following properties :

1. It should take all items into consideration.

2. It should not be affected by extreme values.

3. It should be stable from sample to sample.

4. It should be capable of being used for further statistical analysis.

Mean satisfies all the properties excepting that it is affected by the presence of extreme items. For example, if the items are 5, 6, 7, 7, 8 and 9 then the mean, median and mode are all equal to 7. If the last value is 30 instead of 9, the mean will be 10, whereas median and mode are not changed. Though median and mode are better in this respect they do not satisfy the other properties. Hence mean is the best average among these three.

When to use different averages

The proper average to be used depends upon the nature of the data, nature of the frequency distribution and the purpose.

If the data is qualitative one, only mode can be computed. For example, when we are interested in knowing the typical soil type in a locality or the typical cropping pattern in a region we can use mode. On the other hand, if the data is quantitative one, we can use any one of the averages

If the data is quantitative, then we have to consider the nature of the frequency distribution. When the frequency distribution is skewed (not symmetrical) the median or mode will be proper average. In case of raw data in which extreme values, either small or large, are present, the median or mode is the proper average. In case of a symmetrical distribution either mean or median or mode can be used. However, as seen already, the mean is preferred over the other two.

When we are dealing with rates, speed and prices we use harmonic mean. If we are interested in relative change, as in the case of bacterial growth, cell division etc., geometric mean is the most appropriate average.