Statistical tests and their importance

Are my results significant and come with pure confidence? 🤔

6 min readFeb 3, 2020

Image source https://www.cartoonstock.com/

A test statistic is a statistic used in Statistical Hypothesis Testing (SHT). Statistical Hypothesis Testing was run prior to the computer era and while p-value and hypothesis testing had different roots, now they are used under the same umbrella of SHT. Business, Government and the hard and soft sciences all use and often abuse these statistical tests. How? It’s all about answering or wrong answering the question

“Could these observations really have occurred by chance?”

based on samples and given some expectation (hypothesis).

A Common Methodology

How do we do this?

We devise a test statistic that models the outcome of our observations
We model the expected occurrence of values from this quantity
We measure the actual value of the test statistic in the real world
We see how extreme this measure is, with respect to the expected

Hypothesis Testing and Data Science

Tests are all about repetition of observations. So, we need multiple observations, the larger the sample size the more confident we may be. But how do we form a hypothesis?

Hypothesis is a statement , not a question.

Given that, we try to formulate a statement that we can study and test it, we can prove its non existence and can be an evolution of a previous hypothesis. Below are the formal steps we follow:

Step 1: Formulate all hypotheses

H₀: The NULL Hypothesis, is about that the observations are the result of chance, this is usually what we want to reject.

H₁: The ALTERNATE Hypothesis, is about that the observations are the result of the real effect, plus some chance.

Step 2: Choose the appropriate test statistic that will use evidence to justify against the null hypothesis. And set the decision rule, which is the statement that designates the statistical conditions necessary for rejecting the null hypothesis.

Step 3: The p-value, a probability statement to answer the question
If the null hypothesis was true what is the probability of observing a test statistic at least as extreme as ours. The smaller the p-value the stronger the evidence to reject the H₀.

Step 4: Compare the p-value to a fixed significance level α. It is the threshold below we consider that the effect is statistically significant.

Compare p-value to significance level α

and we can reject the null hypothesis. In scientific work a fixed α-level of 0.05 or 0.01 is often used.

Which test should I use?

There exist general guidelines in correctly choosing a statistical test, based more or less on the type and number of dependent variables as well as the number and type of independent variables and their descriptive statistics (their distributions). Taking time for summarizing and visualizing the data, because the results and the charts can show patterns and outliers, that can help deciding the next steps.

A good place to start, with some extra help on choosing the correct statistical test:

Choosing the Correct Statistical Test in SAS, Stata, SPSS and R

The following table shows general guidelines for choosing a statistical analysis. We emphasize that these are general…

stats.idre.ucla.edu

Decision analysis of Hypothesis Testing

Remember this smoke detector at your home — if it happens to have one due to your absent-mindedness — that “sirens” every time you burn your toast? We can think of hypothesis testing and significance level in terms of this smoke-detector.

This is called Type I error: an alarm without a fire — a false positive or a false reject or incorrect rejection of NULL hypothesis.

On the contrary Type II error: a fire without an alarm — a false negative or a false fail to reject or failure to reject false NULL hypothesis.

We can summarize this in a 2x2 decision table

Decision table when assuming the null hypothesis as a condition of no fire and the alternate of fire.

And finally Type III error: correctly reject the NULL hypothesis but for some other reason.

Confidence Intervals

How confident are we? Do we experiment in a small or a large data set? Most research uses sample data to make inferences about a wider population. For example, Hypothesis testing and p-value assumes whether there is a significant evidence of differences or not in a paired t-test. The confidence interval of the difference gives an indication of the size of the difference.

Thus we can use either P values or confidence intervals to determine whether our results are statistically significant. They are both inferential methods and the results should agree.

Most common statistical tests

Independent or unpaired or student’s t-test

A t-test is used to compare the means of two independent groups. Independent groups means that different people are in each group.

Independent variable: Binary
Dependent variable: Continuous
Assumptions: Normality, Homogeneity of variance

Visualize with: Box-plots or Confidence Interval plots

Mann-Whitney U test

The Mann–Whitney U test is a non-parametric test of the null hypothesis that it is equally likely that a randomly selected value from one population will be less than or greater than a randomly selected value from a second population. It is the non-parametric equivalent to the unpaired t-test.

Independent variable: Binary
Dependent variable: Ordinal/Continuous
Assumptions: -

Visualize with: Histograms of the two groups

Dependent or paired t-test

A paired samples t-test can only be used when the data is paired or matched. It compares the means of two related groups to determine whether there is a statistically significant difference between these means.

Independent variable: Categorical that has only 2 groups
Dependent variable: Continuous (at least interval)
Assumptions: Normality

Visualize with: Histogram of differences

Wilcoxon Signed Rank test

The Wilcoxon signed rank test is used to compare two related samples, matched samples or repeated measurements on a single sample to assess whether their population mean ranks differ. It is a paired difference test and is the non-parametric equivalent to the paired t-test.

Independent variable: Binary
Dependent variable: Ordinal/Continuous
Assumptions: -

Visualize with: Histogram of differences

Chi-squared test

The chi-squared goodness of fit test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories. A chi-squared test can be used to attempt rejection of the null hypothesis that the data are independent. It is a non-parametric test.

Independent variable: Categorical
Dependent variable: Categorical
Assumptions: The data in the cells should be frequencies, or counts of cases rather than percentages or some other transformation of the data. The levels (or categories) of the variables are mutually exclusive.

Visualize with: Stacked/ multiple bar chart with percentages

One-way ANOVA test

One-way analysis of variance (ANOVA) is a technique that can be used to compare means of two or more samples. Used to detect the difference in means of 3 or more independent groups. It can be thought of as an extension of the t-test for 3 or more independent groups.

Independent variable: Categorical (at least 3 categories)
Dependent variable: Continuous
Assumptions: Residuals should be normally distributed, Homogeneity of variance

Visualize with: Box-plots or Confidence Interval plots

Kruskal-Wallis test

Kruskal-Wallis compares the medians of two or more samples to determine if the samples have come from different populations. It is an extension of the Mann–Whitney U test to 3 or more groups and the non-parametric equivalent of One-way ANOVA.

Independent variable: Categorical
Dependent variable: Ordinal/Continuous
Assumptions: -

Visualize with: Box-plots

Experimental Design

The design of an experiment often spells success or failure!

Common setup problems include:

Sample selections
Sample sizes
Randomization
Replication
Data collection
Data exclusion
Blind evaluation

References

The cartoon guide to statistics by Gonick Larry and Smith Woollcott
https://stats.idre.ucla.edu/other/mult-pkg/whatstat/