reading-notes

Chapter 3: Statistical Experiments and Significance Testing

A/B Testing

A/B testing, also known as randomized controlled experiments, is a method of comparing two or more versions of a product or service to determine which version performs better. A/B testing is commonly used in marketing, web design, and product development to make data-driven decisions that can improve user experience, conversion rates, and overall performance.

Use Case

A digital marketing team wants to determine if a new version of their website’s landing page leads to a higher conversion rate. They randomly assign website visitors to either the original landing page (version A) or the new landing page (version B) and record the conversion rates for each group. By comparing the conversion rates, the team can determine which version of the landing page is more effective.

Hypothesis Testing

Hypothesis testing is a statistical method used to make inferences about population parameters based on sample data. It involves setting up a null hypothesis (H₀), which represents the “status quo” or “no effect” scenario, and an alternative hypothesis (H₁), which represents the effect or relationship being tested.

The objective of hypothesis testing is to determine if there is enough evidence in the sample data to reject the null hypothesis in favor of the alternative hypothesis. To do this, a test statistic is calculated and compared to a critical value or p-value.

Use Case

A pharmaceutical company wants to determine if a new drug is more effective at treating a specific condition than the current standard treatment. They conduct a clinical trial and use hypothesis testing to evaluate if there is a statistically significant difference in treatment outcomes between the new drug and the standard treatment.

P-value

The p-value is a measure of the evidence against the null hypothesis. It represents the probability of observing the sample data or more extreme data, assuming the null hypothesis is true. Smaller p-values indicate stronger evidence against the null hypothesis.

A predetermined significance level (α), such as 0.05, is used as a threshold to decide whether to reject or fail to reject the null hypothesis. If the p-value is less than the significance level, the null hypothesis is rejected, and the result is considered statistically significant.

In Python, we can use the scipy.stats module to perform hypothesis tests and obtain p-values:

import scipy.stats as stats

# Example: One-sample t-test
data = [2.3, 3.1, 2.8, 3.4, 2.6]
null_mean = 3
t_stat, p_value = stats.ttest_1samp(data, null_mean)

Use Case

In the A/B testing example, the marketing team would use the p-value to determine if there is a statistically significant difference in conversion rates between the two landing page versions. If the p-value is below the significance level, they can conclude that the new landing page has a significantly different conversion rate than the original.

Type I and Type II Errors

In hypothesis testing, there are two types of errors that can occur:

Type I error: Rejecting the null hypothesis when it is true (false positive)
Type II error: Failing to reject the null hypothesis when it is false (false negative)

The significance level (α) determines the probability of making a Type I error, while the power of a test (1 - β) is the probability of correctly rejecting a false null hypothesis, which is related to the probability of making a Type II error (β).

Use Case

In medical diagnostics, Type I and Type II errors have different consequences. A Type I error may result in unnecessary treatments for patients, while a Type II error may result in a failure to diagnose a condition, leading to a lack of necessary treatment. Balancing these errors is crucial to minimize adverse outcomes for patients.

Power and Sample Size

The power of a hypothesis test is the probability of correctly rejecting a false null hypothesis. It is affected by the sample size, the true effect size, and the significance level. A higher power increases the likelihood of detecting a true effect when it exists.

Sample size plays a crucial role in hypothesis testing, as larger sample sizes lead to more precise estimates and increased power. Calculating the required sample size for a desired power can help researchers design experiments that are more likely to detect true effects.

In Python, we can use the statsmodels library to calculate the required sample size for a two-sample t-test:

import statsmodels.stats.power as smp

effect_size = 0.5
alpha = 0.05
power = 0.8
sample_size = smp.TTestIndPower().solve_power(effect_size, power=power, alpha=alpha)

Use Case

A social scientist wants to determine if a new educational intervention has a significant impact on student test scores. By calculating the required sample size, they can design a study that has adequate power to detect a true effect, if it exists, while minimizing the likelihood of Type II errors.

T-test

The t-test is a common hypothesis test used to compare the means of two groups or the mean of a single group to a known value. There are three main types of t-tests:

One-sample t-test: Compares the mean of a single group to a known value
Independent two-sample t-test: Compares the means of two independent groups
Paired t-test: Compares the means of two related groups (e.g., before and after measurements)

In Python, we can use the scipy.stats module to perform t-tests:

import scipy.stats as stats

# Example: Independent two-sample t-test
group1 = [3.1, 3.5, 2.8, 3.2, 2.9]
group2 = [2.1, 2.5, 1.8, 2.2, 1.9]
t_stat, p_value = stats.ttest_ind(group1, group2)

Use Case

An agricultural researcher wants to determine if two different fertilizers have different effects on crop yields. They can use a two-sample t-test to compare the average yields of crops treated with each fertilizer and determine if there is a significant difference in performance.

Resampling

Resampling is a statistical technique that involves repeatedly drawing samples from the original data and recalculating the test statistic. Resampling methods, such as the bootstrap and permutation tests, provide nonparametric alternatives to traditional parametric tests, making them more robust and less reliant on distributional assumptions.

Use Case

A data scientist is analyzing customer satisfaction scores for a service company. Since the scores are not normally distributed, they decide to use a resampling technique like the bootstrap to estimate the confidence interval for the mean satisfaction score, allowing them to make inferences about the true population mean without relying on parametric assumptions.

Permutation Test

The permutation test is a nonparametric method for testing the null hypothesis by comparing the observed test statistic to the distribution of test statistics obtained by rearranging the data. It involves randomly permuting the group assignments and recalculating the test statistic for each permutation.

In Python, we can implement a permutation test using the numpy library:

import numpy as np

def permutation_test(group1, group2, num_permutations=1000):
    combined = np.concatenate([group1, group2])
    observed_diff = np.mean(group1) - np.mean(group2)
    count = 0

    for _ in range(num_permutations):
        np.random.shuffle(combined)
        perm_diff = np.mean(combined[:len(group1)]) - np.mean(combined[len(group1):])
        if abs(perm_diff) >= abs(observed_diff):
            count += 1

    return count / num_permutations

Use Case

A sports scientist wants to determine if there is a significant difference in the performance of two groups of athletes who followed different training programs. Since the data may not be normally distributed, they decide to use a permutation test to assess the statistical significance of the observed difference in performance.

Chi-squared Test

The chi-squared test is a statistical test used to determine if there is a significant association between two categorical variables in a contingency table. It compares the observed frequencies in the table to the frequencies that would be expected under the assumption of independence.

In Python, we can use the scipy.stats module to perform a chi-squared test:

import scipy.stats as stats

observed = np.array([[10, 20, 30], [20, 30, 40]])
chi2_stat, p_value, _, _ = stats.chi2_contingency(observed)

Use Case

A market researcher wants to determine if there is a significant association between the type of advertisement and customer purchasing behavior. They can use a chi-squared test to analyze the relationship between the two categorical variables and draw conclusions about their independence.

Fisher’s Exact Test

Fisher’s exact test is a nonparametric test used to determine the significance of the association between two categorical variables in a 2x2 contingency table. It is especially useful when the sample size is small, and the assumptions of the chi-squared test are not met.

In Python, we can use the scipy.stats module to perform Fisher’s exact test:

import scipy.stats as stats

observed = np.array([[8, 2], [1, 5]])
_, p_value = stats.fisher_exact(observed)

Use Case

A medical researcher wants to determine if there is a significant association between a rare disease and a specific genetic marker. Due to the rarity of the disease, the sample size is small. They can use Fisher’s exact test to analyze the relationship between the presence of the genetic marker and the disease without relying on the chi-squared test’s assumptions.

Multiple Testing Problem

The multiple testing problem arises when performing multiple hypothesis tests simultaneously, increasing the likelihood of making at least one Type I error. To control the probability of making a Type I error, adjustments to the significance level can be made using methods such as the Bonferroni correction or the False Discovery Rate (FDR) control.

Use Case

A genomics researcher is analyzing gene expression data to identify differentially expressed genes between two conditions. Since thousands of genes are being tested simultaneously, the multiple testing problem arises. The researcher can apply the Bonferroni correction or FDR control methods to adjust the significance level and minimize the chance of identifying false positives among the differentially expressed genes.

Summary

In summary, Chapter 3 of Practical Statistics for Data Scientists - Second Edition provides an overview of statistical experiments and significance testing, covering concepts such as A/B testing, hypothesis testing, p-value, Type I and Type II errors, power and sample size, t-test, resampling, permutation test, chi-squared test, Fisher’s exact test, and the multiple testing problem. These concepts are essential for data scientists to perform rigorous statistical analyses and make valid inferences from sample data.