# Methodological considerations

## 1. P-value

In hypothesis testing, the p-value is the probability that the data would occur if the null hypothesis were true. A very small p-value means that such an observed outcome would be very unlikely under the null hypothesis. Reporting p-values of statistical tests is common practice in academic publications of many quantitative fields (1)(2). A statistically significant outcome has a p-value of under 0.05, strongly significant outcomes below 0.01.

## 2. Multiple testing

In hypothesis testing, one may need to test several hypotheses at once or to check multiple endpoints. When doing so, one can incur in the so-called multiple testing problem (also known as multiple comparison problem). This issue has to do with the increase of the risk 𝛼 of wrongly rejecting the null hypothesis. To overcome this, a better clinical design and/or a correction on the p-value (or alpha) of the tests is required (1).

## 3. Sample size calculation

It is a basic statistical principle with which we define the sample size before we start a clinical study so as to avoid bias in interpreting results. If we include very few subjects in a study, the results cannot be generalized to the population as this sample will not represent the size of the target population. Furthermore, the study then may not be able to detect the difference between test groups, making the study unethical. On the other hand, if we study more subjects than required, we put more individuals at the risk of the intervention, also making the study unethical and wasting precious resources. Generally, the sample size for any study depends on the following: acceptable level of significance, statistical power of the study, expected effect size, underlying event rate in the population, and the standard deviation in the population. (1)

## 4. Imbalanced classes

When dealing with categorical variables, one can easily fall in the imbalanced classes issue. The threshold to determine whether the distribution of observations into the classes of a categorical variable is not well defined as it depends on the number of existing groups and observations per group in a sample. However, classes with fewer observations than other classes will frequently be underrepresented in overall analyses. This is what the imbalanced classes issue tackles. There are several methods to address this issue that depend on which statistical method is being employed to analyze the data (1).

## 5. Bootstrapping

Bootstrapping is any test or metric that uses random sampling with replacement (e.g. mimicking the sampling process), and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy (bias, variance, confidence intervals, prediction error, etc.) to sample estimates. This technique allows estimation of the sampling distribution of almost any statistics using random sampling methods. (1) (See Statistical power below)

## 6. Statistical power

In hypothesis testing, the power of a test is the probability that it correctly rejects the null hypothesis. The higher the statistical power for a given experiment, the lower the probability of making a type II (false negative) error. Experimental results with too little statistical power will lead to invalid conclusions about the meaning of the results. It is common to design experiments with a statistical power of 80% (1) (See Bootstrapping above)