# Statistical modeling

This section aims to describe the most popular statistical modeling approaches and techniques - they are detailed in the following:

## 1. Regression

Regression models describe and attempt to show the potential relationship between at least to variables by a prediction function (trend) to the observed data. It may serve to create a predictive model on apparently random data, showing trends in data, such as in cancer diagnoses or biomarkers (1, 2, 3).

## 2. Linear regression

Simple linear regression models assume a linear relationship between observations data and model predictions. For this, the parameters “slope” and “y-intercept” in a slope-intercept equation are fitted. When more than one predictor is considered, a set of “slopes” are estimated (1, 2, 3).

## 3. Logistic regression

Logistic regression is a classification and predictive analytics algorithm. It is used to predict a categorical response on a set of independent variables. The idea is to find a relationship between features and probability of a particular outcome (1,2).

## 3.1 Binomial logistic regression

In binomial logistic regression, one predicts a binary outcome (0 or 1) based on a set of independent variables (1).

## 3.2 Multinomial logistic regression

In multinomial logistic regression, one deals with situations where the response variable can have three or more possible nominal values (1).

## 3.3 Ordinal logistic regression

Ordinal logistic regression is used when the response variable is ordered (i.e. ordinal). The response variable has a meaningful order and more than two categories or levels. Examples of such variables might be expansion of the tumor in cancer (graded tumor growth), answers on an opinion poll (Agree/ Disagree/ Neutral), or clinical score measurements (Severe/ Moderate/ Mild/ Absent) (1).

## 4. ANOVA

Analysis of variance (ANOVA) is a statistical technique that is used to check if the means of two or more groups are significantly different from each other. ANOVA checks the impact of one or more factors by comparing the means of different samples. It is often used as a post-hoc analysis of a statistical model such as linear regression. One-way ANOVA is the most basic form. There are other forms that can be used in different situations, including two-way ANOVA, factorial ANOVA, Welch’s F-test ANOVA, ranked ANOVA and Games-Howell pairwise test (1,2).

## 5. AIC

The Akaike information criterion (AIC) is a mathematical method for evaluating how well a model fits the data it was generated from and thereby compares the quality of a set of statistical models to each other. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. The best-fit model according to AIC is the one that explains the greatest amount of variation using the fewest possible independent variables. Similarly to the BIC (see below), among various alternative models, the model to be preferred is the one with the minimum AIC value. (1,2,3).

## 6. BIC

Along with the AIC, the BIC is a well-known general approach to model selection that favors more parsimonious models over more complex models (i.e. it adds a penalty based on the number of parameters being estimated in the model). The difference between BIC and AIC is manifested when we add a number of k parameters (regressors or/and intercept), in order to increase the goodness of fit of the model. In such a case, the BIC penalizes more (in comparison to the AIC) such an increase of parameters. Similarly to the AIC, among various alternative models, the model to be preferred is the one with the minimum BIC value. (1)