# Quantitative validation methods

# Patient level data

Here we present the measures used when handling patient-level data in the context of validation:

## 1. Pearson correlation test

See Pearson correlation test in the Hypothesis testing section.

## 2. Spearman’s rank test

See Spearman's rank test in the Hypothesis testing section.

## 3. Kendall rank correlation test

See Kendall rank correlation test in the Hypothesis testing section.

## 4. Mean Bias Error

The Mean Bias Error (MBE) consists of comparing forecasted outputs ŷ (or predicted time series) with observed data y (or observed or measured time series). MBE is not a good indicator of the model reliability because the errors often compensate each other, but it allows one to see how much it overestimates or underestimates (1)

## 5. ROC

A receiver operating characteristic curve (ROC curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters at different classification thresholds: True Positive Rate (TPR) and False Positive Rate (FPR). Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives (1)

## 6. AUC

AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (as in integral calculus) from (0,0) to (1,1). It provides an aggregate measure of performance across all possible classification thresholds. AUC ranges in values from 0 to 1. A model, whose predictions are 100% wrong, has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0. (1)

## 7. Bland-Altman analysis (MA plot or normalized prediction distribution error analysis)

The Bland-Altman (B&A) analysis is based on the quantification of the agreement between two quantitative measurements by studying the mean difference and constructing limits of agreement. The B&A plot analysis is a simple way to evaluate a bias between the mean differences and to estimate an agreement interval, within which 95% of the differences of the second method, compared to the first one, fall. (1)(2)(3)

# Population level data

Here we detail the measures used when handling population-level data in the context of validation. Some of these methods are already presented in the Hypothesis testing section. Visit that section for more information.

## 1. Parametric tests

See Parametric tests in the Hypothesis testing section.

## 1.1 Z-test

See Z-test in the Hypothesis testing section.

## 1.2 Student’s t-test

See Student’s t-test in the Hypothesis testing section.

## 1.3 Chi-squared test

See Chi-squared test in the Hypothesis testing section.

## 2. Non-parametric tests

See Non-parametric tests in the Hypothesis testing section.

## 2.1 Kolmogorov-Smirnov test

See Kolmogorov-Smirnov test in the Hypothesis testing section.

## 2.2 Mann-Whitney U test

See Mann-Whitney U test in the Hypothesis testing section.

## 2.3 Wilcoxon signed-rank test

See Wilcoxon signed-rank test in the Hypothesis testing section.

## 3. Coverage and Precision metrics (Nova’s method)

When only summary data is available, in order to evaluate the Computational Model’s (CM) capability to successfully reproduce a time series or a Kaplan-Meier-like curve, we can use visual predictive checking (VPC) associated with two metrics: coverage and precision. Both metrics are based on the width of the prediction (Prediction Interval) and observed intervals (Confidence Interval):

## 3.1 Coverage

The coverage represents the model’s accuracy to reproduce the range of observed data in real life

Coverage = (Obs. interval ∩ Pred. interval)/Obs. interval

## 3.2 Precision

The precision represents the model’s ability to provide results with a reasonable variability

Precision = (Obs. interval ∩ Pred. interval)/Pred. interval

Both metrics are defined between 0 and 100%. A coverage and precision of at least 70% is considered as acceptable whereas 80% and more is considered as good (1).

## 4. Ratio of AUCs test (Nova’s method)

When comparing two curves (observed vs simulated one), the ratio of AUC can be useful. This metric relies on the ratio of the area in between curves (the region existing below the upper curve and above the lower curve), divided by the area under the observed curve. We consider that an acceptable threshold for this metric would be 0.3 or below as one can notice on the plot of the simulated data with a limited amount of noise that an AUC ratio of 0.3 would be equivalent to a difference of 20% between simulated and observed values (See this reference for more detail 1).