deep-significance: Easy and Better Significance Testing for Deep Neural Networks
Contents
- :interrobang: Why
- :inbox_tray: Installation
- :bookmark: Examples
- Intermezzo: Almost Stochastic Order - a better significance test for Deep Neural Networks
- Scenario 1: Comparing multiple runs of two models
- Scenario 2: Comparing multiple runs across datasets
- Scenario 3: Comparing sample-level scores
- Scenario 4: Comparing more than two models
- How to report results
- Sample size
- Other features
- General Recommendations & other notes
- :mortar_board: Cite
- :medal_sports: Acknowledgements
- :people_holding_hands: Papers using deep-significance
- :books: Bibliography
:interrobang: Why?
Although Deep Learning has undergone spectacular growth in the recent decade, a large portion of experimental evidence is not supported by statistical hypothesis tests. Instead, conclusions are often drawn based on single performance scores.
This is problematic: Neural network display highly non-convex loss surfaces (Li et al., 2018) and their performance depends on the specific hyperparameters that were found, or stochastic factors like Dropout masks, making comparisons between architectures more difficult. Based on comparing only (the mean of) a few scores, we often cannot conclude that one model type or algorithm is better than another. This endangers the progress in the field, as seeming success due to random chance might lead practitioners astray.
For instance, a recent study in Natural Language Processing by Narang et al. (2021) has found that many modifications proposed to transformers do not actually improve performance. Similar issues are known to plague other fields like e.g., Reinforcement Learning (Henderson et al., 2018) and Computer Vision (Borji, 2017) as well.
To help mitigate this problem, this package supplies fully-tested re-implementations of useful functions for significance testing:
- Statistical Significance tests such as Almost Stochastic Order (del Barrio et al, 2017; Dror et al., 2019), bootstrap (Efron & Tibshirani, 1994) and permutation-randomization (Noreen, 1989).
- Bonferroni correction methods for multiplicity in datasets (Bonferroni, 1936).
- Bootstrap power analysis (Yuan & Hayashi, 2003) and other functions to determine the right sample size.
All functions are fully tested and also compatible with common deep learning data structures, such as PyTorch / Tensorflow tensors as well as NumPy and Jax arrays. For examples about the usage, consult the documentation here , the scenarios in the section Examples or the demo Jupyter notebook.
:inbox_tray: Installation
The package can simply be installed using pip
by running
pip3 install deepsig
Another option is to clone the repository and install the package locally:
git clone https://github.com/Kaleidophon/deep-significance.git
cd deep-significance
pip3 install -e .
Warning: Installed like this, imports will fail when the clones repository is moved.
:bookmark: Examples
tl;dr: Use aso()
to compare scores for two models. If the returned eps_min < 0.5
, A is better than B. The lower
eps_min
, the more confident the result (we recommend to check eps_min < 0.2
and record eps_min
alongside
experimental results).
:warning: Testing models with only one set of hyperparameters and only one test set will be able to guarantee superiority in all settings. See General Recommendations & other notes.
In the following, we will lay out three scenarios that describe common use cases for ML practitioners and how to apply the methods implemented in this package accordingly. For an introduction into statistical hypothesis testing, please refer to resources such as this blog post for a general overview or Dror et al. (2018) for a NLP-specific point of view.
We assume that we have two sets of scores we would like to compare, and , for instance obtained by running two models and multiple times with a different random seed. We can then define a one-sided test statistic based on the gathered observations. An example of such test statistics is for instance the difference in observation means. We then formulate the following null-hypothesis:
That means that we actually assume the opposite of our desired case, namely that is not better than , but equally as good or worse, as indicated by the value of the test statistic. Usually, the goal becomes to reject this null hypothesis using the SST. p-value testing is a frequentist method in the realm of SST. It introduces the notion of data that could have been observed if we were to repeat our experiment again using the same conditions, which we will write with superscript in order to distinguish them from our actually observed scores (Gelman et al., 2021). We then define the p-value as the probability that, under the null hypothesis, the test statistic using replicated observation is larger than or equal to the observed test statistic:
We can interpret this expression as follows: Assuming that is not better than , the test assumes a corresponding distribution of statistics that is drawn from. So how does the observed test statistic fit in here? This is what the -value expresses: When the probability is high, is in line with what we expected under the null hypothesis, so we can not reject the null hypothesis, or in other words, we \emph{cannot} conclude to be better than . If the probability is low, that means that the observed is quite unlikely under the null hypothesis and that the reverse case is more likely - i.e. that it is likely larger than - and we conclude that is indeed better than . Note that the -value does not express whether the null hypothesis is true. To make our decision about whether or not to reject the null hypothesis, we typically determine a threshold - the significance level , often set to 0.05 - that the p-value has to fall below. However, it has been argued that a better practice involves reporting the p-value alongside the results without a pidgeonholing of results into significant and non-significant (Wasserstein et al., 2019).
Intermezzo: Almost Stochastic Order - a better significance test for Deep Neural Networks
Deep neural networks are highly non-linear models, having their performance highly dependent on hyperparameters, random seeds and other (stochastic) factors. Therefore, comparing the means of two models across several runs might not be enough to decide if a model A is better than B. In fact, even aggregating more statistics like standard deviation, minimum or maximum might not be enough to make a decision. For this reason, del Barrio et al. (2017) and Dror et al. (2019) introduced Almost Stochastic Order (ASO), a test to compare two score distributions.
It builds on the concept of stochastic order: We can compare two distributions and declare one as stochastically dominant by comparing their cumulative distribution functions:
Here, the CDF of A is given in red and in green for B. If the CDF of A is lower than B for every , we know the algorithm A to score higher. However, in practice these cases are rarely so clear-cut (imagine e.g. two normal distributions with the same mean but different variances). For this reason, del Barrio et al. (2017) and Dror et al. (2019) consider the notion of almost stochastic dominance by quantifying the extent to which stochastic order is being violated (red area):
ASO returns a value , which expresses (an upper bound to) the amount of violation of stochastic order. If (where \tau is 0.5 or less), A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as superior. We can also interpret as a confidence score. The lower it is, the more sure we can be that A is better than B. Note: ASO does not compute p-values. Instead, the null hypothesis formulated as
If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5 (see the discussion in this section). Furthermore, the significance level is determined as an input argument when running ASO and actively influence the resulting <img