Project Icon

deep-significance

深度神经网络显著性测试的开源解决方案

deep-significance 提供完全测试的显著性测试功能,包括几乎随机顺序(ASO)方法、bootstrap 检验和置换随机化方法。结合 Bonferroni 校正和样本大小分析,兼容 PyTorch、TensorFlow 和 NumPy 数据结构。支持多模型、多数据集和样本级别的比较,帮助用户准确评估模型性能,避免因随机因素导致的错误结论。

deep-significance: Easy and Better Significance Testing for Deep Neural Networks

Build Status Coverage Status Compatibility License: GPL v3 Code style: black DOI

Contents

:interrobang: Why?

Although Deep Learning has undergone spectacular growth in the recent decade, a large portion of experimental evidence is not supported by statistical hypothesis tests. Instead, conclusions are often drawn based on single performance scores.

This is problematic: Neural network display highly non-convex loss surfaces (Li et al., 2018) and their performance depends on the specific hyperparameters that were found, or stochastic factors like Dropout masks, making comparisons between architectures more difficult. Based on comparing only (the mean of) a few scores, we often cannot conclude that one model type or algorithm is better than another. This endangers the progress in the field, as seeming success due to random chance might lead practitioners astray.

For instance, a recent study in Natural Language Processing by Narang et al. (2021) has found that many modifications proposed to transformers do not actually improve performance. Similar issues are known to plague other fields like e.g., Reinforcement Learning (Henderson et al., 2018) and Computer Vision (Borji, 2017) as well.

To help mitigate this problem, this package supplies fully-tested re-implementations of useful functions for significance testing:

  • Statistical Significance tests such as Almost Stochastic Order (del Barrio et al, 2017; Dror et al., 2019), bootstrap (Efron & Tibshirani, 1994) and permutation-randomization (Noreen, 1989).
  • Bonferroni correction methods for multiplicity in datasets (Bonferroni, 1936).
  • Bootstrap power analysis (Yuan & Hayashi, 2003) and other functions to determine the right sample size.

All functions are fully tested and also compatible with common deep learning data structures, such as PyTorch / Tensorflow tensors as well as NumPy and Jax arrays. For examples about the usage, consult the documentation here , the scenarios in the section Examples or the demo Jupyter notebook.

:inbox_tray: Installation

The package can simply be installed using pip by running

pip3 install deepsig

Another option is to clone the repository and install the package locally:

git clone https://github.com/Kaleidophon/deep-significance.git
cd deep-significance
pip3 install -e .

Warning: Installed like this, imports will fail when the clones repository is moved.

:bookmark: Examples


tl;dr: Use aso() to compare scores for two models. If the returned eps_min < 0.5, A is better than B. The lower eps_min, the more confident the result (we recommend to check eps_min < 0.2 and record eps_min alongside experimental results).

:warning: Testing models with only one set of hyperparameters and only one test set will be able to guarantee superiority in all settings. See General Recommendations & other notes.


In the following, we will lay out three scenarios that describe common use cases for ML practitioners and how to apply the methods implemented in this package accordingly. For an introduction into statistical hypothesis testing, please refer to resources such as this blog post for a general overview or Dror et al. (2018) for a NLP-specific point of view.

We assume that we have two sets of scores we would like to compare, and , for instance obtained by running two models and multiple times with a different random seed. We can then define a one-sided test statistic based on the gathered observations. An example of such test statistics is for instance the difference in observation means. We then formulate the following null-hypothesis:

That means that we actually assume the opposite of our desired case, namely that is not better than , but equally as good or worse, as indicated by the value of the test statistic. Usually, the goal becomes to reject this null hypothesis using the SST. p-value testing is a frequentist method in the realm of SST. It introduces the notion of data that could have been observed if we were to repeat our experiment again using the same conditions, which we will write with superscript in order to distinguish them from our actually observed scores (Gelman et al., 2021). We then define the p-value as the probability that, under the null hypothesis, the test statistic using replicated observation is larger than or equal to the observed test statistic:

We can interpret this expression as follows: Assuming that is not better than , the test assumes a corresponding distribution of statistics that is drawn from. So how does the observed test statistic fit in here? This is what the -value expresses: When the probability is high, is in line with what we expected under the null hypothesis, so we can not reject the null hypothesis, or in other words, we \emph{cannot} conclude to be better than . If the probability is low, that means that the observed is quite unlikely under the null hypothesis and that the reverse case is more likely - i.e. that it is likely larger than - and we conclude that is indeed better than . Note that the -value does not express whether the null hypothesis is true. To make our decision about whether or not to reject the null hypothesis, we typically determine a threshold - the significance level , often set to 0.05 - that the p-value has to fall below. However, it has been argued that a better practice involves reporting the p-value alongside the results without a pidgeonholing of results into significant and non-significant (Wasserstein et al., 2019).

Intermezzo: Almost Stochastic Order - a better significance test for Deep Neural Networks

Deep neural networks are highly non-linear models, having their performance highly dependent on hyperparameters, random seeds and other (stochastic) factors. Therefore, comparing the means of two models across several runs might not be enough to decide if a model A is better than B. In fact, even aggregating more statistics like standard deviation, minimum or maximum might not be enough to make a decision. For this reason, del Barrio et al. (2017) and Dror et al. (2019) introduced Almost Stochastic Order (ASO), a test to compare two score distributions.

It builds on the concept of stochastic order: We can compare two distributions and declare one as stochastically dominant by comparing their cumulative distribution functions:

Here, the CDF of A is given in red and in green for B. If the CDF of A is lower than B for every , we know the algorithm A to score higher. However, in practice these cases are rarely so clear-cut (imagine e.g. two normal distributions with the same mean but different variances). For this reason, del Barrio et al. (2017) and Dror et al. (2019) consider the notion of almost stochastic dominance by quantifying the extent to which stochastic order is being violated (red area):

ASO returns a value , which expresses (an upper bound to) the amount of violation of stochastic order. If (where \tau is 0.5 or less), A is stochastically dominant over B in more cases than vice versa, then the corresponding algorithm can be declared as superior. We can also interpret as a confidence score. The lower it is, the more sure we can be that A is better than B. Note: ASO does not compute p-values. Instead, the null hypothesis formulated as

If we want to be more confident about the result of ASO, we can also set the rejection threshold to be lower than 0.5 (see the discussion in this section). Furthermore, the significance level is determined as an input argument when running ASO and actively influence the resulting <img

项目侧边栏1项目侧边栏2
推荐项目
Project Cover

豆包MarsCode

豆包 MarsCode 是一款革命性的编程助手,通过AI技术提供代码补全、单测生成、代码解释和智能问答等功能,支持100+编程语言,与主流编辑器无缝集成,显著提升开发效率和代码质量。

Project Cover

AI写歌

Suno AI是一个革命性的AI音乐创作平台,能在短短30秒内帮助用户创作出一首完整的歌曲。无论是寻找创作灵感还是需要快速制作音乐,Suno AI都是音乐爱好者和专业人士的理想选择。

Project Cover

有言AI

有言平台提供一站式AIGC视频创作解决方案,通过智能技术简化视频制作流程。无论是企业宣传还是个人分享,有言都能帮助用户快速、轻松地制作出专业级别的视频内容。

Project Cover

Kimi

Kimi AI助手提供多语言对话支持,能够阅读和理解用户上传的文件内容,解析网页信息,并结合搜索结果为用户提供详尽的答案。无论是日常咨询还是专业问题,Kimi都能以友好、专业的方式提供帮助。

Project Cover

阿里绘蛙

绘蛙是阿里巴巴集团推出的革命性AI电商营销平台。利用尖端人工智能技术,为商家提供一键生成商品图和营销文案的服务,显著提升内容创作效率和营销效果。适用于淘宝、天猫等电商平台,让商品第一时间被种草。

Project Cover

吐司

探索Tensor.Art平台的独特AI模型,免费访问各种图像生成与AI训练工具,从Stable Diffusion等基础模型开始,轻松实现创新图像生成。体验前沿的AI技术,推动个人和企业的创新发展。

Project Cover

SubCat字幕猫

SubCat字幕猫APP是一款创新的视频播放器,它将改变您观看视频的方式!SubCat结合了先进的人工智能技术,为您提供即时视频字幕翻译,无论是本地视频还是网络流媒体,让您轻松享受各种语言的内容。

Project Cover

美间AI

美间AI创意设计平台,利用前沿AI技术,为设计师和营销人员提供一站式设计解决方案。从智能海报到3D效果图,再到文案生成,美间让创意设计更简单、更高效。

Project Cover

AIWritePaper论文写作

AIWritePaper论文写作是一站式AI论文写作辅助工具,简化了选题、文献检索至论文撰写的整个过程。通过简单设定,平台可快速生成高质量论文大纲和全文,配合图表、参考文献等一应俱全,同时提供开题报告和答辩PPT等增值服务,保障数据安全,有效提升写作效率和论文质量。

投诉举报邮箱: service@vectorlightyear.com
@2024 懂AI·鲁ICP备2024100362号-6·鲁公网安备37021002001498号