Awesome-Data-Centric-AI
A curated, but incomplete, list of data-centric AI resources. It should be noted that it is unfeasible to encompass every paper. Thus, we prefer to selectively choose papers that present a range of distinct ideas. We welcome contributions to further enrich and refine this list.
:loudspeaker: News: Please check out our open-sourced Large Time Series Model (LTSM)!
If you want to contribute to this list, please feel free to send a pull request. Also, you can contact daochen.zha@rice.edu.
- Survey paper: Data-centric Artificial Intelligence: A Survey
- Perspective paper (SDM 2023): Data-centric AI: Perspectives and Challenges
- Data-centric AI: Techniques and Future Perspectives (KDD 2023 Tutorial): [Website] [Slides] [Video] [Paper]
- Blogs:
- 中文解读:
- Graph Structure Learning (GSL) is a data-centric AI direction in graph neural networks (GNNs):
- Check our latest Knowledge Graphs (KGs) based paper search engine DiscoverPath: https://github.com/ynchuang/DiscoverPath
Want to discuss with others who are also interested in data-centric AI? There are three options:
- Join our Slack channel
- Join our QQ group (183116457). Password:
datacentric
- Join the WeChat group below (if the QR code is expired, please add WeChat ID:
zdcwhu
and add a note indicating that you want to join the Data-centric AI group)!
What is Data-centric AI?
Data-centric AI is an emerging field that focuses on engineering data to improve AI systems with enhanced data quality and quantity.
Data-centric AI vs. Model-centric AI
In the conventional model-centric AI lifecycle, researchers and developers primarily focus on identifying more effective models to improve AI performance while keeping the data largely unchanged. However, this model-centric paradigm overlooks the potential quality issues and undesirable flaws of data, such as missing values, incorrect labels, and anomalies. Complementing the existing efforts in model advancement, data-centric AI emphasizes the systematic engineering of data to build AI systems, shifting our focus from model to data.
It is important to note that "data-centric" differs fundamentally from "data-driven", as the latter only emphasizes the use of data to guide AI development, which typically still centers on developing models rather than engineering data.
Why Data-centric AI?
Two motivating examples of GPT models highlight the central role of data in AI.
- On the left, large and high-quality training data are the driving force of recent successes of GPT models, while model architectures remain similar, except for more model weights.
- On the right, when the model becomes sufficiently powerful, we only need to engineer prompts (inference data) to accomplish our objectives, with the model being fixed.
Another example is Segment Anything, a foundation model for computer vision. The core of training Segment Anything lies in the large amount of annotated data, containing more than 1 billion masks, which is 400 times larger than existing segmentation datasets.
What is the Data-centric AI Framework?
Data-centric AI framework consists of three goals: training data development, inference data development, and data maintenance, where each goal is associated with several sub-goals.
- The goal of training data development is to collect and produce rich and high-quality training data to support the training of machine learning models.
- The objective of inference data development is to create novel evaluation sets that can provide more granular insights into the model or trigger a specific capability of the model with engineered data inputs.
- The purpose of data maintenance is to ensure the quality and reliability of data in a dynamic environment.
Cite this Work
Zha, Daochen, et al. "Data-centric Artificial Intelligence: A Survey." arXiv preprint arXiv:2303.10158, 2023.
@article{zha2023data-centric-survey,
title={Data-centric Artificial Intelligence: A Survey},
author={Zha, Daochen and Bhat, Zaid Pervaiz and Lai, Kwei-Herng and Yang, Fan and Jiang, Zhimeng and Zhong, Shaochen and Hu, Xia},
journal={arXiv preprint arXiv:2303.10158},
year={2023}
}
Zha, Daochen, et al. "Data-centric AI: Perspectives and Challenges." SDM, 2023.
@inproceedings{zha2023data-centric-perspectives,
title={Data-centric AI: Perspectives and Challenges},
author={Zha, Daochen and Bhat, Zaid Pervaiz and Lai, Kwei-Herng and Yang, Fan and Hu, Xia},
booktitle={SDM},
year={2023}
}
Table of Contents
Training Data Development
Data Collection
- Revisiting time series outlier detection: Definitions and benchmarks, NeurIPS 2021 [Paper] [Code]
- Dataset discovery in data lakes, ICDE 2020 [Paper]
- Aurum: A data discovery system, ICDE 2018 [Paper] [Code]
- Table union search on open data, VLDB 2018 [Paper]
- Data Integration: The Current Status and the Way Forward, IEEE Computer Society Technical Committee on Data Engineering 2018 [Paper]
- To join or not to join? thinking twice about joins before feature selection, SIGMOD 2016 [Paper]
- Data curation at scale: the data tamer system, CIDR 2013 [Paper]
- Data integration: A theoretical perspective, PODS 2002 [Paper]
Data Labeling
- Segment Anything [Paper] [code]
- Active Ensemble Learning for Knowledge Graph Error Detection, WSDM 2023 [Paper]
- Active-Learning-as-a-Service: An Efficient MLOps System for Data-Centric AI, NeurIPS 2022 Workshop on Human in the Loop Learning [paper] [code]
- Training language models to follow instructions with human feedback, NeurIPS 2022 [Paper]
- Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling, ICLR 2021 [Paper] [Code]
- A survey of deep active learning, ACM Computing Surveys 2021 [Paper]
- Adaptive rule discovery for labeling text data, SIGMOD 2021 [Paper]
- Cut out the annotator, keep the cutout: better segmentation with weak supervision, ICLR 2021 [Paper]
- Meta-AAD: Active anomaly detection with deep reinforcement learning, ICDM 2020 [Paper] [Code]
- Snorkel: Rapid training data creation with weak supervision, VLDB 2020 [Paper] [Code]
- Graph-based semi-supervised learning: A review, Neurocomputing 2020 [Paper]
- Annotator rationales for labeling tasks in crowdsourcing, JAIR 2020 [Paper]
- Rethinking pre-training and self-training, NeurIPS 2020 [Paper]
- Multi-label dataless text classification with topic modeling, KIS 2019 [Paper]
- Data programming: Creating large training sets, quickly, NeurIPS 2016 [Paper]
- Semi-supervised consensus labeling for crowdsourcing, SIGIR 2011 [Paper]
- Vox Populi: Collecting High-Quality Labels from a Crowd, COLT 2009 [Paper]
- Democratic co-learning, ICTAI 2004 [Paper]
- Active learning with statistical models, JAIR 1996 [Paper]
Data Preparation
- DataFix: Adversarial Learning for Feature Shift Detection and Correction, NeurIPS 2023 [Paper] [Code]
- OpenGSL: A Comprehensive Benchmark for Graph Structure Learning, arXiv 2023 [Paper] [Code]
- TSFEL: Time series feature extraction library, SoftwareX 2020 [Paper] [Code]
- Alphaclean: Automatic generation of data cleaning pipelines, arXiv 2019 [Paper] [Code]
- Introduction to Scikit-learn, Book 2019 [Paper] [Code]
- Feature extraction: a survey of the types, techniques, applications, ICSC 2019 [Paper]
- Feature engineering for predictive modeling using reinforcement learning, AAAI 2018 [Paper]
- Time series classification from scratch with deep neural networks: A strong baseline, IIJCNN 2017 [Paper]
- Missing data imputation: focusing on single imputation, ATM 2016 [Paper]
- Estimating the number and sizes of fuzzy-duplicate clusters, CIKM 2014 [Paper]
- Data normalization and standardization: a technical report, MLTR 2014 [Paper]
- CrowdER: crowdsourcing entity resolution, VLDB 2012 [Paper]
- Imputation of Missing Data Using Machine Learning Techniques, KDD 1996