Awesome-Data-Centric-AI

A curated, but incomplete, list of data-centric AI resources. It should be noted that it is unfeasible to encompass every paper. Thus, we prefer to selectively choose papers that present a range of distinct ideas. We welcome contributions to further enrich and refine this list.

:loudspeaker: News: Please check out our open-sourced Large Time Series Model (LTSM)!

If you want to contribute to this list, please feel free to send a pull request. Also, you can contact daochen.zha@rice.edu.

Survey paper: Data-centric Artificial Intelligence: A Survey
Perspective paper (SDM 2023): Data-centric AI: Perspectives and Challenges
Data-centric AI: Techniques and Future Perspectives (KDD 2023 Tutorial): [Website] [Slides] [Video] [Paper]
Blogs:
中文解读:
Graph Structure Learning (GSL) is a data-centric AI direction in graph neural networks (GNNs):
- Paper: OpenGSL: A Comprehensive Benchmark for Graph Structure Learning
- Code: https://github.com/OpenGSL/OpenGSL
- 中文解读：GNN中的Data-centric AI —— 图结构学习（GSL）以及基准库OpenGSL介绍
Check our latest Knowledge Graphs (KGs) based paper search engine DiscoverPath: https://github.com/ynchuang/DiscoverPath

Want to discuss with others who are also interested in data-centric AI? There are three options:

Join our Slack channel
Join our QQ group (183116457). Password: datacentric
Join the WeChat group below (if the QR code is expired, please add WeChat ID: zdcwhu and add a note indicating that you want to join the Data-centric AI group)!

What is Data-centric AI?

Data-centric AI is an emerging field that focuses on engineering data to improve AI systems with enhanced data quality and quantity.

Data-centric AI vs. Model-centric AI

In the conventional model-centric AI lifecycle, researchers and developers primarily focus on identifying more effective models to improve AI performance while keeping the data largely unchanged. However, this model-centric paradigm overlooks the potential quality issues and undesirable flaws of data, such as missing values, incorrect labels, and anomalies. Complementing the existing efforts in model advancement, data-centric AI emphasizes the systematic engineering of data to build AI systems, shifting our focus from model to data.

It is important to note that "data-centric" differs fundamentally from "data-driven", as the latter only emphasizes the use of data to guide AI development, which typically still centers on developing models rather than engineering data.

Why Data-centric AI?

Two motivating examples of GPT models highlight the central role of data in AI.

On the left, large and high-quality training data are the driving force of recent successes of GPT models, while model architectures remain similar, except for more model weights.
On the right, when the model becomes sufficiently powerful, we only need to engineer prompts (inference data) to accomplish our objectives, with the model being fixed.

Another example is Segment Anything, a foundation model for computer vision. The core of training Segment Anything lies in the large amount of annotated data, containing more than 1 billion masks, which is 400 times larger than existing segmentation datasets.

What is the Data-centric AI Framework?

Data-centric AI framework consists of three goals: training data development, inference data development, and data maintenance, where each goal is associated with several sub-goals.

The goal of training data development is to collect and produce rich and high-quality training data to support the training of machine learning models.
The objective of inference data development is to create novel evaluation sets that can provide more granular insights into the model or trigger a specific capability of the model with engineered data inputs.
The purpose of data maintenance is to ensure the quality and reliability of data in a dynamic environment.

Cite this Work

Zha, Daochen, et al. "Data-centric Artificial Intelligence: A Survey." arXiv preprint arXiv:2303.10158, 2023.

@article{zha2023data-centric-survey,
  title={Data-centric Artificial Intelligence: A Survey},
  author={Zha, Daochen and Bhat, Zaid Pervaiz and Lai, Kwei-Herng and Yang, Fan and Jiang, Zhimeng and Zhong, Shaochen and Hu, Xia},
  journal={arXiv preprint arXiv:2303.10158},
  year={2023}
}

Zha, Daochen, et al. "Data-centric AI: Perspectives and Challenges." SDM, 2023.

@inproceedings{zha2023data-centric-perspectives,
  title={Data-centric AI: Perspectives and Challenges},
  author={Zha, Daochen and Bhat, Zaid Pervaiz and Lai, Kwei-Herng and Yang, Fan and Hu, Xia},
  booktitle={SDM},
  year={2023}
}

Training Data Development
Inference Data Development
Data Maintenance
Data Benchmark

Training Data Development

Data Collection

Revisiting time series outlier detection: Definitions and benchmarks, NeurIPS 2021 [Paper] [Code]
Dataset discovery in data lakes, ICDE 2020 [Paper]
Aurum: A data discovery system, ICDE 2018 [Paper] [Code]
Table union search on open data, VLDB 2018 [Paper]
Data Integration: The Current Status and the Way Forward, IEEE Computer Society Technical Committee on Data Engineering 2018 [Paper]
To join or not to join? thinking twice about joins before feature selection, SIGMOD 2016 [Paper]
Data curation at scale: the data tamer system, CIDR 2013 [Paper]
Data integration: A theoretical perspective, PODS 2002 [Paper]

Data Labeling

Segment Anything [Paper] [code]
Active Ensemble Learning for Knowledge Graph Error Detection, WSDM 2023 [Paper]
Active-Learning-as-a-Service: An Efficient MLOps System for Data-Centric AI, NeurIPS 2022 Workshop on Human in the Loop Learning [paper] [code]
Training language models to follow instructions with human feedback, NeurIPS 2022 [Paper]
Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling, ICLR 2021 [Paper] [Code]
A survey of deep active learning, ACM Computing Surveys 2021 [Paper]
Adaptive rule discovery for labeling text data, SIGMOD 2021 [Paper]
Cut out the annotator, keep the cutout: better segmentation with weak supervision, ICLR 2021 [Paper]
Meta-AAD: Active anomaly detection with deep reinforcement learning, ICDM 2020 [Paper] [Code]
Snorkel: Rapid training data creation with weak supervision, VLDB 2020 [Paper] [Code]
Graph-based semi-supervised learning: A review, Neurocomputing 2020 [Paper]
Annotator rationales for labeling tasks in crowdsourcing, JAIR 2020 [Paper]
Rethinking pre-training and self-training, NeurIPS 2020 [Paper]
Multi-label dataless text classification with topic modeling, KIS 2019 [Paper]
Data programming: Creating large training sets, quickly, NeurIPS 2016 [Paper]
Semi-supervised consensus labeling for crowdsourcing, SIGIR 2011 [Paper]
Vox Populi: Collecting High-Quality Labels from a Crowd, COLT 2009 [Paper]
Democratic co-learning, ICTAI 2004 [Paper]
Active learning with statistical models, JAIR 1996 [Paper]

Data Preparation

DataFix: Adversarial Learning for Feature Shift Detection and Correction, NeurIPS 2023 [Paper] [Code]
OpenGSL: A Comprehensive Benchmark for Graph Structure Learning, arXiv 2023 [Paper] [Code]
TSFEL: Time series feature extraction library, SoftwareX 2020 [Paper] [Code]
Alphaclean: Automatic generation of data cleaning pipelines, arXiv 2019 [Paper] [Code]
Introduction to Scikit-learn, Book 2019 [Paper] [Code]
Feature extraction: a survey of the types, techniques, applications, ICSC 2019 [Paper]
Feature engineering for predictive modeling using reinforcement learning, AAAI 2018 [Paper]
Time series classification from scratch with deep neural networks: A strong baseline, IIJCNN 2017 [Paper]
Missing data imputation: focusing on single imputation, ATM 2016 [Paper]
Estimating the number and sizes of fuzzy-duplicate clusters, CIKM 2014 [Paper]
Data normalization and standardization: a technical report, MLTR 2014 [Paper]
CrowdER: crowdsourcing entity resolution, VLDB 2012 [Paper]
Imputation of Missing Data Using Machine Learning Techniques, KDD 1996