Brief Summary
Training data is the fuel of modern artificial intelligence (AI), fundamentally shaping the capabilities, limitations, and biases of AI systems. The emergence of large-scale generative models has elevated the importance of understanding how data influences their behaviors, bringing the field of data attribution to the forefront. This survey provides a comprehensive overview of data attribution, covering its methods, applications, and evaluation protocols, with a particular emphasis on the challenges and opportunities arising in the era of generative AI. We start by introducing a conceptual framework for attribution centered on three core questions: what to attribute (model behaviors), attribute to what (training entities), and how to attribute (influence measures). Within this framework, we systematically review major attribution approaches, including those based on influence functions, weighted marginal contributions, training dynamics, and simulators. We then examine key applications of data attribution, such as data selection, fact tracing, adversarial attacks and defenses, and the emerging data economy. Finally, we critically assess common evaluation criteria, including the quality of counterfactual predictions, utility in downstream tasks, and computational efficiency. We conclude with a forward-looking perspective on the future of data attribution, highlighting key open challenges and promising directions for future research.
Citation
@article{deng2025survey,
title = {A Survey of Data Attribution: Methods, Applications, and Evaluation in the Era of Generative AI},
author = {Deng, Junwei and Hu, Yuzheng and Hu, Pingbang and Li, Ting-Wei and Liu, Shixuan and Wang, Jiachen T. and Ley, Dan and Dai, Qirun and
Huang, Benhao and Huang, Jin and Jiao, Cathy and Just, Hoang Anh and Pan, Yijun and Shen, Jingyan and Tu, Yiwen and Wang, Weiyi and
Wang, Xinhe and Zhang, Shichang and Zhang, Shiyuan and Jia, Ruoxi and Lakkaraju, Himabindu and Peng, Hao and Tang, Weijing and
Xiong, Chenyan and Zhao, Jieyu and Tong, Hanghang and Zhao, Han and Ma, Jiaqi W.},
year = {2025},
journal = {SSRN},
note = {Available at SSRN: \url{https://ssrn.com/abstract=5451054}},
doi = {10.2139/ssrn.5451054}
}