At #TPDL2023 right now, @martinklein is presenting “It's Not Just GitHub: Identifying Data and Software Sources Included in Publications”
The authors trained a classifier to classify open-access data and software (OADS) URLs from research papers as dataset or code. Archivists can then take these URLs and preserve the referenced datasets and code for reproducibility.
Paper: https://doi.org/10.1007/978-3-031-43849-3_17
Preprint: https://arxiv.org/abs/2307.14469