Making Research Software FAIR: How SoFAIR Tackles the Reproducibility Challenge

Reproducibility is one of the significant challenges of contemporary science. A landmark survey revealed that more than 70% of researchers had failed to reproduce another scientist’s experiments, and more than 50% were unable to reproduce their own (Baker, 2016).

“Single occurrences that cannot be reproduced are of no significance to science.” (Popper, 1935)

One of the less visible, but increasingly critical, factors contributing to this problem is the visibility and availability of research software. Software now underpins almost every stage of the research lifecycle from data collection and analysis to simulation, modelling, and visualisation. Yet despite its centrality, research software often remains hidden within manuscripts, mentioned fleetingly or omitted altogether. Without proper identification and registration, software is rarely linked back to the publications that introduced or used it, leaving its role in scientific discovery under-recognised and its reusability limited.

Unavailability of software has been identified as the sixth most significant reason for irreproducibility (Baker, 2016). Unlike articles and datasets, software resources have too often failed to meet the FAIR principles; Findable, Accessible, Interoperable, and Reusable, principles that enable others to build on previous work.

This is precisely where the SoFAIR project positions itself. At its core lies a deceptively simple research problem: if software is to be treated as a first-class citizen of the scholarly ecosystem, it must be made visible. That means moving beyond casual references in footnotes and supplementary materials, towards systems of persistent identification and robust registration. Software without a traceable identity becomes effectively lost to the scholarly record. SoFAIR integrates software more deliberately into scholarly communication, linking it not only to publications but also to datasets and other research objects in a way that reflects its true role in the production of knowledge.

Highlights

Testing Software Detection Models

In the early stages of SoFAIR, the annotation of a fully multi-disciplinary corpus had not yet been completed. To move forward, the team conducted experiments using existing cross-disciplinary data. The aim was to develop models capable of detecting software mentions across any research domain. For these experiments, a subset of the SoMeSci dataset was used, consisting of over 47,000 sentences drawn from 1,367 documents, containing more than 3,700 unique software mentions (Laufer et al., 2021). A smaller subset of 100 full-text documents was sourced from CORE to undertake the evaluation of the models.

To assess how input granularity influenced model performance, as part of the pre-processing stage the texts were segmented in three ways by sentence, paragraph, or as a complete document. Two pipelines were tested: a simple pipeline that analyzed text chunks directly, and an iterative pipeline that applied multiple rounds of keyword-based and semantic filtering to discover and verify software mentions.

The experiments revealed that sentence-level segmentation consistently produced the best results across models. Among the models tested, Llama3:70b achieved the highest overall performance, comparable to previous state-of-the-art models even without fine-tuning. This demonstrates the potential of large language models in extracting software mentions accurately.

However, computational costs remain significant. Processing all newly published articles would require substantial resources, highlighting the importance of scalable solutions for practical adoption in open scholarly infrastructures. Strategies to achieve this might include trading off a small drop in performance for faster processing speeds, deploying distributed computing, or a combination of both.

Looking ahead, SoFAIR aims to embed its outputs into open infrastructures like CORE, ensuring that software is discoverable as part of the broader research ecosystem. By integrating persistent identifiers for software alongside articles and datasets, SoFAIR contributes to a more interconnected and transparent scholarly record that supports reproducibility at scale.

The implications are significant. Science advances cumulatively, with each new insight relying on the reliability of what came before. When software is invisible, both transparency and accountability suffer, and researchers are denied the chance to reuse, adapt, or build upon essential tools. At the same time, software creators often go unrecognised for their contributions, their work folded into publications without proper attribution. SoFAIR addresses both of these challenges, reframing research software as something to be discovered, cited, and reused rather than forgotten in the background.

The reproducibility crisis will not be solved overnight. But making research software FAIR and fully integrated into the scholarly record is a concrete step toward more trustworthy, transparent, and cumulative science.

References

Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533, 452–454. https://doi.org/10.1038/533452a
Popper, K. (1935). The Logic of Scientific Discovery. London: Routledge.
Laufer, C., Satpute, A., Singh, J., van Strien, D., & Leser, U. (2021). SoMeSci: The Software Mentions in Science Corpus. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 2503–2516). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.eacl-main.215