Title
SOMD: Software Mention Detection in Scholarly Publications
Abstract
Data-driven scientific processes strongly rely on the use of software to collect and prepare data and to generate insights via automated analysis. Hence, tracking the provenance of software artifacts is becoming an essential aspect of transparency and reproducibility. Additionally, aggregated observations of software citations can help to measure their usage and impact in the long run. While the referencing of scientific articles is handled according to well-established patterns, the citation practices of code bases and software programs are less coherent. Therefore, we invite participants of our shared task to develop robust supervised information extraction models that facilitate the disambiguation of software mentions and relevant metadata in scholarly publications. The task utilizes the Software Mentions in Science - SoMeSci knowledge graph of software mentions (Schindler et al., 2022). As a novelty presented with this task, SoMeSci will be extended to include more publications in the fields of Artificial Intelligence (AI) and Computer Science.
Subtasks
- Subtask 1: Software Mention Detection
- Subtask 2: Additional Information Detection
- Subtask 3: Relation Extraction
- Subtask 4: Disambiguation
- Subtask 5: End to End
Datasets
SoMeSci is a knowledge graph of software mentions including 399,942 triples to date. It describes 3,756 software mentions, including type information and extensive metadata, from 1,367 PubMed Central articles. The dataset will be expanded to include Computer Science publications following the SomeSci schema.
Metrics
We will evaluate method performance using traditional IR metrics (P/R/F1) on specific subtasks, such as 1) detection of software mentions and types, 2) detection of related attributes (e.g. version, developer, etc), and 3) disambiguation of detected mentions. While the final set of tasks will still be announced, details for a more exhaustive set of tasks and affiliated baselines can be found here.
Contact Persons
- Frank Krüger (HS Wismar)
- Stefan Dietze (GESIS)
- Saurav Karmakar (GESIS)
- Danilo Dessi’ (GESIS)
- Jennifer D’Souza (TIB)
References
- David Schindler, Felix Bensmann, Stefan Dietze, Frank Krüger. “The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central”. PeerJ Computer Science. 2022. https://doi.org/10.7717/peerj-cs.835
- David Schindler, Felix Bensmann, Stefan Dietze, Frank Krüger, “SoMeSci—A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles”, Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM 2021). 2021. https://doi.org/10.1145/3459637.3482017