EDINA has begun its role of co-ordinating an international team of information scientists working on a two-year study to investigate how web links in scientific and other academic articles might fail to lead to the resources being referenced.
This is the focus of the Hiberlink project, in which a team from the University of Edinburgh and the Los Alamos National Laboratory (LANL) will assess the extent of ‘reference rot’, using a vast corpus of online scholarly work.
Increasingly, web-based scholarship includes links that point at resources needed or created in research activity, including software, datasets, websites, presentations, blogs, videos etc, as well as scientific workflows and ontologies. These referenced resources often evolve over time, unlike traditional scholarly articles. The ‘reference rot’ problem occurs whenever the original version of a linked resource is not available any more.
The problem has two aspects. First, the web link that references a resource may no longer work. Second, the content at the end of the link may have evolved and may even have become dramatically different from when originally referenced. So when eventually an online scholarly work is revisited and its references are double checked by a researcher in order to confirm evidence, to establish context, to inform policy and decision making or for any other practical purpose, then the original information referenced on websites or in online databases may have changed or even ceased to exist, hampering or preventing the research process.
The ultimate goal for the Hiberlink project is to identify practical solutions to the ‘reference rot’ problem. Technical developments at EDINA will start in June to develop approaches that can be integrated easily in the publication process. The project intends to work with academic publishers and other web-based publication venues to ensure more effective preservation of web-based resources so as to increase the prospect of continued access for future generations of researchers, students and their teachers.
Hiberlink builds directly upon a pilot study from LANL, powered by their Memento ‘Time Travel for the Web’ technology, which confirmed that as many as 30% of the web links in a selection of 400,000 arXiv.org papers did not function, and that 65% of the remaining links referred to a resource that was not archived, and hence in danger of disappearing without a trace.
Using text mining and information extracting tools developed by the Language Technology Group (LTG) at the University of Edinburgh School of Informatics, the project will examine a large body of scholarly publication to assess what links still work as intended and what web content has been successfully archived and therefore preserved for use by future researchers and students.
Related EDINA activity to ensure continuing access to scholarly content will benefit from knowledge generated from the Hiberlink project. The Keepers Registry, with an international perspective, is monitoring the extent of e-journal archiving. Nationally, the UK LOCKSS Alliance has a cooperative focus on helping HE academic libraries ensure sustainable continuing access to scholarly work. The Hiberlink project will involve these initiatives as part of engaging relevant stakeholders, sharing experience, and considering best practise to stabilise continuing access to referenced resources.