Another year has passed and left a lot of good news, investigations and developments for CORE. Today we would like to tell you about one of them – Open Access (OA) Helper, an application developed for iOS mobile devices by Claus Wolf. We asked Claus to tell us how he came up with the OA Helper and here is what he answered.
When, where and why did you decide to develop OA Helper app?
In October 2018, I learned about how open access discovery services connect users to legal Open Access copies of otherwise paywalled articles. The available plugins weren’t available for Safari, my preferred browser, so I decided to give creating one a try.
We are happy to announce the release of CORE Reader, which provides a seamless experience for users wishing to read papers hosted by CORE. In this post, we provide an overview of what is new and we encourage you to follow this development as new functionalities in the reader are on our roadmap.
At the beginning of this project, there was a reflection that most open access services do not yet provide a rich user experience for reading research papers. Determined to change this, we originally started looking at whether CORE could render research papers as HTML, as has recently become trendy across publisher platforms. While such rendering remains to be one of the ultimate goals, we realised that this could only be achieved for a small fraction of documents in CORE. More specifically, those that the data provider offers in machine readable formats, such as LaTeX or JATS XML. While we want to encourage more repositories to support such formats (and this remains to be a Plan S recommendation), we wanted to improve the reading experience for all of our users across all of our content.
CORE Discovery helps users find freely accessible copies of research papers that might be behind a paywall on the publisher’s website. It is backed by our huge dataset of millions of full text open access papers as well as content from widely used external services beyond CORE. The tool not only provides state-of-the-art coverage of freely available content, it is the only discovery service which:
- delivers state-of-the-art performance compared to other discovery tools in terms of both content coverage (finding a freely available copy when it is available) and precision (reliably delivering a free copy of the paper on success);
- is run by researchers for researchers (as opposed to companies);
- has the best grip on content from the global network of open repositories;
- can deliver to readers other relevant freely available research papers even in situations where a freely available version is not available from anywhere on the web.
To satisfy the needs of CORE users, the world’s largest global aggregator of open access research papers now helps users access articles of their interest. Generally, discovery tools can find typically free copies of papers for about 15%-30% of published documents (slide 11). This means that in more than 70% of cases, they don’t bring to the user anything useful. CORE Discovery can offer the user relevant documents even in situations where other discovery tools are not successful. What distinguishes CORE Discovery from other discovery services on the market is that it does not stop when an open access version is not available, but always aims to offer related open access articles to the end user.
Nick has worked in scholarly communications for over 10 years, currently as Open Research Advisor at the University of Leeds. Previously he was Research Services Advisor at Leeds Beckett University. Nick is interested in effective dissemination of research through sustainable models of open access, including underlying data, and potential synergies with open education and Open Educational Resources (OER), particularly underlying technology, software and interoperability of systems.
Q: What does Open Access means to you?
A: We live in the age of information where the world’s knowledge should be immediately and easily accessible to the majority of humanity. Instead much primary research is restricted to those that can afford it, whether to read under traditional subscription models or, under an APC based model, to publish at all. Meanwhile fake news is propagated freely with potentially disastrous consequences for our democracy, our ecology and global equality. Sustainable and affordable open access to research is essential for a well informed global population, the first step to building a better society.
With equity as the theme of this year’s Open Access week we will be exploring issues of equality including gender imbalance within the academy and how our University’s research can better benefit the Global South. Early plans include a gender analysis of Leeds research outputs and a Wikimedia editathon focussing on women scientists and encouraging researchers of all genders to properly cite Wikipedia with open access research.
Gloria is a lecturer in the Centre for Language and Communication Studies, Institute of Lifelong Learning and Development Studies at Chinhoyi University of Technology in Zimbabwe. She has special responsibility for coordinating the Information Literacy Skills component of the Communication Skills module. She is a former Library Director at the same institution.
Q: What does Open Access mean to you?
A: Open access means unlimited access to research materials and tools to publicise research and make it more visible to a wider audience. Open access means knowing the research others are carrying out and making one’s research known to others.
In the past the Library was on the forefront of celebrating OA week. The celebrations entailed having seminars with speakers from the Library and other experts from around the country.
George is an Institutional Repository Co-ordinator at the University of Strathclyde. His interests and expertise are in structured open data, especially within repositories and semantic web contexts, information retrieval, distributed digital repositories and human-computer interaction.
Q: What does Open Access mean to you?
A: Aside from the usual reasons why Open Access is important, I like to remember that Open Access is about resource discovery. It is about cracking open the sum total of human knowledge in a way that machines can understand and, by extension, providing it in a way which enables users to find scholarly content more easily and, of course, in an unrestricted way.
International Open Access Week is approaching soon but, to be honest, we don’t tend to have plans for Open Access week because at Strathclyde every week is Open Access week! I think there might be quite a few UK institutions that operate in a similar way. In the UK we are fortunate that there is a powerful regulatory aspect to the REF2021 Open Access Policy which ensures researchers take better notice of the open science agenda.
David is the Open Access Officer at Brunel University London based within the Scholarly Communication & Rights Management team. He is an advocate of OA publishing, and of building services that realise the movement within local institutional communities. David has spoken at UKSG, NASIG, RLUK and Altmetric conferences about this topic in recent years. David is an ambassador for the CORE service.
Q: What does Open Access mean to you?
A: To us at Brunel, Open Access means many things – ideologically and practically. Most importantly, we consider Open Access to research output a critical, underpinning component on the journey toward an ‘Open Science’ world. Open Science encompasses many areas, aiming to enhance scientific and educational sectors.
As with many institutions, at Brunel we operate local OA services for our community, within an ever-growing landscape of technological and policy drivers. Open Access means creating an environment that supports policy drivers, whilst advantaging new technologies for our community as they emerge.
Much progress is being driven by these factors. However, it is as important to foster discussion and leadership amongst research communities. Open Access means researchers and students shaping and leading their subjects into new forms of science communication and practice.
At Brunel our role in supporting Open Access is to:
– Engage and inform our community about these issues as they evolve
– Build and tailor services to our community’s needs
– Recognise and celebrate ‘open’ activity by our researchers in all its forms
Milica is a librarian at the Institute of Technical Sciences of the Serbian Academy of Sciences and Arts since 2007. Her education background is in art history and her previous work experience includes heritage policies and documentation standards, heritage-related civil society projects and digitisation, traditional librarianship and bibliography. Currently, her professional interests focus on Open Science, library services aimed at supporting research activities, training on academic services and tools, information literacy and research ethics. Since November 2014, she has been serving as the EIFL Open Access country coordinator in Serbia. In this capacity, she designed and coordinated the project – Revisiting open access journal policies and practices in Serbia, which was implemented with EIFL’s support in 2016–2017. She has also been involved with institutional repositories since 2013, when her affiliated institution implemented the first fully functional institutional repository in Serbia. She is now a member of the Repository Development Team at the University of Belgrade Computer Centre, which is currently the leading force in repository development in Serbia.
I am currently working with Martin Klein, Matteo Cancellieri and Herbert Van de Sompel on a project funded by the European Open Science Cloud Pilot that aims to test and benchmark ResourceSync against OAI-PMH in a range of scenarios. The objective is to perform a quantitative evaluation that could then be used as evidence to convince data providers to adopt ResourceSync. During this work, we have encountered a problem related to the scalability of ResourceSync and developed a solution to it in the form of an On Demand Resource Dump. The aim of this blog post is to explain the problem, how we arrived to the solution and how the solution works.
One of the scenarios we have been exploring deals with a situation where the resources to be synchronised are metadata files of a small data size (typically from a few bytes to several kilobytes). Coincidentally, this scenario is very common for metadata in repositories of academic manuscripts, research data (e.g. descriptions of images), cultural heritage, etc.
The problem is related to the issue that while most OAI-PMH implementations typically deliver 100-1000 responses per one HTTP request, ResourceSync is designed in a way that requires resolving each resource individually. We have identified and confirmed by testing that for repositories with larges numbers of metadata items, this can have a very significant impact on the performance of harvesting, as the overhead of the HTTP request is considerable compared to the size of the metadata record.
More specifically, we have run tests over a sample of 357 repositories. The results of these tests show that while the speed of OAI-PMH harvesting ranges from 30-520 metadata records per second, depending largely on the repository platform, the speed of harvesting by ResourceSync is somewhere in the range of only 4 metadata records per second for harvesting the same content using existing ResourceSync client/server implementations and sequential downloading strategy. We are preparing a paper on this, so I am not going to disclose the exact details of the analysis at this stage.
As ResourceSync has been created to overcome many of the problems of OAI-PMH, such as:
- being too flexible in terms of support for incremental harvesting, resulting in inconsistent implementations of this feature across data providers,
- some of its implementations being unstable and less suitable for exchanging large quantities of metadata and
- being only designed for metadata transfer, omitting the much needed support for content exchange
it is important that Resource Sync performs well under all common scenarios, including the one we are dealing with.
Can Resource Dumps be the solution?
An obvious option for solving the problem that is already offered by ResourceSync are Resource Dumps. While a Resource Dump can speed up harvesting to levels far exceeding those of OAI-PMH, it creates some considerable extra complexity on the side of the server. The key problem is that it creates the necessity to periodically package the data as a Resource Dump, which basically means running a batch process to produce a compressed (zip) file containing the resources.
The number of Resource Dumps a source needs to maintain is equal to the number of Capability Lists it maintains times the size of the Resource Dump Index. The minimum practical operational size of a Resource Dump Index is 2. This is to ensure we don’t remove a dump currently being downloaded by a client during the creation of a new dump. As we have observed that a typical repository may contain about 250 OAI-PMH sets (Capability Lists in the ResourceSync terminology), this implies the need for a significant data duplication and requirements on period creation of Resource Dumps if a source chose to use Resource Dumps as part of the harvesting process.
On Demand Resource Dumps
To deal with the problem, we suggest an extension of ResourceSync that will support the concept of an On Demand Resource Dump. An On Demand Resource Dump is a Resource Dump which is created, as the name suggests, whenever a client asks for it. More specifically, a client can scan through the list of resources presented in a Resource List or a Change List (without resolving them individually) and request from the source to package any set of the resources as a Resource Dump. This approach speeds up and saves processing on the side of both the source as well as the client. Our initial tests show that this enables ResourceSync to perform as well as OAI-PMH in the metadata only harvesting scenario when requests are sent sequentially (the most extreme scenario for ResourceSync). However, as ResourceSync requests can be parallelised, as opposed to OAI-PMH (due to the reliance of OAI-PMH on the resumption token), this makes ResourceSync a clear winner.
In the rest of this post, I will explain how this works and how it could be integrated with the ResourceSync specification.
There are basically 3 steps:
- defining that the server supports an on-demand Resource Dump,
- sending a POST request to the on-demand dump endpoint and
- receiving a response from the server that 100% conforms to the Resource Dump specification.
I will first introduce steps 2 and 3 and then I will come back to step 1.
Step 2: sending a POST request to the On Demand dump endpoint
We have defined an endpoint at https://core.ac.uk/datadump . You can POST it a list of resource identifiers (which can be discovered in a Resource List). In the example below, I am using curl to send it a list of resource identifiers in JSON which I want to get resolved. Obviously, the approach is not limited to JSON, it can be used for any resource listed in a Resource List regardless of its type. Try it by executing the code below in your terminal.
curl -d ‘[“https://core.ac.uk/api-v2/articles/get/42138752″,”https://core.ac.uk/api-v2/articles/get/32050″]‘ -H “Content-Type: application/json” https://core.ac.uk/datadump -X POST > on-demand-resource-dump.zip