The team continues to work on improving the CORE. This period was a highly productive period for CORE in terms of growing and developing our products.
You can found something interesting about:
- Improving access to the CORE Repository Dashboard
- Improving data provider registration
- CORE’s team involves an international event on processing research papers
- CORE Discovery and repositories
- CORE’s work on product improvement
This blog post was originally posted at the arXiv blog.
arXiv readers now have a faster way to find articles relevant to their interests. From an article abstract page, readers can simply activate the CORE Recommender to find additional open access research on similar topics.
The Recommender, part of the arXivLabs toolset, was developed by CORE, a global aggregator of open access scientific content, which provides access to millions of full texts. CORE’s mission is to aggregate all open access research outputs from repositories and journals worldwide and make them available to the public. In this way, CORE facilitates free unrestricted access to research for all.
CORE is happy to announce the release of a new version of the CORE Repository Dashboard. The update will be of particular interest to UK repositories as we are releasing with it a new tool to support REF2021 open access compliance assessment. The tool was developed for repository managers and research administrators to improve the harvesting of their repository outputs and ensure their content is visible to the world. Full details here.
During the last quarter of 2019 CORE released new updates for two of its services, CORE Reader and CORE Discovery, and has initiated new collaborations, for example with LA Referencia. In addition, CORE was singled out for its contributions and was awarded for its outstanding impact of research on society and prosperity. Details about these and more news can be found below.
CORE won an Outstanding Impact of Research on Society and Prosperity Award
During the Research Excellence Awards 2019 Ceremony dedicated to the 50th anniversary of The Open University, CORE was presented with the award for “Outstanding Impact of Research on Society and Prosperity Award”. This important award reflects the clear value CORE represents to its millions of users as the largest repository of Open Access scientific knowledge.
LA Referencia – an aggregator of research papers from Latin America collaborates with CORE – a scholarly communications infrastructure that provides access to the world’s largest collection of open access research publications, acquired from a global network of repositories and journals.
CORE Recommender is now integrated within LA Referencia, allowing users to discover similar articles from across a network of thousand open access data providers. CORE Recommender acts as a gate to millions of open access research papers, suggesting relevant articles where the full text is guaranteed to be openly available. Moreover, the recommender delivers to users only free to read materials, i.e. materials that can be accessed without hitting a paywall.
CORE releases CORE Discovery in Mozilla and Opera browsers
CORE Discovery, a browser extension that offers one-click access to free copies of research papers whenever you might hit a paywall, is now published in Mozilla and Opera Stores. The plug in was originally released as a Google Chrome extension.
CORE presents its full texts growth and introduces eduTDM at Open Science Fair 2019
CORE was active at the Open Science Fair 2019, an international event for all topics related to Open Science. CORE had two posters at this event; a general to the CORE service poster, which updated the community about the full text growth and wide usage of the CORE services, and a second one about the eduTDM.
CORE releases CORE Discovery tool
Our free CORE Discovery service provides you with:
- Highest coverage of freely available content. Our tests have shown CORE Discovery finding more free content than any other discovery system.
- Free service for researchers by researchers. CORE Discovery is the only free content discovery extension developed by researchers for researchers. There is no major publisher or enterprise controlling and profiting from your usage data.
- Best grip on open repository content. Due to CORE being a leader in harvesting open access literature, CORE Discovery has the best grip on open content from open repositories as opposed to other services that disproportionately focus only on content indexed in major commercial databases.
- Repository integration and discovering documents without a DOI. The only service offering seamless and free integration into repositories. CORE Discovery is also the only discovery system that can locate scientific content even for items with an unknown DOI or which do not have a DOI.
The tool is available as:
- A browser extension for researchers and anyone interested in reading scientific documents
- Plugin for repositories, enriching metadata only pages in repositories with links to freely available copies of the paper
- API for developers and third party services
If you are interested in the CORE Discovery plugin do get in touch.
CORE receives Vannevar Bush Best Paper Award
The CORE team has also won the Vannevar Bush Best Paper Award at JCDL 2019, one of the most highly recognised digital libraries conference in the world, for our work on analysing how soon authors deposit into repositories, which was driven by CORE data. A blog post about this is already available.
. More specifically, over the last 3 months CORE had more than 25 million users, tripling our usage compared to 2017. According to
I am currently working with Martin Klein, Matteo Cancellieri and Herbert Van de Sompel on a project funded by the European Open Science Cloud Pilot that aims to test and benchmark ResourceSync against OAI-PMH in a range of scenarios. The objective is to perform a quantitative evaluation that could then be used as evidence to convince data providers to adopt ResourceSync. During this work, we have encountered a problem related to the scalability of ResourceSync and developed a solution to it in the form of an On Demand Resource Dump. The aim of this blog post is to explain the problem, how we arrived to the solution and how the solution works.
One of the scenarios we have been exploring deals with a situation where the resources to be synchronised are metadata files of a small data size (typically from a few bytes to several kilobytes). Coincidentally, this scenario is very common for metadata in repositories of academic manuscripts, research data (e.g. descriptions of images), cultural heritage, etc.
The problem is related to the issue that while most OAI-PMH implementations typically deliver 100-1000 responses per one HTTP request, ResourceSync is designed in a way that requires resolving each resource individually. We have identified and confirmed by testing that for repositories with larges numbers of metadata items, this can have a very significant impact on the performance of harvesting, as the overhead of the HTTP request is considerable compared to the size of the metadata record.
More specifically, we have run tests over a sample of 357 repositories. The results of these tests show that while the speed of OAI-PMH harvesting ranges from 30-520 metadata records per second, depending largely on the repository platform, the speed of harvesting by ResourceSync is somewhere in the range of only 4 metadata records per second for harvesting the same content using existing ResourceSync client/server implementations and sequential downloading strategy. We are preparing a paper on this, so I am not going to disclose the exact details of the analysis at this stage.
As ResourceSync has been created to overcome many of the problems of OAI-PMH, such as:
- being too flexible in terms of support for incremental harvesting, resulting in inconsistent implementations of this feature across data providers,
- some of its implementations being unstable and less suitable for exchanging large quantities of metadata and
- being only designed for metadata transfer, omitting the much needed support for content exchange
it is important that Resource Sync performs well under all common scenarios, including the one we are dealing with.
Can Resource Dumps be the solution?
An obvious option for solving the problem that is already offered by ResourceSync are Resource Dumps. While a Resource Dump can speed up harvesting to levels far exceeding those of OAI-PMH, it creates some considerable extra complexity on the side of the server. The key problem is that it creates the necessity to periodically package the data as a Resource Dump, which basically means running a batch process to produce a compressed (zip) file containing the resources.
The number of Resource Dumps a source needs to maintain is equal to the number of Capability Lists it maintains times the size of the Resource Dump Index. The minimum practical operational size of a Resource Dump Index is 2. This is to ensure we don’t remove a dump currently being downloaded by a client during the creation of a new dump. As we have observed that a typical repository may contain about 250 OAI-PMH sets (Capability Lists in the ResourceSync terminology), this implies the need for a significant data duplication and requirements on period creation of Resource Dumps if a source chose to use Resource Dumps as part of the harvesting process.
On Demand Resource Dumps
To deal with the problem, we suggest an extension of ResourceSync that will support the concept of an On Demand Resource Dump. An On Demand Resource Dump is a Resource Dump which is created, as the name suggests, whenever a client asks for it. More specifically, a client can scan through the list of resources presented in a Resource List or a Change List (without resolving them individually) and request from the source to package any set of the resources as a Resource Dump. This approach speeds up and saves processing on the side of both the source as well as the client. Our initial tests show that this enables ResourceSync to perform as well as OAI-PMH in the metadata only harvesting scenario when requests are sent sequentially (the most extreme scenario for ResourceSync). However, as ResourceSync requests can be parallelised, as opposed to OAI-PMH (due to the reliance of OAI-PMH on the resumption token), this makes ResourceSync a clear winner.
In the rest of this post, I will explain how this works and how it could be integrated with the ResourceSync specification.
There are basically 3 steps:
- defining that the server supports an on-demand Resource Dump,
- sending a POST request to the on-demand dump endpoint and
- receiving a response from the server that 100% conforms to the Resource Dump specification.
I will first introduce steps 2 and 3 and then I will come back to step 1.
Step 2: sending a POST request to the On Demand dump endpoint
We have defined an endpoint at https://core.ac.uk/datadump . You can POST it a list of resource identifiers (which can be discovered in a Resource List). In the example below, I am using curl to send it a list of resource identifiers in JSON which I want to get resolved. Obviously, the approach is not limited to JSON, it can be used for any resource listed in a Resource List regardless of its type. Try it by executing the code below in your terminal.
curl -d ‘[“https://core.ac.uk/api-v2/articles/get/42138752″,”https://core.ac.uk/api-v2/articles/get/32050″]‘ -H “Content-Type: application/json” https://core.ac.uk/datadump -X POST > on-demand-resource-dump.zip