Petr Knoth

CORE to become an independent Open Access service from August 2023

Jisc and The Open University have had a long-standing relationship delivering CORE (core.ac.uk) for over 10 years. During this time, the service has grown from a project to an important and widely used Open Research infrastructure.

The current Jisc – OU contract for delivering CORE to the open scholarly community is expiring in July 2023. From this time onwards CORE will be operated by The Open University and will no longer receive direct funding from Jisc. The Open University is grateful to Jisc for its support of CORE over the last ten years. read more...

Increasing the Speed of Harvesting with On Demand Resource Dumps

I am currently working with Martin Klein, Matteo Cancellieri and Herbert Van de Sompel on a project funded by the European Open Science Cloud Pilot that aims to test and benchmark ResourceSync against OAI-PMH in a range of scenarios. The objective is to perform a quantitative evaluation that could then be used as evidence to convince data providers to adopt ResourceSync. During this work, we have encountered a problem related to the scalability of ResourceSync and developed a solution to it in the form of an On Demand Resource Dump. The aim of this blog post is to explain the problem, how we arrived to the solution and how the solution works.

The problem

One of the scenarios we have been exploring deals with a situation where the resources to be synchronised are metadata files of a small data size (typically from a few bytes to several kilobytes). Coincidentally, this scenario is very common for metadata in repositories of academic manuscripts, research data (e.g. descriptions of images), cultural heritage, etc.

The problem is related to the issue that while most OAI-PMH implementations typically deliver 100-1000 responses per one HTTP request, ResourceSync is designed in a way that requires resolving each resource individually. We have identified and confirmed by testing that for repositories with larges numbers of metadata items, this can have a very significant impact on the performance of harvesting, as the overhead of the HTTP request is considerable compared to the size of the metadata record.

More specifically, we have run tests over a sample of 357 repositories. The results of these tests show that while the speed of OAI-PMH harvesting ranges from 30-520 metadata records per second, depending largely on the repository platform, the speed of harvesting by ResourceSync is somewhere in the range of only 4 metadata records per second for harvesting the same content using existing ResourceSync client/server implementations and sequential downloading strategy. We are preparing a paper on this, so I am not going to disclose the exact details of the analysis at this stage.

As ResourceSync has been created to overcome many of the problems of OAI-PMH, such as:

being too flexible in terms of support for incremental harvesting, resulting in inconsistent implementations of this feature across data providers,
some of its implementations being unstable and less suitable for exchanging large quantities of metadata and
being only designed for metadata transfer, omitting the much needed support for content exchange

it is important that Resource Sync performs well under all common scenarios, including the one we are dealing with.

Can Resource Dumps be the solution?

An obvious option for solving the problem that is already offered by ResourceSync are Resource Dumps. While a Resource Dump can speed up harvesting to levels far exceeding those of OAI-PMH, it creates some considerable extra complexity on the side of the server. The key problem is that it creates the necessity to periodically package the data as a Resource Dump, which basically means running a batch process to produce a compressed (zip) file containing the resources.

The number of Resource Dumps a source needs to maintain is equal to the number of Capability Lists it maintains times the size of the Resource Dump Index. The minimum practical operational size of a Resource Dump Index is 2. This is to ensure we don’t remove a dump currently being downloaded by a client during the creation of a new dump. As we have observed that a typical repository may contain about 250 OAI-PMH sets (Capability Lists in the ResourceSync terminology), this implies the need for a significant data duplication and requirements on period creation of Resource Dumps if a source chose to use Resource Dumps as part of the harvesting process.

On Demand Resource Dumps

To deal with the problem, we suggest an extension of ResourceSync that will support the concept of an On Demand Resource Dump. An On Demand Resource Dump is a Resource Dump which is created, as the name suggests, whenever a client asks for it. More specifically, a client can scan through the list of resources presented in a Resource List or a Change List (without resolving them individually) and request from the source to package any set of the resources as a Resource Dump. This approach speeds up and saves processing on the side of both the source as well as the client. Our initial tests show that this enables ResourceSync to perform as well as OAI-PMH in the metadata only harvesting scenario when requests are sent sequentially (the most extreme scenario for ResourceSync). However, as ResourceSync requests can be parallelised, as opposed to OAI-PMH (due to the reliance of OAI-PMH on the resumption token), this makes ResourceSync a clear winner.

In the rest of this post, I will explain how this works and how it could be integrated with the ResourceSync specification.

There are basically 3 steps:

defining that the server supports an on-demand Resource Dump,
sending a POST request to the on-demand dump endpoint and
receiving a response from the server that 100% conforms to the Resource Dump specification.

I will first introduce steps 2 and 3 and then I will come back to step 1.

Step 2: sending a POST request to the On Demand dump endpoint

We have defined an endpoint at https://core.ac.uk/datadump . You can POST it a list of resource identifiers (which can be discovered in a Resource List). In the example below, I am using curl to send it a list of resource identifiers in JSON which I want to get resolved. Obviously, the approach is not limited to JSON, it can be used for any resource listed in a Resource List regardless of its type. Try it by executing the code below in your terminal.

curl -d ‘[“https://core.ac.uk/api-v2/articles/get/42138752″,”https://core.ac.uk/api-v2/articles/get/32050″]‘ -H “Content-Type: application/json” https://core.ac.uk/datadump -X POST > on-demand-resource-dump.zip read more...

CORE releases a new API version

We are very proud to announce that CORE has now released CORE API 2.0. The new API offers new opportunities for developers to make use of the CORE open access aggregator in their applications.

The main new features are:

Support for looking up articles by a global identifier (DOI, OAI, arXiv, etc.) instead of just CORE ID.

Access to new resource types, repositories and journals, and organisation of API methods according to the resource type.

Enables accessing the original metadata exactly as it was harvested from the repository of origin.

Supports the retrieval of the changes of the metadata as it was harvested by CORE.

Provides the possibility of retrieving citations extracted from the full-text by CORE.

Support for batch request for searching, recommending, accessing full-texts, harvesting history, etc.

Our response to the new HEFCE OA Policy

Our response to the HEFCE OA policy, which was officially announced today, is now available here.

CORE among the top 10 search engines for research that go beyond Google

Using search engines effectively is now a key skill for researchers, but could more be done to equip young researchers with the tools they need? Here, Dr Neil Jacobs and Rachel Bruce from JISC’s digital infrastructure team shared their top ten resources for researchers from across the web. CORE was placed among the top 10 search engines that go beyond Google.

More information on the JISC’s website.

CORE Three Access Levels Visualised

The following poster about the aggregation use cases for open access and their implementation in CORE will be presented at JCLD 2013.

CORE: Three Access Levels to Underpin Open Access

The article describing the motivation and case for CORE has been published today in the D-Lib Magazine: http://www.dlib.org/dlib/november12/knoth/11knoth.html

Final blog post

The main idea of this blog post is to provide a summary of the CORE outputs produced over the last 9 months and report the lessons learned.

Outputs

The outputs can be divided into (a) technical, (b) content and service and (c) dissemination outputs.

(a) Technical outputs

According to our project management software, to this day, we have resolved 214 issues. Each issue corresponds to a new function or a fixed bug. In this section we will describe the new features and improvements we have developed. The technology on which the system is built has been decribed in our previous blog post.

Among the main outputs achieved during the project are:

An improvement of the metadata and content harvesting to allow more efficient parallel processing.

The addition of new new text mining tasks including:

language detection
concept extraction
citation extraction
text classification
de-duplication (to be released soon)

Technical Approach

In the last six months, CORE has made a huge step forward in terms of the technology solution. According to our project management software, to this day, we have resolved 214 issues. Each issue corresponds to a new function or a fixed bug.

The idea of this blog post is to provide an overview of the technologies and standards CORE is using and to report on the experience we had with them during the development of CORE in the last months. We will provide more information about the new features and enhancements in the following blog posts.

Technologies

Tomcat Web server – CORE has been using Tomcat as an application container since its start, however relatively recently the CORE frontend has been deployed as a Tomcat cluster. This means that the application is deployed on multiple (currently just 2) machines. Using a load balancer the web traffic is redirected to any of these servers. The advantage of this solution is not only performance, but also the reliability of the service. For example, it is now possible for us to redeploy the application while the CORE service is still running. At the same time, the architecture is prepared for growth in the future. So far, our experience with this solution is generally positive.

Apache Nutch – We have adopted Apache Nutch in order to obey the information in the robots.txt file. Apache Nutch makes the implementation very simple and we have a very positive experience with it.

SVM Light – Support Vector Machine (in particular SVM multiclass) classifiers have been used in CORE to perform a pilot set of tests for text classification of research papers. While the tool is extremely simple to set up and great to work with, it does not allow building models from a very large number of examples. Although we couldn’t utilise all examples we have, the tool was still good enough for carrying out experiments. We are now looking how to improve the scalability in the training phase to make use of a larger number of examples. We think that tools, such as Apache Mahout, might be able to provide the answer.

Google Charts – Google Charts have been used for graphs in the Repository Analytics. A very cool interactive graphs, easy to set up.

Logback – used to improve logging in CORE and also to provide logs in the Repository Analytics tool.

Apache Lucene – has been used previously and has proved to be a great tool – very fast and scalable.

Language detection software – The issue of language detection appeared more important to resolve as the content in the CORE aggregation system has growed. Particularly with the aggregation of content from the Directory of Open Access Journals, it started to be important to distinguish publications in different languages. We originally tried to approach this problem using the AlchemyAPI. AlchemyAPI offers their API for free for a low number (several thousand) of queries per day. This can be extended up to 30k queries for non-commercial academic providers. We asked AlchemyAPI to provide this, but learned that they require you to acknowledge AlchemyAPI on every publication about your project (even those that talk about completely different aspects of the system). Therefore, we have decided to use the Language Detection Library for Java available on Google code. We are very happy with this decision.

ParsCit (citation extraction) – we have used ParsCit: An open-source CRF Reference String and Logical Document Structure Parsing Package for detection of basic metadata and citation parsing from full-text. ParsCit provides reasonable performance and accuracy and we are quite happy with it.

AlchemyAPI – We have used this for language detection in the past and also for concept extraction. Overall, we feel this solution is not suitable for us due to their licensing restrictions and lack of transparency of the extraction methods and we are moving to a different solution.

Blekko – Blekko is a search engine which offers an API that allows you to query the Web free of charge at a maximum frequency of 1 query per second. This is fabulous in comparison to other search enines like Google, Yahoo or Bing who either extremely restrict the use of their API or charge enormously high fees for its use. Unfortunately, the Blekko API doesn’t provide indexed results of pdfs. Something that would be very useful for the focused crawling functionality of CORE. Still, as far as we know, this is the best free search API available.

Additional tools on which CORE is based have been described in our previous blog post

Standards

In terms of standards, CORE has been using information in the robots.txt file to set the harvesting frequency and obey the wishes of the repository owner to restrict access to certain parts of the system. However, we have noticed that certain archives specify different policies for different bots, for example allowing GoogleBot into sections that are invisible to other bots or assigning a lower crawl-delay to the GoogleBot than to other bots. We considers such policies unfair and violating the principles of Open Access.

We have developed the CORE API as a RESTful service (in addition to our SPARQL endpoint). While it might sound politically incorrect, we have found the RESTful service to be much more popular among developers than the SPARQL endpoint.

Techniques and approaches

During the project, we have improved the architecture of the system and made steps towards helping it to grow. The system has been divided into a front-end (currently 2 machines), responsible for dealing with the requests coming from the web, and a powerful backend machine, responsible for all the harvesting and processing. The applications are synchronised using a database. Indexes from the back-end machine are daily synced to the front-end machines.

Another useful tool we have developed is a self-test module which periodically monitors the health of the systems and provides information in case something doesn’t seem right.

Author: Petr Knoth