The CORE Harvester system is now online! Though the system requires more testing and improvements, it is now deployed in its beta version at http://core.kmi.open.ac.uk.
The CORE system currently relies on the following technologies (this blog post will be updated to keep the information current):
– OCLC OAIHarvester2 – a set of Java classes for the OAI-PMH metadata harvesting
– J2EE and Spring libraries for the development of the web based interface of the application
– Apache Lucene – for the indexing of the metadata and full-text documents
– Apache Tika – for the extraction of text from pdf documents
– Sesame – as a triple store for exposing the extracted triples
– MySQL – as a backend for Sesame and the Harvester application
We are at the moment in the process of testing of a newly developed component for metadata harvesting. This component is responsible for acquiring (1) meta-data records and (2) the associated full-text content from Open Access repositories. The harvesting of the metadata is performed using standard OAI-PMH requests to the repositories. Successful requests return an XML document which contains information about the papers stored in a repository. Although the OAI-PMH protocol itself is not directly concerned with the downloading of full-text content, as it focuses on the transfer of metadata, a good practise in repositories (which is unfortunately not consistently applied) is to provide as a part of the metadata the URLs to the full-text documents. Document URLs can be thus extracted and used to automatically download full-texts from repositories over the HTTP protocol. The CORE system provides this functionality and is optimized for regular metadata harvesting and full-text downloading of large amounts of content. The fact that CORE caches the actual full-text content in order to process the documents and to discover additional metadata distin- guishes this approach from a number of other Open Access federated search systems, such as BASE or OAISTER, that rely only on the metadata accessible through OAI-PMH.
Aims, Objectives and Final Output(s) of the project
The CORE project aims to facilitate the access and navigation to relevant scientific papers distributed in Open Access institutional repositories.
- Release a new open metadata collection in the Linked Data format describing the semantic relations between resources stored across a selection of UK institutional repositories. The project will assign dereferenceable URIs to all resources in the collection and will make them publicly available.
- Develop a web-service reusable by other Open Access repositories and a demonstrator tool for the Open Research Online (ORO) repository.
- Develop good practice for the uptake of the provided repository and service in collaboration with the Directory of Open Access Repositories (OpenDOAR) and UKOLN.
The CORE objectives will be achieved through the development of a CORE architecture which will consists of the following subsystems:
- Content Harvester – system for harvesting metadata and full-text content from institutional repositories and indexing
- Relation Analyzer – system for the discovery of semantic relations between full-text articles
- RDF Publisher – system for publishing the results in RDF with its associated services (demonstrator)
The functionality of the system is demonstrated in the following Figure:
In the first month of the project we have developed a first version of the Harvesting system which has been already tested on the ORO repository which is based on the EPrints system.