CORE has been accepted for a presentation at the CERN Workshop on Innovations in Scholarly Communication (OAI7). OAI7 ( http://indico.cern.ch/conferenceDisplay.py?confId=103325 ) takes place in June in Geneva. OAI7 is one of the most important events in the Open Access publishing field. It is aimed at those involved in the development of Open Access (OA) repositories and those who can influence the direction of developments either within their institution, their country or at an international level – that includes technical developers of OA bibliographic databases and connected services, research information policy developers at university or library level, funding bodies concerned with access to the results of their research, OA publishers,and influential researchers keen to lead OA developments in their own field.
In the last weeks we have invested a significant effort in the development of CORE administration tools. These tools will allow an easier maintainance and analysis of the metadata and full-text content flow from Open Access repositories to CORE. It will also enable the inclusion of more Open Access repositories in the future. The approach we are taking is that after the end of the project we want all the regular maintenance tasks to be performed directly from the user interface without requiring the administrator to have any knowledge of the source code. We believe that this is an important step towards sustainability of CORE.
The CORE project will release the provided metadata and software in the following way:
– The metadata will be released under the Creatives Commons attribution license (CC BY). This license lets others distribute, remix, tweak, and build upon our work, even commercially, as long as they credit CORE for the original creation. This is the most accommodating of licenses offered. Recommended for maximum dissemination and use of licensed materials.
– The software will be offered under the New BSD License or similar.
– The provided service will be offered free-of charge to everybody.
The CORE Harvester system is now online! Though the system requires more testing and improvements, it is now deployed in its beta version at http://core.kmi.open.ac.uk.
The CORE system currently relies on the following technologies (this blog post will be updated to keep the information current):
– OCLC OAIHarvester2 – a set of Java classes for the OAI-PMH metadata harvesting
– J2EE and Spring libraries for the development of the web based interface of the application
– Apache Lucene – for the indexing of the metadata and full-text documents
– Apache Tika – for the extraction of text from pdf documents
– Sesame – as a triple store for exposing the extracted triples
– MySQL – as a backend for Sesame and the Harvester application
We are at the moment in the process of testing of a newly developed component for metadata harvesting. This component is responsible for acquiring (1) meta-data records and (2) the associated full-text content from Open Access repositories. The harvesting of the metadata is performed using standard OAI-PMH requests to the repositories. Successful requests return an XML document which contains information about the papers stored in a repository. Although the OAI-PMH protocol itself is not directly concerned with the downloading of full-text content, as it focuses on the transfer of metadata, a good practise in repositories (which is unfortunately not consistently applied) is to provide as a part of the metadata the URLs to the full-text documents. Document URLs can be thus extracted and used to automatically download full-texts from repositories over the HTTP protocol. The CORE system provides this functionality and is optimized for regular metadata harvesting and full-text downloading of large amounts of content. The fact that CORE caches the actual full-text content in order to process the documents and to discover additional metadata distin- guishes this approach from a number of other Open Access federated search systems, such as BASE or OAISTER, that rely only on the metadata accessible through OAI-PMH.
Aims, Objectives and Final Output(s) of the project
The CORE project aims to facilitate the access and navigation to relevant scientific papers distributed in Open Access institutional repositories.
- Release a new open metadata collection in the Linked Data format describing the semantic relations between resources stored across a selection of UK institutional repositories. The project will assign dereferenceable URIs to all resources in the collection and will make them publicly available.
- Develop a web-service reusable by other Open Access repositories and a demonstrator tool for the Open Research Online (ORO) repository.
- Develop good practice for the uptake of the provided repository and service in collaboration with the Directory of Open Access Repositories (OpenDOAR) and UKOLN.
The CORE objectives will be achieved through the development of a CORE architecture which will consists of the following subsystems:
- Content Harvester – system for harvesting metadata and full-text content from institutional repositories and indexing
- Relation Analyzer – system for the discovery of semantic relations between full-text articles
- RDF Publisher – system for publishing the results in RDF with its associated services (demonstrator)
The functionality of the system is demonstrated in the following Figure:
In the first month of the project we have developed a first version of the Harvesting system which has been already tested on the ORO repository which is based on the EPrints system.