In the two previous blogs posts in this series (Finding fulltext and What does Google do?) I’ve described some of the challenges related to harvesting metadata and full text from institutional repositories. I’ve omitted some of the technical issues we’ve encountered (e.g. issues with OAI-PMH Resumption Tokens) as generally we’ve been able to work around these – although I may come back to these at some point in the future. Also worth a read is Nick Sheppard’s post on the UKCORR blog touching on some of these issues.
The CORE team is organising a workshop collocated with JCDL 2012, a major conference in the field of digital libraries. Our proposal to organise the 1st International Workshop on Mining Scientific Publications was accepted. The aim of the workshop is to bring together researchers, digital library developers and practitioners from government and industry to address the current challenges in the field of mining scientific publications and building the necessary infrastructure to support this. The topics of the workshop are directly related to the work carried out in both SreviceCORE and DiggiCORE projects and are available on the workshop website.
This is the second post in a series about the issues CORE has encountered trying to harvest (and build services on) metadata and fulltext items from UK HE research repositories. The first post “Finding fulltext” looked at the problems of harvesting fulltext due to variations in how links are made (or not) from metadata records to fulltext content.
In this post I want to consider the question of what services like CORE are allowed or permitted to do with repository content. A third post will then describe some of the solutions to the various challenges we see.
In order to be able to provide the search functions, similarity measures and other functionality CORE harvests both metadata and fulltext items from repositories. This raises questions about whether we are allowed to harvest metadata or fulltext items, and if so what are we allowed to do with them once we have harvested them. In the first phase of CORE we relied on OAI-PMH to harvest metadata, and then used links from the harvested records to try to discover the related fulltext item.
This is the first in a series of blog posts looking at these issues, the problems we’ve encountered and the solutions we have put in place (so far). In this post I’m going to focus on the question of finding fulltext items from the metadata. This wasn’t always straightforward. Not all repositories link to fulltext records from the metadata in the same way, and in many cases there is no direct link from the metadata to the fulltext reocrds, but rather a link to the repositories webpage for the record, rather than to the full text.
The ServiceCORE team has now moved to an agile development lifecycle with a 2 weeks long release period. What is available in the new release that has just been published?
– A new advanced search facility.
– Search snippets available on the results page. Snippets created from the resource ful-text where available.
– The system supports citation extraction (available for newly processed resources) and displays references mined from the article full-texts. CORE also provides direct links to them, if they are held in our repository – http://core.kmi.open.ac.uk/display/41214)
– A new document preview feature
The ServiceCORE project is addressing a wide range of problems caused by the rapid increase of Open Access scientific papers stored across UK institutional repositories. These problems include:- The difficulty of accessing real, full-text data from these distributed sources efficiently.
- The difficulty of generating data statistics (size, growth, subjects).
- The difficulty of searching, organising and navigating this distributed information.
- The difficulty of analysing the data.
- The difficulty of repurposing and reusing the data in other applications.
- The difficulty of building services on top of the UK Repository Infrastructure
The ServiceCORE project is responding to these challenges by developing a nation-wide aggregation service for content stored across UK Open Access Repositories. The CORE system is not only harmonising the access to the UK repository content, but it is also processing the full-text-content using text-mining methods to enrich the existing metadata. This includes extraction of citation information, recommendation of similar content etc.
The subsequent development of the CORE system will be achieved through two new projects: ServiceCORE and DiggiCORE. The ServiceCORE project is focused on improving the CORE technical infrastructure, on developing new services including an API for research publications, a subject-based classification system and a good practise for publishing and exposing content in OARs. The DiggiCORE project, which is in collaboration with The European Library aims at the analysis of impact of Open Access research publications by constructing citation networks and analysing research trends. In this respect both projects are complementary. More information about ServiceCORE can be found in the CORE project family section.
The KMi submission authored by Petr Knoth, Vojtech Robotka and Zdenek Zdrahal entitled: ” Connecting Repositories in the Open Access Domain using Text Mining and Semantic Data” won the Best Poster/Demo Award at the International Conference on Theory and Practise of Digital Libraries (TPDL 2011) which is this week taking place in Berlin, Germany.
The European Conference on Research and Advanced Technology for Digital Libraries (ECDL) has been the leading European scientific forum on digital libraries for 14 years. For the 15th year the conference was renamed into: International Conference on Theory and Practice of Digital Libraries (TPDL).
The CORE Plugin has been finally approved by the OU Library and became last week and became a part of the institutional repository Open Research Online. An example of the plugin can be seen at the bottom of this page.
What have we produced:
- CORE harvesting admin interface (http://core-project.kmi.open.ac.uk/node/9 – access to the admin system is not public)
- CORE Portal (http://core.kmi.open.ac.uk) – A general purpose federated search system for accessing articles harvested from OA repositories.
- CORE Mobile (https://market.android.com/details?id=org.xmlrpc&feature=search_result) – A native Android application, which can be used on both mobile and tablet devices. The application allows search and navigation across related papers stored in OA repositories and downloading full-text articles on the mobile device.
- CORE Plugin – It provides information to digital libraries about related documents stored in other repositories that are semantically related to the document currently being visited. Integrated into Open Research Online (oro.open.ac.uk – currently in the process of an internal OU approval).
- SPARQL endpoint (http://core.kmi.open.ac.uk:8081/COREWeb/squery) and RDF triple store (http://ckan.net/package/core)
Knoth, P., Robotka, V. and Zdrahal, Z. (2011) Connecting Repositories in the Open Access Domain using Text Mining and Semantic Data, International Conference on Theory and Practice of Digital Libraries 2011 (TPDL 2011), Berlin, Germany
Knoth, P. and Zdrahal, Z. (2011) CORE: Connecting Repositories in the Open Access Domain, CERN workshop on Innovations in Scholarly Communication (OAI7), Geneva, Switzerland
Youtube video presentation:
Linked Data in Libraries event (London) presentation:
Core presentation from petrknoth
Next steps:Find ways how to further develop CORE to enable the inclusion of larger amounts of content, i.e. the aggregation of content from more repositories. Integration of CORE with currently emerging Research Data management and repository systems to allow the linking of publications with data. Further dissemination of the service to increase its adoption
Evidence of Reuse:
- Data and services currently being reused by the Open Research Online Repository.
- Positive feedback received from the participants the OAI7 workshop, namely Astrid van Wesenbeeck (SPARC Europe).
- A positive feedback about CORE received by email as a reaction on the upload of the CORE video on YouTube from Graham Steel.
- Our team has discovered a set of OAI-PMH base URLs that were not up to date in the OpenDOAR repository and provided this feedback to OpenDOAR. Bill Hubbard of OpenDOAR appreciated this collaboration.
The project has helped us to further develop skills needed to technically handle large amounts of data. It also increased our understanding of the current state-of-the-art technologies for access and retrieval of Open Access content. These skills will help us to further develop CORE in the future.