The main idea of this blog post is to provide a summary of the CORE outputs produced over the last 9 months and report the lessons learned.
Highlights
Outputs
The outputs can be divided into (a) technical, (b) content and service and (c) dissemination outputs.
(a) Technical outputs
According to our project management software, to this day, we have resolved 214 issues. Each issue corresponds to a new function or a fixed bug. In this section we will describe the new features and improvements we have developed. The technology on which the system is built has been decribed in our previous blog post.
Among the main outputs achieved during the project are:
- An improvement of the metadata and content harvesting to allow more efficient parallel processing.
- The addition of new new text mining tasks including:
- language detection
- concept extraction
- citation extraction
- text classification
- de-duplication (to be released soon)
- Improvement of existing text-mining tasks: semantic similarity (support for metadata and external resources)
- Pilot development and testing of the text-classification module
- An update of the CORE infrastructure to increase uptime and scalability and to allow service maintenance while the application is running.
- An update of the infrastructure to enable more advanced scheduling of repository harvesting tasks.
- The development of a statistical module tracking the amount of harvested metadata and content
- New functionality allowing batch import of content from the filesystem or using protocols other than OAI-PMH.
- Support for manual and automatic records removal/deletion
- Added the possibility of focused crawling (still in testing).
- OpenDOAR synchronisation component
- Improved logging and making the CORE process transparent for the CORE users.
- Optimised performance of metadata extraction, pdf to text extraction and content harvesting.
(b) Content harvesting and service
- CORE API (search, semantic similarity, download pdf and plaintext functionality, deduplication) )
- Repository Analytics
(c) Dissemination
- JCDL 2012 – organisation of a workshop with The European Library on Mining Scientific Publications. The proceedings are published in the D-Lib magazine. In addition, we have written the guest editorial for this issue and submitted a CORE related paper about visual exploratory search which we hope to integrate with the CORE system.
- OR 2012 – Presentation at the text-mining workshop, poster presentation
- Submitted article to eScience 2012
- We are in contact with: British Educational Index (CORE API), UNESCO (CORE Plugin), The European Library/Europeana (CORE API), British Library, OpenDOAR (CORE Repository Analytics), UCLC (CORE Plugin), Los Alamos National Laboratory (CORE API), Cottagelabs (CORE API), OpenAIREPlus, UK RepositoryNet+
Lessons learned
Access to content for research purposes is a problem – During the 1st International Workshop on Mining Scientific Publications collocated with JCDL 2012 (blog post URL) we asked researchers how they feel about accessing information in scholarly databases for research and development purposes. The results of a questionnaire indicated that access to raw data is limited, which is a problem. It is difficult for researchers to acquire datasets of publications or research data and share them, it is currently too complicated for developers to access and build applications on top of the available data. CORE is trying to help researchers to get this unrestricted access to the research publications in the Open Access domain.
Users of APIs need flexible and reliable access to aggregated and enriched information – Users of APIs, mostly researchers and developers, need convenient access to content. More specifically, they need to be able able to focus on carrying out experiments or developing applications minimising the effort of acquiring and preprocessing data. It is important that APIs provide flexible access to content, i.e. access that allows building of a wide range of applications many of which might be unprecedented. It is also essential that APIs aim to provide services that make it easier to acquire preprocessed data saving the time of researchers and developers. In CORE we want to work with these groups, listen to them and learn what functionalities they require. If we can, we will do our best to support them.
Open services are the key (Open Source is not enough) – after working on CORE for about 1.5 years, our experience suggests that the software solution is just a part of the whole problem. A significant proportion of our time has been spent by monitoring the aggregation, providing a suitable hardware infrastructure, operating the CORE services and analysing statistical results. This makes us believe that delivering Open Source solutions is not sufficient for building the necessary infrastructure for Open Access. What we need are sustainable, transparent and reliable Open Services. The aim of CORE is to deliver such a service.