CORE Repositories Dashboard: An infrastructure to increase collaboration of Aggregators with Open Repositories

In an effort to improve the quality and transparency of the harvesting process of the open access content and create a two way collaboration between the CORE project and the providers of this content, CORE is introducing the Repositories Dashboard. The aim of the Dashboard is to provide an online interface for repository providers and offer, through this online interface, valuable information to content providers about:

  • the content harvested from the repository enabling its management, such as by requesting metadata updates or managing take-down requests,
  • the times and frequency of content harvesting, including all detected technical issues and suggestions for improving the efficiency of harvesting and the quality of metadata, including compliance with existing metadata guidelines,
  • statistics regarding the repository content, such as the distribution of content according to subject fields and types of research outputs, and the comparison of these with the national average.

In the CORE Dashboard there is a designated page for every institution, where repository managers will be able to add all the information that corresponds to their own repository, such as the institution’s logo, the repository name and email address.

Screen Shot 2015-09-06 at 13.41.18

The Dashboard allows repository managers to create accounts for other colleagues as well.

Screen Shot 2015-09-08 at 15.56.10

With regards to managing the harvested outputs, the Dashboard enables repository managers to add or remove documents’ full-text in the CORE collection without having to contact the CORE team.

Screen Shot 2015-09-06 at 13.48.55

Screen Shot 2015-09-06 at 13.49.33

These actions can now be completed immediately only by clicking on the “Take Down” or “Take up” buttons. It is also possible to download a CSV file of all the records harvested by CORE from a repository.

CORE can also be notified about metadata changes through the Dashboard. Repository managers can click on the blue “Update Metadata” button and then the single item’s metadata will be updated without the need for a full repository harvest.

In cases where a whole repository needs to be re-harvested, a re-harvesting button can be found on the main page for each repository.

Screen Shot 2015-09-06 at 13.55.07

The CORE Dashboard is also in position to identify technical issues that may occur while harvesting a repository. We have divided these types of issues into three categories: Error, Warning and Info, and we provide an explanation for each category on the Dashboard.

Screen Shot 2015-09-06 at 13.57.51

An example of how errors and warnings may appear:

Screen Shot 2015-09-06 at 13.58.49

We believe that this information will be valuable to the repository managers in their communication with the staff that support the repository technically. For example, in the screenshot provided, it is shown that CORE cannot harvest the repository because there is an entry in the robots.txt file that prohibits the download. The “Documents issues” section informs us that CORE was not able to harvest the repository because a login may have been required to access the PDF, or perhaps the metadata record does not link to a PDF, but to a video file. The last section, “Malformed PDF url” singles out these records that do not have an hyper-text transfer protocol or have invalid characters in the URL.

Currently we are working to include two important functionalities in the Dashboard. The first one is the IRUS-UK statistics, which will give repository managers access to their records’ statistics through CORE. The second one is RIOXX, whereby repository managers will be able to check repository compliance with the RIOXX guidelines.

If you have any questions, recommendations or you would like to gain access to the CORE Dashboard, email us at

CORE releases a new API version

We are very proud to announce that CORE has now released CORE API 2.0. The new API offers new opportunities for developers to make use of the CORE open access aggregator in their applications.

The main new features are:

  • Support for looking up articles by a global identifier (DOI, OAI, arXiv, etc.) instead of just CORE ID.
  • Access to new resource types, repositories and journals, and organisation of API methods according to the resource type.
  • Enables accessing the original metadata exactly as it was harvested from the repository of origin.
  • Supports the retrieval of the changes of the metadata as it was harvested by CORE.
  • Provides the possibility of retrieving citations extracted from the full-text by CORE.
  • Support for batch request for searching, recommending, accessing full-texts, harvesting history, etc.

The goals of the new API also include improving scalability, cleaning up and unifying the API responses and making it easier for developers to start working with it.

The API is implemented and documented using Swagger, which has the advantage that anybody can start playing with the API directly from our online client. The documentation of the API v2.0 is available and the API is currently in beta. Those interested to register for a new API key can do so by completing the online form.

Our previous version, the CORE API v1.0, will not be abandoned yet, but it will run alongside with the new version. However, the API v1.0 is deprecated and will eventually be replaced by the API v2.0.

What is CORE

The mission of CORE (COnnecting REpositories) is to aggregate all open access research outputs from repositories and journals worldwide and make them available to the public. In this way CORE facilitates free unrestricted access to research for all.


  • supports the right of citizens and general public to access the results of research towards which they contributed by paying taxes,
  • facilitates access to open access content for all by offering services to general public, academic institutions, libraries, software developers, researchers, etc.,
  • provides support to both content consumers and content providers by working with digital libraries, institutional and subject repositories and journals,
  • enriches the research content using state-of-the-art technology and provides access to it through a set of services including search, API and analytical tools,
  • contributes to a cultural change by promoting open access, a fast growing movement.

CORE among the top 10 search engines for research that go beyond Google

Using search engines effectively is now a key skill for researchers, but could more be done to equip young researchers with the tools they need? Here, Dr Neil Jacobs and Rachel Bruce from JISC’s digital infrastructure team shared their top ten resources for researchers from across the web. CORE was placed among the top 10 search engines that go beyond Google.

More information on the JISC’s website.

Related content recommendation for EPrints

We have released the first version of a content recommendation package for EPrints available via the EPrints Bazaar ( ). The functionality is offered through CORE and can be seen, for example, in Open Research Online EPrints ( ) or on the European Library portal ( ). I was wonderring if any EPrints repository manager would be interested to get in touch to test this in his/her repository. As the
package is available via the EPrints Bazaar, the installation requires just a few clicks. We would be grateful for any suggestions for improvements and also for information regarding how this could be effectively provided to DSpace and Fedora repositories.

Final blog post

The main idea of this blog post is to provide a summary of the CORE outputs produced over the last 9 months and report the lessons learned.


The outputs can be divided into (a) technical, (b) content and service and (c) dissemination outputs.

(a) Technical outputs

According to our project management software, to this day, we have resolved 214 issues. Each issue corresponds to a new function or a fixed bug. In this section we will describe the new features and improvements we have developed. The technology on which the system is built has been decribed in our previous blog post.

Among the main outputs achieved during the project are:

  • An improvement of the metadata and content harvesting to allow more efficient parallel processing.
  • The addition of new new text mining tasks including:
    • language detection
    • concept extraction
    • citation extraction
    • text classification
    • de-duplication (to be released soon)
  • Improvement of existing text-mining tasks: semantic similarity (support for metadata and external resources)
  • Pilot development and testing of the text-classification module
  • An update of the CORE infrastructure to increase uptime and scalability and to allow service maintenance while the application is running.
  • An update of the infrastructure to enable more advanced scheduling of repository harvesting tasks.
  • The development of a statistical module tracking the amount of harvested metadata and content
  • New functionality allowing batch import of content from the filesystem or using protocols other than OAI-PMH.
  • Support for manual and automatic records removal/deletion
  • Added the possibility of focused crawling (still in testing).
  • OpenDOAR synchronisation component
  • Improved logging and making the CORE process transparent for the CORE users.
  • Optimised performance of metadata extraction, pdf to text extraction and content harvesting.

(b) Content harvesting and service

  • Complete graphical redesign and refactoring of the CORE Portal
  • New version of CORE Research Mobile for iOS devices (Apple) including iPhone and iPad.
  • Update and new graphical design of the CORE Mobile application for Android.
  • The creation of new services:
  • Creation of a new version of the CORE Plugin
  • Significant increase in terms of the harvested metadata, content and repositories. Harvested metadata: 8.5M, harvested full-text files: 450k, Size of data on the disk: 1.5TB, RDF triples in the CORE Repository: 35M, 232 supported repositories
  • CORE – The CORE repository has been officially added to the LOD cloud.
  • (c) Dissemination

    • JCDL 2012 – organisation of a workshop with The European Library on Mining Scientific Publications. The proceedings are published in the D-Lib magazine. In addition, we have written the guest editorial for this issue and submitted a CORE related paper about visual exploratory search which we hope to integrate with the CORE system.
    • OR 2012 – Presentation at the text-mining workshop, poster presentation
    • Submitted article to eScience 2012
    • We are in contact with: British Educational Index (CORE API), UNESCO (CORE Plugin), The European Library/Europeana (CORE API), British Library, OpenDOAR (CORE Repository Analytics), UCLC (CORE Plugin), Los Alamos National Laboratory (CORE API), Cottagelabs (CORE API), OpenAIREPlus, UK RepositoryNet+

    Lessons learned

    Access to content for research purposes is a problem – During the 1st International Workshop on Mining Scientific Publications collocated with JCDL 2012 (blog post URL) we asked researchers how they feel about accessing information in scholarly databases for research and development purposes. The results of a questionnaire indicated that access to raw data is limited, which is a problem. It is difficult for researchers to acquire datasets of publications or research data and share them, it is currently too complicated for developers to access and build applications on top of the available data. CORE is trying to help researchers to get this unrestricted access to the research publications in the Open Access domain.

    Users of APIs need flexible and reliable access to aggregated and enriched information – Users of APIs, mostly researchers and developers, need convenient access to content. More specifically, they need to be able able to focus on carrying out experiments or developing applications minimising the effort of acquiring and preprocessing data. It is important that APIs provide flexible access to content, i.e. access that allows building of a wide range of applications many of which might be unprecedented. It is also essential that APIs aim to provide services that make it easier to acquire preprocessed data saving the time of researchers and developers. In CORE we want to work with these groups, listen to them and learn what functionalities they require. If we can, we will do our best to support them.

    Open services are the key (Open Source is not enough) – after working on CORE for about 1.5 years, our experience suggests that the software solution is just a part of the whole problem. A significant proportion of our time has been spent by monitoring the aggregation, providing a suitable hardware infrastructure, operating the CORE services and analysing statistical results. This makes us believe that delivering Open Source solutions is not sufficient for building the necessary infrastructure for Open Access. What we need are sustainable, transparent and reliable Open Services. The aim of CORE is to deliver such a service.

    Technical Approach

    In the last six months, CORE has made a huge step forward in terms of the technology solution. According to our project management software, to this day, we have resolved 214 issues. Each issue corresponds to a new function or a fixed bug.

    The idea of this blog post is to provide an overview of the technologies and standards CORE is using and to report on the experience we had with them during the development of CORE in the last months. We will provide more information about the new features and enhancements in the following blog posts.


    Tomcat Web server – CORE has been using Tomcat as an application container since its start, however relatively recently the CORE frontend has been deployed as a Tomcat cluster. This means that the application is deployed on multiple (currently just 2) machines. Using a load balancer the web traffic is redirected to any of these servers. The advantage of this solution is not only performance, but also the reliability of the service. For example, it is now possible for us to redeploy the application while the CORE service is still running. At the same time, the architecture is prepared for growth in the future. So far, our experience with this solution is generally positive.

    Apache Nutch – We have adopted Apache Nutch in order to obey the information in the robots.txt file. Apache Nutch makes the implementation very simple and we have a very positive experience with it.

    SVM Light – Support Vector Machine (in particular SVM multiclass) classifiers have been used in CORE to perform a pilot set of tests for text classification of research papers. While the tool is extremely simple to set up and great to work with, it does not allow building models from a very large number of examples. Although we couldn’t utilise all examples we have, the tool was still good enough for carrying out experiments. We are now looking how to improve the scalability in the training phase to make use of a larger number of examples. We think that tools, such as Apache Mahout, might be able to provide the answer.

    Google Charts – Google Charts have been used for graphs in the Repository Analytics. A very cool interactive graphs, easy to set up.

    Logback – used to improve logging in CORE and also to provide logs in the Repository Analytics tool.

    Apache Lucene – has been used previously and has proved to be a great tool – very fast and scalable.

    Language detection software – The issue of language detection appeared more important to resolve as the content in the CORE aggregation system has growed. Particularly with the aggregation of content from the Directory of Open Access Journals, it started to be important to distinguish publications in different languages. We originally tried to approach this problem using the AlchemyAPI. AlchemyAPI offers their API for free for a low number (several thousand) of queries per day. This can be extended up to 30k queries for non-commercial academic providers. We asked AlchemyAPI to provide this, but learned that they require you to acknowledge AlchemyAPI on every publication about your project (even those that talk about completely different aspects of the system). Therefore, we have decided to use the Language Detection Library for Java available on Google code. We are very happy with this decision.

    ParsCit (citation extraction) – we have used ParsCit: An open-source CRF Reference String and Logical Document Structure Parsing Package for detection of basic metadata and citation parsing from full-text. ParsCit provides reasonable performance and accuracy and we are quite happy with it.

    AlchemyAPI – We have used this for language detection in the past and also for concept extraction. Overall, we feel this solution is not suitable for us due to their licensing restrictions and lack of transparency of the extraction methods and we are moving to a different solution.

    Blekko – Blekko is a search engine which offers an API that allows you to query the Web free of charge at a maximum frequency of 1 query per second. This is fabulous in comparison to other search enines like Google, Yahoo or Bing who either extremely restrict the use of their API or charge enormously high fees for its use. Unfortunately, the Blekko API doesn’t provide indexed results of pdfs. Something that would be very useful for the focused crawling functionality of CORE. Still, as far as we know, this is the best free search API available.

    Additional tools on which CORE is based have been described in our previous blog post


    In terms of standards, CORE has been using information in the robots.txt file to set the harvesting frequency and obey the wishes of the repository owner to restrict access to certain parts of the system. However, we have noticed that certain archives specify different policies for different bots, for example allowing GoogleBot into sections that are invisible to other bots or assigning a lower crawl-delay to the GoogleBot than to other bots. We considers such policies unfair and violating the principles of Open Access.

    We have developed the CORE API as a RESTful service (in addition to our SPARQL endpoint). While it might sound politically incorrect, we have found the RESTful service to be much more popular among developers than the SPARQL endpoint.

    Techniques and approaches

    During the project, we have improved the architecture of the system and made steps towards helping it to grow. The system has been divided into a front-end (currently 2 machines), responsible for dealing with the requests coming from the web, and a powerful backend machine, responsible for all the harvesting and processing. The applications are synchronised using a database. Indexes from the back-end machine are daily synced to the front-end machines.

    Another useful tool we have developed is a self-test module which periodically monitors the health of the systems and provides information in case something doesn’t seem right.

    CORE Fight for Open Access in Scotland!

    The 7th International Conference on Open Repositories (OR 2012) has seen last week close to 500 participants, the highest number in its history. The theme and title of OR 2012 in Edinburgh – Open Services for Open Content: Local In for Global Out – reflects the current move towards open content, ‘augmented content’, distributed systems and data delivery infrastructures. A very good fit with what CORE ( offers.

    The CORE system developed in KMi had a very active presence. Petr Knoth has presented different aspects of the CORE system in a presentation, at a poster session (with Owen Stephens) and also during the developers challenge. CORE has been also discussed in a number of presentations by other participants not directly linked to the Open University. Perhaps the most important case being the UK RepositoryNet+ project presentation. UK RepositoryNet+ is a socio-technical infrastructure funded by JISC supporting deposit, curation & exposure of Open Access research literature. UK RepositoryNet+ aims to provide a stable socio-technical infrastructure at the network-level to maximize value to UK HE of that investment by supporting a mix of distributed and centrally delivered service components within pro-active management, operation, support and outcome. While this infrastructure will be designed to meet the needs of UK research, it is set and must operate effectively within a global context. UK RepositoryNet+ considers the CORE system as an important component in this infrastructure.

    The similarity of the CORE approach with that of William Wallace, a Scottish hero in the picture, is the determination to fight for freedom. In this case, freedom of access to content. There is, hopefully, also one difference. We wish CORE will not end end up in the same way as William Wallace … We will see -:)