CORE will be at the Open Repositories 2016

In this year’s Open Repositories 2016, an international conference addressed to the scholarly communications community with a focus on repositories, open access, open data and open science, CORE had 6 items accepted; 1 Paper, 1 Workshop, 1 Repository Rave presentation, 1 Poster and 2 showcases in the Developer Track and Ideas Challenge. The titles and summaries of our accepted proposals are:

Paper: Exploring Semantometrics: full text-based research evaluation for open repositories / Knoth, Petr; Herrmannova, Drahomira
Over the recent years, there has been a growing interest in developing new scientometric measures that could go beyond the traditional citation-based bibliometric measures. This interest is motivated on one side by the wider availability or even emergence of new information evidencing research performance, such as article downloads, views and twitter mentions, and on the other side by the continued frustrations and problems surrounding the application of citation-based metrics to evaluate research performance in practice. Semantometrics are a new class of research evaluation metrics which build on the premise that full text is needed to assess the value of a publication. This talk will present the results of an investigation into the properties of the semantometric contribution measure (Knoth & Herrmannova, 2014). We will provide a comparative evaluation of the contribution measure with traditional bibliometric measures based on citation counting. Our analysis also focuses on the potential application of semantometric measures in large databases of research papers.

Workshop: Mining Repositories: How to assist the research and academic community in their text and data mining needs – a workshop / Pontika, Nancy; Knoth, Petr; van Dijke, Hege; Anastasiou, Lucas
Over the past five years there has been a significant interest in text and data mining (TDM) practices from the European Union (EU). In scholarly communication, TDM is already a developed practice in some scientific fields, for example, in the life sciences and computer science. Nonetheless, after a call that we sent out to the United Kingdom Council of Research Repositories (UKCoRR) list serve, we discovered that there was a limited number of TDM projects that had as their primary source of information the repositories’ collections. To address this challenge, the EU-funded project OpenMinTeD looks to enable the creation of an infrastructure that fosters and facilitates the use of TDM technologies in the scientific publications field, targeting both domain users and TDM experts. In this context we propose a three hour workshop, where we will introduce the topic of TDM to the repositories community, explore how the OpenMinTed project aims to assist with the adoption of TDM practices and present on existing TDM projects that were conducted using text and data from repositories.

Repository Rave presentation: Implementation of the RIOXX metadata guidelines in the UK’s repositories through a harvesting service / Cancellieri, Matteo; Pontika, Nancy
The COnnecting REpositories (CORE) project aims to aggregate content from open access repositories and journals and distribute this content in one central end point facilitating the open access dissemination of the scientific research. In an effort to improve the quality and transparency of the aggregation process of the open access content and create a two-way collaboration between the CORE project and the providers of this content, CORE has created the Repositories Dashboard. The RIOXX Metadata application profile aims to assist repository managers in tracking compliance with the Research Councils UK Policy on Open Access and Guidance. In this Repositories Rave session we will present how CORE is implementing the RIOXX metadata in the CORE Dashboard.

Poster: Intergration of IRUS-UK statistics in the CORE Repositories Dashboard/ Pearce, Samuel; Pontika, Nancy
The COnnecting REpositories (CORE) project aims to aggregate content from open access repositories and journals, and distribute this content in one central point facilitating the open access dissemination of the scientific research. Institutional Repository Usage Statistics UK (IRUS-UK) is a Jisc-funded project that serves as a national repository usage statistics aggregation service, which aims to provide article download statistics from UK repositories. At CORE, we wanted to present the information regarding the manuscripts’ downloads to repository managers and therefore, we have integrated into the CORE Repositories Dashboard. In this poster we will present a) the submission process of the IRUS-UK statistics and b) how CORE retrieves these statistics and displays them to the UK Higher Education Institutions (HEIs).

Developer Track ​and Ideas Challenge:
Oxford vs Cambridge Contest: Collecting Open Research Evaluation Metrics for University Ranking ​/ ​​Knoth, Petr; Cancellieri, Matteo; Herrmannova, Drahomira
Universities are increasingly deciding to pay for commercial tools that help them evidence research impact of their academics. The popular tools of choice, such as Elsevier’s SciVal, Thomson Reuter’s Web of Science and more recently Altmetric, cost universities substantial amounts. However, many performance indicators including bibliometric and usage based data can now be freely collected from datasets available on the Web and via APIs. This allows us to acquire both article and higher-level performance indicators to evidence impact for a given university based on papers in its repository. It also makes it possible to compare the research performance of universities based on these metrics. In this demonstration, we will show, on the example of a traditional Oxford University vs Cambridge University contest, how to freely gather and compare the research performance of universities. Using the popular iPython Notebook environment, we will show some code snippets and graphs demonstrating the practicality of our approach.


Mining Open Access Publications in CORE / Cancellieri, Matteo; Anastasiou, Lucas; Pearce, Samuel; Pontika, Nancy
In this session we will present our brand new API (v2). This new API allows developers to access and work with our data. We will present the new API and demonstrate -in an interactive live session- the capabilities and functionalities that this new tool can offer to researchers, developers and text-miners.

W​e are ​very excited for this opportunity and looking forward to seeing you at the Open Repositories 2016.

Waving goodbye to API v1

Back in March, we announced the beta release of the CORE API v2. This added new features such as searching via DOI and retrieving citations for full texts.

This new API should be more reliable and have a higher quality metadata output compared to the old version.

Over the next few months, we aim to finalise the API v2 and finally close access to v1. The scheduled date of the v1 switch off is Monday, 25th April 2016. 

We hope that most users have already had an opportunity to test v2 of the API but if not, we suggest that you check out the documentation here.

For those who are wondering:

  • API keys generated for API v1 will continue to work for v2. If you want to register or have forgotten your key, please fill out this form.
  • The Similarity service provided by our EPrints plugin and Javascript library will be unaffected and will continue to work after this date.

We hope to make this transition as smooth as possible so if there is something missing from our API v2 or need assistance, please get in touch or write in the comments below.

How about them stats?

Every month Samuel Pearce, one of the CORE developers, collects the CORE statistics – perhaps a boring task, but useful for us to know where we stand as a service. A very brief report of the accumulative statistics of all years that CORE operates as a project, 2011 – 2015, are as follows.
Users can retrieve from CORE,

  • 25,363,829 metadata records and
  • 2,954,141 open access full-text records, 

from 689 repositories (institutional and subject) and 5,488 open access journals. In addition, 122 users have access to the CORE API

In the playful Christmas spirit we attempted this time to have some fun with the statistics.

Since we harvest outputs in other languages than English, we created a top 20 list of the languages that appear in CORE’s full-text manuscripts.


We also investigated how much we have progressed in the amount of the harvested metadata.


And the amount of full-text we have in our collection.

The metadata and harvest graph in XKCD style have been created based on this awesome python notebook by Jake Vanderplas.

You may have noticed that the numbers in the graphs do not exactly match the numbers presented above. This is due to many reasons; for example during the harvesting process CORE retrieves either records with different types of inconsistencies or duplicates that we do not count in the “official” CORE collection. In addition, the numbers in the graphs include the amount of deleted or disabled records by the source repository. Therefore, the graphs illustrate the numbers that we actually harvest in CORE (what we have in our database), while the records that we provide via our search engine have been filtered and thus they are a bit smaller.

Finally, we calculated where CORE’s collection would take us if we had printed all the full-text from our database in a A3 page. We discovered that all this paper would take us 1/3 of the way to the moon.

Our next mission is to collect more full-text, enough to take us to the moon!

Merry Christmas!

*Note: Special thanks to Matteo Cancellieri for creating the images and the graphs.


7 tips for successful harvesting

7tipsThe CORE (COnnecting REpositories) project aims to aggregate open access research outputs from open repositories and open journals, and make them available for dissemination via its search engine.  The project indexes metadata records and harvests the full-text of the outputs, provided that they are stored in a PDF format and are openly available. Currently CORE hosts around 24 million open access articles from 5,488 open access journals and 679 repositories.

Like in any type of partnership, the harvesting process is a two way relationship, were the content provider and the aggregator need to be able to communicate and have a mutual understanding. For a successful harvesting it is recommended that content providers apply the following best practices (some of the following recommendations relate generally to harvesting, while some are CORE specific):

  1. Platform: For those who haven’t deployed a repository yet, it is highly advised that the repository platform is not built in house, but one of the industry standard platforms is chosen. The benefits of choosing one of the existing platforms is that they provide frequent content updates, constant support and extend repository functionality through plug-ins.
  2. Repository information status: Check that your repository is included in an international repositories list and that the Open Archives Initiative Metadata Harvesting Protocol (OAI-PMH) address is up to date. The primary directories for CORE are the Directory of Open Access Repositories, the Registry of Open Access Repositories and the Directory of Open Access Journals.
  3. Robots.txt: This file states whether a web crawler is allowed to access a repository. At CORE often times a harvesting attempt fails from the very beginning  due to the fact that the service is banned in this file. Additionally, it is highly recommended that repositories provide equal access levels with the same conditions to all crawlers without making distinctions between services; for example providing complete access at full speed to commercial services, while limiting access to some repository directories or imposing slower crawling speed to non-commercial services. Currently 98 UK research and dissertations repositories are listed in the CORE Dashboard and 10 of these have a prohibiting rule in the robots.txt file for CORE, which affects harvesting.

    An example of a robots.txt file that blocks the CORE service.
    An example of two robots.txt files and their rules.
  4. Meta-tags: The meta-tags describe a webpage’s content in the code, which makes them extremely useful to machines rather than humans. Crawlers of harvesting services, like CORE, and search engines, like Google Scholar, expect to find a list of meta-tags in the code of each webpage in order to be able to harvest the content properly. Failure to do so results in crawlers making guesses about the content or completely omitting it.
  5. <dc:identifier> field: The aim of the Dublin Core Metadata tags is to ensure online interoperability of metadata standards. The importance of the <dc:identifier> tag is that it describes the resource of the harvested output. CORE expects in this field to find the direct URL of the PDF. When the information in this field is not presented properly, the CORE crawler needs to crawl for the PDF and the success of finding it cannot be guaranteed. This also causes additional server processing time and bandwidth both for the harvester and the hosting institution.
    There are also three additional points that need to be considered with regards to the <dc:identifier>; a) this field should describe an absolute path to the file, b) it should contain an appropriate file name extension, for example “.pdf” and c) the full-text items should be stored under the same repository domain.

    A good metadata tag
    A <dc:identifier> tag including all three requirements
  6. URL external resolving: When external resolution services, such as Handle.Net® or doi®, are used it is important to ensure that the URL produced works properly and it links to an actual document and not a dead page.
  7. Is everything in order?: To ensure that everything is alright with your repository, use monitor tools to check how your repository looks to the outside world.  Such tools are the OAI PMH validator, which will test your endpoint, the Google Webmaster Tools, which will help improve your search rank and the CORE Dashboard, which provides detailed information on the harvesting process, technical issues and how to improve the exposure of your metadata.

If we compare web crawling to harvesting, the latter is more helpful to repositories since it deals with structured XML formats that contain all the bibliographic information fields. That way all this data can be reused for data analysis and data mining purposes.

* I would like to thank the CORE developers, Samuel Pearce, Lucas Anastasiou and Matteo Cancellieri for their technical input in this blog post.

CORE Repositories Dashboard: An infrastructure to increase collaboration of Aggregators with Open Repositories

In an effort to improve the quality and transparency of the harvesting process of the open access content and create a two way collaboration between the CORE project and the providers of this content, CORE is introducing the Repositories Dashboard. The aim of the Dashboard is to provide an online interface for repository providers and offer, through this online interface, valuable information to content providers about:

  • the content harvested from the repository enabling its management, such as by requesting metadata updates or managing take-down requests,
  • the times and frequency of content harvesting, including all detected technical issues and suggestions for improving the efficiency of harvesting and the quality of metadata, including compliance with existing metadata guidelines,
  • statistics regarding the repository content, such as the distribution of content according to subject fields and types of research outputs, and the comparison of these with the national average.

In the CORE Dashboard there is a designated page for every institution, where repository managers will be able to add all the information that corresponds to their own repository, such as the institution’s logo, the repository name and email address.

Screen Shot 2015-09-06 at 13.41.18

The Dashboard allows repository managers to create accounts for other colleagues as well.

Screen Shot 2015-09-08 at 15.56.10

With regards to managing the harvested outputs, the Dashboard enables repository managers to add or remove documents’ full-text in the CORE collection without having to contact the CORE team.

Screen Shot 2015-09-06 at 13.48.55

Screen Shot 2015-09-06 at 13.49.33

These actions can now be completed immediately only by clicking on the “Take Down” or “Take up” buttons. It is also possible to download a CSV file of all the records harvested by CORE from a repository.

CORE can also be notified about metadata changes through the Dashboard. Repository managers can click on the blue “Update Metadata” button and then the single item’s metadata will be updated without the need for a full repository harvest.

In cases where a whole repository needs to be re-harvested, a re-harvesting button can be found on the main page for each repository.

Screen Shot 2015-09-06 at 13.55.07

The CORE Dashboard is also in position to identify technical issues that may occur while harvesting a repository. We have divided these types of issues into three categories: Error, Warning and Info, and we provide an explanation for each category on the Dashboard.

Screen Shot 2015-09-06 at 13.57.51

An example of how errors and warnings may appear:

Screen Shot 2015-09-06 at 13.58.49

We believe that this information will be valuable to the repository managers in their communication with the staff that support the repository technically. For example, in the screenshot provided, it is shown that CORE cannot harvest the repository because there is an entry in the robots.txt file that prohibits the download. The “Documents issues” section informs us that CORE was not able to harvest the repository because a login may have been required to access the PDF, or perhaps the metadata record does not link to a PDF, but to a video file. The last section, “Malformed PDF url” singles out these records that do not have an hyper-text transfer protocol or have invalid characters in the URL.

Currently we are working to include two important functionalities in the Dashboard. The first one is the IRUS-UK statistics, which will give repository managers access to their records’ statistics through CORE. The second one is RIOXX, whereby repository managers will be able to check repository compliance with the RIOXX guidelines.

If you have any questions, recommendations or you would like to gain access to the CORE Dashboard, email us at dashboard [at] core [ dot ] ac [ dot ] uk

  • Update, March 2016: We published a paper at the LIBER Quarterly journal providing more information and updates on the CORE Repositories Dashboard. 

CORE releases a new API version

We are very proud to announce that CORE has now released CORE API 2.0. The new API offers new opportunities for developers to make use of the CORE open access aggregator in their applications.

The main new features are:

  • Support for looking up articles by a global identifier (DOI, OAI, arXiv, etc.) instead of just CORE ID.
  • Access to new resource types, repositories and journals, and organisation of API methods according to the resource type.
  • Enables accessing the original metadata exactly as it was harvested from the repository of origin.
  • Supports the retrieval of the changes of the metadata as it was harvested by CORE.
  • Provides the possibility of retrieving citations extracted from the full-text by CORE.
  • Support for batch request for searching, recommending, accessing full-texts, harvesting history, etc.

The goals of the new API also include improving scalability, cleaning up and unifying the API responses and making it easier for developers to start working with it.

The API is implemented and documented using Swagger, which has the advantage that anybody can start playing with the API directly from our online client. The documentation of the API v2.0 is available and the API is currently in beta. Those interested to register for a new API key can do so by completing the online form.

Our previous version, the CORE API v1.0, will not be abandoned yet, but it will run alongside with the new version. However, the API v1.0 is deprecated and will eventually be replaced by the API v2.0.

What is CORE

The mission of CORE (COnnecting REpositories) is to aggregate all open access research outputs from repositories and journals worldwide and make them available to the public. In this way CORE facilitates free unrestricted access to research for all.


  • supports the right of citizens and general public to access the results of research towards which they contributed by paying taxes,
  • facilitates access to open access content for all by offering services to general public, academic institutions, libraries, software developers, researchers, etc.,
  • provides support to both content consumers and content providers by working with digital libraries, institutional and subject repositories and journals,
  • enriches the research content using state-of-the-art technology and provides access to it through a set of services including search, API and analytical tools,
  • contributes to a cultural change by promoting open access, a fast growing movement.

CORE among the top 10 search engines for research that go beyond Google

Using search engines effectively is now a key skill for researchers, but could more be done to equip young researchers with the tools they need? Here, Dr Neil Jacobs and Rachel Bruce from JISC’s digital infrastructure team shared their top ten resources for researchers from across the web. CORE was placed among the top 10 search engines that go beyond Google.

More information on the JISC’s website.

Related content recommendation for EPrints

We have released the first version of a content recommendation package for EPrints available via the EPrints Bazaar ( ). The functionality is offered through CORE and can be seen, for example, in Open Research Online EPrints ( ) or on the European Library portal ( ). I was wonderring if any EPrints repository manager would be interested to get in touch to test this in his/her repository. As the
package is available via the EPrints Bazaar, the installation requires just a few clicks. We would be grateful for any suggestions for improvements and also for information regarding how this could be effectively provided to DSpace and Fedora repositories.