CORE Discovery helps users find freely accessible copies of research papers that might be behind a paywall on the publisher’s website. It is backed by our huge dataset of millions of full text open access papers as well as content from widely used external services beyond CORE. The tool not only provides state-of-the-art coverage of freely available content, it is the only discovery service which:
delivers state-of-the-art performance compared to other discovery tools in terms of both content coverage (finding a freely available copy when it is available) and precision (reliably delivering a free copy of the paper on success);
is run by researchers for researchers (as opposed to companies);
has the best grip on content from the global network of open repositories;
can deliver to readers other relevant freely available research papers even in situations where a freely available version is not available from anywhere on the web.
To satisfy the needs of CORE users, the world’s largest global aggregator of open access research papers now helps users access articles of their interest. Generally, discovery tools can find typically free copies of papers for about 15%-30% of published documents (slide 11). This means that in more than 70% of cases, they don’t bring to the user anything useful. CORE Discovery can offer the user relevant documents even in situations where other discovery tools are not successful. What distinguishes CORE Discovery from other discovery services on the market is that it does not stop when an open access version is not available, but always aims to offer related open access articles to the end user.read more...
* This post was authored by Matteo Cancellieri, Petr Knoth and Nancy Pontika.
Last month, CORE attended the JISC ORCID hackday events in Birmingham and London. (ORCID is a non-profit organisation that aims to solve the author disambiguation problem by offering unique author identifiers). Following the discussions that sparked off at the two events, we decided to test the CORE data towards ORCID’s API and we discovered some information that we think is of interest to the scholarly community.
Currently, CORE has data for 5.5 million unique Document Object Identifiers (DOIs) linked to records in our database (both metadata only and full text). Based on this number, we wanted to find out how many of these DOIs were connected to an ORCID id. Therefore, we set up a script that called the ORCID API obeying to the rate limit. In around 7 days we had collected the full results.
From the 5,523,577 articles with a DOI that existed in the CORE collection, we discovered that 196,713 different authors had an ORCID id, and 927,645 articles included at list one ORCID id.
We found that 16% of the DOIs in CORE are connected to at least one author registered in ORCID. The following map shows the distribution of the ORCID ids discovered across the world.
Why is this useful? It enables us to assess the ORCID’s coverage across a large multidisciplinary dataset of Open Access papers. Doing some more digging in the data (we haven’t done this yet), it would also be possible to analyse the growth of ORCID over time. These data can be sliced and diced according to various criteria, such as geographical coverage or repository, to understand how ORCID coverage can be improved.
Based on our results, the UK has the biggest number of ORCID IDs. However, this result is a bit skewed by the fact that CORE has an excellent content coverage across UK repositories.
We also tried to find authors with ORCID IDs who deposited content in one of the UK repositories. Our result indicates that 68,849 ORCIDs were discovered from 254,467 unique DOIs. It was then very useful to look at the distribution of the top 15 repositories based on ORCID IDs across the UK repositories. This analysis can be extremely helpful in identifying repositories with low ORCID coverage and encouraging them to take an appropriate action.
Repositories implementing RIOXX have already the possibility to expose ORCID IDs through an attribute in the rioxxterm:author tag. While this opportunity exists, our quick survey showed that only few repositories supporting RIOXX have implemented it. Thanks to John Salter, Software Developer at Leeds University, for his help in collecting the data and creating the chart. John is currently working on including the ORCID IDs in the White Rose repository, which “forced” us 🙂 to use a log-scale in the chart due to the widespread implementation of the ID attribute in their metadata (>6k ORCID IDs vs less than 10 IDs from the other repositories)
We have made the dataset available online on Github and it can be found here .
There are few caveats in the data that must be taken into consideration. Our main challenge was that some of the aggregated DOIs were not valid or pointing to a journal instead of a paper. The ORCID API returned only a partial match and in the case of the journals’ DOIs, this meant that the ORCID IDs returned results that regarded all the authors of one journal instead of one specific paper.
In this preliminary study we realised that the information we extracted from the data was useful to us and, perhaps, could be useful to repository managers. Our plan is to design and implement a new functionality in the CORE Repositories Dashboard. We are planning to submit a proposal for this to OR2017 and we would really appreciate your feedback. If you are a repository manager and you want to know more email us at firstname.lastname@example.org .
* Post updated on June 20th and June 23rd with links to presentations.
In this year’s Open Repositories 2016, an international conference addressed to the scholarly communications community with a focus on repositories, open access, open data and open science, CORE had 6 items accepted; 1 Paper, 1 Workshop, 1 Repository Rave presentation, 1 Poster and 2 showcases in the Developer Track and Ideas Challenge. The titles and summaries of our accepted proposals are:
Paper: Exploring Semantometrics: full text-based research evaluation for open repositories / Knoth, Petr; Herrmannova, Drahomira
Over the recent years, there has been a growing interest in developing new scientometric measures that could go beyond the traditional citation-based bibliometric measures. This interest is motivated on one side by the wider availability or even emergence of new information evidencing research performance, such as article downloads, views and twitter mentions, and on the other side by the continued frustrations and problems surrounding the application of citation-based metrics to evaluate research performance in practice. Semantometrics are a new class of research evaluation metrics which build on the premise that full text is needed to assess the value of a publication. This talk will present the results of an investigation into the properties of the semantometric contribution measure (Knoth & Herrmannova, 2014). We will provide a comparative evaluation of the contribution measure with traditional bibliometric measures based on citation counting. Our analysis also focuses on the potential application of semantometric measures in large databases of research papers.
Workshop: Mining Repositories: How to assist the research and academic community in their text and data mining needs – a workshop / Pontika, Nancy; Knoth, Petr; van Dijke, Hege; Anastasiou, Lucas
Over the past five years there has been a significant interest in text and data mining (TDM) practices from the European Union (EU). In scholarly communication, TDM is already a developed practice in some scientific fields, for example, in the life sciences and computer science. Nonetheless, after a call that we sent out to the United Kingdom Council of Research Repositories (UKCoRR) list serve, we discovered that there was a limited number of TDM projects that had as their primary source of information the repositories’ collections. To address this challenge, the EU-funded project OpenMinTeD looks to enable the creation of an infrastructure that fosters and facilitates the use of TDM technologies in the scientific publications field, targeting both domain users and TDM experts. In this context we propose a three hour workshop, where we will introduce the topic of TDM to the repositories community, explore how the OpenMinTed project aims to assist with the adoption of TDM practices and present on existing TDM projects that were conducted using text and data from repositories.
Repository Rave presentation: Implementation of the RIOXX metadata guidelines in the UK’s repositories through a harvesting service / Cancellieri, Matteo; Pontika, Nancy
The COnnecting REpositories (CORE) project aims to aggregate content from open access repositories and journals and distribute this content in one central end point facilitating the open access dissemination of the scientific research. In an effort to improve the quality and transparency of the aggregation process of the open access content and create a two-way collaboration between the CORE project and the providers of this content, CORE has created the Repositories Dashboard. The RIOXX Metadata application profile aims to assist repository managers in tracking compliance with the Research Councils UK Policy on Open Access and Guidance. In this Repositories Rave session we will present how CORE is implementing the RIOXX metadata in the CORE Dashboard.
Poster: Intergration of IRUS-UK statistics in the CORE Repositories Dashboard/ Pearce, Samuel; Pontika, Nancy
The COnnecting REpositories (CORE) project aims to aggregate content from open access repositories and journals, and distribute this content in one central point facilitating the open access dissemination of the scientific research. Institutional Repository Usage Statistics UK (IRUS-UK) is a Jisc-funded project that serves as a national repository usage statistics aggregation service, which aims to provide article download statistics from UK repositories. At CORE, we wanted to present the information regarding the manuscripts’ downloads to repository managers and therefore, we have integrated into the CORE Repositories Dashboard. In this poster we will present a) the submission process of the IRUS-UK statistics and b) how CORE retrieves these statistics and displays them to the UK Higher Education Institutions (HEIs).
The CORE (COnnecting REpositories) project aims to aggregate open access research outputs from open repositories and open journals, and make them available for dissemination via its search engine. The project indexes metadata records and harvests the full-text of the outputs, provided that they are stored in a PDF format and are openly available. Currently CORE hosts around 24 million open access articles from 5,488 open access journals and 679 repositories.
Like in any type of partnership, the harvesting process is a two way relationship, were the content provider and the aggregator need to be able to communicate and have a mutual understanding. For a successful harvesting it is recommended that content providers apply the following best practices (some of the following recommendations relate generally to harvesting, while some are CORE specific):
Platform: For those who haven’t deployed a repository yet, it is highly advised that the repository platform is not built in house, but one of the industry standard platforms is chosen. The benefits of choosing one of the existing platforms is that they provide frequent content updates, constant support and extend repository functionality through plug-ins.
Robots.txt: This file states whether a web crawler is allowed to access a repository. At CORE often times a harvesting attempt fails from the very beginning due to the fact that the service is banned in this file. Additionally, it is highly recommended that repositories provide equal access levels with the same conditions to all crawlers without making distinctions between services; for example providing complete access at full speed to commercial services, while limiting access to some repository directories or imposing slower crawling speed to non-commercial services. Currently 98 UK research and dissertations repositories are listed in the CORE Dashboard and 10 of these have a prohibiting rule in the robots.txt file for CORE, which affects harvesting.
Meta-tags: The meta-tags describe a webpage’s content in the code, which makes them extremely useful to machines rather than humans. Crawlers of harvesting services, like CORE, and search engines, like Google Scholar, expect to find a list of meta-tags in the code of each webpage in order to be able to harvest the content properly. Failure to do so results in crawlers making guesses about the content or completely omitting it.
<dc:identifier> field: The aim of the Dublin Core Metadata tags is to ensure online interoperability of metadata standards. The importance of the <dc:identifier> tag is that it describes the resource of the harvested output. CORE expects in this field to find the direct URL of the PDF. When the information in this field is not presented properly, the CORE crawler needs to crawl for the PDF and the success of finding it cannot be guaranteed. This also causes additional server processing time and bandwidth both for the harvester and the hosting institution.
There are also three additional points that need to be considered with regards to the <dc:identifier>; a) this field should describe an absolute path to the file, b) it should contain an appropriate file name extension, for example “.pdf” and c) the full-text items should be stored under the same repository domain.
URL external resolving: When external resolution services, such as Handle.Net® or doi®, are used it is important to ensure that the URL produced works properly and it links to an actual document and not a dead page.
Is everything in order?: To ensure that everything is alright with your repository, use monitor tools to check how your repository looks to the outside world. Such tools are the OAI PMH validator, which will test your endpoint, the Google Webmaster Tools, which will help improve your search rank and the CORE Dashboard, which provides detailed information on the harvesting process, technical issues and how to improve the exposure of your metadata.
In an effort to improve the quality and transparency of the harvesting process of the open access content and create a two way collaboration between the CORE project and the providers of this content, CORE is introducing the Repositories Dashboard. The aim of the Dashboard is to provide an online interface for repository providers and offer, through this online interface, valuable information to content providers about:
the content harvested from the repository enabling its management, such as by requesting metadata updates or managing take-down requests,
the times and frequency of content harvesting, including all detected technical issues and suggestions for improving the efficiency of harvesting and the quality of metadata, including compliance with existing metadata guidelines,
statistics regarding the repository content, such as the distribution of content according to subject fields and types of research outputs, and the comparison of these with the national average.
In the CORE Dashboard there is a designated page for every institution, where repository managers will be able to add all the information that corresponds to their own repository, such as the institution’s logo, the repository name and email address.
The Dashboard allows repository managers to create accounts for other colleagues as well.
With regards to managing the harvested outputs, the Dashboard enables repository managers to add or remove documents’ full-text in the CORE collection without having to contact the CORE team.
These actions can now be completed immediately only by clicking on the “Take Down” or “Take up” buttons. It is also possible to download a CSV file of all the records harvested by CORE from a repository.
CORE can also be notified about metadata changes through the Dashboard. Repository managers can click on the blue “Update Metadata” button and then the single item’s metadata will be updated without the need for a full repository harvest.read more...
KMI and the European Library/Europeana jointly organised the 1st International Workshop on Mining Scientific Publications associated with JCDL 2012 – the most prestigious conference in the world of digital libraries. The workshop was attended by major players in the field including the National Library of Medicine, Library of Congress, CiteSeerX, Elsevier and British Library. Although Barack in the end didn’t come, the workshop was very successful, the only problem being the lack of chairs in the room. We (the workshop organisers – Petr Knoth, KMi; Zdenek Zdrahal, KMI and Andreas Juffinger, The European Library/Europeana) were motivated by the positive response of the community to the importance of issues researchers face when mining research publications to improve the way research is carried out and evaluated.
A paper authored by Drahomira (aka Dasha) Herrmannova and Petr Knoth (both KMI) entitled ‘Visual search for supporting content exploration in large document collections’ presented by Dasha during the workshop received encouraging feedback. Another KMI talk was given by Petr who discussed the issues in current digital library aggregation systems, especially those focusing on Open Access, and explained the advantages offered by the CORE system in his presentation titled “COnnecting REpositories (CORE): Aggregating and Enriching Content to Support Open Access.”
All papers presented at the workshop are available on the workshop page.