Releasing a new CORE Discovery browser extension

CORE Discovery helps users find freely accessible copies of research papers that might be behind a paywall on the publisher’s website. It is backed by our huge dataset of millions of full text open access papers as well as content from widely used external services beyond CORE. The tool not only provides state-of-the-art coverage of freely available content, it is the only discovery service which:

  • delivers state-of-the-art performance compared to other discovery tools in terms of both content coverage (finding a freely available copy when it is available) and precision (reliably delivering a free copy of the paper on success);
  • is run by researchers for researchers (as opposed to companies);
  • has the best grip on content from the global network of open repositories;
  • can deliver to readers other relevant freely available research papers even in situations where a freely available version is not available from anywhere on the web.

To satisfy the needs of CORE users, the world’s largest global aggregator of open access research papers now helps users access articles of their interest. Generally, discovery tools can find typically free copies of papers for about 15%-30% of published documents (slide 11). This means that in more than 70% of cases, they don’t bring to the user anything useful. CORE Discovery can offer the user relevant documents even in situations where other discovery tools are not successful. What distinguishes CORE Discovery from other discovery services on the market is that it does not stop when an open access version is not available, but always aims to offer related open access articles to the end user. read more...

CORE releases a new website version

A couple of days ago we released a new version of our website and if you visit our main page it now looks slightly different.

Image: blickpixel @ pixabay https://pixabay.com/en/lego-legomaennchen-males-workers-568039/

One of our aims was to showcase in a more clear way the CORE testimonials, i.e. what others think of the project and how the community uses our products, mainly our API and Datasets. In an effort to give credit to the universities and companies

that are using our services, such as our read more...

Analysing ORCID coverage across repositories through CORE

* This post was authored by Matteo Cancellieri, Petr Knoth and Nancy Pontika.

Last month, CORE attended the JISC ORCID hackday events in Birmingham and London. (ORCID is a non-profit organisation that aims to solve the author disambiguation problem by offering unique author identifiers). Following the discussions that sparked off at the two events, we decided to test the CORE data towards ORCID’s API and we discovered some information that we think is of interest to the scholarly community.

Currently, CORE has data for 5.5 million unique Document Object Identifiers (DOIs) linked to records in our database (both metadata only and full text). Based on this number, we wanted to find out how many of these DOIs were connected to an ORCID id. Therefore, we set up a script that called the ORCID API obeying to the rate limit. In around 7 days we had collected the full results.
From the 5,523,577 articles with a DOI that existed in the CORE collection, we discovered that 196,713 different authors had an ORCID id, and 927,645 articles included at list one ORCID id.

map
Credits to Aristotelis Charalampous for helping us generate this really cool map visualisation

We found that 16% of the DOIs in CORE are connected to at least one author registered in ORCID. The following map shows the distribution of the ORCID ids discovered across the world.

Why is this useful? It enables us to assess the ORCID’s coverage across a large multidisciplinary dataset of Open Access papers. Doing some more digging in the data (we haven’t done this yet), it would also be possible to analyse the growth of ORCID over time. These data can be sliced and diced according to various criteria, such as geographical coverage or repository, to understand how ORCID coverage can be improved.

Based on our results, the UK has the biggest number of ORCID IDs. However, this result is a bit skewed by the fact that CORE has an excellent content coverage across UK repositories.

We also tried to find authors with ORCID IDs who deposited content in one of the UK repositories. Our result indicates that 68,849 ORCIDs were discovered from 254,467 unique DOIs. It was then very useful to look at the distribution of the top 15 repositories based on ORCID IDs across the UK repositories. This analysis can be extremely helpful in identifying repositories with low ORCID coverage and encouraging them to take an appropriate action.
orcid_by_repo

 

Repositories implementing RIOXX have already the possibility to expose ORCID IDs through an attribute in the rioxxterm:author tag. While this opportunity exists, our quick survey showed that only few repositories supporting RIOXX have implemented it. Thanks to John Salter, Software Developer at Leeds University, for his help in collecting the data and creating the chart. John is currently working on including the ORCID IDs in the White Rose repository, which “forced” us 🙂 to use a log-scale in the chart due to the widespread implementation of the ID attribute in their metadata (>6k ORCID IDs vs less than 10 IDs from the other repositories)

rioxx_vs_orcid_api

Dataset

We have made the dataset available online on Github and it can be found here .
There are few caveats in the data that must be taken into consideration. Our main challenge was that some of the aggregated DOIs were not valid or pointing to a journal instead of a paper. The ORCID API returned only a partial match and in the case of the journals’ DOIs, this meant that the ORCID IDs returned results that regarded all the authors of one journal instead of one specific paper.

What next?

In this preliminary study we realised that the information we extracted from the data was useful to us and, perhaps, could be useful to repository managers. Our plan is to design and implement a new functionality in the CORE Repositories Dashboard. We are planning to submit a proposal for this to OR2017 and we would really appreciate your feedback. If you are a repository manager and you want to know more email us at dashboard@core.ac.uk .

Waving goodbye to API v1

Back in March, we announced the beta release of the CORE API v2. This added new features such as searching via DOI and retrieving citations for full texts.

This new API should be more reliable and have a higher quality metadata output compared to the old version.

Over the next few months, we aim to finalise the API v2 and finally close access to v1. The scheduled date of the v1 switch off is Monday, 25th April 2016. 

We hope that most users have already had an opportunity to test v2 of the API but if not, we suggest that you check out the documentation here.

For those who are wondering:

  • API keys generated for API v1 will continue to work for v2. If you want to register or have forgotten your key, please fill out this form.
  • The Similarity service provided by our EPrints plugin and Javascript library will be unaffected and will continue to work after this date.

We hope to make this transition as smooth as possible so if there is something missing from our API v2 or need assistance, please get in touch or write in the comments below.

7 tips for successful harvesting

7tipsThe CORE (COnnecting REpositories) project aims to aggregate open access research outputs from open repositories and open journals, and make them available for dissemination via its search engine.  The project indexes metadata records and harvests the full-text of the outputs, provided that they are stored in a PDF format and are openly available. Currently CORE hosts around 24 million open access articles from 5,488 open access journals and 679 repositories.

Like in any type of partnership, the harvesting process is a two way relationship, were the content provider and the aggregator need to be able to communicate and have a mutual understanding. For a successful harvesting it is recommended that content providers apply the following best practices (some of the following recommendations relate generally to harvesting, while some are CORE specific):

  • Platform: For those who haven’t deployed a repository yet, it is highly advised that the repository platform is not built in house, but one of the industry standard platforms is chosen. The benefits of choosing one of the existing platforms is that they provide frequent content updates, constant support and extend repository functionality through plug-ins.
  • Repository information status: Check that your repository is included in an international repositories list and that the Open Archives Initiative Metadata Harvesting Protocol (OAI-PMH) address is up to date. The primary directories for CORE are the Directory of Open Access Repositories, the Registry of Open Access Repositories and the Directory of Open Access Journals.
  • Robots.txt: This file states whether a web crawler is allowed to access a repository. At CORE often times a harvesting attempt fails from the very beginning  due to the fact that the service is banned in this file. Additionally, it is highly recommended that repositories provide equal access levels with the same conditions to all crawlers without making distinctions between services; for example providing complete access at full speed to commercial services, while limiting access to some repository directories or imposing slower crawling speed to non-commercial services. Currently 98 UK research and dissertations repositories are listed in the CORE Dashboard and 10 of these have a prohibiting rule in the robots.txt file for CORE, which affects harvesting.
    An example of a robots.txt file that blocks the CORE service.
    An example of two robots.txt files and their rules.
  • Meta-tags: The meta-tags describe a webpage’s content in the code, which makes them extremely useful to machines rather than humans. Crawlers of harvesting services, like CORE, and search engines, like Google Scholar, expect to find a list of meta-tags in the code of each webpage in order to be able to harvest the content properly. Failure to do so results in crawlers making guesses about the content or completely omitting it.
  • <dc:identifier> field: The aim of the Dublin Core Metadata tags is to ensure online interoperability of metadata standards. The importance of the <dc:identifier> tag is that it describes the resource of the harvested output. CORE expects in this field to find the direct URL of the PDF. When the information in this field is not presented properly, the CORE crawler needs to crawl for the PDF and the success of finding it cannot be guaranteed. This also causes additional server processing time and bandwidth both for the harvester and the hosting institution.
    There are also three additional points that need to be considered with regards to the <dc:identifier>; a) this field should describe an absolute path to the file, b) it should contain an appropriate file name extension, for example “.pdf” and c) the full-text items should be stored under the same repository domain.

    A good metadata tag
    A <dc:identifier> tag including all three requirements
  • URL external resolving: When external resolution services, such as Handle.Net® or doi®, are used it is important to ensure that the URL produced works properly and it links to an actual document and not a dead page.
  • Is everything in order?: To ensure that everything is alright with your repository, use monitor tools to check how your repository looks to the outside world.  Such tools are the OAI PMH validator, which will test your endpoint, the Google Webmaster Tools, which will help improve your search rank and the CORE Dashboard, which provides detailed information on the harvesting process, technical issues and how to improve the exposure of your metadata.
  • read more...

    CORE Repositories Dashboard: An infrastructure to increase collaboration of Aggregators with Open Repositories

    In an effort to improve the quality and transparency of the harvesting process of the open access content and create a two way collaboration between the CORE project and the providers of this content, CORE is introducing the Repositories Dashboard. The aim of the Dashboard is to provide an online interface for repository providers and offer, through this online interface, valuable information to content providers about:

    • the content harvested from the repository enabling its management, such as by requesting metadata updates or managing take-down requests,
    • the times and frequency of content harvesting, including all detected technical issues and suggestions for improving the efficiency of harvesting and the quality of metadata, including compliance with existing metadata guidelines,
    • statistics regarding the repository content, such as the distribution of content according to subject fields and types of research outputs, and the comparison of these with the national average.

    In the CORE Dashboard there is a designated page for every institution, where repository managers will be able to add all the information that corresponds to their own repository, such as the institution’s logo, the repository name and email address.

    Screen Shot 2015-09-06 at 13.41.18

    The Dashboard allows repository managers to create accounts for other colleagues as well.

    Screen Shot 2015-09-08 at 15.56.10

    With regards to managing the harvested outputs, the Dashboard enables repository managers to add or remove documents’ full-text in the CORE collection without having to contact the CORE team.

    Screen Shot 2015-09-06 at 13.48.55

    Screen Shot 2015-09-06 at 13.49.33

    These actions can now be completed immediately only by clicking on the “Take Down” or “Take up” buttons. It is also possible to download a CSV file of all the records harvested by CORE from a repository.

    CORE can also be notified about metadata changes through the Dashboard. Repository managers can click on the blue “Update Metadata” button and then the single item’s metadata will be updated without the need for a full repository harvest. read more...

    Yes, we can! – The CORE team organises a workshop at JCDL 2012 in Washington, DC

    KMI and the European Library/Europeana jointly organised the 1st International Workshop on Mining Scientific Publications associated with JCDL 2012 – the most prestigious conference in the world of digital libraries. The workshop was attended by major players in the field including the National Library of Medicine, Library of Congress, CiteSeerX, Elsevier and British Library. Although Barack in the end didn’t come, the workshop was very successful, the only problem being the lack of chairs in the room. We (the workshop organisers – Petr Knoth, KMi; Zdenek Zdrahal, KMI and Andreas Juffinger, The European Library/Europeana) were motivated by the positive response of the community to the importance of issues researchers face when mining research publications to improve the way research is carried out and evaluated.

    A paper authored by Drahomira (aka Dasha) Herrmannova and Petr Knoth (both KMI) entitled ‘Visual search for supporting content exploration in large document collections’ presented by Dasha during the workshop received encouraging feedback. Another KMI talk was given by Petr who discussed the issues in current digital library aggregation systems, especially those focusing on Open Access, and explained the advantages offered by the CORE system in his presentation titled “COnnecting REpositories (CORE): Aggregating and Enriching Content to Support Open Access.”

    All papers presented at the workshop are available on the workshop page.