CORE Ambassador: David Walters

David Walters, Brunel UniversityDavid is the Open Access Officer at Brunel University London based within the Scholarly Communication & Rights Management team. He is an advocate of OA publishing, and of building services that realise the movement within local institutional communities. David has spoken at UKSG, NASIG, RLUK and Altmetric conferences about this topic in recent years. David is an ambassador for the CORE service.

Q: What does Open Access mean to you?
A: To us at Brunel, Open Access means many things – ideologically and practically. Most importantly, we consider Open Access to research output a critical, underpinning component on the journey toward an ‘Open Science’ world. Open Science encompasses many areas, aiming to enhance scientific and educational sectors.
As with many institutions, at Brunel we operate local OA services for our community, within an ever-growing landscape of technological and policy drivers. Open Access means creating an environment that supports policy drivers, whilst advantaging new technologies for our community as they emerge.
Much progress is being driven by these factors. However, it is as important to foster discussion and leadership amongst research communities. Open Access means researchers and students shaping and leading their subjects into new forms of science communication and practice.
At Brunel our role in supporting Open Access is to:
– Engage and inform our community about these issues as they evolve
– Build and tailor services to our community’s needs
– Recognise and celebrate ‘open’ activity by our researchers in all its forms read more...

CORE Ambassador: Milica Sevkusic

MilicaMilica is a librarian at the Institute of Technical Sciences of the Serbian Academy of Sciences and Arts since 2007. Her education background is in art history and her previous work experience includes heritage policies and documentation standards, heritage-related civil society projects and digitisation, traditional librarianship and bibliography. Currently, her professional interests focus on Open Science, library services aimed at supporting research activities, training on academic services and tools, information literacy and research ethics. Since November 2014, she has been serving as the EIFL Open Access country coordinator in Serbia. In this capacity, she designed and coordinated the project – Revisiting open access journal policies and practices in Serbia, which was implemented with EIFL’s support in 2016–2017. She has also been involved with institutional repositories since 2013, when her affiliated institution implemented the first fully functional institutional repository in Serbia. She is now a member of the Repository Development Team at the University of Belgrade Computer Centre, which is currently the leading force in repository development in Serbia.
read more...

Increasing the Speed of Harvesting with On Demand Resource Dumps

 

I am currently working with Martin Klein, Matteo Cancellieri and Herbert Van de Sompel on a project funded by the European Open Science Cloud Pilot that aims to test and benchmark ResourceSync against OAI-PMH in a range of scenarios. The objective is to perform a quantitative evaluation that could then be used as evidence to convince data providers to adopt ResourceSync. During this work, we have encountered a problem related to the scalability of ResourceSync and developed a solution to it in the form of an On Demand Resource Dump. The aim of this blog post is to explain the problem, how we arrived to the solution and how the solution works.

The problem

One of the scenarios we have been exploring deals with a situation where the resources to be synchronised are metadata files of a small data size (typically from a few bytes to several kilobytes). Coincidentally, this scenario is very common for metadata in repositories of academic manuscripts, research data (e.g. descriptions of images), cultural heritage, etc.

The problem is related to the issue that while most OAI-PMH implementations typically deliver 100-1000 responses per one HTTP request, ResourceSync is designed in a way that requires resolving each resource individually. We have identified and confirmed by testing that for repositories with larges numbers of metadata items, this can have a very significant impact on the performance of harvesting, as the overhead of the HTTP request is considerable compared to the size of the metadata record.

More specifically, we have run tests over a sample of 357 repositories. The results of these tests show that while the speed of OAI-PMH harvesting ranges from 30-520 metadata records per second, depending largely on the repository platform, the speed of harvesting by ResourceSync is somewhere in the range of only 4 metadata records per second for harvesting the same content using existing ResourceSync client/server implementations and sequential downloading strategy. We are preparing a paper on this, so I am not going to disclose the exact details of the analysis at this stage.

As ResourceSync has been created to overcome many of the problems of OAI-PMH, such as:

  • being too flexible in terms of support for incremental harvesting, resulting in inconsistent implementations of this feature across data providers,
  • some of its implementations being unstable and less suitable for exchanging large quantities of metadata and
  • being only designed for metadata transfer, omitting the much needed support for content exchange

it is important that Resource Sync performs well under all common scenarios, including the one we are dealing with.

Can Resource Dumps be the solution?

An obvious option for solving the problem that is already offered by ResourceSync are Resource Dumps. While a Resource Dump can speed up harvesting to levels far exceeding those of OAI-PMH, it creates some considerable extra complexity on the side of the server. The key problem is that it creates the necessity to periodically package the data as a Resource Dump, which basically means running a batch process to produce a compressed (zip) file containing the resources.

The number of Resource Dumps a source needs to maintain is equal to the number of Capability Lists it maintains times the size of the Resource Dump Index. The minimum practical operational size of a Resource Dump Index is 2. This is to ensure we don’t remove a dump currently being downloaded by a client during the creation of a new dump. As we have observed that a typical repository may contain about 250 OAI-PMH sets (Capability Lists in the ResourceSync terminology), this implies the need for a significant data duplication and requirements on period creation of Resource Dumps if a source chose to use Resource Dumps as part of the harvesting process.

On Demand Resource Dumps

To deal with the problem, we suggest an extension of ResourceSync that will support the concept of an On Demand Resource Dump. An On Demand Resource Dump is a Resource Dump which is created, as the name suggests, whenever a client asks for it. More specifically, a client can scan through the list of resources presented in a Resource List or a Change List (without resolving them individually) and request from the source to package any set of the resources as a Resource Dump. This approach speeds up and saves processing on the side of both the source as well as the client. Our initial tests show that this enables ResourceSync to perform as well as OAI-PMH in the metadata only harvesting scenario when requests are sent sequentially (the most extreme scenario for ResourceSync). However, as ResourceSync requests can be parallelised, as opposed to OAI-PMH (due to the reliance of OAI-PMH on the resumption token), this makes ResourceSync a clear winner.

In the rest of this post, I will explain how this works and how it could be integrated with the ResourceSync specification.

There are basically 3 steps:

  1. defining that the server supports an on-demand Resource Dump,
  2. sending a POST request to the on-demand dump endpoint and
  3. receiving a response from the server that 100% conforms to the Resource Dump specification.

I will first introduce steps 2 and 3 and then I will come back to step 1.

Step 2: sending a POST request to the On Demand dump endpoint

We have defined an endpoint at https://core.ac.uk/datadump . You can POST it a list of resource identifiers (which can be discovered in a Resource List). In the example below, I am using curl to send it a list of resource identifiers in JSON which I want to get resolved. Obviously, the approach is not limited to JSON, it can be used for any resource listed in a Resource List regardless of its type. Try it by executing the code below in your terminal.

curl -d ‘[“https://core.ac.uk/api-v2/articles/get/42138752″,”https://core.ac.uk/api-v2/articles/get/32050″]‘ -H “Content-Type: application/json” https://core.ac.uk/datadump -X POST > on-demand-resource-dump.zip read more...

CORE visits Ethiopia and participates in an Open Science training session

=&0=&

In June 2017, EIFL invited the global open access full text aggregator CORE to take part in an Open Science train-the-trainer course for universities and research institutions in EIFL partner countries.

Watch the videos recorded during the workshop and read more

Solomon Mekonnen – Open Access Ethiopia 

Zaituni Kaijage – Open Access Tanzania

Dr Roshan Karn – Open Access Nepal

Dr Manisha Dhakal – Open Access Nepal

Simon Osei – Open Access Ghana

Gloria Kadyamatimba – Open Access Zimbabwe

It was a great experience travelling to Addis Ababa and a big thanks to the workshop host, Library of the University of Addis Ababa (Mesfin Gezehagn, Solomon Mekonnen and Girma Aweke) for their hospitality. It was also great to meet the trainers participating in the workshop, from Ghana (Lucy Adjoa Dzandu, Simon Kwame Osei, Benjamin Yao Folitse), Nepal (Dr Manisha Dhakal and Dr Roshan Kumar Karn), Tanzania (Zaituni Kokujona Kaijage, Paul Samwel Muneja, Bwire Wilson Bwire) and Zimbabwe (Gloria Kadyamatimba).

 

Implementing the CORE Recommender in Strathprints: a “whitehat” improvement to promote user interaction

by George Macgregor, Institutional Repository Coordinator, University of Strathclyde

This guest blog post briefly reviews why the CORE Recommender was quickly adopted on Strathprints and how it has become a central part of our quest to improve the interactive qualities of repositories.

Back in October 2016 my colleagues at the CORE Team released their Recommender plugin. The CORE Recommender plugin can be installed on repositories and journal systems to recommend similar scholarly content. On this very blog, Nancy Pontika, Lucas Anastasiou and Petr Knoth, announced the release of the Recommender as a:

…great opportunity to improve the functionality of repositories by unleashing the power of recommendation over a huge collection of open-access documents, currently 37 million metadata records and more than 4 million full-text, available in CORE*.
(* Note from CORE Team: the up-to-date numbers are 80,097,014 metadata and 8,586,179 full-text records.).

When the CORE Recommender is deployed a repository user will find that as they are viewing an article or abstract page within the repository, they will be presented with recommendations for other related research outputs, all mined from CORE. The Recommender sends data about the item the user is visiting to CORE. Such data include any identifiers and, where possible, accompanying metadata. The CORE response to the repository then delivers CORE’s content recommendations and a list of suggested related outputs are presented to the user in the repository user interface. The algorithm used to compute these recommendations is described in the original CORE Recommender blog post but is ultimately based on content-based filtering, citation graph analysis and analysis of the semantic relatedness between the articles in the CORE aggregation. It is therefore unlike most standard recommender engines and is an innovative application of open science in repositories.

Needless to say, we were among the first institutions to proudly implement the CORE Recommender on our EPrints repository. The implementation was on Strathprints, the University of Strathclyde’s institutional repository, and was rolled out as part of some wider work to improve repository visibility and web impact. The detail of this other work can be found in a poster presented at the 2017 Repository Fringe Conference and

a recent blog post read more...

CORE listed Number 1 in the list of top 21 free online journal and research databases

Image from the Scribendi website, 101 Free Online Journal and Research Databases for Academics.

An online editing and proofreading company, Scribendi, has recently put together a list of top 21 freely available online databases

. It is a pleasure to see CORE listed as Number 1 resource in this list. CORE has been included in this list thanks to its large volume of open access and free of cost content, offering 66 million of bibliographic metadata records and 5 million of full-text research outputs. Our content originates from open access journals and repositories, both institutional and disciplinary and can be accessed via our read more...

CORE’s open access and text mining services – 2016 growth (or, how about them stats – 2016 edition)

The past year has been productive for the CORE team; the number of harvested repositories and our open access content, both in metadata and full-text, has massively increased. (You can see last year’s blog post with our 2015 achievements in numbers here.)

There was also progress with regards to our services; the number of our API users was almost doubled in 2016, we have now about 200 registered CORE Dashboard users, and this past October we released a new version of our recommender and updated our dataset.

Around this time of the year, the joyful Christmas spirit of the CORE team increases along with our numbers.  Thus, we decided to recalculate how far are the CORE research outputs – if we had printed them – from reaching the moon (last year we made it to 1/3 of the way).

We are thrilled to see that this year we got CORE even closer to the moon! We would also like to thank all our data providers, who have helped us reaching this goal.

Fear not, we will never print all our research outputs, we believe that their mission is to be discoverable on the web as open access. Plus we love trees.

Merry Christmas from the CORE Team!

* Note: Special thanks to Matteo Cancellieri for creating the CORE graphics.

 

CORE released a new Dataset

picture1We are pleased to announce that we have released a new version of our dataset, which contains data aggregated by CORE in a downloadable file.

It is intended for (possibly computationally intensive) data analysis. Here you can read the dataset description and the download page. If you need fresh data, and your requirements are not computationally intensive, you can also use our API.

CORE Recommender

=&0=&This post was authored by Nancy Pontika, Lucas Anastasiou and Petr Knoth.

The CORE team is thrilled to announce the release of a new version of our recommender; a plugin that can be installed in repositories and journal systems to suggest similar articles. This is a great opportunity to improve the functionality of repositories by unleashing the power of recommendation over a huge collection of open-access documents, currently 37 million metadata records and more than 4 million full-text, available in CORE.

Recommender systems and the CORE Plug-In

Typically, a recommender tracks a user’s preferences when browsing a website and then filters the user’s choices suggesting similar or related items. For example, if I am looking for computer components at Amazon, then the service might send me emails suggesting various computer components. Amazon is one of the pioneers of recommenders in the industry being one of the first adopters of item-item collaborative filtering (a method firstly introduced in 2001 by Sarwar et al. in a highly influential scientific paper of modern computer science).

Over the years, many recommendation methods and their variations have been suggested, evaluated both by academia and industry. From a user’s perspective, recommenders are either personalised, recommendations targeted to a particular user, based on the knowledge of the user’s preferences or past activity, or non-personalised, recommending the same items to every user.

From a technological perspective, there are two important classes of recommender systems: collaborative filtering and content based filtering.  

1. Collaborative filtering (CF):

Techniques in this category try to match a user’s expected behaviour over an item according to what other users have done in the past. It starts by analysing a large amount of user interactions, ratings, visits and other sources of behaviour and then builds a model according to these. It then predicts a user’s behaviour according to what other similar users – neighbour users – have done in the past – user-based collaborative filtering.

The basic assumption of CF is that a user might like an unseen item, if it is liked by other users similar to him/her. In a production system, the recommender output can then be described as, for example, ‘people similar to you also liked these items.’

These techniques are now widely used and have proven extremely effective exploratory browsing and hence boost sales. However, in order to work effectively, they need to build a sufficiently fine-grained model providing specific recommendations and, thus, they require a large amount of user-generated data. One of the consequences of insufficient amount of data is that CF cannot recommend items that no user has acted upon yet, the so called cold-items. Therefore, the strategy of many recommender systems is to expose these items to users in some way, for example either by blending them discretely to a home page, or by applying content-based filtering on them decreasing in such way the number of cold-items in the database.

While CF can achieve state-of-the-art quality recommendations, it requires some sort of a user profile to produce recommendations. It is therefore more challenging to apply it on websites that do not require a user sign-on, such as CORE.

2. Content-based filtering (CBF)

CBF attempts to find related items based on attributes (features) of each item. These attributes could be, for example the item’s name, description, dimensions, price, location, and so on.

For example, if you are looking in an online store for a TV, the store can recommend other TVs that are close to the price, screen size, and could also be similar – or the same – brand, that you are looking for, be high-definition, etc. The advantage of content-based recommendations is that they do not suffer from the cold-start problem described above. The advantage of content-based filtering is that it can be easily used for both personalised and non-personalised recommendations.

The CORE recommendation system

There is a plethora of recommenders out there serving a broad range of purposes. At CORE, a service that provides access to millions of research articles, we need to support users in finding articles relevant to what they read. As a result, we have developed the CORE Recommender. This recommender is deployed within the CORE system to suggest relevant documents to the ones currently visited.

In addition, we also have a recommender plugin that can be installed and integrated into a repository system, for example, EPrints. When a repository user views an article  page within the repository, the plugin sends to CORE information about the visited item. This can include the item’s identifier and, when possible, its metadata. CORE then replies back to the repository system and embeds a list of suggested articles for reading. These actions are generated by the CORE recommendation algorithm.

How does the CORE recommender algorithm work?

Based on the fact that the CORE corpus is a large database of documents that mainly have text, we apply content-based filtering to produce the list of suggested items. In order to discover semantic relatedness between the articles in our collection, we represent this content in a vector space representation, i.e. we transform the content to a set of term vectors and we find similar documents by finding similar vectors.

The CORE Recommender is deployed in various locations, such as on the CORE Portal and in various institutional repositories and journals. From these places, the recommender algorithm  receives information as input, such as the identifier, title, authors, abstract, year, source url, etc. In addition, we try to enrich these attributes with additional available data, such as citation counts, number of downloads, whether the full-text available is available in CORE, and more related information. All these form the set of features that are used to find the closest document in the CORE corpus.

Of course not every attribute has the same importance as others. In our internal ranking algorithm we boost positively or negatively some attributes, which means that we weigh more or less some fields to achieve better recommendations. In the case of the year attribute, we go even further, and apply a decay function over it, i.e. recent articles or articles published a couple of years ago get the same boosting (offset), while we reduce the importance of older articles by 50% every N years (half-life). In this way recent articles retain their importance, while older articles contribute less to the recommendation results.

Decay function in year attribute
Decay function in year attribute

Someone may ask:

how do you know which weight to put in each field you are using? How did you come up with the parameters used in the decay function? read more...

7 tips for successful harvesting

7tipsThe CORE (COnnecting REpositories) project aims to aggregate open access research outputs from open repositories and open journals, and make them available for dissemination via its search engine.  The project indexes metadata records and harvests the full-text of the outputs, provided that they are stored in a PDF format and are openly available. Currently CORE hosts around 24 million open access articles from 5,488 open access journals and 679 repositories.

Like in any type of partnership, the harvesting process is a two way relationship, were the content provider and the aggregator need to be able to communicate and have a mutual understanding. For a successful harvesting it is recommended that content providers apply the following best practices (some of the following recommendations relate generally to harvesting, while some are CORE specific):

  • Platform: For those who haven’t deployed a repository yet, it is highly advised that the repository platform is not built in house, but one of the industry standard platforms is chosen. The benefits of choosing one of the existing platforms is that they provide frequent content updates, constant support and extend repository functionality through plug-ins.
  • Repository information status: Check that your repository is included in an international repositories list and that the Open Archives Initiative Metadata Harvesting Protocol (OAI-PMH) address is up to date. The primary directories for CORE are the Directory of Open Access Repositories, the Registry of Open Access Repositories and the Directory of Open Access Journals.
  • Robots.txt: This file states whether a web crawler is allowed to access a repository. At CORE often times a harvesting attempt fails from the very beginning  due to the fact that the service is banned in this file. Additionally, it is highly recommended that repositories provide equal access levels with the same conditions to all crawlers without making distinctions between services; for example providing complete access at full speed to commercial services, while limiting access to some repository directories or imposing slower crawling speed to non-commercial services. Currently 98 UK research and dissertations repositories are listed in the CORE Dashboard and 10 of these have a prohibiting rule in the robots.txt file for CORE, which affects harvesting.
    An example of a robots.txt file that blocks the CORE service.
    An example of two robots.txt files and their rules.
  • Meta-tags: The meta-tags describe a webpage’s content in the code, which makes them extremely useful to machines rather than humans. Crawlers of harvesting services, like CORE, and search engines, like Google Scholar, expect to find a list of meta-tags in the code of each webpage in order to be able to harvest the content properly. Failure to do so results in crawlers making guesses about the content or completely omitting it.
  • <dc:identifier> field: The aim of the Dublin Core Metadata tags is to ensure online interoperability of metadata standards. The importance of the <dc:identifier> tag is that it describes the resource of the harvested output. CORE expects in this field to find the direct URL of the PDF. When the information in this field is not presented properly, the CORE crawler needs to crawl for the PDF and the success of finding it cannot be guaranteed. This also causes additional server processing time and bandwidth both for the harvester and the hosting institution.
    There are also three additional points that need to be considered with regards to the <dc:identifier>; a) this field should describe an absolute path to the file, b) it should contain an appropriate file name extension, for example “.pdf” and c) the full-text items should be stored under the same repository domain.

    A good metadata tag
    A <dc:identifier> tag including all three requirements
  • URL external resolving: When external resolution services, such as Handle.Net® or doi®, are used it is important to ensure that the URL produced works properly and it links to an actual document and not a dead page.
  • Is everything in order?: To ensure that everything is alright with your repository, use monitor tools to check how your repository looks to the outside world.  Such tools are the OAI PMH validator, which will test your endpoint, the Google Webmaster Tools, which will help improve your search rank and the CORE Dashboard, which provides detailed information on the harvesting process, technical issues and how to improve the exposure of your metadata.
  • read more...