Best of both worlds

In the two previous blogs posts in this series (Finding fulltext and What does Google do?) I’ve described some of the challenges related to harvesting metadata and full text from institutional repositories. I’ve omitted some of the technical issues we’ve encountered (e.g. issues with OAI-PMH Resumption Tokens) as generally we’ve been able to work around these – although I may come back to these at some point in the future. Also worth a read is Nick Sheppard’s post on the UKCORR blog touching on some of these issues.

Given the issues described in the previous posts, CORE is faced with the question of what to do about harvesting where permissions are unclear, inconsistent and not easy to apply purely through software (i.e. not machine readable), and where the location of full-text items (which CORE wants to harvest) is not necessarily given in the harvested metadata?

We would propose that for metadata the answer is simple: Harvest it anyway, until asked explicitly not to. This may seem a glib and self-serving answer, but this is not the intention. The arguments for going ahead with the metadata harvest no matter what the policy are as follows:

These factors mean that our assessment of the risk of there being a negative consequence (whether legal or reputational) to any party as a result of us harvesting and using the metadata is that the risk is negligible, and we can react to individual cases and ‘takedown’ requests as necessary.

The question of fulltext harvesting is more problematic. The copyright inherent in fulltext is unlikely to belong to the repository (or the hosting organisation), and the policies expressed by repositories show a varied view on what is permissible for a third party to do with the fulltext content (and for what purposes). The services offered by CORE only work where fulltext is available and can be harvested and parsed by our software, and in some cases we know this is permitted, but the problems of finding the relevant policies, and understanding the full intentions and implications of policies remain.

Despite all these issues, web search engines such as Google are able to harvest full-text content from repositories in most cases. Rather than rely on understanding published policies they rely instead on a simple control mechanism for all websites – the robots.txt file. This gives control to the publisher, and offers a simple way of ensuring content is not crawled when this is not desired for some reason.

In our original discussions about what CORE should do, we discussed the possibility of proposing a simple method for repositories to tell CORE whether it could harvest content or not. We also felt strongly that CORE (a non-profit service designed to improve the discoverability of open access content for the academic community) should not be at a disadvantage to Google et al when it came to building services on top of Open Repository content. At this point we realised that if we brought these desires together, we had a straightforward answer – CORE should harvest full-text content but respect any restrictions in robots.txt – and so we would be competing on a level playing field with Google, and offering a clear mechanism to repositories to ‘opt-out’ of the process if they wish to do so.

At one point in our investigations I began to wonder whether the whole use of OAI-PMH was actually worth the time and effort – afterall search engines don’t rely on this mechanism, and if we think of respositories as simply web based resources, then why do we need a sector specific protocol to ‘crawl’ or ‘harvest’ the records and content? I still think this is an important question, but since we had already put in a significant effort into our harvesting software and processes, and there are certainly advantages to using OAI-PMH (such as easily knowing about changes and deletions without recrawling the whole repository web presence), we are still going to use this for the metadata.

However, once we have the metadata, with a URL which may point to a full-text resource, or may point to a record page, from which there are links to one or more full-text resources, we are going to use web crawler technology to try to retrieve the full-text PDFs related to the metadata record. It is likely we will use the Apache Nutch crawler to do this which brings with it two key advantages:

  • It should help us with the issue of the metadata record not linking directly to the full text – but crawling say two levels from the URL given in the metadata record, we can look for PDFs and link them to the metadata record
  • Nutch will adhere to the robots.txt directives automatically, bringing us in line with other web crawlers

We feel this approach gives us the best of both worlds – harvesting for the metadata then directed crawling for the full-text. This isn’t a solution I’ve seen elsewhere (although please get in touch if you know of similar implementations – directly, or in the comments on this post). For the repository managers it means there is a straightforward mechanism to control access to full-text items, and a single place to limit access to specific items whether this is from Google, Bing, or CORE.

We are, of course, very interested in getting feedback on this proposed approach, especially from repository managers – so please get in touch and let us know what you think (directly, or in the comments on this post)

1 thoughts on “Best of both worlds”

  1. I think you are right to go ahead with harvesting unless specifically told not to. The IR community can be over-cautious at times.

Comments are closed.