What does Google do?

This is the second post in a series about the issues CORE has encountered trying to harvest (and build services on) metadata and fulltext items from UK HE research repositories. The first post “Finding fulltext” looked at the problems of harvesting fulltext due to variations in how links are made (or not) from metadata records to fulltext content.

In this post I want to consider the question of what services like CORE are allowed or permitted to do with repository content. A third post will then describe some of the solutions to the various challenges we see.

It may seem obvious that repositories offering the ability to harvest metadata should expect external services to do exactly that, and then make some use of the metadata. However, most UK HE research repositories have some policies relating to what other services can do with both metadata and content made available by the repository. This differentiation between metadata and fulltext content is deliberate, and repositories will often be more permissive in what they state is allowed to be done with metadata compared to fulltext content.

My starting point in investigating repository policies was OpenDOAR, a directory of academic open access repositories http://www.opendoar.org/. On OpenDOAR there are 155 ‘institutional’ repositories in the UK (not necessarily limited to HE), of which 125 offer OAI-PMH. I should note here that the data in this post (whether from OpenDOAR or elsewhere) was generally collected and analysed using a combination of ScraperWiki and Google Refine. All figures are based on a snapshot of data taken between 18th February and 18th March 2012, and of course any errors are of course mine alone. If I’ve made any errors in my interpretation or recording of policies or data please let me know on twitter (http://twitter.com/ostephens) or by email owen@ostephens.com (although note the metadata policies and summaries I’m reporting from OpenDOAR are not mine to correct). A spreadsheet containing the data I refer to in this post is available at https://docs.google.com/spreadsheet/ccc?key=0ArKBuBr9wWc3dFM5Vi1QLWdOR0tjUGdwd1BFZlVqeVE

Looking at the ‘metadata’ policy summaries that OpenDOAR has recorded for these 125 repositories the majority (57) say “Metadata re-use policy explicitly undefined” which seems to sometimes mean OpenDOAR doesn’t have a record of a metadata re-use policy, and sometimes seems to mean that OpenDOAR knows that there is no explicit metadata re-use policy defined by the repository. Of the remaining repositories, for a large proportion (47) OpenDOAR records “Metadata re-use permitted for not-for-profit purposes”, and for a further 18 “Commercial metadata re-use permitted”.

However, although OpenDOAR has made a substantial effort to collect and accurately reflect institutional policies (and indeed, has been behind a push to get repositories to formulate and state policies clearly via its Policies Tool), perhaps inevitably there are both errors and omissions. For example, the OpenDOAR record for Aston University says “Metadata re-use policy explicitly undefined” and notes “Policy not found”. However, if we refer directly to (a way of getting some XML about the OAI-PMH service for a repository) we find that there is a link to a web page which defines the Aston University repository policies. This states:

The metadata may be re-used in any medium without prior permission for not-for-profit purposes and re-sold commercially provided: the OAI Identifier or a link to the original metadata record are given; Aston University Research Archive is mentioned

We can see that re-use of the metadata is permitted, even though this was not recorded on OpenDOAR. My suspicion is that one reason this has happened is that Aston University do not list the policy in their XML response to the OAI-PMH ‘Identify’ request, but just link to the policy on another web page. However, in common with all other repositories I have looked at to date, the policy (whether in the XML response, or on a separate page) is designed to be human readable, not machine readable. The guidance in how to handle both record and repository level rights statements in OAI-PMH from http://www.openarchives.org/OAI/2.0/guidelines-rights.htm, has some examples where rights are linked to machine readable versions of common licenses – particularly the Creative Commons licenses. However, I have yet to come across a real-world example of this. Although I don’t doubt that there are examples, I think this is the exception rather than the rule.

This raises significant challenges for services such as CORE which rely on automated harvesting and processing of records – for such services the policies being made available by repositories are hard to make use of, as they don’t provide terms in a form that is easy for the software to extract and ‘understand’.

If we move onto the harvesting of fulltext content, the situation has a lot of similarities to that I’ve described for metadata. Unfortunately, the situation is even less clear for fulltext content than it is for metadata. OpenDOAR lists 54 repositories with the policy summary “Full data item policies explicitly undefined”, but after that the next most common (29 repositories) policy summary (as recorded by OpenDOAR) is “Rights vary for the re-use of full data items” – more on this in a moment. OpenDOAR records “Re-use of full data items permitted for not-for-profit purposes” for a further 20 repositories, and then (of particularly interest for CORE) 16 repositories as “Harvesting full data items by robots prohibited”.

Once again, deliving into individual repository policies can throw up conflicts with the policy OpenDOAR has recorded. For example OpenDOAR lists the University of Southampton policy on full data re-use as “Harvesting full data items by robots prohibited”, but querying the University of Southampton directly we see (amongst other statements):

Full items may be harvested by robots transiently. Where full items are harvested permanently permission must be sought from the University of Southampton.

As suggested by the presence of several policy summaries stating “Rights vary for the re-use of full data items”, it is quite possible (and I think it would be quite exceptional for it to be otherwise) that rights will vary depending on the fulltext item in question. This, in theory, can be expressed in a repository record, and can be made available to those harvesting the records – for example in a <dc:rights> tag in the associated metadata record. However, this information seems to be absent more often than it is present, and even where the rights data has been entered in the repository, not all repositories output the rights metadata in the records accessible via OAI-PMH (it seems particularly that in eprints the dc:rights field is not output via OAI-PMH by default).

The problem of parsing rights information with software (whether this is repository level, or record level ‘dc:rights’ information) remains with fulltext content, and if anything policies relating to the use of fulltext content are more nuanced and subtle than with metadata content. For example the University of Bristol policy states (amongst other things) that:

Full items must not be harvested by robots except transiently for full-text indexing or citation analysis.

In this case very specific particular types of use are being granted. Additionally this policy (and many others) uses several specific terms without defining them in detail. I would argue that definitions of ‘robot’, ‘transiently’, ‘full-text indexing’ and ‘citation analysis’ would be necessary to fully understand the implications of this policy.

So we are faced with conflicting information, spread across multiple services, expressed in a way that our software (and sometimes a human) cannot easily understand!

This brings us to the title of this post “What does Google do?” (a play on the Jeff Jarvis book “What would Google do?”). Despite the complications I have described around policies relating to fulltext harvesting from repositories, Google (both http://google.com and http://scholar.google.com) and other search engines do include repository content, capturing metadata and fulltext, providing fulltext indexing and caching of pdf documents, and possibly other services or information based on machine based parsing of content. Some repositories mention indexing by Google or Google Scholar in their promotional material e.g. http://www3.imperial.ac.uk/library/find/spiral/faq and http://www.lib.cam.ac.uk/repository/usecases/.

Given the complexity of the situation, as described above, we thought it was worth looking at what Google did, and how repositories related to Google (and by implication other web search engines). Neither Google nor Google Scholar use OAI-PMH (or at least, not as far as they are public about what they do) to harvest metadata, but they rely on crawling repository web pages, just like any other website, although Google Scholar has some specific provisions to help repositories (and publishers) get accurate metadata into the Google Scholar index.

Starting with the same list of respositories in OpenDOAR as previously, I created a Google search per repository that looked for any PDFs indexed in Google from that site – for example http://www.google.com/#hl=en&q=site:aura.abdn.ac.uk+filetype%3Apdf finds pdfs indexed by Google in the Aberdeen University Research Archive repository. This showed that only 8 out of the 125 repositories had zero pdfs in Google’s index, and while a few had trivial numbers (4 repositories had just a single pdf in the Google index), in general most repositories had some kind of presence in the Google index, with some having substantial numbers of pdfs indexed (e.g. the University of Southampton result for this search gave 13,700). It should be noted that the results Google gives for searches are not always accurate, but I think they give a flavour of the situation, and a zero is usually pretty conclusive.

If we look at the results for the ROSE repository at the University of Bristol from the search http://www.google.com/#hl=en&q=site:rose.bris.ac.uk+filetype%3Apdf, we can see there are “About 1,320” results. In each case it seems that the PDF document is cached for display in the results page, and in some cases there is a ‘quickview’ option which displays the cached version of the document in Google Docs.

Similarly looking at Google Scholar using searches of the form http://scholar.google.com/scholar?q=site:aura.abdn.ac.uk, we can see that only 8 repositories in the list give zero hits. Taking ROSE as an example again, the search http://scholar.google.com/scholar?q=site:rose.bris.ac.uk gives “about 242”, but unlike with http://www.google.com, in this case the PDF is not obviously cached.

Both Google and Google Scholar offer advice to ensuring they index your site effectively. Google offer a wealth of advice, and tools, to help webmasters get their sites indexed correctly – all available from https://www.google.com/webmasters/tools/home. Google Scholar also offers advice at http://scholar.google.com.au/intl/en/scholar/inclusion.html, which is specifically aimed at the type of content (e.g. research papers) covered by Google Scholar. Of particular note the Google Scholar guidelines include information on how to include bibliographic metadata for Google Scholar to harvest.

For both Google and Google Scholar, it is made clear in the guidelines that the mechanism for marking any content the repository does not want to be indexed in the respective indexes is the ‘robots.txt’ file. This is a simple text file that can be placed on your web server, and gives simple instructions to web crawlers/spiders/robots as to what they should and should not index on the site, as well as some other aspects of site indexing behaviour. The robots.txt file is a long standing convention (see the robots.txt entry on Wikipedia for more information), but is not strictly enforceable – it is up to the robot software crawling the site to read robots.txt and obey – if an unscrupulous robot chooses to ignore the file, there is nothing inherent in robots.txt to prevent it doing what it wants (there are, of course, other methods of blocking particular pieces of software etc. accessing your site, but they are not related to robots.txt).

Looking at the robots.txt files across all the repositories in the list, we can see the majority are not doing anything specific to block Google etc. from indexing their sites. There are a few exceptions – which account for some of the ‘zero’ results sets described above. For example the “Bradford Scholars” portal at the University of Bradford blocks access in robots.txt (http://bradscholars.brad.ac.uk/robots.txt) as follows:

User-agent: *
Disallow: /
Disallow: /browse
# Uncomment the following line only if sitemaps.org or HTML sitemaps are used
#Disallow: /browse-title

Sites blocking access are the exception rather than the rule.

If we go back to the ROSE repository at Bristol, and look at a specific record such as https://rose.bris.ac.uk/handle/1983/286, we can see in the html the following:

<meta content="Champneys, AR; Kuznetsov, YA; Sandstede, B" name="citation_authors" />
<meta content="homoclinic orbit; numerical analysis; continuation; bifurcation; Preprint" name="citation_keywords" />
<meta content="http://rose.bris.ac.uk/handle/1983/286" name="citation_abstract_html_url" />
<meta content="2006-01-31T17:34:08Z" name="citation_date" />
<meta content="A numerical toolbox for homoclinic bifurcation analysis" name="citation_title" />
<meta content="en" name="citation_language" />

This is the markup for metadata used specifically by Google Scholar – so it seems that Bristol is making some effort to appear in Google Scholar results. We can also look at the ROSE robots.txt file:

====
The contents of this file are subject to the license and copyright
detailed in the LICENSE and NOTICE files at the root of the source
tree and available online at

http://www.dspace.org/license/
====
User-agent: *

# Uncomment the following line ONLY if sitemaps.org or HTML sitemaps are used
# and you have verified that your site is being indexed correctly.
# Disallow: /browse

# You also may wish to disallow access to the following paths, in order
# to stop web spiders from accessing user-based content:
# Disallow: /advanced-search
# Disallow: /contact
# Disallow: /feedback
# Disallow: /forgot
# Disallow: /login
# Disallow: /register
# Disallow: /search

This basically means that robots.txt is saying any web crawler can feel free to access to any part of the site, although as this looks like a default DSpace configuration, it is possible that this is not a deliberate choice on the part of the repository.

What do we conclude from all this? Google, Google Scholar, and other web search engines do not rely on the repository specific mechanisms to index their content, and do not take any notice of repository policies (there are certainly examples where the fulltext policy listed on OpenDOAR explicitly says harvesting by robots is not allowed, but the robots.txt file is permissive although I have not yet tracked down a completely clear cut example of this when looking at policies hosted directly on a repository website).

By adopting a clear, unambiguous mechanism for allowing a content owner to say whether the search engine can crawl the site, with an ‘opt-out’ rather than ‘opt-in’ approach (if robots.txt is not present, web crawlers will assume permission to crawl the website), internet search engines have made it possible to build indexes of large amounts of content on the web. It clearly would not be possible for Google to operate in its current form if it asked permission to index content each time its rights to do so were unclear. In the main, web content owners, including repository owners, accept this as a quid pro quo of sorts in order to make their content more discoverable. Ignoring the question of how Google and others profit from this (and whether this counts as non-commercial activity), the point for the repositories is to make their content discoverable.

What does this mean for CORE (and any similar services). I bet you can’t wait for the next blog post in this series to find out!