Finding fulltext

In order to be able to provide the search functions, similarity measures and other functionality CORE harvests both metadata and fulltext items from repositories. This raises questions about whether we are allowed to harvest metadata or fulltext items, and if so what are we allowed to do with them once we have harvested them. In the first phase of CORE we relied on OAI-PMH to harvest metadata, and then used links from the harvested records to try to discover the related fulltext item.

This is the first in a series of blog posts looking at these issues, the problems we’ve encountered and the solutions we have put in place (so far). In this post I’m going to focus on the question of finding fulltext items from the metadata. This wasn’t always straightforward. Not all repositories link to fulltext records from the metadata in the same way, and in many cases there is no direct link from the metadata to the fulltext reocrds, but rather a link to the repositories webpage for the record, rather than to the full text.

This (edited for brevity) example from the University of Cambridge (which uses the DSpace software) has a link in <dc:identifier>, which links to the html page describing the item. To get the fulltext, you then need to find the link to the pdf on that page and click through.
<record> <header> <identifier>oai:www.dspace.cam.ac.uk:1810/221924</identifier> </header> <metadata> <oai_dc:dc xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <dc:title>Reading Lists in Cambridge: A Standard System?</dc:title> <dc:creator>Jones, Huw</dc:creator> <dc:identifier>http://www.dspace.cam.ac.uk/handle/1810/221924</dc:identifier> <dc:relation>1/4</dc:relation> </oai_dc:dc> </metadata> </record>

While this example from the University of Southampton (again edited) links directly to the pdf from <dc:identifier>, and links to the html page for the item using <dc:relation>
<record> <header> <identifier>oai:eprints.soton.ac.uk:66183</identifier> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>A methodology for developing high damping materials with application to noise reduction of railway track</dc:title> <dc:creator>Ahmad, Nazirah</dc:creator> <dc:format>application/pdf</dc:format> <dc:identifier>http://eprints.soton.ac.uk/66183/2451/P2503.pdf</dc:identifier> <dc:identifier>Ahmad, Nazirah (2009) A methodology for developing high damping materials with application to noise reduction of railway track. University of Southampton, Institute of Sound and Vibration Research, Doctoral Thesis, 250pp.</dc:identifier> <dc:relation>http://eprints.soton.ac.uk/66183/</dc:relation> </oai_dc:dc> </metadata> </record>

The lack of consistency here obviously raises some challenges for those wishing to harvest fulltext items.

When I posted some questions around this topic to the ever-helpful code4lib mailing list, Godmar Black (http://people.cs.vt.edu/~gback/) pointed out that the definition of the OAI-PMH says “To facilitate access to the resource associated with harvested metadata, repositories should use an element in metadata records to establish a linkage between the record (and the identifier of its item) and the identifier (URL, URN, DOI, etc.) of the associated resource. The mandatory Dublin Core format provides the identifier element that should be used for this purpose.” (from http://www.openarchives.org/OAI/openarchivesprotocol.html#UniqueIdentifier)

Note that this does not state what type of identifier should be used, and where an URL is used it isn’t stated that this should resolve to the fulltext item in the browser (although it does suggest that it should identify the resource, not identify the description of the resource).

As part of the same discussion Raffaele Messuti (http://atomotic.com/) noted that in Italy records describing theses are required to do the following:

Publish metadata as MPEG DIDL (see http://www.dlib.org/dlib/november03/bekaert/11bekaert.html)
Populate dii:Identifier with a URL for the html web page (jump off page) describing the item
Use didl:Component to represent each full text document composing the Item

From what I can see looking at an example (http://amsdottorato.cib.unibo.it/cgi/oai2?verb=GetRecord&metadataPrefix=didl&identifier=oai:amsdottorato.cib.unibo.it:4182) the link to the actual resource is given in <didl:Resource> within <didl:Component>.

This approach feels useful not just because it introduces consistency, but it also clearly answers the question of what to link to where the item described consists of multiple files/parts.

Creating a standard approach may prove successful for a small, well defined, community – and I think it would be useful to UK HE repository managers to work towards a standard approach, similar to the Italian etheses example. However, this would only solve the problem for CORE for a particular subset of repositories. CORE is already looking at harvesting repositories from outside the UK, and the wider we cast our net for repositories to harvest, the more likely we are to hit a variety of practices across communities.

So what will CORE do? I’m going to come back to this in a later post – in the next post in this short series I want to look at policies on metadata and fulltext harvesting, and how ‘harvesting’ differs from ‘crawling’ (the latter being the approach that a web search engine like Google might take).

2 thoughts on “Finding fulltext”

Re inconsistency in the oai-dc data, I think one of the issues here is a (possibly FRBR-like) question of what is being identified. i.e. if we take the view that the oai-dc record is “about” a single thing, what is that thing?

Various possibilities, including:

– the article/paper etc as “abstract thing” (~= FRBR Work or Expression), which may be available in several formats
– the article/paper as “concrete item” (“the fulltext item”), in a single format (~= FRBR Item), of which there may be several available
– the metadata page (splash page, jump-off page etc) (“the description of the resource” (is that of the abstract thing? or of the concrete item?))

These were the sort of issues that the work on the Scholarly Works Application Profile (SWAP) sought to address, but for one reason or another, it wasn’t really taken up.

admin says:

20th March 2012 at 12:14 pm

Thanks Pete. I agree that this is an issue – it may not be straightforward, and its one of the reasons I like the MPEG DIDL approach taken by the Italian etheses community, as it allows a slightly more complex view.

For CORE specifically it doesn’t matter in some ways – we make assertions about similarities between specific documents, not between the more abstract entities that the metadata might represent. That is, we compare a pdf to another pdf, and we feel confident in stating a similarity measure between the two pdfs. Obviously it makes sense for us to present this in the context of the metadata (so the user can see some meaningful description) but in the end we are saying something more specific in the RDF.

Comments are closed.