Metadata and content harvesting component testing

We are at the moment in the process of testing of a newly developed component for metadata harvesting. This component is responsible for acquiring (1) meta-data records and (2) the associated full-text content from Open Access repositories. The harvesting of the metadata is performed using standard OAI-PMH requests to the repositories. Successful requests return an XML document which contains information about the papers stored in a repository. Although the OAI-PMH protocol itself is not directly concerned with the downloading of full-text content, as it focuses on the transfer of metadata, a good practise in repositories (which is unfortunately not consistently applied) is to provide as a part of the metadata the URLs to the full-text documents. Document URLs can be thus extracted and used to automatically download full-texts from repositories over the HTTP protocol. The CORE system provides this functionality and is optimized for regular metadata harvesting and full-text downloading of large amounts of content. The fact that CORE caches the actual full-text content in order to process the documents and to discover additional metadata distin- guishes this approach from a number of other Open Access federated search systems, such as BASE or OAISTER, that rely only on the metadata accessible through OAI-PMH.

The developed harvesting component, which is accompanied by a web user interface, supports the following functions:
– Adding/removing and editing repositories – This is achieved by specifying repository OAI-PMH harvesting URL within the CORE application user interface.
– Scheduling metadata and full-text harvesting – Depending on the size and the importance of a particular repository, the CORE application allows to set the frequency of the harvesting and schedule updates in a flexible way.
– Monitoring the progress and the status of the harvesting – The user interface of the CORE system provides actual information about the currently running and scheduled harvesting tasks and allows their management.
CORE: COnnecting REpositories 7
– Providing statistics on the downloaded records and full-text items – The CORE system allows to display information about the number of downloaded items and full-text documents per repository.

Current challenges:
– The harvesting of full-text content requires a relatively large storage space.
– The support of different repositories for pdf download is limited (some repositories have pdf files protected by password even though they claim to be Open Access)
– Many files from Open Access repositories are scanned files content of which cannot be read without Optical Character Recognition techniques
– The developed harvesting system must support parallel download from many repositories to cut down the harvesting time.