The CORE project produced a number of tools that can be reused or adapted to solve specific problems. In this blog post, we are going to explain how do we envisage this to happen and describe how can our team assist. Some of the answers were developed during the last Advisory Board meeting that took place on Monday 25th July.
1) Development of subject based repositories as aggregations of content from a set of existing Open Access repositories – the CORE harvesting software can be easily set to perform metadata and content harvesting from any set of OAI-PMH compliant repositories. The fact that CORE provides access to the full-texts enables us to apply different text mining and classification methods to filter the content to be finally presented to the user.
The development of the CORE system has been rapid and we were overcoming issues at a daily basis. It is just now, when the CORE system is fully functional, when we can evaluate the successes and comment on the issues we had to face.
Let us first start with the challenges we were facing to and explain how we addressed them:
- Metadata harvesting – Our decision to reuse the OCLC OAIHarvester2 as a component in our system proved to be a good one. However, as the component was originally designed as a command line tool, it had to be slightly modified in order to use it reliably on Tomcat. This required us to update some exception handling etc. Overall, these fixes required relatively minor effort.
- Getting the OAI-PMH base URL for British Open Access repositories – OpenDOAR and ROAR were used as authoritative lists of the OAI-PMH base URLs. We have found that these URLs were not valid for a number of the British OA repositories (we will provide a list of these repositories in one of the subsequent blog posts). We were able to resolve this issue in a few cases by guessing the correct URL. We will provide a detailed feedback on this to OpenDOAR.
- Downloading Content from OA repositories – We have implemented a set of Java classes to carry out the downloading of pdf files. Our pragmatic decision here was to download only content in the pdf format. There were two challenges we had to face: 1) The file download has to be fast enough. We have addressed this problem by downloading the content to a set of Open University servers connected to a very fast broadband, by using appropriate BufferedStreams in order to fully exploit the connection potential and by automatically cancelling the download when the remote server response was very slow (typically when the remote server did not send any data for two minutes). 2) The second issue was associated to the cost of data storage. Given the fact that CORE needs to download data from many Open Access repositories, the system requires a large disk space. At the moment we have downloaded and processed more than 50k files which accounts for about 200GB of data. We estimate that nowdays approximately 5TB might be required to carry out the same work for all OA repositories worldwide. At the time of the proposal writing, we believed that disk space is one of the cheapest hardware components, however we realised that in order to carry out system backups and allow a quick response of the system and the integration of CORE with the OU infrastructure, fast SAS disks are required. We have negotiated with the OU technical admin team to buy a another TB of disk space for CORE to be covered from the OU central budget at a cost of £3,000. This will enable a long term sustainability of the CORE system for British repositories, but won’t be sufficient for all Open Access repositories worldwide.
- PDF to text extraction – This was one of the most challenging parts of the CORE system development. We have tested 3 systems for pdf to text extraction – iText, Apache Tika (PDFBox) and pdftotext. The issue with Apache Tika was that the extraction was very slow (about 30s to 1 minute per average pdf, which was prohibitive for the scale of the application), the issues with iText and pdftotext were the quality of the text. To summarize, Apache Tika produced good quality text, but the extraction was too slow, while the other tools were fast enough, but the quality of the resulting text was inferior. Eventually we managed to speed the extraction up, by optimising our system which communicated with Apache Tika using BufferedStreams instead of pure Strings. At the moment we are able to extract text from about 500 PDFs per hour.
- Thumbnail generation – In order to develop a nice search web interface that would enable access to the harvested and processed articles, we wanted to generate for each article an image thumbnail. Originally we have used PDFBox for this task as well, but we discovered that about one in about a thousand of pdfs caused the PDFBox to crash the Java Virtual Machine. This is something that in theory shouldn’t happen and we have reported this issue to Apache (https://issues.apache.org/jira/browse/PDFBOX-1019). The bug is still being solved, but it appears that the problem requires a fix in Oracles’s Java implementation. Though the problem appeared rarely, the consequence for us was that we had to restart the Tomcat server on which our application was running. To avoid this problem completely, we have implemented a different solution which uses ImageMagick (http://www.imagemagick.org/script/index.php) instead. Since that time the issue never reoccurred.
- Similarity calculation – our team knew right from the start of the project that we will need to provide a very well optimised version of our similarity calculation system to be able to discover relevant papers in a reasonable amount of time due to the problem of a large number of combinations. In order to make this task possible, we have not only optimised the calculation, but also developed a new heuristic that cuts the number of combinations to be taken into account using a document frequency cut criterion. The result is that the time complexity of the similarity calculation is approximately linear with respect to the number of items in the index (in contrast to the theoretical quadratic complexity) which allows the CORE system to scale. During the project we also had to face to other issues regarding the calculation: 1) the calculation results were poor due to a low text quality. This problem has been fully resolved by optimising the text extraction system. 2) The similarity calculation and the impact of the heuristic was affected by a number of strings in the index that were not carrying any meaning. These strings were the result of text extraction of mathematic formula, numbers and other types of noisy data. To face this issue, we have developed our own TextAnalyzer and TextFilter on top of the Lucene library, which filters out these tokens.
Overall, we are glad to say that we were able to recover from all the major issues we have encountered. We found it it extremely useful to develop and test the system on a daily basis using agile development methodologies. The proof of the very active development and involvement of the CORE project team is that today we have already 575 code revisions in our SVN repository since the project start.
The project team has submitted a paper describing CORE to the International Conference on Theory and Practise in Digital Libraries (TPDL 2011) – http://www.tpdl2011.org/ to be held in September in Berlin. This conference is the main scientific forum on digital libraries in Europe. The paper has been accepted and the acceptance rate for this year was 33%.
The first version of the CORE dataset been released yesterday and registered in the Linked Data cloud (http://ckan.net/package/core). The CORE project exposes data about similarities between papers in the Open Access domain. We are providing links to the OAI repository. The similarities are calculated using Natural Language Processing techniques based on the full-text. This distinguishes CORE from other systems, such as Mendeley or MarcXimiL. The similarities are provided only for research articles with an accessible and machine readable full-text.
The CORE Harvester system is now online! Though the system requires more testing and improvements, it is now deployed in its beta version at http://core.kmi.open.ac.uk.