In the last six months, CORE has made a huge step forward in terms of the technology solution. According to our project management software, to this day, we have resolved 214 issues. Each issue corresponds to a new function or a fixed bug.
The idea of this blog post is to provide an overview of the technologies and standards CORE is using and to report on the experience we had with them during the development of CORE in the last months. We will provide more information about the new features and enhancements in the following blog posts.
Highlights
Technologies
Tomcat Web server – CORE has been using Tomcat as an application container since its start, however relatively recently the CORE frontend has been deployed as a Tomcat cluster. This means that the application is deployed on multiple (currently just 2) machines. Using a load balancer the web traffic is redirected to any of these servers. The advantage of this solution is not only performance, but also the reliability of the service. For example, it is now possible for us to redeploy the application while the CORE service is still running. At the same time, the architecture is prepared for growth in the future. So far, our experience with this solution is generally positive.
Apache Nutch – We have adopted Apache Nutch in order to obey the information in the robots.txt file. Apache Nutch makes the implementation very simple and we have a very positive experience with it.
SVM Light – Support Vector Machine (in particular SVM multiclass) classifiers have been used in CORE to perform a pilot set of tests for text classification of research papers. While the tool is extremely simple to set up and great to work with, it does not allow building models from a very large number of examples. Although we couldn’t utilise all examples we have, the tool was still good enough for carrying out experiments. We are now looking how to improve the scalability in the training phase to make use of a larger number of examples. We think that tools, such as Apache Mahout, might be able to provide the answer.
Google Charts – Google Charts have been used for graphs in the Repository Analytics. A very cool interactive graphs, easy to set up.
Logback – used to improve logging in CORE and also to provide logs in the Repository Analytics tool.
Apache Lucene – has been used previously and has proved to be a great tool – very fast and scalable.
Language detection software – The issue of language detection appeared more important to resolve as the content in the CORE aggregation system has growed. Particularly with the aggregation of content from the Directory of Open Access Journals, it started to be important to distinguish publications in different languages. We originally tried to approach this problem using the AlchemyAPI. AlchemyAPI offers their API for free for a low number (several thousand) of queries per day. This can be extended up to 30k queries for non-commercial academic providers. We asked AlchemyAPI to provide this, but learned that they require you to acknowledge AlchemyAPI on every publication about your project (even those that talk about completely different aspects of the system). Therefore, we have decided to use the Language Detection Library for Java available on Google code. We are very happy with this decision.
ParsCit (citation extraction) – we have used ParsCit: An open-source CRF Reference String and Logical Document Structure Parsing Package for detection of basic metadata and citation parsing from full-text. ParsCit provides reasonable performance and accuracy and we are quite happy with it.
AlchemyAPI – We have used this for language detection in the past and also for concept extraction. Overall, we feel this solution is not suitable for us due to their licensing restrictions and lack of transparency of the extraction methods and we are moving to a different solution.
Blekko – Blekko is a search engine which offers an API that allows you to query the Web free of charge at a maximum frequency of 1 query per second. This is fabulous in comparison to other search enines like Google, Yahoo or Bing who either extremely restrict the use of their API or charge enormously high fees for its use. Unfortunately, the Blekko API doesn’t provide indexed results of pdfs. Something that would be very useful for the focused crawling functionality of CORE. Still, as far as we know, this is the best free search API available.
Additional tools on which CORE is based have been described in our previous blog post
Standards
In terms of standards, CORE has been using information in the robots.txt file to set the harvesting frequency and obey the wishes of the repository owner to restrict access to certain parts of the system. However, we have noticed that certain archives specify different policies for different bots, for example allowing GoogleBot into sections that are invisible to other bots or assigning a lower crawl-delay to the GoogleBot than to other bots. We considers such policies unfair and violating the principles of Open Access.
We have developed the CORE API as a RESTful service (in addition to our SPARQL endpoint). While it might sound politically incorrect, we have found the RESTful service to be much more popular among developers than the SPARQL endpoint.
Techniques and approaches
During the project, we have improved the architecture of the system and made steps towards helping it to grow. The system has been divided into a front-end (currently 2 machines), responsible for dealing with the requests coming from the web, and a powerful backend machine, responsible for all the harvesting and processing. The applications are synchronised using a database. Indexes from the back-end machine are daily synced to the front-end machines.
Another useful tool we have developed is a self-test module which periodically monitors the health of the systems and provides information in case something doesn’t seem right.