Detecting duplicate records and manuscript versions in your repository

There are many reasons why a repository may end up with multiple copies of an article, for example, having the author’s original manuscript and the final post-review copy is a common scenario of near-duplicate content. Another example might be when multiple co-authors deposit the same manuscript without being aware of each other. Detecting (near-)duplicates and distinguishing them from different versions of the same article is both challenging and time-consuming. We have seen that a typical repository will have hundreds of duplicates and near-duplicate records, signifying the scale of this issue.

To assist the community with this challenge, we have developed the new CORE Dashboard Versions and Duplicates module. This provides a simple interface for identifying versions and duplicates in your repository. Our system pinpoints different versions of your articles allowing you to easily review them side-by-side and mark them using the widely used NISO Journal Article Versions (JAV) taxonomy. Exact duplicates can also be reviewed and marked for removal from your repository. The marking can then be exported from the Dashboard into .csv format enabling automation in your repository. The duplicates check runs periodically every time CORE indexes content from your repository.

How it works

When users log in to the CORE Dashboard, they will find the new Versions / Duplicates tab. This provides information about the number of detected matches (potential duplicates or versions) and the last time the deduplication check has been performed.

Screenshot from the deduplication module

The user can then click on any of the identified matches. This will bring up a side-by-side comparison view populated with the respective metadata records. We also provide information about the “reference record”, which is a Works entity which represents a research output irrespective of the repository it is stored in. This is an abstract research works entity created for this paper by using metadata from across CORE’s data providers.

For those institutions that have configured the CORE OAI Resolver, which works out of the box for most EPrints repositories, the system allows the user to go to the respective page representing the record directly in your repository.

Detailed breakdown showing three potential duplicate articles

You can then compare records and review how they differ. Each record can be labelled with respect to its version in JAV taxonomy. Records can be subsequently marked as “to be removed” or “different from the reference paper”.

  • “To be removed”: this indicates that one of the other presented records in the repository are the same as this record. You decided to mark this record as the one to be removed.
  • “Different from the reference paper”: this indicates that this record is not related to the reference paper. You decided to disassociate this record from the reference paper. This will remove the connection of this record from the reference Works entity

How to access it

This service is now available in Beta. Supporting and Sustaining Members will find the new Versions / Duplicates tab available when they log into the CORE dashboard.

We are currently looking for CORE community members who might be interested in evaluating the benefits of this module with us on their repository content. If you are interested, please drop us an email and we will be in touch.

We constantly work with repository managers, research managers and many others to improve the services we provide and we welcome feedback from users as this helps inform our development roadmap.

If your institution is not yet a Supporting or Sustaining member of CORE yet, please do consider joining the amazing institutions that have committed to the ongoing sustainability of CORE.

Until next time… Thanks for reading.