There are many reasons why a repository may end up with multiple copies of an article, for example, having the author’s original manuscript and the final post-review copy is a common scenario of near-duplicate content. Another example might be when multiple co-authors deposit the same manuscript without being aware of each other. Detecting (near-)duplicates and distinguishing them from different versions of the same article is both challenging and time-consuming. We have seen that a typical repository will have hundreds of duplicates and near-duplicate records, signifying the scale of this issue.