Duplication Detector

Duplication Detector, created for Wikipedia:Copyright problems on the English Wikipedia, is a tool used to compare any two web pages to identify text which has been copied from one to the other. Either, neither, or both pages may be current or old revisions of a Wikipedia article.

Please supply the URLs of two websites to compare (you can also choose, using the advanced version, to upload either document from your computer). The tool supports text, HTML, and PDF documents. For other types of documents, check Google's cache for an HTML version by doing a Google search for "cache:URL". To make the tool run faster for very large documents, increase minimum number of words to 3. For source documents containing scattered numerals, you may have to check "Remove numbers" to get the best matches.

Note: On May 22, 2019, I've moved dupdet from gridengine to kubernetes backend. I hope this change may reduce the downtime of this tool by enabling its automatic restart in the case of error 500. If there are any issues, please query my talk page without delay. Thank you!

Duplication Detector can see article text hidden by templates like {{copyvio}}, since the text is still in the HTML page source, but cannot see text that has been removed. You need to use the URL of an old revision in this case.

Simple version (generates pages that can be linked to):

Advanced version (allows uploads):

Things to do in the future:

Caching results for repeated queries
Use a statistical model to rule out common phrases and proper names
Show side-by-side comparison of phrases in original context with original capitalization and punctuation
Detect copying of long phrases with minor modifications such as removed/added/modified words

The PHP source for Duplication Detector is available under the Simplified BSD License and was originally written by Derrick Coetzee. It does not require Tool Labs to run, so feel free to download and use it yourself using your own webserver or php command-line tool. (.tar.gz) (.zip) Latest version available from Github.