I'm working on a new deduplication program. It's still experimental right now so I don't want to release it publicly, and it's currently linux-only. Though you can probably compile it for windows - the only part that might give you problems is the memory mapped IO. Anyone want to try it and evaluate?
It's a format-agnostic dedup. Short version is that it looks for files which share long stretches of data. It'll pick up things like multiple edits of one document, or multiple archives that all contain the same file (providing they use identical compression settings), or different versions of a file that differ in metadata but share the data in common. It'll scan a large collection of files, do some math, and return you a list of the ones that look similar.
Anyway, code.
https://pastebin.com/GNYGhZCf
This is barely-tested. It's probably full of memory leaks and such. It's a proof of concept. And yes, Rabin would speed it up a little, but I'm struggling to learn the math behind that.